# Test T5 Bottleneck Autoencoder
Following the aborted attempt to convert wikipedia to embeddings I have switched my attention to seeing how well the autoencoder performs on different tasks.

First initialise our module.

In [1]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: List[str]) -> List[List[float]]:

        # big batches are causing us to run out of memory. Limit the size
        embeddings = list()
        for i in range(0, len(text), 100):
            end = i + 100
            if end > len(text):
                end = len(text)
            batch = text[i:end]
        
            inputs = self.tokenizer(batch, padding=True, truncation=True, return_tensors='pt').to(self.device)
            decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
            embeddings.extend(self.model(
                    **inputs,
                    decoder_input_ids=decoder_inputs['input_ids'],
                    encode_only=True,
                ).to('cpu').tolist())
        
        return embeddings

    @torch.no_grad()
    def generate_from_latent(self, latent: List[float], max_length=512, temperature=1.0) -> str:
        dummy_text = ['.']
        dummy = torch.tensor(self.embed(dummy_text)).to(device)
        latent = torch.tensor(latent).to(device)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)



## Task 1: Decoding of random numbers
I want to test the coherence of the decoder output when fed random numbers as input encodings. This will give a measure of the COLA hardwired into the decoder.

In [3]:
texts = [
    'My name is John Oates, and I am a software engineer at a large technology company.',
    'Transformers are a neat way to generate pattern recognition in sequences.',
    'Religion is the study of why nature is the way it is, while science studies how nature works and when it happened.',
]

embedding = autoencoder.embed(texts)
for i in range(len(embedding)):
    reconstruction = autoencoder.generate_from_latent(embedding[i])
    print(reconstruction)
    print(embedding[i])
    print(f'Length: {len(embedding[i])} Max: {max(embedding[i])} Min: {min(embedding[i])}')


I am John Oates, and he is a software engineer at a large technology company.
[0.01991282030940056, -0.06426593661308289, -0.08330079168081284, -0.0544731542468071, 0.04063833877444267, 0.10520881414413452, 0.05505284667015076, 0.009258291684091091, 0.021132873371243477, -0.007605783175677061, -0.06777230650186539, 0.08723703771829605, 0.1629476398229599, 0.056689925491809845, 0.06832209974527359, -0.03231178969144821, 0.03981472924351692, 0.02445632591843605, 0.00010888367978623137, 0.08325071632862091, -0.04979619383811951, 0.037378259003162384, -0.011149161495268345, 0.032959774136543274, 0.0070025259628891945, -0.07656986266374588, -0.030392395332455635, -0.014880109578371048, 0.029728980734944344, 0.18170928955078125, 0.02795609086751938, -0.0036259947810322046, -0.017561130225658417, 0.027174590155482292, 0.18387097120285034, 0.05278738588094711, 0.06831362098455429, -0.07823357731103897, 0.007841477170586586, -0.07940450310707092, -0.03891848027706146, 0.05875302106142044, -0.08

So the range is roughly +-0.25. What happens if we increase the vectors?

In [4]:
for i in range(len(embedding)):
    double_embed = [v*4.0 for v in embedding[i]]
    reconstruction = autoencoder.generate_from_latent(double_embed)
    print(reconstruction)


I am John Oates, and I am a software engineer at a large technology company.
Transformers are a neat way to generate pattern recognition in sequences.
Religion is the study of why nature is what it is, while science is the study of how things work and when it happened.


So it appears that the direction means more than the magnitude.

Lets now move onto some randomness.

In [6]:
import random

In [7]:
for i in range(20):
    rand_vec = [random.uniform(-0.25, 0.25) for _ in range(1024)]
    reconstruction = autoencoder.generate_from_latent(rand_vec)
    print(f'{i}: {reconstruction}')


0: 
1: – people and technology developed there, and became living with sand.
2: Strategy:, The thin wire walk, the thin line
3: 
4: The building demonstrates the "Donation of the Carbon" in Freiburg on the occasion of the millennium.
5: .
6: in which its has and has not done for in their respective
7: But we are chasing and chasing (in the woods, with music)
8: and died in 2001 and died of cancer at 23 September 2007.
9: to the PGDS and the
10: - Kidder and Theta - The fourth personal assistant for the Iron Mock
11: //| empied upon the former and WCAM used the generic
12: , which were created during the extraction of petroleum
13: The next issue is selected and selected by the Association.
14: and were killed or became critically injured at the same time.
15: OGAE descended from that organization's old Colorado State
16: with the March 26th. The band and everyone drove to the car
17: (3)
18: : from until 2017 to 2012 – the language of
19: -


So I conclude that the decoder generates syntactically correct text, even from random numbers. That is pretty cool. More importantly, it only generates English words.

## Test 2: Changing the sentiment of a statement.

In [8]:
sentiment_texts = [
    "John is happy to be programming again.",
    "Jane loves going to Rock 'n' Soul choir."
]

In [9]:
embedding = autoencoder.embed(sentiment_texts)
for i in range(len(embedding)):
    reconstruction = autoencoder.generate_from_latent(embedding[i])
    print(reconstruction)


John is happy to be programming again.
Jane loves going to Rock 'n' Soul choir.


In [10]:
negative = autoencoder.embed("hates")
print(negative)
for i in range(len(embedding)):
    neg_vec = [v1*v2 for (v1, v2) in zip(embedding[i], negative[0])]
    reconstruction = autoencoder.generate_from_latent(neg_vec)
    print(reconstruction)


[[-0.1034824550151825, -0.06876642256975174, -0.13420964777469635, -0.08945334702730179, 0.11473014205694199, 0.18411469459533691, -0.00549500435590744, 0.020607423037290573, -0.050372637808322906, -0.21729706227779388, -0.036227621138095856, -0.037276122719049454, 0.04307132586836815, 0.06629865616559982, 0.007258169818669558, -0.02289493940770626, -0.0040655843913555145, 0.024970997124910355, 0.07110512256622314, -0.05078974738717079, 0.007019795011729002, -0.07858515530824661, -0.1131925880908966, -0.0016681329580023885, 0.023082945495843887, -0.054645512253046036, 0.018191572278738022, -0.011540539562702179, 0.019201574847102165, -0.03604283556342125, 0.05499414727091789, -0.08330504596233368, -0.07072454690933228, -0.0010814776178449392, 0.05308796837925911, -0.00027150390087626874, 0.014545335434377193, 0.09264039248228073, 0.05463188514113426, -0.0444226898252964, 0.0661279633641243, -0.04774559289216995, -0.006843614857643843, 0.07339505851268768, 0.018964972347021103, -0.09703

OK - so vector multiply is not the right answer. Lets try addition.

In [11]:
negative = autoencoder.embed("hates")
for i in range(len(embedding)):
    neg_vec = [v1+v2 for (v1, v2) in zip(embedding[i], negative[0])]
    reconstruction = autoencoder.generate_from_latent(neg_vec)
    print(reconstruction)


John hates programming again.
Jane hates going to Rock 'n' Soul choir.


Well - I did not expect that. Addition sets the sentiment. Lets test this further.

In [13]:
sentiment_texts = [
    "John is happy to be programming again.",
    "Jane loves going to Rock 'n' Soul choir.",
    "It was a bad movie, full of unrealistic characters and an untraceable plot."
]

In [14]:
sentiment = autoencoder.embed("melancholic")
old_sentiment = autoencoder.embed('happy')
for i in range(len(embedding)):
    neg_vec = [v1+v2-v3 for (v1, v2, v3) in zip(embedding[i], sentiment[0], old_sentiment[0])]
    reconstruction = autoencoder.generate_from_latent(neg_vec)
    print(reconstruction)


John melancholy releasing the programming again.
Jane loves melancholic music going into Rock 'n' Soul Chorus.


Can I determine the sentiment of a statement?

In [17]:
def dot_product(v1, v2):
    """Calculate and return the dot product of two vectors."""
    return sum([x * y for x, y in zip(v1, v2)])

def closest_dot_product_index(vectors, target_vector):
    """Find the index of the vector in 'vectors' with the largest dot product with 'target_vector'."""
    # Initial placeholders for the closest index and the minimum difference
    closest_index = -1
    max_dp = float('inf')
    
    for i, vector in enumerate(vectors):
        # Calculate the dot product between the current vector and the target vector
        dp = dot_product(vector, target_vector)
        
        # Update the closest index and minimum difference if the current vector is closer
        if dp < max_dp:
            max_dp = dp
            closest_index = i
            
    return closest_index


In [18]:
sentiments = ['good', 'bad', 'happy', 'loves']
sentiment_vec = autoencoder.embed(sentiments)
for i in range(len(embedding)):
    sentiment = sentiments[closest_dot_product_index(sentiment_vec, embedding[i])]
    print(f'{sentiment} : {sentiment_texts[i]}')


bad : John is happy to be programming again.
good : Jane loves going to Rock 'n' Soul choir.


So that does not work. It probably means that the vectors contain more than the meaning but say, the syntax as well. A dot product would be comparing these syntactic similarities as well as the semantic similarities. Is it possible to separate the syntactic from the semantic?

What does negating the vector do?

In [19]:
for i in range(len(embedding)):
    neg_vec = [-1.0 * v1 for v1 in embedding[i]]
    reconstruction = autoencoder.generate_from_latent(neg_vec)
    print(reconstruction)


Falls in from "tail "ors"Nige on
),


Negating does not work. So why does adding work?

What does a null vector do?

In [20]:
null_vec = [1.0 for _ in embedding[0]]
reconstruction = autoencoder.generate_from_latent(null_vec)
print(f"Vec: {reconstruction}")

Vec: "Grizzly", a historic stage in the French chansons populaires


Lets try masking out parts of the vector

In [21]:
import numpy as np

def mask_vec(vec, start_fraction, end_fraction = 1.0):
    start_mask_index = int(len(vec) * start_fraction)
    end_mask_index = int(len(vec) * end_fraction)
    print(f'{start_mask_index} : {end_mask_index}')
    return [vec[i] if i < start_mask_index or i > end_mask_index else 0.0 for i in range(len(vec))]

for i in range(len(embedding)):
    for frac in np.arange(0.0, 1.0, 0.1):
        reconstruction = autoencoder.generate_from_latent(mask_vec(embedding[i], 0.2, 0.2+frac))
        print(f'{i} {frac:.1} : {reconstruction}')

204 : 204
0 0e+00 : John is happy to be programming again.
204 : 307
0 0.1 : John is happy to be programming again.
204 : 409
0 0.2 : John is really happy to be back playing programming again.
204 : 512
0 0.3 : John was happy to be a programming again.
204 : 614
0 0.4 : John is now happy to be programming again.
204 : 716
0 0.5 : John: I am happy to be programming again.
204 : 819
0 0.6 : Johns – Paul was happy to be programming again.
204 : 921
0 0.7 : John V. Paul – wanted to be comfortable working for the programming again.
204 : 1024
0 0.8 : John M. Winters to be happy and excited again with programming.
204 : 1126
0 0.9 : John M. Alfreds: happy to be programming and interpreter again.
204 : 204
1 0e+00 : Jane loves going to Rock 'n' Soul choir.
204 : 307
1 0.1 : Jane loves going to Rock 'n' Soul choir.
204 : 409
1 0.2 : Jane loves going to the Rock 'n' Roll Choir.
204 : 512
1 0.3 : Jane loves going to Rock 'n' Roll church and singing in a South African choir.
204 : 614
1 0.4 : Jan

So it appears that the vector contains redundancy. 

In [22]:
v = 'This is the first sentence. I want to see if the encoding is ordered by sentences. And one last sentence to test the end.'
e = autoencoder.embed([v])
reconstruction = autoencoder.generate_from_latent(e)
print(reconstruction)

This is the first sentence. I want to see if the encoding is ordered by sentences. And one last sentence to test the end.


In [23]:
for frac in np.arange(0.0, 1.0, 0.1):
    reconstruction = autoencoder.generate_from_latent(mask_vec(e[0], frac))
    print(f'{frac:.1} : {reconstruction}')

0 : 1024
0e+00 : 
102 : 1024
0.1 : It is characterized by the sequence F. Then it is made to grow with the needle between the two lungs.
204 : 1024
0.2 : I sends the first Letters to be declared. Then the muffs by rows and the a section. Why must the poetry be for all
307 : 1024
0.3 : This is to order the letters. The first envelop is to be read. Then a northern cipher is detested and the second detestable. The last should be a sequence of words for two rows.
409 : 1024
0.4 : This is to indicate the letters are read. The first syllable is ordered by a crushing. Then the last two syllables to test and the sentence to be sentence north.
512 : 1024
0.5 : This is the first i.e. to set the sentences. Then one should try to order and to be interpreted by a red pine letter. The last sentence is the
614 : 1024
0.6 : This is the first sentence. I want to see if the sentences are encodings. Then the last sentence is to test and see the end.
716 : 1024
0.7 : This is the first sentence. Is to orde

OK - so the whole vector seems to contain info. Lets try masking from the other end.

In [24]:
for frac in np.arange(0.0, 1.0, 0.1):
    reconstruction = autoencoder.generate_from_latent(mask_vec(e[0], 0.0, frac))
    print(f'{frac:.1} : {reconstruction}')

0 : 0
0e+00 : The first sentence is the encoding. I want to see if this sentence is ordered by sentences. And the last sentence to test one end.
0 : 102
0.1 : The first sentence is. If I want to see if the sentence is ordered by encoding. And the last sentence to try this one.
0 : 204
0.2 : The first is. So if you want to find the order of sentences, one sentence is encoding. And try to order the last sentence from the end.
0 : 307
0.3 : The first is. So I want to check the order of sentences. And this is one sentence to the end.
0 : 409
0.4 : The first question is if you want to see the order of sentences. And I want to see the last one from the end.
0 : 512
0.5 : So the first question is. I want to see the order of the sentences. and encoding the last one to the start of this sentence.
0 : 614
0.6 : The first order. If you want to know how the sentence is ordered. Add one by one to the last time.
0 : 716
0.7 : Then we should see how the other is organized. And I want to end the sente