# Preprocess Wikipedia
We need to have the wikipedia articles presented as a series of tokens of the dimension of our model. BottleneckT5 is a good autoencoder that reduces sentences of 512 or less tokens into vectors of various sizes. I was happiest with the performance of the large model, which gives vectors that are 1024 long. If we make this the dimension of our model then we can preprocess wikipedia into a series of vectors 1024 long. BottleneckT5 will encode multiple sentences provided the total number of tokens are less than 512. How we divide the data will determine how our model works. If we assume single whole sentences in then we should expect each output to represent one sentence. As we are moving from models that use part words as inputs to whole sentences let's be conservative first and not encode multiple sentences. This corresponds with my notion that we are creating a model that works with ideas, rather than words. That means we need to identify thae sentences in our articles and then create the corresponding vectors. This notebook will go through the process of doing this.

In [8]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: List[str]) -> List[List[float]]:

        # big batches are causing us to run out of memory. Limit the size
        embeddings = list()
        for i in range(0, len(text), 100):
            end = i + 100
            if end > len(text):
                end = len(text)
            batch = text[i:end]
        
            inputs = self.tokenizer(batch, padding=True, truncation=True, return_tensors='pt').to(self.device)
            decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
            embeddings.extend(self.model(
                    **inputs,
                    decoder_input_ids=decoder_inputs['input_ids'],
                    encode_only=True,
                ).to('cpu').tolist())
        
        return embeddings

    @torch.no_grad()
    def generate_from_latent(self, latent: List[float], max_length=512, temperature=1.0) -> str:
        dummy_text = ['.']
        dummy = torch.tensor(self.embed(dummy_text)).to(device)
        latent = torch.tensor(latent).to(device)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)


In [9]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)


In the next cell we check the performance of BottleneckT5 on a single sentence. Random sentences are given that the model could not have possibly seen before to prove that the model has found a good generalisation.

In [10]:
texts = [
    'My name is John Oates, and I am a software engineer at a large technology company.',
    'Transformers are a neat way to generate pattern recognition in sequences.',
    'Religion is the study of why nature is the way it is, while science studies how nature works and when it happened.',
]

embedding = autoencoder.embed(texts)
for i in range(len(embedding)):
    reconstruction = autoencoder.generate_from_latent(embedding[i])
    print(reconstruction)

print(type(embedding))

I am John Oates, and I am a software engineer at a large technology company.
Transformers are a neat way to generate pattern recognition in sequences.
Religion is the study of why nature is the way it is, while science studies how things work and how they happened.
<class 'list'>


Now to proceed with the encoding we first want to test the process on a small subset. Lets use the first 10 articles.

In [11]:
from datasets import load_dataset
wiki = load_dataset("olm/wikipedia", language="en", date="20240201")


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [12]:
from datasets import Dataset

# Skim off the first 10 articles for testing
test_wiki = wiki['train'][:10]
test_wiki = Dataset.from_dict(test_wiki)


In [7]:
print(wiki.column_names)
print(test_wiki.column_names)

{'train': ['id', 'url', 'title', 'text']}
['id', 'url', 'title', 'text']


In [13]:
import re

def get_sentences(article):
    # split the article into sentences
    sentences = re.split(r'(?<=[.!?;:\n\r])\s+', article)
    # remove empty sentences
    sentences = [s.strip() for s in sentences if len(s.strip()) > 0]
    # remove sentences that are too long by filtering out sentences with more than 400 words
    sentences = [s for s in sentences if len(s.split()) <= 400]
    return sentences

test_wiki_sentences = test_wiki.map(lambda x: {'sentences': get_sentences(x['text'])})

Map: 100%|██████████| 10/10 [00:00<00:00, 2070.95 examples/s]


In [14]:
def get_embeddings_batch(sentences):
    return autoencoder.embed(sentences)

test_wiki_embeddings = test_wiki_sentences.map(lambda x: {'embeddings': get_embeddings_batch(x['sentences'])})

Map: 100%|██████████| 10/10 [00:00<00:00, 13.15 examples/s]


In [24]:
print(test_wiki_embeddings.column_names)
for e in test_wiki_embeddings:
    for i in range(len(e['sentences'])):
        print(e['sentences'][i])
        embeddings = torch.tensor(e['embeddings'][i]).to(device)  # Move embeddings to the same device
        print(embeddings)
        print(autoencoder.generate_from_latent(embeddings))
        print('-----------------')
    break

['id', 'url', 'title', 'text', 'sentences', 'embeddings']
The Vickers Vagabond was Vickers' entrant for the second Lympne light aircraft competition, held in 1924.
tensor([ 0.0124,  0.0364, -0.0052,  ..., -0.0788,  0.0109,  0.1947],
       device='cuda:0')
The Vickers Vagabond was a Vickers' entry for the second Lympne light aircraft competition, held in 1924.
-----------------
It was a conventional small biplane, with a very unusual method of trimming.
tensor([-0.1697,  0.0039, -0.0185,  ...,  0.0743,  0.0542, -0.0175],
       device='cuda:0')
It was a conventional small biplane, with a very unusual method of trimming.
-----------------
It was eliminated from the trials at an early stage and only one was built.
tensor([-1.2169e-01,  5.5488e-02,  6.1400e-03,  ..., -2.9306e-02,
         2.9215e-02, -6.4560e-05], device='cuda:0')
It was eliminated from the trials at an early stage and only one was built.
-----------------
Development
Following the first Lympne trials held in 1923 for sin

Now save the embeddings and read them back to prove that we have a working representation of the articles.

In [34]:
# Save the test_wiki_embeddings dataset to disk
test_wiki_embeddings.save_to_disk('test_wiki_embeddings')

Saving the dataset (1/1 shards): 100%|██████████| 10/10 [00:00<00:00, 1365.16 examples/s]


In [36]:
# read the dataset back from disk
test_wiki_embeddings_back = Dataset.load_from_disk('test_wiki_embeddings')

In [37]:
print(test_wiki_embeddings_back.column_names)
for e in test_wiki_embeddings_back:
    for i in range(len(e['sentences'])):
        print(e['sentences'][i])
        embeddings = torch.tensor(e['embeddings'][i]).to(device)  # Move embeddings to the same device
        print(embeddings)
        print(autoencoder.generate_from_latent(embeddings))
        print('-----------------')
    break

['id', 'url', 'title', 'text', 'sentences', 'embeddings']
The Vickers Vagabond was Vickers' entrant for the second Lympne light aircraft competition, held in 1924.
tensor([ 0.0198,  0.0262, -0.0087,  ..., -0.0770,  0.0140,  0.1913],
       device='cuda:0')
The Vickers Vagabond was a Vickers' entry for the second Lympne light aircraft competition, held in 1924.
-----------------
It was a conventional small biplane, with a very unusual method of trimming.
tensor([-0.1479,  0.0048, -0.0186,  ...,  0.0602,  0.0649, -0.0328],
       device='cuda:0')
It was a conventional small biplane, with a very unusual method of trimming.
-----------------
It was eliminated from the trials at an early stage and only one was built.
tensor([-0.1034,  0.0481, -0.0100,  ..., -0.0298,  0.0395, -0.0027],
       device='cuda:0')
It was eliminated from the trials at an early stage and only one was built.
-----------------
Development
Following the first Lympne trials held in 1923 for single-seat motor-gliders, t

OK - now lets do it for the whole dataset.

In [17]:
full_wiki = wiki['train']
print(len(full_wiki))
print(full_wiki.column_names)


6775235
['id', 'url', 'title', 'text']


In [18]:
wiki_sentences = full_wiki.map(lambda x: {'sentences': get_sentences(x['text'])})


In [19]:
wiki_embeddings = wiki_sentences.map(lambda x: {'embeddings': get_embeddings_batch(x['sentences'])})
wiki_embeddings.save_to_disk('wiki_embeddings')


Map:   3%|▎         | 171934/6775235 [7:36:36<292:16:28,  6.28 examples/s]  


KeyboardInterrupt: 

In [17]:
for i in range(3985, 4000):
    b = wiki_sentences[i]['sentences']
    print(len(b))

7
10
10
352
8
18
36
19
22
8
17
4
29
35
32


In [19]:
b = wiki_sentences[3988]['sentences']
length = sum([len(s.split()) for s in b])
print(f'{len(b)} sentences with {length} words.')

352 sentences with 5910 words.


This attempt has been abandoned because the amount of time required to convert wikipedia to embeddings was excessive. (3% took 7 hours on my RTX3090, which was running at 92% utilisation)

I have switched my attention to exploring how well the AEC performs on different tasks.