# Contra Bottleneck T5 XL Wikipedia Autoencoder
If you cannot generate something youself - use the work of others. This autoencoder is by huggingface user thesephist.
I will load his model and see how good it is.
If it works I will use it to learn how to do my own autoencoder based on BERT.

In [1]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)


  from .autonotebook import tqdm as notebook_tqdm


The following models show that bigger is better. I have not done any measurements but the bigger models seem to be better. I will use the biggest model available to me.

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
#device = 'cuda:1'
#autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)
#autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-small-wikipedia', device=device)
#autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-base-wikipedia', device=device)
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-xl-wikipedia', device=device)

print(autoencoder.model.config)


Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.41s/it]


T5Config {
  "_name_or_path": "thesephist/contra-bottleneck-t5-xl-wikipedia",
  "architectures": [
    "BottleneckT5LMWithPerturb"
  ],
  "auto_map": {
    "AutoModelForCausalLM": "thesephist/contra-bottleneck-t5-xl-wikipedia--bottleneck_t5.BottleneckT5LMWithPerturb"
  },
  "classifier_dropout": 0.0,
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32128
}



Try and find out how the model works. Unfortunately I cannot find where it is described anywhere on the web. Let's see if their is code downloaded.

In [5]:
print(autoencoder.model)

BottleneckT5LMWithPerturb(
  (shared): Embedding(32128, 2048)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 2048)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=2048, out_features=2048, bias=False)
              (k): Linear(in_features=2048, out_features=2048, bias=False)
              (v): Linear(in_features=2048, out_features=2048, bias=False)
              (o): Linear(in_features=2048, out_features=2048, bias=False)
              (relative_attention_bias): Embedding(32, 32)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=2048, out_features=5120, bias=False)
              (wi_1): Linear(in_features=2048, out_features=5120, bias=False)
        

In [11]:
help(autoencoder.model)

Help on BottleneckT5LMWithPerturb in module transformers_modules.thesephist.contra-bottleneck-t5-xl-wikipedia.52fa8371ec9e0a9c0705b2e805e501ff304d7fed.bottleneck_t5 object:

class BottleneckT5LMWithPerturb(transformers.models.t5.modeling_t5.T5ForConditionalGeneration)
 |  BottleneckT5LMWithPerturb(config: transformers.models.t5.configuration_t5.T5Config)
 |  
 |  T5 Model with a `language modeling` head on top.
 |  
 |  The T5 model was proposed in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text
 |  Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
 |  Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a
 |  text-to-text denoising generative setting.
 |  
 |  This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
 |  library implements for all its model (such as downloading or saving, resiz

In [12]:
texts = [
    'The quick brown fox jumps over the lazy dog',
    'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
    'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
]

for t in texts:
    embedding = autoencoder.embed(t)
    reconstruction = autoencoder.generate_from_latent(embedding)
    print(reconstruction)


The quick brown fox jumps over the lazy dog
Here I am, my name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.
Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, and run an entire company — or even do it exactly the way you want.


Now let's see how it goes with wikipedia text.

In [7]:
from datasets import load_dataset
wiki = load_dataset("olm/wikipedia", language="en", date="20240201")


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [8]:
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Split the dataset into a subset
# test_wiki = train_test_split(wiki['train'], train_size=10, random_state=42)[0]
test_wiki = wiki['train'][:10]
test_wiki = Dataset.from_dict(test_wiki)


In [9]:
import re

def get_sentences(article):
    # split the article into sentences
    sentences = re.split(r'(?<=[.!?\n\r])\s+', article)
    # remove empty sentences
    sentences = [s.strip() for s in sentences if len(s.strip()) > 0]
    # merge sentences where the total for both sentences is less than 512 characters

    merged_sentences = []
    current_sentence = ''
    for sentence in sentences:
        if len(current_sentence) + len(sentence) < 512:
            current_sentence += ' ' + sentence
        else:
            merged_sentences.append(current_sentence)
            current_sentence = sentence
    else:
        if len(current_sentence) > 0:
            merged_sentences.append(current_sentence)
    return merged_sentences

test_wiki_sentences = test_wiki.map(lambda x: {'sentences': get_sentences(x['text'])})

Map: 100%|██████████| 10/10 [00:00<00:00, 1638.59 examples/s]


In [10]:
for article in test_wiki_sentences:
    for sentence in article['sentences']:
        embedding = autoencoder.embed(sentence)
        reconstruction = autoencoder.generate_from_latent(embedding)
        print(f'Original      : {sentence}')
        print(f'Reconstruction: {reconstruction}')
        print(f'Embedding: {embedding}')
    print('---')

Original      :  The Vickers Vagabond was Vickers' entrant for the second Lympne light aircraft competition, held in 1924. It was a conventional small biplane, with a very unusual method of trimming. It was eliminated from the trials at an early stage and only one was built. Development
Following the first Lympne trials held in 1923 for single-seat motor-gliders, the Air Ministry organised a similar event in 1924, this time for low-powered two-seat aircraft. The engine capacity limit was set at 1,100 cc.
Reconstruction: The Vickers Vagabond was a small experimental light aircraft, with an unusual'sneeping' tail arrangement. It was an unsuccessful contender for the first Lympne competition for single seat motor-gliders, held in 1923 and a very limited number were built. Following trials in 1924 at Lympne, the Air Ministry introduced a second competition, this time for two-seat motor-gliders. The aircraft was conventionally equipped with a small high-mounted engine, so the process was el

Wow, that is very good. Now lets play around and see how maths on the embeddings affects the text.

In [13]:
sentence1 = test_wiki_sentences['sentences'][0][0]
sentence2 = test_wiki_sentences['sentences'][0][1]

print(f'{sentence1 = :}')
print(f'{sentence2 = :}')
embedding1 = autoencoder.embed(sentence1)
embedding2 = autoencoder.embed(sentence2)
sentence3 = autoencoder.generate_from_latent((embedding1 + embedding2) / 2)
print(f'{sentence3 = :}')

embedding3 = autoencoder.embed(sentence1 + sentence2)
sentence4 = autoencoder.generate_from_latent(embedding3)
print(f'{sentence4 = :}')

sentence5 = autoencoder.generate_from_latent(embedding3 - ((embedding1 + embedding2) / 2))
print(f'{sentence5 = :}')

sentence6 = autoencoder.generate_from_latent(embedding3 * 0.5)
print(f'{sentence6 = :}')



sentence1 =  The Vickers Vagabond was Vickers' entrant for the second Lympne light aircraft competition, held in 1924. It was a conventional small biplane, with a very unusual method of trimming. It was eliminated from the trials at an early stage and only one was built. Development
Following the first Lympne trials held in 1923 for single-seat motor-gliders, the Air Ministry organised a similar event in 1924, this time for low-powered two-seat aircraft. The engine capacity limit was set at 1,100 cc.
sentence2 = and, as before, the wings had to fold for easy transport and storage. The trials took place between 29 September and 4 October. Several companies built aircraft for them, including the Blackburn Bluebird, Hawker Cygnet, Supermarine Sparrow and two from Westland, the Woodpigeon and Widgeon. The Type 98 Vagabond was Vickers' entry. It was a single-bay, wire-braced biplane with wings of constant chord except towards the rounded trailing tips. The wings had equal span and carried m

So it appears that the best pereformance occurs when the encoding is restricted to a single sentence. I have no way to know when the sentence exceeds the limit of 512 tokens. I will have to experiment with this.

Lets try the xl model.

In [4]:
device = 'cuda:0'
autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-small-wikipedia', device=device)

tokenizer_config.json: 100%|██████████| 2.37k/2.37k [00:00<00:00, 564kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 1.48MB/s]
special_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.69MB/s]
config.json: 100%|██████████| 875/875 [00:00<00:00, 186kB/s]
bottleneck_t5.py: 100%|██████████| 18.9k/18.9k [00:00<00:00, 6.77MB/s]
A new version of the following files was downloaded from https://huggingface.co/thesephist/contra-bottleneck-t5-small-wikipedia:
- bottleneck_t5.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
pytorch_model.bin: 100%|██████████| 382M/382M [01:37<00:00, 3.93MB/s] 
generation_config.json: 100%|██████████| 142/142 [00:00<00:00, 46.5kB/s]
