## Using the GPT-2 Tokenizer from Hugging Face
---
Goal is to be able to tokenize large amounts of text so that I can train a word embedding model on the tokens.

### 1) Import GPT2Tokenizer
---
While we built a BPETokenizer in the other files, we'll take advantage of the GPT2Tokenizer because 1) It's better (better token representations); 2) Likely more optimized, though our encoding and decoding is pretty slick now.

In [1]:
# First import GPT2Tokenizer from HuggingFace Transformers library
from transformers import GPT2Tokenizer

  _torch_pytree._register_pytree_node(


In [2]:
# Defining our tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [3]:
# Tokenize sample text
text = "Your sample text goes here."
encoded_input = tokenizer.encode(text, return_tensors='pt')
decoded_output = tokenizer.decode(encoded_input[0], skip_special_tokens=True)

print(f"Encoded: {encoded_input}")
print(f"Decoded: {decoded_output}")

Encoded: tensor([[7120, 6291, 2420, 2925,  994,   13]])
Decoded: Your sample text goes here.


### 2) Load in our data from Pytorch's existing datasets 
---

In [10]:
# First import pytorch and the Dataloader
import os
import torch
from torch.utils.data import Dataset, DataLoader

In [11]:
# Let's create a custom Dataset class to load in our dataset
tokenizer.pad_token = tokenizer.eos_token

class TextFileDataset(Dataset):
    def __init__(self, directory, tokenizer, max_length=512):
        self.file_paths = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.txt')]
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        file_path = self.file_paths[idx]
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
        encoded = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, truncation=True, padding="max_length")
        return encoded.input_ids.squeeze(0), encoded.attention_mask.squeeze(0)

In [12]:
# Let's define our wiki8_dataset
wiki8_dataset = TextFileDataset(directory="..\\data\\wikitext8_680MB", tokenizer=tokenizer)

In [13]:
# Defining our dataloader. Note - batch_size refers to # of 25MB chunks each batch is
wiki8_dataloader = DataLoader(wiki8_dataset, batch_size=1, shuffle=True)

In [14]:
for i, (input_ids, attention_mask) in enumerate(wiki8_dataloader):
    if i >= 2:  # Just look at the first 2 batches
        break
    print(f"Batch {i+1}:")
    for j in range(input_ids.size(0)):  # Loop through each item in the batch
        decoded_text = tokenizer.decode(input_ids[j], skip_special_tokens=True)
        print(f"Text {j+1}: {decoded_text}\n")

Batch 1:
Text 1: er captain james cook capitancook was a free open content travel guide the website was a wikiwiki so everyone could create or edit articles in order to share travel experiences and independent information capitancook provided more than one eight zero zero articles and photos on destinations around the world all the articles and photos were published under the gnu free documentation license on november two nine two zero zero three capitancook merged with world six six wiki communities gfdl from the creater wshun zero two two zero two four dec two zero zero four utc the comic strip barnaby by crockett johnson best known today for his children s books such as harold and the purple crayon featured an almost cherubic looking five year old and his far from cherubic fairy godfather mr o malley a short cigar smoking man with four tiny wings barnaby got in a fair number of scrapes but most of them were either of mr o malley s making or resulted in embarrassment of some sort for