## Using the GPT-2 Tokenizer from Hugging Face
---
Goal is to be able to tokenize large amounts of text so that I can train a word embedding model on the tokens.

### 1) Import GPT2Tokenizer
---
While we built a BPETokenizer in the other files, we'll take advantage of the GPT2Tokenizer because 1) It's better (better token representations); 2) Likely more optimized, though our encoding and decoding is pretty slick now.

In [1]:
# First import GPT2Tokenizer from HuggingFace Transformers library
from transformers import GPT2Tokenizer

  _torch_pytree._register_pytree_node(


In [2]:
# Defining our tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [3]:
# Tokenize sample text
text = "Your sample text goes here."
encoded_input = tokenizer.encode(text, return_tensors='pt')
decoded_output = tokenizer.decode(encoded_input[0], skip_special_tokens=True)

print(f"Encoded: {encoded_input}")
print(f"Decoded: {decoded_output}")

Encoded: tensor([[7120, 6291, 2420, 2925,  994,   13]])
Decoded: Your sample text goes here.


### 2) Load in our data from Pytorch's existing datasets 
---

In [4]:
# First import pytorch and the Dataloader
import os
import torch
from torch.utils.data import Dataset, DataLoader

In [9]:
# Let's create a custom Dataset class to load in our dataset
tokenizer.pad_token = tokenizer.eos_token

class TextFileDataset(Dataset):
    def __init__(self, directory, tokenizer, max_length=8):
        self.file_paths = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.txt')]
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        file_path = self.file_paths[idx]
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
        encoded = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, truncation=True, padding="max_length")
        return encoded.input_ids.squeeze(0), encoded.attention_mask.squeeze(0)

In [10]:
# Let's define our wiki8_dataset
wiki8_dataset = TextFileDataset(directory="..\\data\\wikitext8_680MB", tokenizer=tokenizer)

In [11]:
# Defining our dataloader. Note - batch_size refers to # of 25MB chunks each batch is
wiki8_dataloader = DataLoader(wiki8_dataset, batch_size=1, shuffle=True)

In [13]:
for i, (input_ids, attention_mask) in enumerate(wiki8_dataloader):
    if i >= 2:  # Just look at the first 2 batches
        break
    print(f"Batch {i+1}:")
    print(f"{input_ids.size(0)}")
    for j in range(np.min(input_ids.size(0), 10)):  # Loop through each item in the batch
        decoded_text = tokenizer.decode(input_ids[j], skip_special_tokens=True)
        print(f"Text {j+1}: {decoded_text}\n")

Batch 1:
1
Text 1: te forces begun to gather in shim

Batch 2:
1
Text 1: four volumes captures the full subtlety of

