<a href="https://colab.research.google.com/github/myeze/MachineLearningModels/blob/main/GPT2Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT2 Model (Early Development)

**This notebook contains a model made by Myles Ezeanii.**

I am developing a Natural Language Processing (NLP) model based on the GPT-2 architecture in order to provide capabilities such as text generation, summarization, and conversational AI to users.

This project uses a pretrained GPT-2 model and dataset available from Hugging Face, and the implementation will utilize the PyTorch library.

---

For my dataset, I chose to use a premade, large-scale collection of data from Wikipedia articles. I felt this was best as Wikipedia provides an extensive and diverse repository of information, covering a wide range of topics and domains. This makes it well-suited for applications requiring broad contextual knowledge.

---

I established that my autotokenizer applies a Byte-Pair Encoding algorithm as well as parallel processing to the dataset in order to properly train the model.


---

My overall goal is for the model is to get a hands on experience of how to create a GPT Model and potentially create a large scale one using data from medical records in order to reccomend health resources/practices.


In [1]:
!pip install transformers>=4.28.0 # Update transformers to a more recent version that supports compute_metrics in TrainingArguments # Import tools for our Generative Pre-Trained Transformer
!pip install datasets==2.14.5 # Import text dataset to train model
!pip install torch==2.0.1 # Import PyTorch for testing

Collecting datasets==2.14.5
  Downloading datasets-2.14.5-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets==2.14.5)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting xxhash (from datasets==2.14.5)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets==2.14.5)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<2023.9.0,>=2023.1.0 (from fsspec[http]<2023.9.0,>=2023.1.0->datasets==2.14.5)
  Downloading fsspec-2023.6.0-py3-none-any.whl.metadata (6.7 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
Collecting multiprocess (from datasets==2.14.5)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
  Downloading multiprocess-0.70.15-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.14.5

In [2]:
from datasets import load_dataset, list_datasets # Import data set loader and a list of datasets we can choose from

ourDataset = load_dataset('wikitext', 'wikitext-2-raw-v1') # Gather large scale dataset from Wikepedia articles
#list_datasets() # Shows all the datasets we can choose from

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [3]:
# Automatically choose the right tokenizer based on our models name
from transformers import AutoTokenizer

# Load a pretrained gpt2 model (I chose 2 based on computational limitations)
ourTokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

# Define padding token to make shorter sequences the same length as the longest sequence in the batch of data
ourTokenizer.pad_token = ourTokenizer.eos_token # Use eos_token as pad_token

# Take a batch of examples as an input
def tokenizeFunction(examples):
    # Add padding and truncation for consistent sequence lengths
    return ourTokenizer(examples["text"], padding='max_length', truncation=True, max_length=128) # Break down the text of each example into smaller units (tokens) and return the output

# Apply the tokenize function to the dataset, process the data in batches, use parallel processing to make tokenization faster, and remove the original "text" column from the dataset
ourTokenizedDatasets = ourDataset.map(tokenizeFunction, batched=True, num_proc=4, remove_columns=["text"])

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

  table = cls._concat_blocks(blocks, axis=0)


Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [4]:
print(ourTokenizer.__class__.__name__)  # Our Tokenizer uses a Byte-Pair Encoding algorithm
print(ourTokenizer.vocab_size)

GPT2TokenizerFast
50257


In [7]:
!pip uninstall torchvision -y
!pip install torchvision==0.15.2 --no-cache-dir

Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121
Collecting torchvision==0.15.2
  Downloading torchvision-0.15.2-cp310-cp310-manylinux1_x86_64.whl.metadata (11 kB)
Downloading torchvision-0.15.2-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchvision
Successfully installed torchvision-0.15.2


In [None]:
import numpy as np
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2-medium" ) # Choose a pre-trained GPT-2 model

# Function to add labels to the dataset
def add_labels(example):
    example['labels'] = example['input_ids'].copy()
    return example

ourTokenizedDatasets = ourTokenizedDatasets.map(add_labels)


training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    prediction_loss_only = True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ourTokenizedDatasets["train"],
    eval_dataset=ourTokenizedDatasets["validation"],
)

trainer.train()



model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss


In [None]:
results = trainer.evaluate(eval_dataset=tokenized_test_dataset)
print(results)

In [None]:
import torch

    # Choose a prompt
    prompt = "The meaning of life is"

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(trainer.model.device)

    # Generate text
    generated_ids = trainer.model.generate(input_ids, max_length=50, num_return_sequences=1)

    # Decode the generated tokens
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    # Print the generated text
    print(generated_text)