# Fine tuning GPT-2 to generate tarot card meanings and interpretations

After scraping texts from a number of sources, it's time to use them to re-train GPT-2 to generate new tarot card meanings, interpreations, and questions to ponder. 

[This article](https://medium.com/swlh/fine-tuning-gpt-2-for-magic-the-gathering-flavour-text-generation-3bafd0f9bb93) by Richard Bownes, PhD has an amazing tutorial for fine tuning the [Hugging Face](https://huggingface.co/) GPT-2 model, and I've followed it here in this notebook to train and generate texts for this project.


### Sources & References
* [Fine tuning GPT-2 for Magic the Gathering Flavour text generation](https://medium.com/swlh/fine-tuning-gpt-2-for-magic-the-gathering-flavour-text-generation-3bafd0f9bb93)
* [Fine tuning GPT-2... notebook](https://colab.research.google.com/drive/16UTbQOhspQOF3XlxDFyI28S-0nAkTzk_?authuser=1#scrollTo=U_XJVIetKN-h)
* [HuggingFace](https://huggingface.co/)

In [2]:
%%capture

# Work around for the error: "NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968"
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
# Uninstall the current CUDA version
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'nvidia-kernel-common-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-325-updates' for glob 'nvidia*'
Note, selecting 'nvidia-346-updates' for glob 'nvidia*'
Note, selecting 'nvidia-driver-binary' for glob 'nvidia*'
Note, selecting 'nvidia-331-dev' for glob 'nvidia*'
Note, selecting 'nvidia-compute-utils-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-384-dev' for glob 'nvidia*'
Note, selecting 'nvidia-headless-525-server' for glob 'nvidia*'
Note, selecting 'nvidia-fs-prebuilt' for glob 'nvidia*'
Note, selecting 'nvidia-driver-440-server' for glob 'nvidia*'
Note, selecting 'nvidia-dkms-450-server' for glob 'nvidia*'
Note, selecting 'nvidia-headless-no-dkms-515-open' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-common' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-source-440-server' for glob 'nvidia*'
Note, selecting 'nvidia-gds' for glob 'nvidia*'
Note,

In [4]:
# Install 11.7 CUDA 8.5 CUDNN
!sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!sudo dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
!sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-46B62B5F-keyring.gpg /usr/share/keyrings/
!wget https://developer.nvidia.com/compute/cudnn/secure/8.5.0/local_installers/11.7/cudnn-local-repo-ubuntu2204-8.5.0.96_1.0-1_amd64.deb
!sudo dpkg -i cudnn-local-repo-ubuntu2204-8.5.0.96_1.0-1_amd64.deb
!sudo apt-get update
!sudo apt-get -y install cuda-11-7

Executing: /tmp/apt-key-gpghome.7QJjUSh3WW/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
gpg: requesting key from 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub'
gpg: key A4B469963BF863CC: "cudatools <cudatools@nvidia.com>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
--2023-04-11 17:39:03--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2204.pin’


2023-04-11 17:39:03 (3.72 MB/s) - ‘cuda-ubuntu2204.pin’ saved [190/190]

--2023-04-11 17:39:03--  https://developer.download.nvidia.com/compute/cuda/11.7.0

In [None]:
# Install torch 1.13.1 with cuda and cudnn
!pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

In [None]:
# Install transformers
!pip install transformers==4.9.2 datasets

In [None]:
import pandas as pd

In [None]:
# Transformers check
!python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

# pyTorch check
import torch
print(torch.cuda.is_available())

# NVCC - GPU Check
!nvcc --version

# NVIDIA-SMI - GPU Check
!nvidia-smi

2023-04-11 16:46:39.733535: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
True
/bin/bash: nvcc: command not found
Tue Apr 11 16:46:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N

## Dataset import

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
fortunes_file = '/content/drive/MyDrive/gpt2/fortunes.txt'

with open(fortunes_file) as f:
    fortunes = f.read().split('\n')

In [None]:
fortunes[:5]

In [None]:
len(fortunes)

### Reformatting dataset into sentence chunks

The documents of this corpus vary by length, and so I'll use SpaCy's sentencizer to break everything down into sentence chunks for processing with GPT-2 -- I think this should help with batch sizing and consistency of output.


In [None]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

In [None]:
chunks = len(fortunes)//10
print(f'Chunk size: {chunks}')

f1 = fortunes[:chunks]
f2 = fortunes[chunks:chunks*2]
f3 = fortunes[chunks*2:chunks*3]
f4 = fortunes[chunks*3:chunks*4]
f5 = fortunes[chunks*4:chunks*5]
f6 = fortunes[chunks*5:chunks*6]
f7 = fortunes[chunks*6:chunks*7]
f8 = fortunes[chunks*7:chunks*8]
f9 = fortunes[chunks*8:chunks*9]
f10 = fortunes[chunks*9:]

In [None]:
f_list = [f1, f2, f3, f4, f5, f6, f7, f8, f9, f10]
f_list = [' '.join(f) for f in f_list]

for f in f_list:
  print(f'Doc length: {len(f)}')

In [None]:
sentences = []

for f in f_list:
  s_words = []
  doc = nlp(f)
  s = [sent.string.strip() for sent in doc.sents]
  sentences.append(s)

In [None]:
sentences = [s for sublist in sentences for s in sublist]

In [None]:
sentences[0]

In [None]:
len(sentences)

## GPT-2 Setup



### Tokenizer

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', 
                                          bos_token = '<|startoftext|>',
                                          eos_token = '<|endoftext|>',
                                          pad_token = '<|pad|>'
                                          )

tokenizer.encode("Sample Text")

In [None]:
fortunes = sentences.copy()
del sentences 

In [None]:
max_fortune = max([len(tokenizer.encode(fortune)) for fortune in fortunes])

print(f'The longest text is {max_fortune} tokens long.')

### Setting up PyTorch Dataset & Dataloaders 

In [None]:
import torch
torch.manual_seed(42)
from torch.utils.data import Dataset # this is the pytorch class import

class TarotDataset(Dataset):

  def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=max_fortune):

    self.tokenizer = tokenizer # the gpt2 tokenizer we instantiated
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:
      """
      This loop will iterate through each entry in the flavour text corpus.
      For each bit of text it will prepend it with the start of text token,
      then append the end of text token and pad to the maximum length with the 
      pad token. 
      """

      encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', 
                                 truncation=True, 
                                 max_length=max_length, 
                                 padding="max_length"
                                 )
      
      """
      Each iteration then appends either the encoded tensor to a list,
      or the attention mask for that encoding to a list. The attention mask is
      a binary list of 1's or 0's which determine whether the langauge model
      should take that token into consideration or not. 
      """

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
    
  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx] 

In [None]:
from torch.utils.data import random_split

dataset = TarotDataset(fortunes, tokenizer, max_length=max_fortune)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

f'There are {train_size} samples for training, and {val_size} samples for validation testing'

Setting up batch size and maximume token length for output text.

In [None]:
#bs = 20 # most recent round --> this worked pretty well
bs = 18
max_fortune = 100

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

train_dataloader = DataLoader(
    train_dataset,
    sampler = RandomSampler(train_dataset), 
    batch_size = bs
    )

validation_dataloader = DataLoader(
    val_dataset,
    sampler = SequentialSampler(val_dataset),
    batch_size = bs
    )

### Setting GPT-2 model parameters

In [None]:
import random
from transformers import GPT2LMHeadModel, GPT2Config
import numpy as np

# Loading the model configuration and setting it to the GPT2 standard settings.
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# Create the instance of the model and set the token size embedding length
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)
model.resize_token_embeddings(len(tokenizer))

# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")

model.to(device)

# Optional step to enable reproducible runs.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Reset VGPU Memory and RAM Memory of Notebook
# del model
# torch.cuda.empty_cache()
#

In [None]:
# We wil create a few variables to define the training parameters of the model
# epochs are the training rounds
# the warmup steps are steps at the start of training that are ignored
# every x steps we will sample the model to test the output

epochs = 4
warmup_steps = 1e2
sample_every = 100

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), 
                  lr = 2e-4,          # Learning rate (reduced from 5e-4)
                  eps = 1e-8
                  )

In [None]:
from transformers import get_linear_schedule_with_warmup

total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps
                                            )

### Setting up the training loop!

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(total_steps))

import random
import time
import datetime

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

total_t0 = time.time()

training_stats = []

model = model.to(device)

for epoch_i in range(0, epochs):

    print(f'Beginning epoch {epoch_i + 1} of {epochs}...')
    print('--------------------------------------------------------------------\n')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()        

        outputs = model(b_input_ids,
                        labels=b_labels, 
                        attention_mask = b_masks,
                        token_type_ids=None
                        )

        loss = outputs[0]  

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every 100 batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print(f'Batch {step} of {len(train_dataloader)}. Loss:{batch_loss}. Time:{elapsed}')

            model.eval()

            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 200,
                                    top_p=0.95, 
                                    num_return_sequences=1
                                )
            
            for i, sample_output in enumerate(sample_outputs):
                  print(f'----> Example output: {tokenizer.decode(sample_output, skip_special_tokens=True)}')
            
            model.train()

        loss.backward()

        optimizer.step()

        scheduler.step()

        progress_bar.update(1)

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)       
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print('\n--------------------------------------------------------------------')
    print(f'Average Training Loss: {avg_train_loss}. Epoch time: {training_time}')
    print('--------------------------------------------------------------------')
    
    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        with torch.no_grad():        

            outputs  = model(b_input_ids,  
                             attention_mask = b_masks,
                             labels=b_labels)
          
            loss = outputs[0]  
            
        batch_loss = loss.item()
        total_eval_loss += batch_loss        

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)    

    print('\n--------------------------------------------------------------------')
    print(f'Validation loss: {avg_val_loss}. Validation Time: {validation_time}')
    print('--------------------------------------------------------------------\n')

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
        )

print('====================================================================\n')
print('Training complete!')
print(f'Total training time: {format_time(time.time()-total_t0)}')

### Model Evaluation

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

pd.set_option('precision', 2)
df_stats = pd.DataFrame(data=training_stats)
df_stats = df_stats.set_index('epoch')

# Use plot styling from seaborn.
sns.set(style='whitegrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.figure(figsize=(12,6))

# Plot the learning curve.
plt.plot(df_stats['Training Loss'], 'b-o', label="Training")
plt.plot(df_stats['Valid. Loss'], 'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.xticks([1, 2, 3, 4])

plt.show()

Looks like there's a bit of overfitting happening towards the end of the training loop -- I'll sample some output and decide whether or not train again.

In [None]:
model.eval()

prompt = "<|startoftext|>"
#prompt = 'How can'
#prompt = 'The '

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

sample_outputs = model.generate(
                                generated, 
                                do_sample=True,   
                                top_k=50, 
                                max_length = 300,
                                top_p=0.95, 
                                num_return_sequences=10
                                )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))  

I'm pretty happy with this - I'll save the model and generate output for the cards in the next notebook.

## Saving the model

In [None]:
dir = '/content/drive/MyDrive/gpt2/models'
model_folder = '/model_2'

output_dir = dir + model_folder


model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)