# Question 4

## Installing from source

In [2]:
# installing hugging face from source
# !git clone https://github.com/huggingface/transformers.git
# sudo is for AWS (without virtualenv)
# !sudo pip3 install -e transformers/.

In [2]:
import math

import torch
from tqdm import tqdm
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [3]:
# testing
!python3 -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))"

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]


## Fine-Tuning GPT-2 model

* Trained using AWS EC2 instance `g4dn.2xlarge`
* [Setup Guide](https://towardsdatascience.com/setting-up-pytorch-with-gpu-support-on-ec2-without-preconfigured-amis-3b101b05a765)

In [4]:
torch.cuda.empty_cache()

In [5]:
# !python3 transformers/examples/pytorch/language-modeling/run_clm.py \
#     --model_name_or_path gpt2 \
#     --train_file data/wiki2.train.txt \
#     --validation_file data/wiki2.valid.txt \
#     --per_device_train_batch_size 4 \
#     --per_device_eval_batch_size 4 \
#     --do_train \
#     --do_eval \
#     --output_dir models/gpt2/

## Inference using Fine-Tuned model

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained("models/gpt2/")
model = GPT2LMHeadModel.from_pretrained("models/gpt2/")

In [7]:
print(tokenizer)
print(model)

GPT2Tokenizer(name_or_path='models/gpt2/', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): 

In [8]:
test_text = " \n = Valkyria Chronicles III = \n \n Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 ,"

In [9]:
inputs = tokenizer(test_text, return_tensors="pt", max_length=1024, truncation=True)
inputs

{'input_ids': tensor([[  220,   198,   796,   569, 18354,  7496, 17740,  6711,   796,   220,
           198,   220,   198,  2311,    73, 13090,   645,   569, 18354,  7496,
           513,  1058,  1279,  2954,    29, 17740,   357,  4960,  1058, 10545,
           230,    99,   161,   254,   112,  5641, 44444,  9202, 25084, 24440,
         12675, 11839,    18,   837]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [10]:
with open("data/wiki2.test.txt", "r", encoding="utf-8") as file:
    wiki_test = file.read()

In [11]:
model.eval()
device = torch.device("cuda")
model.to(device)
print(model.device)

cuda:0


In [12]:
max_length = model.config.n_positions - 1
stride = 128
n = 0
total_loss = 0.0

Cross-Entropy Loss (H) is defined by:

$H = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{i-1})$

Perplexity can therefore, also be defined by:

$P = e^H$

$P = e^{-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | c)}$

Since GPT-2 model uses cross-entropy loss, we'll just use the loss from the output.

We're taking strides for computational reasons.

In [13]:
# calculate loss for each segment
for i in tqdm(range(0, len(wiki_test), stride)):
    encoded_chunk = tokenizer.encode(wiki_test[i : i + max_length], return_tensors="pt")

    encoded_chunk = encoded_chunk.to(model.device)

    with torch.no_grad():
        outputs = model(encoded_chunk, labels=encoded_chunk)
        total_loss += outputs.loss.item()
        n += 1

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9805/9805 [03:32<00:00, 46.11it/s]


In [14]:
average_loss = total_loss / n

In [15]:
perplexity = torch.exp(torch.tensor(average_loss)).item()
print(f"Perplexity: {perplexity}")

Perplexity: 17.1722412109375


* The fine-tuned GPT-2 model shows a perplexity of 17.17, significantly lower than the previous perplexity scores.
* The low perplexity of the fine-tuned GPT-2 model suggests it has effectively learned the structure and nuances of the text it was trained on.
* Some reasons why the performance is good:
    - based on the Transformer architecture, which allows it to effectively capture long-range dependencies in text
    - pretrained on a massive corpus of text data
    - model utilizes a self-attention mechanism to weigh the importance of each word in the context when predicting the next word
    - this model alone has a 124,439,808 trainable parameters