# NLP Externship | Generative LLM - Model Fine-tuning

# Introduction
- Subset of cleaned data for training (20K observations)
- Data split into training/test sets
- model: 'google/flan-t5-base'
- Training and test sets tokenized
- Model optimization and hyperparameter tuning
- Training evaluation: ROUGE
- Final model evaluation: perplexity
- Final model saved to Hugging Face: https://huggingface.co/lmalarky/flan-t5-base-finetuned-python_qa

This notebook was run in Google Colab using A100 GPU
YOu will also need to set up a User Access Tokens to authenticate your identity to the Hugging Face Hub.

## Initialization

In [2]:
pip install transformers[torch] tokenizers datasets evaluate rouge_score sentencepiece huggingface_hub --upgrade accelerate

Collecting transformers[torch]
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m58.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecti

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import nltk
from datasets import load_dataset
import evaluate
from transformers import T5Tokenizer, DataCollatorForSeq2Seq, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, GenerationConfig
from sklearn.metrics.pairwise import cosine_similarity
import spacy

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
df = pd.read_csv('/content/drive/MyDrive/TripleTen/Externship_DataSpeak/Datasets/python_q_a_clean_score3_AandQwc150.csv',
                 index_col=[0])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 108898 entries, 0 to 108897
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id_q             108898 non-null  float64
 1   score_q          108898 non-null  float64
 2   title            108898 non-null  object 
 3   question         108898 non-null  object 
 4   id_a             108898 non-null  float64
 5   score_a          108898 non-null  float64
 6   answer_text      108898 non-null  object 
 7   context_w_title  108898 non-null  object 
 8   context_w_quest  108898 non-null  object 
 9   wc_a             108898 non-null  int64  
 10  wc_q             108898 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 10.0+ MB


In [8]:
df['title_question'] = df.title + ' ' + df.question

In [9]:
df.title_question.nunique()

105455

In [10]:
df_subset = df.sample(n=20000, random_state=0)

In [11]:
df_subset.head()

Unnamed: 0,id_q,score_q,title,question,id_a,score_a,answer_text,context_w_title,context_w_quest,wc_a,wc_q,title_question
48686,13260061.0,1.0,Popen.communicate escapes a string I send to s...,I am trying to spawn a process using Popen and...,13260450.0,3.0,I can't reproduce it on Ubuntu: from subproces...,Popen.communicate escapes a string I send to s...,I am trying to spawn a process using Popen and...,61,64,Popen.communicate escapes a string I send to s...
104281,37063640.0,0.0,Call to local method from list comprehension f...,I am trying to use list comprehension that cal...,37063676.0,3.0,"Should this ""method"": def is_number(s) be def ...",Call to local method from list comprehension f...,I am trying to use list comprehension that cal...,59,119,Call to local method from list comprehension f...
17336,4342168.0,2.0,Can hasattr go multiple children deep in Python?,"If I have node.child1.child2 , can I use h asa...",4342201.0,6.0,hasattr doesn't take a dotted name like that a...,Can hasattr go multiple children deep in Pytho...,"If I have node.child1.child2 , can I use h asa...",47,23,Can hasattr go multiple children deep in Pytho...
62220,17766607.0,-3.0,Python 3 execution order quirk with print?,Why does this work? I would think sup is passe...,17766622.0,4.0,"You're using Python 2, and it's being interpre...",Python 3 execution order quirk with print? You...,Why does this work? I would think sup is passe...,23,65,Python 3 execution order quirk with print? Why...
76468,23040236.0,0.0,sqlite3 remove brackets from printed data,I have created a script that finds the last va...,23040324.0,3.0,The result of the query you execute is being r...,sqlite3 remove brackets from printed data The ...,I have created a script that finds the last va...,72,143,sqlite3 remove brackets from printed data I ha...


In [12]:
df_subset.title_question.iloc[0]

'Popen.communicate escapes a string I send to stdin I am trying to spawn a process using Popen and send it a particular string to its stdin . I have: pipe = subprocess.Popen(cmd, shell=True, stdin=subprocess.PIPE)\npipe.communicate( my_stdin_str.encode(encoding=\'ascii\') )\npipe.stdin.close() However, the second line actually escapes the whitespace in my_stdin_str . For example, if I have: my_stdin_str="This is a string" The process will see: This\\ is\\ a\\ string How can I prevent this behaviour?'

In [13]:
df_subset.answer_text.iloc[0]

'I can\'t reproduce it on Ubuntu: from subprocess import Popen, PIPE\n\nshell_cmd = "perl -pE\'s/.\\K/-/g\'"\np = Popen(shell_cmd, shell=True, stdin=PIPE)\np.communicate("This $PATH is a string".encode(\'ascii\')) In this case shell=True is unnecessary: from subprocess import Popen, PIPE\n\ncmd = ["perl", "-pE" , "s/.\\K/-/g"]\np = Popen(cmd, stdin=PIPE)\np.communicate("This $PATH is a string".encode(\'ascii\')) Both produce the same output: T-h-i-s- -$-P-A-T-H- -i-s- -a- -s-t-r-i-n-g-'

In [14]:
df_final = df_subset[['title_question', 'answer_text']]
#df_final = df_subset[['title', 'answer_text']]
df_final.columns = ['question', 'answer']
df_final.head()

Unnamed: 0,question,answer
48686,Popen.communicate escapes a string I send to s...,I can't reproduce it on Ubuntu: from subproces...
104281,Call to local method from list comprehension f...,"Should this ""method"": def is_number(s) be def ..."
17336,Can hasattr go multiple children deep in Pytho...,hasattr doesn't take a dotted name like that a...
62220,Python 3 execution order quirk with print? Why...,"You're using Python 2, and it's being interpre..."
76468,sqlite3 remove brackets from printed data I ha...,The result of the query you execute is being r...


## Train/test split

In [64]:
train, test = train_test_split(df_final, test_size=0.2, random_state=12345)

In [65]:
print(train.shape)
print(test.shape)

(16000, 2)
(4000, 2)


In [67]:
from datasets import DatasetDict, Dataset

dataset = DatasetDict({
    'train': Dataset.from_pandas(train),
    'test': Dataset.from_pandas(test)
})

In [68]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', '__index_level_0__'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['question', 'answer', '__index_level_0__'],
        num_rows: 4000
    })
})

## Modeling

In [25]:
# Load the tokenizer, model, and data collator
model_checkpoint = 'google/flan-t5-base'
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [27]:
model.generation_config

GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0
}

### Testing pre-trained model

In [28]:
input_ids = tokenizer(
    'How do I check whether a file exists using Python?',
    return_tensors='pt'
).input_ids

generated_ids = model.generate(input_ids=input_ids, max_new_tokens=35)
generated_ids

tensor([[    0,     3,  3626,     3,     9,     3,   102,    63,   189,   106,
          4943,     6,    25,    54,   691,   823,     3,     9,  1042,  8085,
            57,     3, 20424,     8,  1042,    31,     7,     3,  8826,    52,
             5,     1]])

In [29]:
preds = [
tokenizer.decode(gen_id, skip_soecial_tokens=True, clean_up_tokenization_spaces=True)
    for gen_id in generated_ids
]

In [30]:
preds

["<pad> Using a python script, you can check whether a file exists by examining the file's identifier.</s>"]

### Preprocessing/ Tokenization

In [69]:
# Prefix the tasks with "answer the question"
prefix = "answer the question: "

# Define  preprocessing function
def preprocess_function(examples):
    """Add prefix to the sentences, tokenize the text, and set the labels"""
    # The "inputs" are the tokenized answer:
    inputs = [prefix + doc for doc in examples["question"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True) # original 128

    # The "labels" are the tokenized outputs:
    labels = tokenizer(text_target=examples["answer"], max_length=512, truncation=True) #original 512
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Map the preprocessing function across dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [70]:
# Evaluate the training progress
# Set up Rouge score for evaluation
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # decode preds and labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # rougeLSum expects newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return result

In [71]:
model_name = model_checkpoint.split("/")[-1]
model_name

'flan-t5-base'

In [72]:
import torch
torch.cuda.empty_cache()

In [73]:
# Set up training arguments

training_args = Seq2SeqTrainingArguments(
    output_dir=f"lmalarky/{model_name}-finetuned-python_qa",
    evaluation_strategy="epoch",
    learning_rate=1e-4, #was 3e-4
    per_device_train_batch_size=8, #was 8
    per_device_eval_batch_size=4, #was 4
    weight_decay=0.1, #was 0.01
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    push_to_hub=True,
    generation_config=generation_config)

# Set up trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.0314,1.908315,0.187595,0.0546,0.148511,0.163996


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.0314,1.908315,0.187595,0.0546,0.148511,0.163996
2,1.9586,1.903143,0.189579,0.053079,0.148462,0.164334
3,1.923,1.902288,0.191946,0.053485,0.149189,0.165526


TrainOutput(global_step=6000, training_loss=1.9812707926432291, metrics={'train_runtime': 4261.2731, 'train_samples_per_second': 11.264, 'train_steps_per_second': 1.408, 'total_flos': 8217088229376000.0, 'train_loss': 1.9812707926432291, 'epoch': 3.0})

In [74]:
trainer.save_model("flan-t5-base-finetuned-python_qa")

## Testing

In [29]:
index = 4
title = df_subset.title.iloc[index]
question = df_subset.question.iloc[index]
answer = df_subset.answer_text.iloc[index]

context = title + " " + question

In [30]:
print(f'Title: {title}')
print()
print(f'Question: {question}')
print()
print(f'Answer: {answer}')

Title: sqlite3 remove brackets from printed data

Question: I have created a script that finds the last value in the first row of my database import sqlite3
global SerialNum
conn = sqlite3.connect("MyFirstDB.db")
conn.text_factory = str
c = conn.cursor()
SerialNum = c.execute('select Serial from BI4000 where Serial in (Select max(Serial) from BI4000)')
print SerialNum
conn.commtt()
conn.close() the program prints the result [('00003',)] which is the last result in the current database, all the data that will be entered into the final database will be serial numbers and so it will be in order. My question is can I remove all the quotations/brackets/comma as I wish to asign this value to a variable. The program that I wish to make is a testing system that adds new entries to the database, I wish to check what the last entry is in the database so the system can continue the entries from that point.

Answer: The result of the query you execute is being represented as a Python list of Pytho

In [31]:
from transformers import pipeline

generator = pipeline("text2text-generation", model="lmalarky/flan-t5-base-finetuned-python_qa_v2")
generator(
    f"answer the question: {question}"
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[{'generated_text': "You can use the .execute() method to execute a string: serial_number = c.execute('se"}]

## Evaluation

### Perplexity

In [32]:
def calculate_perplexity(sentence):
    inputs = tokenizer(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    loss = outputs.loss
    perplexity = torch.exp(loss)
    return perplexity.item()

In [33]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

tokenizer = AutoTokenizer.from_pretrained("lmalarky/flan-t5-base-finetuned-python_qa")
model = AutoModelForSeq2SeqLM.from_pretrained("lmalarky/flan-t5-base-finetuned-python_qa")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [34]:
print(f'Perplexity of the sentence: {calculate_perplexity(question)}')

Perplexity of the sentence: 1.4784457683563232
