## GPT-2 Finetuning on Sentiment Classification

### Overview

- Compare performance of different text generation model on a sentiment detection task.
- For this, we will fine the text generation model GPT-2 on train data and report performance on the test data.
- Hence, we will also learn how to fine tune the TG models along wth how to apply these model to an example NLP task.

### Model

- Huggingface

### Dataset

- reviews_final.csv - disponível em https://www.dropbox.com/sh/oh6hbg5p9wyldfk/AAD2jw5EcP5oKaYbQ5FB55OGa?dl=0

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Download and import packages

In [None]:

# uninstall
!pip uninstall -y wandb

# download
!pip install transformers

!pip install --upgrade accelerate

# import
import re
import json
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2
Looking i

### Dataset load and prep functions

In [None]:
# Dataset class
class SentimentDataset(Dataset):
    def __init__(self, txt_list, label_list, tokenizer, max_length):
        # define variables    
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        map_label = {0:'negative', 4: 'positive'}
        # iterate through the dataset
        for txt, label in zip(txt_list, label_list):
            # prepare the text
            prep_txt = f'<|startoftext|>Review: {txt}\nSentiment: {map_label[label]}<|endoftext|>'
            # tokenize
            encodings_dict = tokenizer(prep_txt, truncation=True,
                                       max_length=max_length, padding="max_length")
            # append to list
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            self.labels.append(map_label[label])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx], self.labels[idx]

# Data load function
#def load_sentiment_dataset(tokenizer, random_seed = 1, file_path="/content/drive/My Drive/Colab/training.1600000.processed.noemoticon_edit.csv"):
def load_sentiment_dataset(tokenizer, random_seed = 1, file_path="/content/drive/My Drive/Colab/reviews_final.csv"):
    # load dataset and sample 10k reviews.
    #df = pd.read_csv(file_path, encoding='ISO-8859-1', header=None)
    #df = pd.read_csv(file_path, header=True, on_bad_lines='skip')
    df = pd.read_csv(file_path, header=None, on_bad_lines='skip')
    #df = df[[0, 5]]
    df = df[[0, 1]]
    df.columns = ['label', 'text']
    #df = df.fillna(0)
    #df['label'] = df['label'].astype(int)
    df = df.sample(7400, random_state=0)
    #df = df[df['text'].notna()]

    ##DEBUG
    #pd.set_option('display.max_rows', None)
    #print(df)



    def pick_first_n_words(string, max_words=250): # tried a few max_words, kept 250 as max tokens was < 512
        #print(string)
        split_str = string.split()
        #print(split_str[:min(len(split_str), max_words)])
        return " ".join(split_str[:min(len(split_str), max_words)])

    df['text'] = df['text'].apply(lambda x: pick_first_n_words(x))

    print("############")
    
    
    # divide into test and train
    X_train, X_test, y_train, y_test = \
              train_test_split(df['text'].tolist(), df['label'].tolist(),
              shuffle=True, test_size=0.05, random_state=random_seed, stratify=df['label'])

    # get max length
    max_length_train = max([len(tokenizer.encode(text)) for text in X_train])
    max_length_test = max([len(tokenizer.encode(text)) for text in X_test])
    max_length = max([max_length_train, max_length_test]) + 10  #for special tokens (sos and eos) and fillers
    max_length = max(max_length, 300)
    print(f"Setting max length as {max_length}")

    # format into SentimentDataset class
    train_dataset = SentimentDataset(X_train, y_train, tokenizer, max_length=max_length)

    # return
    return train_dataset, (X_test, y_test)

### Load model and tokenizer; Call data Prep

In [None]:
# import 
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

# model
model_name = "gpt2"
seed = 42

# seed
torch.manual_seed(seed)

<torch._C.Generator at 0x7f32ea83a870>

In [None]:
# iterate for N trials
for trial_no in range(3):
    
    print("Loading model...")
    # load tokenizer and model
    tokenizer = GPT2Tokenizer.from_pretrained(model_name, bos_token='<|startoftext|>',
                                              eos_token='<|endoftext|>', pad_token='<|pad|>')
    model = GPT2LMHeadModel.from_pretrained(model_name).cuda()
    model.resize_token_embeddings(len(tokenizer))

    print("Loading dataset...")
    train_dataset, test_dataset = load_sentiment_dataset(tokenizer, trial_no)
    
    print("Start training...")
    training_args = TrainingArguments(output_dir='results', num_train_epochs=10, 
                                    logging_steps=10, load_best_model_at_end=True,
                                      save_strategy="no", evaluation_strategy="no", per_device_train_batch_size=2, per_device_eval_batch_size=2,
                                    warmup_steps=100, weight_decay=0.01, logging_dir='logs')

    Trainer(model=model, args=training_args, train_dataset=train_dataset,
            eval_dataset=test_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                                  'attention_mask': torch.stack([f[1] for f in data]),
                                                                  'labels': torch.stack([f[0] for f in data])}).train()
    
    # test
    print("Start testing...")
    # eval mode on model
    _ = model.eval()

    # compute prediction on test data
    original, predicted, all_text, predicted_text = [], [], [], []
    map_label = {0:'negative', 4: 'positive'}
    for text, label in tqdm(zip(test_dataset[0], test_dataset[1])):
        # predict sentiment on test data
        prompt = f'<|startoftext|>Review: {text}\nSentiment:'
        generated = tokenizer(f"<|startoftext|> {prompt}", return_tensors="pt").input_ids.cuda()
        sample_outputs = model.generate(generated, do_sample=False, top_k=50, max_length=512, top_p=0.90, 
                temperature=0, num_return_sequences=0, pad_token_id=tokenizer.eos_token_id)
        pred_text = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
        # extract the predicted sentiment
        try:
            pred_sentiment = re.findall("\nSentiment: (.*)", pred_text)[-1]
        except:
            pred_sentiment = "None"
        original.append(map_label[label])
        predicted.append(pred_sentiment)
        all_text.append(text)
        predicted_text.append(pred_text)
    #transform into dataframe
    df = pd.DataFrame({'text': all_text, 'predicted': predicted, 'original': original, 'predicted_text': predicted_text})
    df.to_csv(f"result_run_{trial_no}.csv", index=False)
    # compute f1 score
    print(f1_score(original, predicted, average='macro'))

Loading model...


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loading dataset...
############
Setting max length as 355
Start training...




Step,Training Loss
10,16.5272
20,12.6581
30,8.0367
40,2.9323
50,1.6795
60,1.1034
70,1.2002
80,0.9583
90,1.3162
100,0.8116


Start testing...


370it [00:16, 22.37it/s]


0.8533298097251586
Loading model...


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading dataset...
############
Setting max length as 355
Start training...




Step,Training Loss
10,22.8843
20,15.0617
30,9.9911
40,2.9345
50,1.3763
60,1.6446
70,1.1242
80,1.0001
90,0.9073
100,1.0277


Start testing...


370it [00:14, 24.94it/s]


0.8537078256794752
Loading model...


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading dataset...
############
Setting max length as 355
Start training...




Step,Training Loss
10,18.9452
20,19.343
30,9.119
40,3.6492
50,1.7798
60,1.4092
70,0.9301
80,0.9033
90,0.7236
100,1.0102


Start testing...


370it [00:17, 20.90it/s]

0.8283644181988205





In [None]:
df

Unnamed: 0,text,predicted,original,predicted_text
0,Existe a opção para pagar com o TICKET RESTAUR...,negative,negative,Review: Existe a opção para pagar com o TICKE...
1,Aplicativo inútil. Não consegue PAGAR um lanch...,positive,negative,Review: Aplicativo inútil. Não consegue PAGAR...
2,Rappi prime atualmente é a melhor assinatura d...,positive,positive,Review: Rappi prime atualmente é a melhor ass...
3,Experiência ruim principalmente no que diz pro...,negative,negative,Review: Experiência ruim principalmente no qu...
4,"A aplicação funciona muito bem,os parceiros é ...",positive,positive,"Review: A aplicação funciona muito bem,os par..."
...,...,...,...,...
365,Imprescindível para quem quer avaliar o que co...,negative,positive,Review: Imprescindível para quem quer avaliar...
366,Bom antes era uma beleza mais ai esses tempos ...,positive,negative,Review: Bom antes era uma beleza mais ai esse...
367,Nao resolvem problemas quando é solicitada aju...,positive,negative,Review: Nao resolvem problemas quando é solic...
368,Desde que foi comprado pela Magalu vai de mal ...,negative,negative,Review: Desde que foi comprado pela Magalu va...


In [None]:
import os
os.makedirs("gpt2_10epochs", exist_ok=True)
model.save_pretrained("gpt2_10epochs")

In [None]:
input_text = "o aplicativo é ruim "

# predict sentiment on test data
prompt = f'<|startoftext|>Review: {input_text}\nSentiment:'
#generated = tokenizer(f"<|startoftext|> {prompt}", return_tensors="pt").input_ids
generated = tokenizer(f"<|startoftext|> {prompt}", return_tensors="pt").input_ids.cuda()
sample_outputs = model.generate(generated, do_sample=False, top_k=50, max_length=512, top_p=0.90, 
                                temperature=0, num_return_sequences=0, pad_token_id=tokenizer.eos_token_id)
pred_text = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
print(pred_text)
# extract the predicted sentiment
try:
  pred_sentiment = re.findall("\nSentiment: (.*)", pred_text)[-1]
except:
  pred_sentiment = "None"

print(pred_sentiment)

 Review: o aplicativo é ruim 
Sentiment: negative
negative


In [None]:
df

Unnamed: 0,text,predicted,original,predicted_text
0,Existe a opção para pagar com o TICKET RESTAUR...,negative,negative,Review: Existe a opção para pagar com o TICKE...
1,Aplicativo inútil. Não consegue PAGAR um lanch...,positive,negative,Review: Aplicativo inútil. Não consegue PAGAR...
2,Rappi prime atualmente é a melhor assinatura d...,positive,positive,Review: Rappi prime atualmente é a melhor ass...
3,Experiência ruim principalmente no que diz pro...,negative,negative,Review: Experiência ruim principalmente no qu...
4,"A aplicação funciona muito bem,os parceiros é ...",positive,positive,"Review: A aplicação funciona muito bem,os par..."
...,...,...,...,...
365,Imprescindível para quem quer avaliar o que co...,negative,positive,Review: Imprescindível para quem quer avaliar...
366,Bom antes era uma beleza mais ai esses tempos ...,positive,negative,Review: Bom antes era uma beleza mais ai esse...
367,Nao resolvem problemas quando é solicitada aju...,positive,negative,Review: Nao resolvem problemas quando é solic...
368,Desde que foi comprado pela Magalu vai de mal ...,negative,negative,Review: Desde que foi comprado pela Magalu va...
