## Classification benchmarks: `ZSL`, `Finetuned BERT`, `Unprompted GPT`, `Prompted GPT` and `Finetuned GPT` 

GPT models can also be used for classification tasks. Here I'm putting together a classification benchmark by comparing a ZSL, a finetuned BERT and finetuned / prompted GPT models to see how they perform. I'll try to leverage transfer learning only, meaning I won't be training models (except for GPT SFT)

OpenAI docs: 
- https://platform.openai.com/docs/guides/fine-tuning
- https://platform.openai.com/docs/guides/fine-tuning/advanced-usage
- https://github.com/openai/openai-cookbook/blob/main/examples/Fine-tuned_classification.ipynb

In [7]:
import credentials
import json
import os
os.environ["OPENAI_API_KEY"] = credentials.openai_api

import openai

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from sklearn.metrics import classification_report, confusion_matrix

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Load Sentiment dataset from HuggingFace

In [2]:
data = load_dataset('amazon_reviews_multi', 'en', split = 'validation',)

Found cached dataset amazon_reviews_multi (C:/Users/Rabay_Kristof/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


In [3]:
data = pd.DataFrame(data)

data['review'] = data.apply(lambda x: x['review_title'] + '. ' + x['review_body'], axis = 1)

data = data[data['stars'] != 3]
data['sentiment'] = data['stars'].apply(lambda x: 'positive' if x >= 4 else 'negative')

data.drop(labels = ['review_id', 'product_id', 'reviewer_id', 'language', 'review_title', 'review_body', 'stars', 'product_category'], axis = 1, inplace = True)
data = data.sample(frac = 0.125, random_state=43)
data.reset_index(drop = True, inplace = True)

print(data.shape)
data.head(3)

(500, 2)


Unnamed: 0,review,sentiment
0,"Needed cupcake rings, ended up with breast mil...",negative
1,One Star. This is the band I received.,negative
2,Good washer. Great product especially if you l...,positive


### 1. Zero-Shot-Classifier

Model: `MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary`

In [8]:
model_name = "MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary"
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("zero-shot-classification", model=model_name, tokenizer=tokenizer, use_fast=False)

In [11]:
candidate_labels = ['positive', 'negative']
sequence_to_classify = data['review'].tolist()

In [24]:
%%time
ZSL_output = classifier(sequence_to_classify, candidate_labels, multi_label=False)

CPU times: total: 6min 15s
Wall time: 1min 38s


In [26]:
ZSL_output[0]

{'sequence': 'Needed cupcake rings, ended up with breast milk steam bags- very unhappy. If I could give this 0 stars I would. It’s not at all what we ordered. Needed these cupcake rings for my daughters birthday party tomorrow and instead I am left with breast pump and breast milk accessory micro steam bags. Wtf.',
 'labels': ['negative', 'positive'],
 'scores': [0.9748134016990662, 0.02518662065267563]}

In [27]:
for i in ZSL_output:
    i['labels'] = i['labels'][0]
    i['scores'] = i['scores'][0]

In [34]:
ZSL_output = pd.DataFrame(ZSL_output)

ZSL_results = data.merge(ZSL_output, left_on = 'review', right_on = 'sequence').drop(labels = ['sequence'], axis = 1)
ZSL_results.head(3)

Unnamed: 0,review,sentiment,labels,scores
0,"Needed cupcake rings, ended up with breast mil...",negative,negative,0.974813
1,One Star. This is the band I received.,negative,positive,0.885268
2,Good washer. Great product especially if you l...,positive,positive,0.738286


In [37]:
print(classification_report(ZSL_results['sentiment'], ZSL_results['labels']))

              precision    recall  f1-score   support

    negative       0.92      0.87      0.89       253
    positive       0.87      0.92      0.89       247

    accuracy                           0.89       500
   macro avg       0.89      0.89      0.89       500
weighted avg       0.89      0.89      0.89       500



In [50]:
pd.DataFrame(confusion_matrix(ZSL_results['sentiment'], ZSL_results['labels'], labels = ['negative', 'positive']),
             columns = ['Pred - Neg', 'Pred - Pos'], 
             index=['True - Neg', 'True - Pos'])

Unnamed: 0,Pred - Neg,Pred - Pos
True - Neg,219,34
True - Pos,20,227


### 2. FineTuned BERT

Model: `nlptown/bert-base-multilingual-uncased-sentiment`

In [68]:
pipe = pipeline("text-classification", model = 'nlptown/bert-base-multilingual-uncased-sentiment', use_fast=False)
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [71]:
%%time
FT_BERT_output = pipe(sequence_to_classify, **tokenizer_kwargs)

CPU times: total: 4min 40s
Wall time: 1min 12s


In [72]:
FT_BERT_output[0]

{'label': '1 star', 'score': 0.9395105838775635}

In [75]:
replacer = {'1 star' : 'negative', 
            '2 stars' : 'negative',
            '3 stars' : 'negative',  
            '4 stars' : 'positive',  
            '5 stars' : 'positive'}
FT_BERT_output = pd.DataFrame(FT_BERT_output).replace(replacer)

FT_BERT_results = pd.concat([data, FT_BERT_output], axis = 1)
FT_BERT_results.head(3)

Unnamed: 0,review,sentiment,label,score
0,"Needed cupcake rings, ended up with breast mil...",negative,negative,0.939511
1,One Star. This is the band I received.,negative,negative,0.993209
2,Good washer. Great product especially if you l...,positive,positive,0.78792


In [76]:
print(classification_report(FT_BERT_results['sentiment'], FT_BERT_results['label']))

              precision    recall  f1-score   support

    negative       0.91      0.97      0.94       253
    positive       0.97      0.90      0.93       247

    accuracy                           0.94       500
   macro avg       0.94      0.94      0.94       500
weighted avg       0.94      0.94      0.94       500



In [77]:
pd.DataFrame(confusion_matrix(FT_BERT_results['sentiment'], FT_BERT_results['label'], labels = ['negative', 'positive']),
             columns = ['Pred - Neg', 'Pred - Pos'], 
             index=['True - Neg', 'True - Pos'])

Unnamed: 0,Pred - Neg,Pred - Pos
True - Neg,245,8
True - Pos,24,223


A lot less false positives, a little more FNs (but that may be due to 3 stars mapped to negative, while they should really be neutral)

### 3. Unprompted `GPT` model

Unprompted here refers to 'Zero Shot', meaning to examples are given in the prompt

In [95]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain
from langchain.prompts import PromptTemplate

In [117]:
#llm = OpenAI(model_name="text-davinci-003", temperature = 0, max_tokens = 1, top_p = 1.0)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature = 0, max_tokens = 1, top_p = 1.0)

template = """Decide whether a product review's sentiment is positive or negative. Only predict positive or negative.
Product review: {review}
Sentiment:"""

prompt = PromptTemplate(input_variables=["review"], template=template)

chain = LLMChain(llm = llm, prompt = prompt)

In [118]:
REVIEW = 'I hated this TV'

print(prompt.format(review = REVIEW))
print(chain.run(REVIEW))

Decide whether a product review's sentiment is positive or negative. Only predict positive or negative.
Product review: I hated this TV
Sentiment:
Negative


In [127]:
%%time
GPT_unprompted_output = [chain.run(i).lower().strip() for i in sequence_to_classify]

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID dcaa5764183f595f6e9d19190922bc5d in your message.).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID ca33f76389e9af6294ef7e870be72bf3 in your message.).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can r

CPU times: total: 2.11 s
Wall time: 10min 59s


In [129]:
GPT_unprompted_results = pd.concat([data, pd.Series(GPT_unprompted_output, name = 'label')], axis = 1)
GPT_unprompted_results.head(3)

Unnamed: 0,review,sentiment,label
0,"Needed cupcake rings, ended up with breast mil...",negative,negative
1,One Star. This is the band I received.,negative,negative
2,Good washer. Great product especially if you l...,positive,positive


In [131]:
GPT_unprompted_results[~GPT_unprompted_results['label'].isin(['positive', 'negative'])]

Unnamed: 0,review,sentiment,label
23,Unassembled. The picture shown is of an assemb...,positive,neutral
30,a gift. Bought for my mother in law,positive,neutral
56,Be careful on the print size. It is slightly s...,positive,neutral
77,Works well. The warmer works really well. The ...,positive,mixed
87,Ok. For the price serviceable,positive,neutral
89,Worth reading. Beautifully written. Missing de...,positive,neutral
106,It works.... Well... to lower your pH more qui...,positive,neutral
196,Almost fits. The nylon material feels sturdier...,positive,neutral
227,"Confused. Hi, I lost my AirPod case so I bough...",negative,neutral
255,nice set. Nice set. Set is smaller than I thou...,positive,mixed


In [133]:
GPT_unprompted_results['label'].value_counts()

label
negative    258
positive    226
neutral      13
mixed         3
Name: count, dtype: int64

In [134]:
GPT_unprompted_results = GPT_unprompted_results[GPT_unprompted_results['label'].isin(['positive', 'negative'])]

In [135]:
print(classification_report(GPT_unprompted_results['sentiment'], GPT_unprompted_results['label']))

              precision    recall  f1-score   support

    negative       0.94      0.97      0.96       250
    positive       0.97      0.94      0.95       234

    accuracy                           0.95       484
   macro avg       0.96      0.95      0.95       484
weighted avg       0.95      0.95      0.95       484



In [136]:
pd.DataFrame(confusion_matrix(GPT_unprompted_results['sentiment'], GPT_unprompted_results['label'], labels = ['negative', 'positive']),
             columns = ['Pred - Neg', 'Pred - Pos'], 
             index=['True - Neg', 'True - Pos'])

Unnamed: 0,Pred - Neg,Pred - Pos
True - Neg,243,7
True - Pos,15,219


Even tho I asked the model to only predict positive or negative, on 13 occasions it predicted neutral, for 3 samples it predicted mixed. This is fine, most items are categorized as expected, and these neutral-mixed elements are correctly labeled as they have been. They really do not seem neither pos nor neg.

Accounting for only the pos-neg items, unprompted GPT is the best model so far, better than a fine tuned BERT...

Let's see if adding few-shots helps the model further

### 4. Prompted `GPT` model

Prompted here refers to 'Few Shot', meaning a couple of examples are given in the prompt