<a href="https://colab.research.google.com/github/nandita7krishnan/llm/blob/main/Tweet_analysis_using_roberta_and_more.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Chapter 4 - Text Classification</h1>
<i>Classifying text with both representative and generative models</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter04/Chapter%204%20-%20Text%20Classification.ipynb)

---

This notebook is for Chapter 4 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


# **Data**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
from datasets import load_dataset

# Load our data
data = load_dataset('csv', data_files='/content/Corona_NLP_test.csv')
data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment'],
        num_rows: 3798
    })
})

In [6]:
data["train"][7]

{'UserName': 8,
 'ScreenName': 44960,
 'Location': 'Geneva, Switzerland',
 'TweetAt': '03-03-2020',
 'OriginalTweet': '@DrTedros "We can\x92t stop #COVID19 without protecting #healthworkers.\r\r\nPrices of surgical masks have increased six-fold, N95 respirators have more than trebled &amp; gowns cost twice as much"-@DrTedros #coronavirus',
 'Sentiment': 'Neutral'}

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [7]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [10]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["train"], "OriginalTweet")), total=len(data["train"])):
    negative_score = output[0]["score"]
    neutral_score = output[1]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, neutral_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 3798/3798 [00:47<00:00, 79.59it/s] 


In [12]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Neutral Reveiew", "Positive Review"]
    )
    print(performance)

In [13]:
data['train']['Sentiment'].cast()

{'UserName': Value('int64'),
 'ScreenName': Value('int64'),
 'Location': Value('string'),
 'TweetAt': Value('string'),
 'OriginalTweet': Value('string'),
 'Sentiment': Value('string')}

In [19]:
def to_label(data_parameter):
  if data_parameter['Sentiment'] == 'Positve' or data_parameter['Sentiment'] == 'Extremely Positive':
    data_parameter['Sentiment'] = 2
  elif data_parameter['Sentiment'] == 'Negative' or data_parameter['Sentiment'] == 'Extremely Negative':
    data_parameter['Sentiment'] = 0
  else:
    data_parameter['Sentiment'] = 1
  return data_parameter

data['train'] = data['train'].map(to_label)

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


In [20]:
data['train']['Sentiment']

Column([0, 1, 2, 0, 1])

In [21]:
data['train'].features

{'UserName': Value('int64'),
 'ScreenName': Value('int64'),
 'Location': Value('string'),
 'TweetAt': Value('string'),
 'OriginalTweet': Value('string'),
 'Sentiment': Value('int64')}

In [25]:
evaluate_performance(data["train"]["Sentiment"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.58      0.70      0.63      1633
Neutral Reveiew       0.52      0.49      0.51      1566
Positive Review       0.57      0.34      0.43       599

       accuracy                           0.56      3798
      macro avg       0.56      0.51      0.52      3798
   weighted avg       0.56      0.56      0.55      3798



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [31]:
from datasets import load_dataset

# Load our data
data_train= load_dataset('csv', data_files='/content/Corona_NLP_train.csv', encoding='latin-1')
data_test = load_dataset('csv', data_files='/content/Corona_NLP_test.csv',  encoding='latin-1')

In [32]:
data_test

DatasetDict({
    train: Dataset({
        features: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment'],
        num_rows: 3798
    })
})

In [35]:
def to_label(data_parameter):
  if data_parameter['Sentiment'] == 'Positve' or data_parameter['Sentiment'] == 'Extremely Positive':
    data_parameter['Sentiment'] = 2
  elif data_parameter['Sentiment'] == 'Negative' or data_parameter['Sentiment'] == 'Extremely Negative':
    data_parameter['Sentiment'] = 0
  else:
    data_parameter['Sentiment'] = 1
  return data_parameter

data_train['train'] = data_train['train'].map(to_label)
data_test['train'] = data_test['train'].map(to_label)


Map:   0%|          | 0/41157 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

In [36]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data_train["train"]["OriginalTweet"], show_progress_bar=True)
test_embeddings = model.encode(data_test["train"]["OriginalTweet"], show_progress_bar=True)

Batches:   0%|          | 0/1287 [00:00<?, ?it/s]

Batches:   0%|          | 0/119 [00:00<?, ?it/s]

In [37]:
train_embeddings.shape

(41157, 768)

In [40]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data_train["train"]["Sentiment"])

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [44]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data_test['train']["Sentiment"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.65      0.65      0.65      1633
Neutral Reveiew       0.56      0.66      0.61      1566
Positive Review       0.53      0.28      0.37       599

       accuracy                           0.60      3798
      macro avg       0.58      0.53      0.54      3798
   weighted avg       0.60      0.60      0.59      3798



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [47]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data_train["train"]["Sentiment"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data_test["train"]["Sentiment"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.59      0.59      0.59      1633
Neutral Reveiew       0.55      0.43      0.48      1566
Positive Review       0.35      0.54      0.42       599

       accuracy                           0.52      3798
      macro avg       0.50      0.52      0.50      3798
   weighted avg       0.54      0.52      0.52      3798



### Zero-shot Classification

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

### Encoder-decoder Models

In [48]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [49]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data_test['train'] = data_test['train'].map(lambda example: {"t5": prompt + example['OriginalTweet']})
data_test['train']

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

Dataset({
    features: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment', 't5'],
    num_rows: 3798
})

In [54]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data_test['train'], "t5")), total=len(data_test['train'])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 2 if text == "positive" else 1)


100%|██████████| 3798/3798 [03:41<00:00, 17.18it/s]


In [55]:
evaluate_performance(data_test['train']["Sentiment"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.52      0.87      0.65      1633
Neutral Reveiew       0.00      0.00      0.00      1566
Positive Review       0.32      0.58      0.41       599

       accuracy                           0.46      3798
      macro avg       0.28      0.48      0.35      3798
   weighted avg       0.27      0.46      0.34      3798



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### ChatGPT for Classification

In [57]:
import openai

# Create client
client = openai.OpenAI(api_key="")

In [58]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

In [60]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive, neutral,  or negative tweet:

[DOCUMENT]

If it is positive return 2 and if it is negative return 0, 1 if it's neutral. Do not give any other answers.
"""

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [61]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data_test["train"]["OriginalTweet"])]

  0%|          | 0/3798 [00:02<?, ?it/s]


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)