<h1>Chapter 4 - Text Classification</h1>
<i>Classifying text with both representative and generative models</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter04/Chapter%204%20-%20Text%20Classification.ipynb)

---

This notebook is for Chapter 4 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install datasets transformers sentence-transformers openai

# **Data**

In [1]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [8]:
from pprint import pprint

pprint(data["train"][0, -1])
pprint(data["train"]["text"][:5])
pprint(data["train"]["label"][:5])

{'label': [1, 0],
 't5': ['Is the following sentence positive or negative? the rock is destined '
        'to be the 21st century\'s new " conan " and that he\'s going to make '
        'a splash even greater than arnold schwarzenegger , jean-claud van '
        'damme or steven segal .',
        'Is the following sentence positive or negative? things really get '
        'weird , though not particularly scary : the movie is all portent and '
        'no content .'],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and '
          "that he's going to make a splash even greater than arnold "
          'schwarzenegger , jean-claud van damme or steven segal .',
          'things really get weird , though not particularly scary : the movie '
          'is all portent and no content .']}
['the rock is destined to be the 21st century\'s new " conan " and that he\'s '
 'going to make a splash even greater than arnold schwarzenegger , jean-claud '
 'van damme or steven s

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [2]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path, tokenizer=model_path, return_all_scores=True, device="cuda:0"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [9]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
foo = True
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    if foo:
        pprint(output)
        foo = False
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

  0%|                                                            | 0/1066 [00:00<?, ?it/s]

[{'generated_text': 'stuart little 2 is a sappy , sappy , and'}]





KeyError: 'score'

In [6]:
from sklearn.metrics import classification_report


def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred, target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [24]:
print(len(data["test"]["label"]), len(y_pred))
print(data["test"]["label"][:10], y_pred[:10])


evaluate_performance(data["test"]["label"], y_pred)

1066 1066
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1] [1, 1, 0, 1, 1, 0, 1, 1, 1, 1]
                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [34]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
validate_embeddings = model.encode(data["validation"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:08<00:00, 32.42it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:01<00:00, 31.89it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:01<00:00, 31.69it/s]


In [35]:
print(train_embeddings.shape)
print(test_embeddings.shape)
print(validate_embeddings.shape)

(8530, 768)
(1066, 768)
(1066, 768)


In [None]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])
# Throw some extra training data in there...
clf.fit(validate_embeddings, data["validation"]["label"])

In [46]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.84      0.86      0.85       533
Positive Review       0.85      0.84      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



In [58]:
# this is me reading ahead in the book, the same code comes below

label_embeddings = model.encode(
    ["a negative movie review", "a positive movie review"], show_progress_bar=False
)

print(label_embeddings.shape, test_embeddings.shape)

from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
print(sim_matrix.shape)
print(sim_matrix[:5])
y_pred = np.argmax(sim_matrix, axis=1)
print(y_pred[:5])
print(data["test"]["label"][:5])

evaluate_performance(data["test"]["label"], y_pred)

(2, 768) (1066, 768)
(1066, 2)
[[0.22957906 0.3515991 ]
 [0.27431846 0.3582818 ]
 [0.11462615 0.13544795]
 [0.16793126 0.28072846]
 [0.14008924 0.14751881]]
[1 1 1 1 1]
[1, 1, 1, 1, 1]
                 precision    recall  f1-score   support

Negative Review       0.83      0.76      0.79       533
Positive Review       0.78      0.85      0.81       533

       accuracy                           0.80      1066
      macro avg       0.80      0.80      0.80      1066
   weighted avg       0.80      0.80      0.80      1066



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [60]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(
    np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)])
)
averaged_target_embeddings = df.groupby(768).mean().values
print(averaged_target_embeddings.shape)

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

(2, 768)
                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [61]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review", "A positive review"])

In [62]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [64]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

### Encoder-decoder Models

In [3]:
# Load our model
pipe = pipeline("text2text-generation", model="google/flan-t5-small", device="cuda:0")

Device set to use cuda:0


In [4]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example["text"]})
data

Map: 100%|██████████████████████████████████| 8530/8530 [00:00<00:00, 39829.42 examples/s]
Map: 100%|██████████████████████████████████| 1066/1066 [00:00<00:00, 38209.87 examples/s]
Map: 100%|██████████████████████████████████| 1066/1066 [00:00<00:00, 37127.91 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [13]:
# Run inference
y_pred = []
outs = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    outs.append(output)
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

  with torch.cuda.amp.autocast(enabled=False):
100%|█████████████████████████████████████████████████| 1066/1066 [00:27<00:00, 39.03it/s]


In [20]:
pprint(data["test"][:2])

pprint(outs[:2])

evaluate_performance(data["test"]["label"], y_pred)

{'label': [1, 1],
 't5': ['Is the following sentence positive or negative? lovingly photographed '
        'in the manner of a golden book sprung to life , stuart little 2 '
        'manages sweetness largely without stickiness .',
        'Is the following sentence positive or negative? consistently clever '
        'and suspenseful .'],
 'text': ['lovingly photographed in the manner of a golden book sprung to life '
          ', stuart little 2 manages sweetness largely without stickiness .',
          'consistently clever and suspenseful .']}
[[{'generated_text': 'positive'}], [{'generated_text': 'positive'}]]
                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [33]:
from ollama import chat


def chatgpt_generation(prompt, document, model="llama3.2"):
    """Generate an output based on a prompt and an input document."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.replace("[DOCUMENT]", document)},
    ]
    chat_completion = chat(
        messages=messages, model=model
    )
    return chat_completion.message.content

## needs to be run outside a docker container

Or fiddling with network flags

In [34]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

ConnectError: [Errno 111] Connection refused

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [None]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

100%|██████████| 1066/1066 [13:34<00:00,  1.31it/s]


In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066

