<h1>Chapter 4 - Text Classification</h1>
<i>Classifying text with both representative and generative models</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter04/Chapter%204%20-%20Text%20Classification.ipynb)

---

This notebook is for Chapter 4 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install datasets transformers sentence-transformers openai

# **Data**

In [None]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [None]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [None]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest" # 该模型基于 RoBERTa 架构，在 Twitter 数据集上进行了情感分析任务的微调。

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True, # 设置为 True 表示在进行情感分析时，返回每个情感类别的得分。
    device="cuda:0"
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
import numpy as np
from tqdm import tqdm # tqdm 是一个用于在 Python 中显示进度条的库。
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = [] # 用于存储模型的预测结果。在后续的循环中，每次预测得到的结果都会被添加到这个列表中。
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])): # data["test"] 是测试集部分
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    # np.argmax 函数用于返回数组中最大值的索引。如果 negative_score 更高，assignment 为 0；如果 positive_score 更高，assignment 为 1。
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:10<00:00, 102.16it/s]


In [None]:
from sklearn.metrics import classification_report # classification_report 是 scikit-learn 库中用于生成分类模型性能报告的工具。

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    # y_true：真实标签的数组或列表，包含数据集中每个样本的实际类别。
    # y_pred：预测标签的数组或列表，包含模型对每个样本的预测结果。
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [None]:
from sentence_transformers import SentenceTransformer # 它提供了一种方便的方式来将文本转换为固定长度的向量表示

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)
# model.encode() 方法用于将输入的文本转换为嵌入向量。
# data["train"]["text"] 表示从 data 数据集中获取训练集的文本数据。这里假设 data 是一个字典，其中 "train" 键对应训练集数据，而 "text" 键对应训练集中的文本字段。
# show_progress_bar=True 参数会在转换过程中显示一个进度条，让你可以直观地了解转换的进度。
# train_embeddings 是一个二维数组，形状为 (n_train_samples, embedding_dim)，其中 n_train_samples 是训练集中样本的数量，embedding_dim 是嵌入向量的维度，对于 all-mpnet-base-v2 模型，嵌入向量的维度是 768。

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [None]:
train_embeddings.shape

(8530, 768)

In [None]:
from sklearn.linear_model import LogisticRegression # 用于实现逻辑回归算法的类。

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42) # 通常 clf 是 classifier（分类器）的缩写。random_state=42 是一个可选参数，用于设置随机数生成器的种子。
clf.fit(train_embeddings, data["train"]["label"]) # fit 是 LogisticRegression 类的一个方法，用于训练模型。

In [None]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:
如果我们根本不使用分类器会发生什么？相反，我们可以平均每个类的嵌入并应用余弦相似度来预测哪些类与文档最匹配：

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)])) # np.hstack 函数将 train_embeddings 和标签数组按水平方向拼接在一起。
averaged_target_embeddings = df.groupby(768).mean().values
# df.groupby(768) 按照第 768 列（即标签列）对 DataFrame 进行分组。
# .mean() 计算每个分组内各列的平均值，得到每个目标标签下嵌入向量的平均嵌入。
# .values 将结果转换为 numpy 数组。

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)
# test_embeddings 是测试集文本的嵌入向量，是一个形状为 (n_test_samples, embedding_dim) 的二维数组。
# averaged_target_embeddings 是每个目标标签下的平均嵌入向量，是一个形状为 (n_classes, embedding_dim) 的二维数组，其中 n_classes 是目标标签的数量。
# cosine_similarity 函数计算 test_embeddings 中每个向量与 averaged_target_embeddings 中每个向量之间的余弦相似度，返回一个形状为 (n_test_samples, n_classes) 的相似度矩阵。
# np.argmax(sim_matrix, axis=1) 在相似度矩阵的每一行中找到最大值所在的列索引，这个索引对应着预测的目标标签。
# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1) # 在相似度矩阵的每一行中找到最大值所在的列索引，这个索引对应着预测的目标标签。

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!
如果你使用不同的描述会发生什么？使用“非常负面的电影评论”和“非常积极的电影评论”来看看会发生什么！

## **Classification with Generative Models**

### Encoder-decoder Models

In [None]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
# example['text'] 表示从样本字典中获取 text 键对应的值，即原始的文本内容。
# prompt + example['text'] 将提示语 prompt 和原始文本内容拼接在一起，形成新的文本。这里就可以建立好模型的输入：（提示句＋待分类的句子）
# {"t5": ...} 是一个字典，将拼接后的文本作为值，键为 "t5"。这意味着处理后的样本将包含一个新的键 "t5"，其对应的值是带有提示语的文本。
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [None]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [00:46<00:00, 22.83it/s]


In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [None]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

In [None]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ] # messages 是一个列表，用于存储与模型交互的消息，每个消息是一个字典，包含 "role" 和 "content" 两个键。
     #  第一个消息的 "role" 为 "system"，其 "content" 是 "You are a helpful assistant."。
     # 这个消息的作用是给模型设定一个系统级别的指令，告知模型以一个乐于助人的助手身份进行回复。
     #  第二个消息的 "role" 为 "user"，其 "content" 是通过 prompt.replace("[DOCUMENT]", document) 生成的。
     # 这行代码会将 prompt 字符串中的 [DOCUMENT] 占位符替换为实际的 document 内容，从而得到用户真正想要询问模型的完整信息。

    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0 # 这是一个控制生成文本随机性的参数，取值范围是 0 到 2 之间。
              # 当 temperature 设置为 0 时，模型的输出会更加确定性，倾向于选择概率最高的词汇，生成的结果相对稳定和保守。
    )
    return chat_completion.choices[0].message.content

In [None]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_KEY*HERE. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [None]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

100%|██████████| 1066/1066 [13:34<00:00,  1.31it/s]


In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066

