In [1]:
# Set up

# Introduction
In this notebook, we will adapt the approach from Chapter 4 of the "Hands-On Large Language Models" book to classify text using a new dataset. Specifically, we will use a pre-trained Transformer model to classify sentiment in Amazon product reviews. We'll explore both representation-based models and generative models, while adding our own analysis and insights along the way.


In [2]:
# %%capture
!pip install datasets transformers sentence-transformers openai

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

The dataset we will use is the **Amazon Polarity Dataset**. This dataset contains reviews from Amazon, categorized as either positive or negative sentiment. Each entry consists of a title, the review text, and the associated sentiment label, making it an excellent dataset for training and evaluating sentiment classification models.


In [29]:
# Load our data
data = load_dataset("amazon_polarity")

# Take a random sample of 10k training examples and 2k test examples
train_sample = data["train"].shuffle(seed=42).select(range(10000))
test_sample = data["test"].shuffle(seed=42).select(range(2000))

In [30]:
# ## Value Counts for Labels in the Training Set
# To better understand our dataset, let's count how many positive and negative labels we have in the training set.
from collections import Counter

# Count the number of occurrences of each label in the training data
label_counts = Counter(train_sample["label"])
print(f"Label Counts in Training Set: {label_counts}")


Label Counts in Training Set: Counter({0: 5003, 1: 4997})


In [31]:
# Count the number of occurrences of each label in the test data
label_counts = Counter(test_sample["label"])
print(f"Label Counts in Test Set: {label_counts}")

Label Counts in Test Set: Counter({1: 1018, 0: 982})


In [32]:
# Let's take a quick look at a couple of examples from our dataset to understand its structure.
print(train_sample[0])
print(train_sample[1])

{'label': 0, 'title': 'Anyone who likes this better than the Pekinpah is a moron.', 'content': "All the pretty people in this film. Even the Rudy character played by Michael Madsen. This is adapted from a Jim Thompson novel for cryin' out loud! These are supposed to be marginal characters, not fashion models. Though McQueen and McGraw were attractive (but check out McQueen's crummy prison haircut) they were believable in the role. Baldwin and Bassinger seem like movie stars trying to act like hard cases. Action wise, the robbery scene in the Pekinpah version was about 100 times more exciting and suspenseful than anything in this re-make."}
{'label': 0, 'title': 'Author seems mentally unstable', 'content': 'I know that Tom Robbins has a loyal following and I started the book with high expectations. However, I did not enjoy this book as it was too much work to follow his confused logic. I think that he was under the influence during most of time that he wrote.'}


# Text Classification with Representation-Based Models

Now that we have an idea of what our data looks like, we can proceed to load a pre-trained Transformer model for text classification.

We will use a model from the Hugging Face Transformers library, which provides state-of-the-art performance for various NLP tasks.


In [12]:
# Import the pipeline function from the transformers library
from transformers import pipeline

# Here, we use a sentiment analysis model from Hugging Face's model hub that is specifically designed for binary sentiment analysis.
model_path = "distilbert-base-uncased-finetuned-sst-2-english"


# Load the model into a pipeline for easy inference
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"  # Use GPU if available for faster inference
)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

# Running Sentiment Analysis on Sample Data

Let's use the loaded model to classify some sample reviews from our dataset.

We'll run the model on a few reviews to see how well it predicts the sentiment.


In [33]:

# Run sentiment analysis on the first review
sample_review = data["train"][0]["content"]
result = pipe(sample_review)
print(f"Review: {sample_review}")
print(f"Sentiment Analysis Result: {result}")

Review: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
Sentiment Analysis Result: [[{'label': 'NEGATIVE', 'score': 0.0008272510604001582}, {'label': 'POSITIVE', 'score': 0.9991727471351624}]]


 # Evaluating the Model Performance

In [34]:
# Import necessary libraries
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset  # Imports KeyDataset from transformers for efficient data loading

# Run inference - This section performs the prediction process
y_pred = []  # Initializes an empty list to store the predictions

# Iterate through the test data using tqdm for a progress bar
for output in tqdm(pipe(KeyDataset(test_sample, "content"), batch_size=8), total=len(test_sample)):
    # Extract negative and positive sentiment scores from the pipeline's output
    negative_score = output[0]["score"]
    positive_score = output[1]["score"]
    assignment = np.argmax([negative_score, positive_score])  # Determines the predicted class (0 for negative, 1 for positive)
    y_pred.append(assignment)  # Appends the predicted class to the y_pred list

# Display the first 10 predictions
print(f"First 10 Predictions: {y_pred[:10]}")


100%|██████████| 2000/2000 [00:13<00:00, 144.93it/s]

First 10 Predictions: [0, 0, 0, 0, 0, 1, 0, 1, 1, 1]





In [35]:
from sklearn.metrics import classification_report
# To evaluate the model's performance, we will create a classification report.

# Extract true labels for the sampled data
y_true = test_sample["label"]

# Define a function to evaluate performance
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

# Evaluate the model performance
evaluate_performance(y_true, y_pred)


                 precision    recall  f1-score   support

Negative Review       0.86      0.91      0.88       982
Positive Review       0.91      0.85      0.88      1018

       accuracy                           0.88      2000
      macro avg       0.88      0.88      0.88      2000
   weighted avg       0.88      0.88      0.88      2000



# Classification Tasks That Leverage Embeddings

In [36]:
from sentence_transformers import SentenceTransformer

# Load model
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = embedding_model.encode(train_sample["content"], show_progress_bar=True)
test_embeddings = embedding_model.encode(test_sample["content"], show_progress_bar=True)

train_embeddings.shape



Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

(10000, 768)

In [38]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, train_sample["label"])

# Predict previously unseen instances
y_pred_embeddings = clf.predict(test_embeddings)

# Evaluate the performance of the embedding-based classification
evaluate_performance(test_sample["label"], y_pred_embeddings)


                 precision    recall  f1-score   support

Negative Review       0.88      0.89      0.88       982
Positive Review       0.89      0.88      0.89      1018

       accuracy                           0.89      2000
      macro avg       0.88      0.89      0.88      2000
   weighted avg       0.89      0.89      0.89      2000



# What if we don't use a classifier at all?

 Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [39]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(train_sample["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred_no_classifier = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(test_sample["label"], y_pred_no_classifier)


                 precision    recall  f1-score   support

Negative Review       0.82      0.80      0.81       982
Positive Review       0.82      0.83      0.82      1018

       accuracy                           0.82      2000
      macro avg       0.82      0.82      0.82      2000
   weighted avg       0.82      0.82      0.82      2000



# Zero-shot Classification

In [40]:
# Create embeddings for our labels
label_embeddings = embedding_model.encode(["A negative review", "A positive review"])

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred_zero_shot = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(test_sample["label"], y_pred_zero_shot)


                 precision    recall  f1-score   support

Negative Review       0.80      0.72      0.76       982
Positive Review       0.75      0.82      0.79      1018

       accuracy                           0.77      2000
      macro avg       0.78      0.77      0.77      2000
   weighted avg       0.77      0.77      0.77      2000

