<a href="https://colab.research.google.com/github/itsdivya1309/Machine-Learning/blob/main/LLMs/Text%20Classification/Text_Classification_Representation_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Representation Model

Here, we'll focus on binary sentiment classification of rotten tomatoes movie reviews.

We can accomplish this task in two ways:

### 1. Perform classification directly with a task-specific model

### 2. Perform classification indirectly with general-purpose embeddings

We'll use pre-trained models for now.

In [1]:
# Importing the dataset
!pip install datasets

from datasets import load_dataset

# Load Rotten Tomatoes Moview Review dataset
data = load_dataset('rotten_tomatoes')
data

Collecting datasets
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.0-py3-none-any.whl (484 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Using a Task-specific model

We'll use the `Twitter-RoBERTa-base for Sentiment Analysis` model. This is a RoBERTa model fine-tuned on tweets for sentiment analysis.

In [2]:
!pip install transformers



In [3]:
# Loading the model
from transformers import pipeline

# Path to our model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    top_k=None,
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

In [5]:
sample = data['train']['text'][0]

In [6]:
output = pipe(sample)
output

[[{'label': 'positive', 'score': 0.9073736071586609},
  {'label': 'neutral', 'score': 0.0880218893289566},
  {'label': 'negative', 'score': 0.00460449093952775}]]

In [7]:
# Get the best sentiment prediction
best_prediction = max(output[0], key=lambda x: x['score'])

# Print only the most confident sentiment
print(f"Predicted Sentiment: {best_prediction['label']} (Confidence: {best_prediction['score']:.2f})")

Predicted Sentiment: positive (Confidence: 0.91)


The model classifies text into `positive`, `negative` and `neutral` categories.

In [9]:
# A list to store predictions
y_pred = []

# Iterate through test dataset
for output in tqdm(pipe(data['test']['text']), total=len(data['test'])):
    # Convert output into a dictionary for easy lookup
    scores = {entry['label']: entry['score'] for entry in output}

    # Extract scores safely
    negative_score = scores.get('negative', 0)  # Default to 0 if not found
    positive_score = scores.get('positive', 0)

    # Get predicted class (0 = Negative, 1 = Positive)
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:00<00:00, 100705.62it/s]


### Understanding the General Output Format

When we use `pipe(text)`, the model gives us an output list where each item is a dictionary like this:

```
[{'label': 'POSITIVE', 'score': 0.98}]
```
or

```
[{'label': 'NEGATIVE', 'score': 0.85}]
```
Now, the problem is:

We don't know for sure if 'NEGATIVE' is always at index 0 and 'POSITIVE' is at index 1. The order might change depending on the model output.

Hence, we need o check the output list to check the order of class labels before assigning the scores.

We check the first prediction (`output[0]`).
If it's 'NEGATIVE', we take `output[0]['score']` as the negative score and `output[1]['score']` as the positive score. Otherwise, we swap them.

**Example Scenarios**

1. Example 1: Model Outputs NEGATIVE First

```
output = [{'label': 'NEGATIVE', 'score': 0.80}, {'label': 'POSITIVE', 'score': 0.20}]
```
`output[0]['label'] == 'NEGATIVE'`, so:

```
negative_score = 0.80  # output[0]['score']
positive_score = 0.20  # output[1]['score']
```

2. Example 2: Model Outputs POSITIVE First

```
output = [{'label': 'POSITIVE', 'score': 0.75}, {'label': 'NEGATIVE', 'score': 0.25}]
```

`output[0]['label'] == 'POSITIVE'`, so we enter the else block:

```
negative_score = 0.25  # output[1]['score']
positive_score = 0.75  # output[0]['score']
```

**The labels in the dictionary are ordered by their scores.**

This means, we won't have the output for all the texts in the same format.

In [10]:
# Evaluation
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print classification report"""
    report = classification_report(
        y_true, y_pred,
        target_names=['Negative Review', 'Positive Review']
    )
    print(report)

In [11]:
evaluate_performance(data['test']['label'], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



### Using another pre-trained model

Le's use the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model this time.

In [12]:
# Loading the model
model_path = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english'

# Creating a pipeline
pipe_distilbert = pipeline(
    'sentiment-analysis',
    model=model_path,
    tokenizer=model_path,
    top_k=None
)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [13]:
another_sample = data['validation']['text'][-1]
sample_label = data['validation']['label'][-1]
print('Text\n',another_sample)
print('Label: ', sample_label)

Text
 the feature-length stretch . . . strains the show's concept .
Label:  0


In [14]:
model_output = pipe_distilbert(another_sample)
model_output

[[{'label': 'NEGATIVE', 'score': 0.9998082518577576},
  {'label': 'POSITIVE', 'score': 0.00019182452524546534}]]

In [15]:
# Get the best sentiment prediction
best_prediction = max(model_output[0], key=lambda x: x['score'])

# Print only the most confident sentiment
print(f"Predicted Sentiment: {best_prediction['label']} (Confidence: {best_prediction['score']:.2f})")

Predicted Sentiment: NEGATIVE (Confidence: 1.00)


In [16]:
# Make predictions on the test data

# A list to store predictions
y_pred = []

# Iterate through test dataset
for output in tqdm(pipe_distilbert(data['test']['text']), total=len(data['test'])):
    # Ensure correct label matching
    if output[0]['label']=='NEGATIVE':
        negative_score = output[0]['score']
        positive_score = output[1]['score']
    else:
        negative_score = output[1]['score']
        positive_score = output[0]['score']
    # Get predicted class (0=Negative, 1=Positive)
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:00<00:00, 166021.61it/s]


In [17]:
evaluate_performance(data['test']['label'], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.89      0.90      0.90       533
Positive Review       0.90      0.89      0.90       533

       accuracy                           0.90      1066
      macro avg       0.90      0.90      0.90      1066
   weighted avg       0.90      0.90      0.90      1066



We can see that both the models are performing well, considering they aren't trained on the dataset. DistilBERT performs better because it was fine-tuned on the domain data.



---

## Text Classification Using Embedding Models

We use an embedding model to generate features, which are then fed to a classifier.

In the first step, we convert our text input to embeddings using an embedding model. These embeddings are numerical representations of our input text. In the second step, these embeddings serve as the input features for a trainable classifier like Logistic Regression.

Let's use the `sentence-transformer` model to generate embeddings.


In [18]:
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Create embeddings for train and text data
train_embeddings = model.encode(data['train']['text'], show_progress_bar=True)
test_embeddings = model.encode(data['test']['text'], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [19]:
train_embeddings.shape

(8530, 768)

In [21]:
# Training a classifier on the embeddings
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=42)
classifier.fit(train_embeddings, data['train']['label'])

In [22]:
y_pred = classifier.predict(test_embeddings)
evaluate_performance(data['test']['label'], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



### What if we do not have Labelled data?

In the case where we do not have labelled data but only the labels, we can perform ***Zero-shot classification***.

We can describe our labels based on their definitions. For example the `negative` class in movie reviews can be described as `'This is a negative movie review'`. Then, we create embedding for these label descriptions.

So now we have *text-embeddings* and *label-embeddings*. To assign labels to texts (or documents), we can use the cosine similarity of the document-label pairs.

**Cosine similarity** is the cosine of the angle between the two vectors, calculated as the dot product of the vecotrs devided by the product of their lengths. We can use cosine similarity to check how similar a given text is to the description of our candidate labels.



In [23]:
# Creating embeddings for our labels
label_embeddings = model.encode(['A negative movie review', 'A positive movie review'])

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label of each text
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data['test']['label'], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.76      0.79       533
Positive Review       0.78      0.85      0.81       533

       accuracy                           0.80      1066
      macro avg       0.80      0.80      0.80      1066
   weighted avg       0.80      0.80      0.80      1066



In [25]:
# Creating another embeddings for our labels
label_encodings = model.encode(['A very negative movie review', 'A very positive movie review'])
sim_matrix = cosine_similarity(test_embeddings, label_encodings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data['test']['label'], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.86      0.73      0.79       533
Positive Review       0.76      0.88      0.82       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



This means that our creativity with the description of our labels can affect the output of the task.

---