## Week 5 Large Language Models (LLMs)

Welcome to this notebook on Large Language Models! In this session, we will explore the fundamentals of working with LLMs using the Hugging Face Transformers library. We'll cover loading pre-trained models, understanding their inputs and outputs, tokenization, the generation process, and techniques to optimize generation speed. By the end of this notebook, you'll have a practical understanding of how to use an LLM for text generation.

### Setting up the Environment

Before we get into the code, we need to install the necessary Python libraries.
- `transformers`: This is the Hugging Face library that provides access to thousands of pre-trained models for Natural Language Processing (NLP), including LLMs.
- `accelerate`: This library helps in utilizing hardware accelerators like GPUs and TPUs efficiently, and is often required for running larger models or for distributed training/inference.

The following cell will install these libraries. The `%%capture` command is used to suppress the output of the pip installation process to keep the notebook clean.

In [1]:
%%capture
!pip install transformers>=4.41.2 accelerate>=0.31.0

# Loading the LLM

### Loading a Pre-trained LLM and Tokenizer

To work with an LLM, we typically need two main components:
1.  **Tokenizer**: This converts our input text into a format that the model can understand (i.e., numerical IDs or tokens). It also handles converting the model's numerical output back into human-readable text.
2.  **Model**: This is the pre-trained neural network itself that performs the language understanding and generation tasks.

We will be using the `microsoft/Phi-3-mini-4k-instruct` model, a powerful and relatively small LLM suitable for instruction-following tasks.

**Key parameters in `AutoModelForCausalLM.from_pretrained`:**
-   `device_map="cuda"`: This tells the library to load the model onto the GPU if available, which significantly speeds up computations. If no GPU is available, it might fall back to CPU, or you might need to adjust this.
-   `torch_dtype="auto"`: This allows the library to automatically select the optimal data type (e.g., float16 for faster computation and less memory usage on compatible GPUs) for the model's weights.
-   `trust_remote_code=False`: For security reasons, it's generally recommended to set `trust_remote_code=False` unless you are loading a model that explicitly requires custom code from the Hugging Face Hub and you trust the source. Some models might require this to be `True`. *Note: The provided notebook has this set to `False`, which is good practice. If issues arise with a specific model, the documentation might suggest setting it to `True`.*

**Pipeline for Text Generation:**
The Hugging Face `pipeline` is a high-level utility that simplifies common NLP tasks. For text generation, it wraps the model and tokenizer, handling much of the pre-processing and post-processing for us.
- `return_full_text=False`: This ensures that the pipeline only returns the newly generated text, not the input prompt plus the generated text.
- `max_new_tokens=50`: This limits the length of the generated text to a maximum of 50 new tokens.
- `do_sample=False`: When `False`, the model will use greedy decoding, always picking the token with the highest probability at each step. If set to `True`, it enables various sampling strategies (like temperature, top-k, top-p) for more diverse outputs.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Device set to use cuda


# The Inputs and Outputs of a Trained Transformer LLM

### Generating Text with the LLM

Once the LLM and tokenizer are loaded (often encapsulated in a `pipeline`), we can provide it with a **prompt**. A prompt is simply the input text that we want the model to respond to or complete. The model will then process this prompt and generate a continuation or an answer based on its training.

In the following cell, we'll define a sample prompt and use our `generator` pipeline to get the LLM's output.

In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my sincerest apologies for the unfortunate incident that occurred


### Understanding the Model Architecture

LLMs, like the Transformer model we're using, have a complex internal structure composed of layers such as embedding layers, attention mechanisms, and feed-forward networks. Printing the `model` object will display this architecture, giving us insight into its components (e.g., number of layers, hidden sizes, types of attention used). This can be useful for advanced users who want to understand the model's specifics or modify its behavior.

In [4]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, 

# Choosing a single token from the probability distribution (sampling / decoding)

### How LLMs Generate Text: From Probabilities to Tokens

At its core, an LLM, when generating text, predicts the next token in a sequence. It does this by outputting a probability distribution over its entire vocabulary for the next token position. The process generally involves:
1.  **Tokenization**: The input prompt is converted into a sequence of token IDs.
2.  **Model Forward Pass**: These token IDs are passed through the model. The main part of the model (e.g., `model.model(...)` in Hugging Face) processes these tokens and outputs hidden states (contextual representations) for each token.
3.  **Language Modeling Head**: The hidden state of the last token (or the token for which we want to predict the next one) is then passed through a "language modeling head" (`model.lm_head(...)`). This head is typically a linear layer that projects the hidden state's dimension to the vocabulary size. The output of this layer, often called **logits**, represents the unnormalized scores for each token in the vocabulary.
4.  **Softmax (Implicit)**: These logits are then (often implicitly, or as part of a generation strategy) converted into probabilities using a softmax function.
5.  **Decoding/Sampling**: A token is then chosen from this probability distribution. The simplest method is **greedy decoding**, where the token with the highest probability is selected. More advanced methods involve sampling (e.g., using temperature, top-k, or top-p sampling) to introduce randomness and diversity in the generated text.

The following cells will demonstrate these steps manually.

In [5]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

### Greedy Decoding

Here, we apply `argmax(-1)` to the logits of the last token's output (`lm_head_output[0,-1]`).
- `lm_head_output[0,-1]` selects the logits for the last token in our input sequence.
- `.argmax(-1)` finds the index (which corresponds to a token ID in the vocabulary) of the token with the highest score (highest probability after softmax).
This is the essence of greedy decoding: always picking the most likely next token.

In [6]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Paris'

### Understanding Output Shapes: Hidden States

The output from `model.model(input_ids)` (before the LM head) primarily consists of the last hidden states. The shape of this output tensor is typically `(batch_size, sequence_length, hidden_size)`.
-   `batch_size`: The number of input sequences processed simultaneously (1 in our case).
-   `sequence_length`: The number of tokens in the input sequence.
-   `hidden_size`: The dimensionality of the contextual embeddings (hidden states) for each token.

In [7]:
model_output[0].shape

torch.Size([1, 5, 3072])

### Understanding Output Shapes: Logits

The output from the language modeling head (`model.lm_head(...)`) are the logits. The shape of this tensor is typically `(batch_size, sequence_length, vocab_size)`.
-   `batch_size`: Same as above.
-   `sequence_length`: Same as above. For each token in the input sequence, the model predicts a distribution for the *next* token.
-   `vocab_size`: The total number of unique tokens the model knows. The values along this dimension are the raw scores (logits) for each token in the vocabulary being the next token.

In [8]:
lm_head_output.shape

torch.Size([1, 5, 32064])

# Speeding up generation by caching keys and values

### Optimizing Generation: Key-Value Caching

Generating text token by token can be computationally intensive because, for each new token, the model typically recomputes information from previous tokens. The Transformer architecture, particularly its self-attention mechanism, relies on Query (Q), Key (K), and Value (V) matrices.

**Key-Value (KV) Caching** is a crucial optimization technique. Once the K and V matrices are computed for a token, they can be saved (cached) and reused for generating subsequent tokens. This avoids redundant computations for the parts of the sequence that have already been processed. When generating the $(i+1)^{th}$ token, the model only needs to compute Q, K, V for the $i^{th}$ token and can reuse the cached Ks and Vs from tokens $1...i-1$.

Most Hugging Face models have `use_cache=True` enabled by default in their `generate` method configuration. We will explicitly compare the generation time with and without the cache to observe its impact.

In [9]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

### Timing Generation with KV Cache Enabled

The following cell uses the `%%timeit` magic command to measure the execution time of the `model.generate()` method with `use_cache=True` (which is often the default and explicitly stated here for clarity). This will give us a baseline for cached generation.

In [10]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


6.69 s ± 2.33 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Timing Generation with KV Cache Disabled

Now, let's measure the execution time when `use_cache=False`. We expect this to be significantly slower, especially for longer sequences, as the model will recompute keys and values for all preceding tokens at each generation step. This comparison will highlight the efficiency gains from using the KV cache.

In [11]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

31.2 s ± 421 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Text Classification

## Overview of Text Classification

In this section, we'll explore various methods for categorizing text data. Text classification is a fundamental Natural Language Processing (NLP) task with numerous applications, such as sentiment analysis, topic labeling, spam detection, and more.

We will cover:
-   Loading and preparing text datasets.
-   Using **representation models**:
    -   Employing task-specific models fine-tuned for classification.
    -   Leveraging pre-trained embeddings with traditional machine learning classifiers.
    -   Performing zero-shot classification using embeddings.
-   Using **generative models**:
    -   Prompting encoder-decoder models (like FLAN-T5) for classification.
    -   Utilizing large language models like ChatGPT via APIs for classification tasks.

Let's dive in!

### Environment Setup

To run the examples in this notebook, especially if you're using a cloud environment like Google Colab, it's crucial to install the required libraries. The cell below lists the necessary packages:
- `datasets`: For easily loading and manipulating datasets.
- `transformers`: Hugging Face's library for state-of-the-art machine learning models.
- `sentence-transformers`: For easily computing embeddings for sentences and paragraphs.
- [`google-generativeai`](https://pypi.org/project/google-generativeai/) : To interact with Google Gemini.

**Important:**
-   If you haven't installed these packages in your current environment, please **uncomment and run the following code cell.**
-   For optimal performance with many of these models, a **GPU is highly recommended**. If you are on Google Colab, ensure your runtime is set to use a GPU (e.g., T4) by navigating to **Runtime > Change runtime type > Hardware accelerator > GPU**.

In [12]:
%%capture
!pip install datasets transformers sentence-transformers generativeai

## Data
### Loading the Dataset: Rotten Tomatoes Movie Reviews

For our text classification tasks, we'll be using the "rotten_tomatoes" dataset. This dataset contains movie reviews labeled as either positive or negative, making it ideal for binary sentiment classification. The `datasets` library from Hugging Face allows us to download and load this dataset with ease.

In [15]:
!pip install -U datasets fsspec


Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are i

In [1]:
from datasets import load_dataset



# Load our data
data = load_dataset("rotten_tomatoes", trust_remote_code=True)
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Exploring the Data

The `data` object is a `DatasetDict` which typically contains splits like 'train', 'validation', and 'test'. Each split is a `Dataset` object. We can see the features available (e.g., 'text' for the review and 'label' for the sentiment) and the number of rows in each split. Let's inspect a couple of examples to understand the data format better.

In [2]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# **Text Classification with Representation Models**

### Introduction to Representation Models for Classification

Representation models are typically pre-trained models that have learned to convert text into meaningful numerical representations (embeddings). These representations can then be used for various downstream tasks, including classification. We'll first look at models that are already fine-tuned specifically for classification tasks.

## **Using a Task-specific Model**

### Leveraging Pre-Fine-tuned Classification Models

Some models on the Hugging Face Hub are already fine-tuned for specific tasks like sentiment analysis. We can use these directly without needing to train them further on our specific dataset (though fine-tuning on your specific data can often improve performance).

Here, we'll use the `cardiffnlp/twitter-roberta-base-sentiment-latest` model, which is a RoBERTa-based model fine-tuned for sentiment analysis. The Hugging Face `pipeline` function simplifies using such models for inference.

**Key parameters for the pipeline:**
- `model`: Specifies the pre-trained model to use.
- `tokenizer`: Often the same as the model path, it specifies the tokenizer compatible with the model.
- `return_all_scores=True`: This ensures the pipeline returns scores for all possible labels (e.g., negative, neutral, positive), not just the predicted one.
- `device="cuda:0"`: Assigns the pipeline to run on the first available CUDA-enabled GPU for faster processing.

In [3]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


### Batch Inference and Score Processing

To efficiently process our entire test set, we use `KeyDataset` from `transformers.pipelines.pt_utils`. This allows the pipeline to iterate over the 'text' field of our test data.
The `cardiffnlp` model outputs scores for 'negative', 'neutral', and 'positive' sentiments. Since our dataset is binary (positive/negative), we'll compare the scores for 'negative' (index 0 in the output list) and 'positive' (index 2) and assign the label with the higher score.
`tqdm` is used to display a progress bar during inference.

In [4]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:24<00:00, 43.46it/s]


### Evaluating Model Performance

To understand how well our model is performing, we need to evaluate its predictions against the true labels. The `classification_report` from `scikit-learn` provides key metrics:
-   **Precision**: Of all the instances the model predicted as positive, what fraction was actually positive? (TP / (TP + FP))
-   **Recall (Sensitivity)**: Of all the actual positive instances, what fraction did the model correctly identify? (TP / (TP + FN))
-   **F1-score**: The harmonic mean of precision and recall, providing a single score that balances both. (2 * (Precision * Recall) / (Precision + Recall))
-   **Support**: The number of actual instances for each class.
-   **Accuracy**: The overall proportion of correctly classified instances. ((TP + TN) / Total)

In [5]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [6]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Using General-Purpose Embeddings for Classification

Instead of using a model fine-tuned for a specific classification task, we can use models that are good at generating general-purpose text embeddings. These embeddings capture the semantic meaning of the text and can then be fed into simpler, traditional machine learning classifiers or used in other ways for classification.


### Supervised Classification with Embeddings

In this approach, we perform two main steps:
1.  **Generate Embeddings**: Use a pre-trained sentence transformer model to convert each text document in our training and test sets into a fixed-size numerical vector (embedding).
2.  **Train a Classifier**: Train a standard supervised machine learning classifier (e.g., Logistic Regression, SVM) on these embeddings and their corresponding labels from the training set. Then, evaluate it on the test set embeddings.

We'll use the `sentence-transformers/all-mpnet-base-v2` model, which is known for producing high-quality sentence embeddings.

In [7]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [8]:
train_embeddings.shape

(8530, 768)

### Embedding Dimensions

The shape of `train_embeddings` shows `(number_of_documents, embedding_dimension)`. For `all-mpnet-base-v2`, the embedding dimension is 768. Each review is now represented by a vector of 768 numbers.

In [9]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [10]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



### Performance of Embedding-Based Supervised Classification

The classification report shows the performance of the Logistic Regression model trained on the sentence embeddings. This approach often yields strong results and is more flexible than using a task-specific model if you want to train on your own specific labels or dataset nuances.



What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

### Alternative: Classification via Averaged Class Embeddings and Cosine Similarity

This tip suggests a simpler, non-parametric approach. Instead of training a separate classifier:
1.  For each class in the training set, calculate the average embedding of all documents belonging to that class. This gives a "prototype" embedding for each class.
2.  For each document in the test set, calculate its embedding.
3.  Predict the class by finding which class's prototype embedding is most similar (e.g., using cosine similarity) to the test document's embedding.
Let's see how this method performs.

In [11]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-Shot Classification with Embeddings

Zero-shot classification aims to classify text into categories for which the model has not seen any labeled examples during its training. With sentence embeddings, we can achieve this by:
1.  Creating descriptive phrases for our target labels (e.g., "This review is positive," "This review is negative").
2.  Generating embeddings for these label descriptions using the same sentence transformer model.
3.  For each test document, generating its embedding.
4.  Calculating the cosine similarity between the document's embedding and each label description's embedding.
5.  Assigning the document to the label whose description embedding has the highest similarity.
This approach is powerful because it doesn't require any training data for the specific classes you're interested in, as long as you can describe them meaningfully.

In [12]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [14]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



### The Importance of Label Descriptions in Zero-Shot Learning

The performance of zero-shot classification can be quite sensitive to how you phrase your label descriptions. "A negative review" might yield different results than "This movie review expresses disappointment" or "A very negative movie review." Experimenting with these descriptions (a form of prompt engineering) is key to optimizing zero-shot performance.

## **Classification with Generative Models**

### Introduction to Generative Models for Classification

Generative Large Language Models (LLMs), such as T5 or GPT models, can also be adapted for classification tasks. Instead of outputting fixed representations, we prompt these models in a way that they *generate* text that indicates the class label.

### Encoder-decoder Models

### Using Encoder-Decoder Models like FLAN-T5

Encoder-decoder models, like Google's FLAN-T5 (Fine-tuned Language Net - Text-to-Text Transfer Transformer), are designed for text-to-text tasks. We can frame classification as such a task by providing the model with the input text and a question (the prompt), and instructing it to generate the class label as text (e.g., "positive" or "negative"). We'll use `google/flan-t5-small`, a smaller, more manageable version of FLAN-T5.

In [15]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


In [16]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

### Prompting for Classification

The key here is how we structure the input to the FLAN-T5 model. We prepend a question like `"Is the following sentence positive or negative? "` to each review. The model is then expected to generate the answer, hopefully "positive" or "negative". We map these generated text outputs back to our numerical labels (0 or 1).

In [17]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [00:49<00:00, 21.55it/s]


In [18]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Leveraging Advanced API-based Models like Gemini

Very large and capable generative models, such as Google's Gemini series (e.g., Gemini Advanced), can be accessed via APIs for sophisticated NLP tasks, including classification. These models often exhibit strong zero-shot or few-shot performance due to their extensive training.


In [19]:
import google.generativeai as genai
import os
from tqdm import tqdm

**Security Best Practice:** 

1.  **Get your API Key:** If you don't have one, you can obtain a Gemini API key from [Google AI Studio](https://aistudio.google.com/app/apikey).
2.  **Set it as an Environment Variable (Recommended):**
    * You can set an environment variable named `GEMINI_API_KEY` in your system or in your Colab notebook's secrets.
3.  **Configure `genai`:** The code below will try to load the key from the environment variable and configure `genai`. If the environment variable is not found, it will print a message asking you to set it or configure it directly.



A flag `is_api_key_configured` will be used to track if `genai.configure()` has been called.

In [1]:
import google.generativeai as genai # Ensure this is at the top if not already run
from google.colab import userdata


is_api_key_configured = True

try:
    # Attempt to get the API key from an environment variable
    gemini_api_key = userdata.get('GEMINI_API_KEY')
    genai.configure(api_key=gemini_api_key)
    is_api_key_configured = True
    print("Gemini API Key configured successfully from environment variable.")
except KeyError:
    print("--------------------------------------------------------------------------------")
    print("🚨 GEMINI_API_KEY environment variable not found.")
    print("🚨 Please try one of the following options:")
    print("🚨 OPTION 1 (Recommended): Set an environment variable or Colab Secret named GEMINI_API_KEY.")
    print("🚨 OPTION 2 (For quick testing, less secure):")
    print("🚨   In the next code cell, uncomment the lines and replace 'YOUR_GEMINI_API_KEY_HERE' with your actual key.")
    print("🚨 Get an API key from Google AI Studio: https://aistudio.google.com/app/apikey")
    print("--------------------------------------------------------------------------------")

# If you need to use OPTION 2 (direct configuration), run the following cell after replacing the placeholder:

Gemini API Key configured successfully from environment variable.


-   **`generation_config`**: This dictionary holds settings that control how the model generates text.
    -   `temperature: 0`: A lower temperature (0 to 1.0) makes the output more deterministic and less random. For classification where we want a consistent answer ('0' or '1'), 0 is a good choice.
-   **`genai.GenerativeModel(...)`**: This initializes the model. We'll wrap this in a `try-except` block to catch potential authentication or configuration errors if the API key is invalid.

In [2]:
generation_config = {
  "temperature": 0,
  "top_p": 1,
  "top_k": 1,
}

model = None # Initialize model to None

if is_api_key_configured:
    try:
        model = genai.GenerativeModel(
            model_name="gemini-1.5-flash", # You can also try "gemini-1.0-pro"
            generation_config=generation_config,
            # safety_settings=... # Optional: configure safety settings if needed
        )
        print(f"Gemini model '{model.model_name}' initialized successfully.")
    except Exception as e:
        print(f"🚨 Error initializing Gemini model: {e}")
        print("🚨 This can happen if the API key is invalid, not authorized for the model, or if there are network issues.")
        print("🚨 Please double-check your API key and its permissions in Google AI Studio.")
else:
    print("🚨 Skipping model initialization because the API key was not configured.")

# The rest of the steps (5 onwards) would then typically check `if model:` before proceeding.

Gemini model 'models/gemini-1.5-flash' initialized successfully.


### Define a Function to Generate Text with Gemini
Let's create a helper function to make calls to the Gemini API. This will encapsulate the logic for sending a prompt and getting a response.

The `gemini_generation` function will:
1.  Take a `prompt_template` (our instruction to the model), `document_text` (the movie review), and the initialized `model_instance` as input.
2.  Construct the `full_prompt` by inserting the `document_text` into our `prompt_template`.
3.  Call `model_instance.generate_content(full_prompt)` to send the request to the Gemini API.
4.  Extract the generated text from the `response` using `response.text.strip()` (`strip()` removes any leading/trailing whitespace).
5.  Include basic error handling in case the API call fails.

In [3]:
def gemini_generation(prompt_template, document_text, model_instance):
    """Generate an output based on a prompt and an input document using Gemini."""
    if not model_instance:
        print("Error: Gemini model is not initialized.")
        return None

    full_prompt = prompt_template.replace("[DOCUMENT]", document_text)

    try:
        response = model_instance.generate_content(full_prompt)
        return response.text.strip()
    except Exception as e:
        print(f"Error during Gemini API call for document '{document_text[:50]}...': {e}")
        return None # Return None or raise error for more robust handling

### Define the Classification Prompt
The prompt is crucial. It tells the LLM exactly what we want it to do. For classification, we want it to output only '0' or '1'.

This prompt clearly instructs the model:
-   The task: Predict if a movie review is positive or negative.
-   The input format: Where the `[DOCUMENT]` (the review text) will be.
-   The desired output format: "If it is positive return 1 and if it is negative return 0. Do not give any other answers." This specificity helps in getting consistent and easily parsable outputs.

In [4]:
prompt_template = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""
print("Classification prompt template defined.")

Classification prompt template defined.


### Test with a Single Example
Before processing many documents, let's test our setup with a single movie review.

This will:
1.  Define an example `document_example`.
2.  Call our `gemini_generation` function with the `prompt_template`, `document_example`, and our initialized `model`.
3.  Print the model's prediction. This helps verify that the API connection, model, and prompt are working as expected.

In [5]:
document_example = "This movie was an absolute delight, full of charm and wit!"
# document_example = "A rather dull and uninspired film, unfortunately." # Try a negative one too!

print(f"Testing with document: \"{document_example}\"")

if model: # Check if the model was initialized successfully
    prediction_example = gemini_generation(prompt_template, document_example, model)
    print(f"Gemini Prediction for example: {prediction_example}")
else:
    print("Skipping single example test as the model was not initialized.")

Testing with document: "This movie was an absolute delight, full of charm and wit!"
Gemini Prediction for example: 1


### Prepare for Batch Processing (Load Data and Define Evaluation)
To evaluate the model on a larger scale, we need our test dataset and an evaluation function



In [6]:
# Confirming 'data' is accessible (it should be if loaded from your notebook)
if 'data' in locals() and 'test' in data:
    print(f"Dataset 'rotten_tomatoes' found with a test set of {len(data['test'])} samples.")
    # Example: Access the first item's text and label from the test set
    # print(f"First test sample text: {data['test'][0]['text']}")
    # print(f"First test sample label: {data['test'][0]['label']}")
else:
    print("⚠️  'data' variable (dataset) or 'data[\"test\"]' not found. Please ensure it's loaded correctly.")
    # As a fallback, if you were running this standalone, you'd uncomment:
    # from datasets import load_dataset
    # data = load_dataset("rotten_tomatoes")
    # print("Loaded 'rotten_tomatoes' dataset.")


# Placeholder for the evaluation function (ensure this is defined in your environment from your notebook)
from sklearn.metrics import classification_report # Assuming this import is in your definition

def evaluate_performance(y_true, y_pred, target_names=None):
    """Create and print the classification report."""
    if target_names is None:
        target_names = ["Negative Review", "Positive Review"]
    try:
        performance = classification_report(
            y_true, y_pred,
            target_names=target_names,
            zero_division=0 # Handles cases where a class might not have any predictions
        )
        print(performance)
    except ValueError as e:
        print(f"Error during evaluation: {e}")
        print("This can happen if y_true and y_pred have different unique labels or are empty.")

# Quick check if evaluate_performance exists
if 'evaluate_performance' not in locals():
    print("⚠️  'evaluate_performance' function not found. Please ensure it's defined from your notebook.")

⚠️  'data' variable (dataset) or 'data["test"]' not found. Please ensure it's loaded correctly.


### Batch Prediction on the Test Set (Using data["test"])
Now, let's loop through our actual data["test"] set, get predictions from Gemini for each review, and store them.

This process involves:
1.  Initializing an empty list `gemini_predictions` to store the results.
2.  Using `data["test"]["text"]` for the input documents and `data["test"]["label"]` for the true labels.
    We'll use a sample size here for demonstration.
3.  Iterating through each document text using `tqdm` for a progress bar.
4.  Calling `gemini_generation` (defined in Step 5) for each document.
5.  **Handling API Responses:**
    * Checking if the prediction is valid (i.e., '0' or '1').
    * Converting the valid string prediction to an integer.
    * If the model returns something unexpected or an error occurs, we print a warning and append a default value (e.g., 0). This ensures `gemini_predictions` and the true labels list will have the same length for evaluation.

In [7]:
# Ensure 'data' is loaded and 'model' (Gemini model) is initialized from previous steps
if 'data' in locals() and 'test' in data and model:
    print("\nStarting batch prediction with Gemini on the 'rotten_tomatoes' test set...")
    gemini_predictions = []

    # For full evaluation, use all test data:
    # test_texts = data["test"]["text"]
    # true_labels = data["test"]["label"]

    # Using a smaller sample size for this educational guide run to save API calls/time.
    # You can increase this or use the full dataset for your actual evaluation.
    test_texts = data["test"]["text"]
    true_labels = data["test"]["label"] # Make sure to slice labels consistently
    print(f"Processing {len(test_texts)} samples from data['test']...")

    for doc_text in tqdm(test_texts):
        pred_text = gemini_generation(prompt_template, doc_text, model) # Ensure prompt_template and model are from earlier steps
        if pred_text is not None and pred_text in ['0', '1']:
            gemini_predictions.append(int(pred_text))
        else:
            print(f"Warning: Unexpected or no output ('{pred_text}') for document: {doc_text[:70]}...")
            # Strategy: Append a default value (e.g., 0 for negative) to maintain list length.
            gemini_predictions.append(0) # Or another appropriate default, e.g., -1 for "unclassified"

    print(f"Batch prediction complete. Collected {len(gemini_predictions)} predictions.")
else:
    print("Skipping batch prediction: 'data' not loaded, 'data[\"test\"]' is missing, or Gemini 'model' not initialized.")
    gemini_predictions = [] # Ensure it's defined
    true_labels = []      # Ensure it's defined

Skipping batch prediction: 'data' not loaded, 'data["test"]' is missing, or Gemini 'model' not initialized.


### Evaluate Performance
Finally, let's use our evaluate_performance function to see how well the Gemini model performed on the selected samples from test set.

This step will compare the `gemini_predictions` we collected against the corresponding `true_labels` from `data["test"]["label"]`. The classification report will give us metrics like precision, recall, and F1-score.

**Important:** Ensure that `gemini_predictions` and `true_labels` (the slice of `data["test"]["label"]` we used) have the same number of elements. Our error handling in Step 9 (appending a default value) aims to ensure this.

In [8]:
if 'evaluate_performance' in locals() and gemini_predictions and true_labels:
    if len(gemini_predictions) == len(true_labels):
        print("\nGemini Model Performance Evaluation (on selected samples):")
        evaluate_performance(true_labels, gemini_predictions)
    else:
        print(f"Error: Mismatch in prediction count ({len(gemini_predictions)}) and label count ({len(true_labels)}).")
        print("Cannot evaluate performance. Please check the batch prediction step for errors.")
elif not gemini_predictions or not true_labels:
    print("\nSkipping evaluation as no predictions were generated or true labels are missing for the sample.")
else:
    print("\nSkipping evaluation: `evaluate_performance` function not defined.")


Skipping evaluation as no predictions were generated or true labels are missing for the sample.
