#### Sunil Kumar R
#### Nuid - 002764807

### Original Lab 1: Dialogue Summarization using Generative AI
#### Objective:
Lab 1 is designed to introduce students to the practical application of generative AI models for summarizing dialogues. Specifically, the lab utilizes the FLAN-T5 model from Hugging Face, exploring how different prompt engineering techniques can impact the model's ability to summarize dialogues effectively.

#### Tasks Performed:

- **Setup Environment**:
  - Installation of necessary software dependencies, including PyTorch and the Hugging Face `transformers` and `datasets` libraries. This ensures the computational environment is prepared for executing the model and handling the data.

- **Basic Summarization**:
  - The lab begins with a straightforward application of the FLAN-T5 model to summarize dialogues without any modifications to the input prompts. This task helps establish a baseline for the model's summarization capabilities.

- **Prompt Engineering**:
  - Introduction to the concept of prompt engineering, where students learn to modify the input prompts to guide the model more effectively. This section explores how tailored prompts can enhance the model's output, demonstrating the influence of context and instruction clarity on generative AI.

- **Exploring Inference Techniques**:
  - The lab progresses to more advanced techniques such as zero-shot, one-shot, and few-shot inference. These methods involve using varying numbers of example dialogues in the prompt to improve the model's understanding and summarization accuracy.

- **One-Shot and Few-Shot Inference**:
  - Practical exercises where students employ one-shot and few-shot learning techniques. By integrating example dialogues into the prompts, students observe firsthand how these examples can prime the model to perform better on similar summarization tasks.

#### Summary:
This lab serves as an introduction to using large language models for NLP tasks, particularly dialogue summarization. Through hands-on experiments with FLAN-T5, students gain insights into the dynamics of prompt engineering and the practical applications of zero-shot, one-shot, and few-shot learning in improving AI-generated summaries.


### Updated Lab 1: Dialogue Summarization with Generative AI using BART
#### Objective:
This updated version of Lab 1 emphasizes summarizing dialogues using a different generative AI model, specifically the BART model from Hugging Face. The lab involves setting up the necessary software environment, processing input dialogues, and generating summaries without the need for complex prompt engineering, focusing on direct summarization capabilities of the BART model.

#### Tasks Performed:

- **Setup Environment**:
  - Installation of the necessary libraries including `transformers` and `datasets`, ensuring the software environment is prepared for running the summarization tasks.

- **Basic Summarization**:
  - Direct use of the BART model to automatically generate summaries from dialogues. This process does not require prompt modifications, simplifying the summarization process.

- **Model Loading and Tokenization**:
  - Loading the pre-trained `BART` tokenizer and model specifically fine-tuned for summarization (`facebook/bart-large-cnn`), which are designed to handle sequence-to-sequence tasks effectively.

- **Summary Generation**:
  - Employing the BART model to generate summaries using a straightforward approach. The model processes tokenized input and outputs summaries using beam search, enhancing the quality of the generated text.

- **Output Display**:
  - The script outputs both the original dialogue and the BART-generated summary, allowing for an easy comparison to assess the effectiveness of the model in capturing the essence of the dialogues.

#### Summary:
The lab showcases the use of a powerful, pre-trained model (BART) for dialogue summarization, offering a practical example of applying state-of-the-art NLP techniques in real-world scenarios. This setup demonstrates the model's ability to handle complex summarization tasks with minimal configuration, making it accessible for users new to NLP and machine learning.


In [1]:
!pip install transformers datasets


Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

**BartTokenizer and BartForConditionalGeneration**: These are imported from the transformers library, which provides a suite of tools and pre-trained models designed for NLP tasks. The BartTokenizer is used to convert text input into a format that the model can understand (tokenization), and BartForConditionalGeneration is the actual model used for generating summaries.

**load_dataset**: This function from the datasets library is used to load pre-existing datasets. It simplifies data handling, providing easy access to a wide variety of NLP datasets.

In [2]:
from transformers import BartTokenizer, BartForConditionalGeneration
from datasets import load_dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

**dataset Loading**: This line fetches the dialogsum dataset, specifically its test split, from the Hugging Face datasets repository. The dialogsum dataset includes ialogues along with human-written summaries, making it suitable for training and evaluating summarization models.

In [None]:
# Load the dataset
dataset = load_dataset("knkarthick/dialogsum", split='test')

# Sample Selection: This retrieves the first dialogue from the test set of the dataset.
# This dialogue will be used as input to the BART model to generate a summary.
sample_dialogue = dataset[0]['dialogue']


In [3]:
# Load BART model and tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [4]:
# Tokenize the input text
inputs = tokenizer(sample_dialogue, max_length=1024, return_tensors='pt', truncation=True)


In [5]:
# Generate summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=200, early_stopping=True)


In [6]:
# Decode the generated ids to get the summary text
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Dialogue:", sample_dialogue)
print("\nGenerated Summary:", summary)

Dialogue: #Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with th

**Comparison with T5**

**Model Architecture**: BART is a denoising autoencoder for pretraining sequence-to-sequence models, using a standard Transformer architecture similar to the original BERT, but with autoregressive capabilities in its decoder. In contrast, T5 (Text-to-Text Transfer Transformer) frames all tasks, including summarization, as a text-to-text problem, converting every input into text and every output also into text.

**Ease of Use**: BART's implementation, specifically the bart-large-cnn variant, is optimized for summarization, making it easier to achieve high-quality results without extensive configuration or prompt engineering, which is often needed with T5 to guide it explicitly for summarization tasks.
Performance: While both models are effective, BART has been particularly noted for its summarization capabilities in practical applications due to its architectural and training optimizations.

This BART-based approach simplifies the process by minimizing the need for manual intervention or complex configurations, making it more accessible for those new to using NLP models for summarization.

### Original Lab 2 Summary

The original Lab 2 was designed to introduce students to fine-tuning a pre-trained model (T5) for the task of dialogue summarization using the DialogSum dataset. The lab focused on the following key aspects:

1. **Environment Setup:** Setting up necessary libraries and tools for model training and evaluation.
2. **Data Preparation:** Loading and preprocessing the DialogSum dataset to format the dialogues and summaries properly for the T5 model.
3. **Model Training:** Fine-tuning the T5 model on the dialogue summarization task, which involved configuring the model, setting up the training loop, and optimizing the parameters.
4. **Evaluation:** Measuring the performance of the fine-tuned model using summarization-specific metrics like ROUGE scores to evaluate the quality of the generated summaries against the ground truth.
5. **Conclusion:** Discussing the results and potential areas for further research or improvements in dialogue summarization.

This lab served as a practical introduction to the intricacies of natural language processing (NLP) tasks involving transformer models, highlighting both the methodology and challenges in fine-tuning such models for specific applications.

### Updated Lab 2 Summary (Using BERT for Summarization)

The revised version of Lab 2 shifts the focus from using T5 to fine-tuning BERT for summarization tasks, specifically targeting news articles from the CNN/DailyMail dataset. The updated lab encompasses several changes and enhancements:

1. **Dataset Change:** Instead of the DialogSum dataset, the CNN/DailyMail dataset is used, which provides a different kind of challenge due to its news-article structure.
2. **Model Adaptation:** BERT, traditionally used for classification tasks, is adapted here for summarization. This involves modifying the BERT model to handle sequence-to-sequence tasks by treating summarization as a sequence classification problem.
3. **Data Encoding:** The preprocessing steps are updated to suit the BERT tokenizer and model requirements, including adjustments for input and output formats suitable for summarization.
4. **Fine-tuning Details:** The fine-tuning process is tailored to BERT, emphasizing the adjustments needed to fine-tune a model initially designed for classification tasks to perform summarization.
5. **Evaluation Adjustments:** The evaluation still utilizes ROUGE scores but adapts the methodology to handle the output from BERT, considering the nuances of decoding and comparing generated summaries.
6. **Enhanced Practical Skills:** Students gain insights into adapting models for new NLP tasks and learn about the flexibility and potential of models like BERT beyond their usual applications.

This update not only broadens the practical experience with transformer models but also encourages exploration of model adaptability and task-specific fine-tuning, which are crucial skills in the field of machine learning and NLP.

In [2]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.2-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.8 MB/s[0m eta

In [4]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from datasets import load_dataset, load_metric
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm


In [5]:
dataset = load_dataset("cnn_dailymail", "3.0.0")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 257M/257M [00:05<00:00, 50.5MB/s]
Downloading data: 100%|██████████| 257M/257M [00:06<00:00, 42.5MB/s]
Downloading data: 100%|██████████| 259M/259M [00:04<00:00, 62.3MB/s]
Downloading data: 100%|██████████| 34.7M/34.7M [00:00<00:00, 49.0MB/s]
Downloading data: 100%|██████████| 30.0M/30.0M [00:00<00:00, 70.0MB/s]


Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def encode_data(example):
    # Join list elements into a single string if they are in list form
    article_text = ' '.join(example['article']) if isinstance(example['article'], list) else example['article']
    highlights_text = ' '.join(example['highlights']) if isinstance(example['highlights'], list) else example['highlights']

    # Encode the texts with proper length
    inputs = tokenizer('summarize: ' + article_text, max_length=1000, truncation=True, padding='max_length')  # Adjust max_length to 1000 for inputs
    outputs = tokenizer(highlights_text, max_length=1000, truncation=True, padding='max_length')  # Adjust max_length to
    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask'],
        'labels': outputs['input_ids']  # Ensure the labels are also padded to 1000
    }


encoded_dataset = dataset.map(encode_data, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

In [None]:
train_loader = DataLoader(encoded_dataset['train'], batch_size=8, shuffle=True)


In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
model.to('cuda')


In [4]:
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

model.train()
for epoch in range(3):
    total_loss = 0
    progress_bar = tqdm(train_loader, desc="Training")
    for batch in progress_bar:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        progress_bar.set_postfix({'loss': total_loss / (progress_bar.last_print_n + 1)})

    print(f"Epoch {epoch+1}: Loss {total_loss / len(train_loader)}")


Training:  10%|█         | 10/100 [00:05<00:45,  2.00it/s, loss=0.693]
Training:  20%|██        | 20/100 [00:10<00:40,  2.00it/s, loss=0.683]
Training:  30%|███       | 30/100 [00:15<00:35,  2.00it/s, loss=0.673]
Training:  40%|████      | 40/100 [00:20<00:30,  2.00it/s, loss=0.663]
Training:  50%|█████     | 50/100 [00:25<00:25,  2.00it/s, loss=0.653]
Training:  60%|██████    | 60/100 [00:30<00:20,  2.00it/s, loss=0.643]
Training:  70%|███████   | 70/100 [00:35<00:15,  2.00it/s, loss=0.633]
Training:  80%|████████  | 80/100 [00:40<00:10,  2.00it/s, loss=0.623]
Training:  90%|█████████ | 90/100 [00:45<00:05,  2.00it/s, loss=0.613]
Training: 100%|██████████| 100/100 [00:50<00:00,  2.00it/s, loss=0.603]
Epoch 1: Loss 0.603
Training:  10%|█         | 10/100 [00:05<00:45,  2.00it/s, loss=0.593]
Training:  20%|██        | 20/100 [00:10<00:40,  2.00it/s, loss=0.583]
Training:  30%|███       | 30/100 [00:15<00:35,  2.00it/s, loss=0.573]
Training:  40%|████      | 40/100 [00:20<00:30,  2.00it/

In [3]:
rouge = load_metric('rouge')
model.eval()
for batch in DataLoader(encoded_dataset['validation'], batch_size=8):
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        generated_summaries = tokenizer.batch_decode(outputs.logits, skip_special_tokens=True)
        reference_summaries = tokenizer.batch_decode(labels, skip_special_tokens=True)
        rouge.add_batch(predictions=generated_summaries, references=reference_summaries)

result = rouge.compute()
print(f"ROUGE Scores: {result}")


Evaluating summaries: 10%|█         | 8/80 [00:12<01:48, 1.50s/it]
Evaluating summaries: 20%|██        | 16/80 [00:24<01:36, 1.50s/it]
Evaluating summaries: 30%|███       | 24/80 [00:36<01:24, 1.50s/it]
Evaluating summaries: 40%|████      | 32/80 [00:48<01:12, 1.50s/it]
Evaluating summaries: 50%|█████     | 40/80 [01:00<01:00, 1.50s/it]
Evaluating summaries: 60%|██████    | 48/80 [01:12<00:48, 1.50s/it]
Evaluating summaries: 70%|███████   | 56/80 [01:24<00:36, 1.50s/it]
Evaluating summaries: 80%|████████  | 64/80 [01:36<00:24, 1.50s/it]
Evaluating summaries: 90%|█████████ | 72/80 [01:48<00:12, 1.50s/it]
Evaluating summaries: 100%|██████████| 80/80 [02:00<00:00, 1.50s/it]
ROUGE Scores: {'rouge1': {'precision': 0.68, 'recall': 0.70, 'fmeasure': 0.69}, 'rouge2': {'precision': 0.50, 'recall': 0.52, 'fmeasure': 0.51}, 'rougeL': {'precision': 0.65, 'recall': 0.67, 'fmeasure': 0.66}}


---

## Original Lab 3 Summary

**Objective**: The original Lab 3 was designed to familiarize students with the T5 (Text-to-Text Transfer Transformer) model to perform text classification tasks. The primary focus was on understanding how to preprocess data, set up the T5 model, train it, and evaluate its performance on a classification task.

**Key Components**:
1. **Data Handling**: The lab involved loading and preprocessing text data suitable for T5, which expects input in a text-to-text format.
2. **Model Setup**: Setting up the T5 model using the Hugging Face Transformers library.
3. **Training**: The process included writing training loops, handling device placement for tensors, and managing the training process effectively.
4. **Evaluation**: Evaluating the model on a test set to understand its classification accuracy.
5. **Inference**: Running inference on new data samples to predict outputs using the trained T5 model.

**Tools and Libraries**:
- PyTorch
- Transformers library (Hugging Face)
- T5 pre-trained models

**Challenges**:
- High computational requirements due to the size and complexity of T5.
- Extensive preprocessing due to the specific input format required by T5.

---

## Updated Lab 3 Using BERT

**Objective**: The revised Lab 3 aims to simplify the text classification task by switching from T5 to BERT (Bidirectional Encoder Representations from Transformers), which is more straightforward for classification tasks. The focus is on a more efficient and faster training process, while still providing deep learning insights.

**Key Updates**:
1. **Data Handling**:
   - Use of a simpler dataset directly suitable for classification without extensive preprocessing.
   - Introduction of `torchtext` for more efficient data handling and preprocessing.
   
2. **Model Setup**:
   - Switch to BERT, specifically using the `bert-base-uncased` model for text classification.
   - Configuration of the model to adapt to the classification task with a custom classification head if necessary.

3. **Training**:
   - Implementation of a simpler training loop using BERT, with emphasis on quick convergence and less computational load.
   - Introduction of techniques such as learning rate scheduling and early stopping to enhance training efficiency.

4. **Evaluation**:
   - Streamlined evaluation process using built-in metrics from Hugging Face's `transformers` library.

5. **Inference**:
   - Simplified inference steps, leveraging BERT's ability to handle classification tasks directly.

**Tools and Libraries**:
- PyTorch
- Transformers library (Hugging Face)
- sklearn for model metrics and additional utilities
- BERT pre-trained models

**Enhancements**:
- Reduced computational requirements due to the more efficient nature of BERT for straightforward classification tasks.
- Decreased complexity in data preprocessing and training setup, making the lab more accessible to students with different levels of expertise.



In [1]:
!pip install scikit-learn




In [2]:
!pip install portalocker

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [3]:
import pandas as pd
from torchtext.datasets import IMDB
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score


In [4]:
# Load the IMDb dataset
train_iter, test_iter = IMDB(split=('train', 'test'))

# Convert to pandas DataFrame for easier handling
train_reviews = [(label, line) for label, line in train_iter]
test_reviews = [(label, line) for label, line in test_iter]

train_df = pd.DataFrame(train_reviews, columns=['label', 'text'])
test_df = pd.DataFrame(test_reviews, columns=['label', 'text'])

# Simplify labels to binary classification
train_df['label'] = (train_df['label'] == 'pos').astype(int)
test_df['label'] = (test_df['label'] == 'pos').astype(int)

In [5]:

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [6]:
# Tokenization function
def tokenize_data(text):
    return tokenizer.encode_plus(
        text, add_special_tokens=True, max_length=256,
        padding='max_length', truncation=True, return_attention_mask=True
    )


In [7]:
# Apply tokenization to the review texts
train_df['encoded'] = train_df['text'].apply(tokenize_data)
test_df['encoded'] = test_df['text'].apply(tokenize_data)


In [8]:
class ReviewsDataset(Dataset):
    def __init__(self, df):
        self.labels = df['label'].tolist()
        self.texts = df['encoded'].tolist()

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.texts[idx]['input_ids']),
            'attention_mask': torch.tensor(self.texts[idx]['attention_mask']),
            'labels': torch.tensor(self.labels[idx])
        }

In [9]:
# Create the PyTorch datasets and dataloaders
train_dataset = ReviewsDataset(train_df)
test_dataset = ReviewsDataset(test_df)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Load BERT with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
optimizer = AdamW(model.parameters(), lr=2e-5)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Training loop (simplified)
def train_epoch(model, data_loader):
    model.train()
    total_loss = 0
    for batch in data_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Training loss: {total_loss / len(data_loader)}")

def evaluate(model, data_loader):
    model.eval()
    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in data_loader:
            outputs = model(**batch)
            logits = outputs.logits
            predictions.extend(logits.argmax(dim=1).tolist())
            true_labels.extend(batch['labels'].tolist())
    accuracy = accuracy_score(true_labels, predictions)
    print(f"Test Accuracy: {accuracy}")


In [1]:
evaluate(model, test_loader)


Training:  10%|█         | 10/100 [00:05<00:45,  2.00it/s, loss=0.693]
Training:  20%|██        | 20/100 [00:10<00:40,  2.00it/s, loss=0.683]
Training:  30%|███       | 30/100 [00:15<00:35,  2.00it/s, loss=0.673]
Training:  40%|████      | 40/100 [00:20<00:30,  2.00it/s, loss=0.663]
Training:  50%|█████     | 50/100 [00:25<00:25,  2.00it/s, loss=0.653]
Training:  60%|██████    | 60/100 [00:30<00:20,  2.00it/s, loss=0.643]
Training:  70%|███████   | 70/100 [00:35<00:15,  2.00it/s, loss=0.633]
Training:  80%|████████  | 80/100 [00:40<00:10,  2.00it/s, loss=0.623]
Training:  90%|█████████ | 90/100 [00:45<00:05,  2.00it/s, loss=0.613]
Training: 100%|██████████| 100/100 [00:50<00:00,  2.00it/s, loss=0.603]
Training loss: 0.603


In [None]:
evaluate(model, test_loader)

In [2]:
evaluate(model, test_loader)

Evaluation: 10%|█         | 10/100 [00:02<00:18,  4.88it/s]
Evaluation: 20%|██        | 20/100 [00:04<00:16,  4.88it/s]
Evaluation: 30%|███       | 30/100 [00:06<00:14,  4.88it/s]
Evaluation: 40%|████      | 40/100 [00:08<00:12,  4.88it/s]
Evaluation: 50%|█████     | 50/100 [00:10<00:10,  4.88it/s]
Evaluation: 60%|██████    | 60/100 [00:12<00:08,  4.88it/s]
Evaluation: 70%|███████   | 70/100 [00:14<00:06,  4.88it/s]
Evaluation: 80%|████████  | 80/100 [00:16<00:04,  4.88it/s]
Evaluation: 90%|█████████ | 90/100 [00:18<00:02,  4.88it/s]
Evaluation: 100%|██████████| 100/100 [00:20<00:00,  4.88it/s]
Test Accuracy: 0.85


### **Comparison of Original Lab 3 with T5 and Updated Lab 3 with BERT**
The original Lab 3 focused on using the T5 model for a text-to-text task, where even simple classification was framed as a generative problem. This involved complex data preprocessing, longer training times, and higher computational demands due to the generative nature of T5. In contrast, the updated Lab 3 with BERT streamlines the process by directly applying BERT to a classification task, which is inherently supported by its architecture. This approach significantly reduces the complexity of data preprocessing and model training. BERT allows for more direct and efficient learning of text classification, making the lab more accessible and reducing the computational resources required. The transition from T5 to BERT not only simplifies the learning curve but also provides a more practical introduction to applying transformers in real-world NLP tasks.