# Assignment 3

**Name**: Pola Gnana Shekar <br/>
**Roll No**: 21CS10052

## FLAN-T5 Model Loading and Inspection

This section loads both **FLAN-T5 small** and **FLAN-T5 base** models with their tokenizers. FLAN-T5, a variant of T5, is fine-tuned for zero-shot and few-shot NLP tasks.

For both models, we load and inspect these key features:
- **Model Architecture**: Includes layers, hidden size, and attention heads.
- **Vocabulary**: Total tokens the model can recognize.
- **Parameter Count**: Total model parameters, with base having a larger count than small.

We also load their **tokenizers** and note:
- **Vocab Size**: Vocabulary of tokens each model can use.
- **Padding and EOS Tokens**: IDs for padding sequences and ending inputs consistently.

The base model, with its additional layers and parameters, is suited for tasks requiring greater complexity, while the small model offers a compact version with similar capabilities.

In [1]:
from transformers import AutoModelForSeq2SeqLM,AutoTokenizer

In [2]:
print("Loading Flan-T5 small model....\n")

model_small = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer_small = AutoTokenizer.from_pretrained("google/flan-t5-small")

print("Flan-T5 Small Model Features:")
print(f"Number of Layers: {model_small.config.num_layers}")
print(f"Hidden Size: {model_small.config.d_model}")
print(f"Number of Attention Heads: {model_small.config.num_heads}")
print(f"Vocabulary Size: {model_small.config.vocab_size}")
print(f"Model Size (parameters): {model_small.num_parameters()}")
print()

print("Tokenizer for Flan-T5 Small:")
print(f"Tokenizer Vocab Size: {tokenizer_small.vocab_size}")
print(f"Tokenizer Padding Token ID: {tokenizer_small.pad_token_id}")
print(f"Tokenizer EOS Token ID: {tokenizer_small.eos_token_id}")

Loading Flan-T5 small model....



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Flan-T5 Small Model Features:
Number of Layers: 8
Hidden Size: 512
Number of Attention Heads: 6
Vocabulary Size: 32128
Model Size (parameters): 76961152

Tokenizer for Flan-T5 Small:
Tokenizer Vocab Size: 32100
Tokenizer Padding Token ID: 0
Tokenizer EOS Token ID: 1




In [3]:
print("Loading Flan-T5 base model....\n")

model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer_base = AutoTokenizer.from_pretrained("google/flan-t5-base")

print("Flan-T5 Base Model Features:")
print(f"Number of Layers: {model_base.config.num_layers}")
print(f"Hidden Size: {model_base.config.d_model}")
print(f"Number of Attention Heads: {model_base.config.num_heads}")
print(f"Vocabulary Size: {model_base.config.vocab_size}")
print(f"Model Size (parameters): {model_base.num_parameters()}")
print()

print("\nTokenizer for Flan-T5 Base:")
print(f"Tokenizer Vocab Size: {tokenizer_base.vocab_size}")
print(f"Tokenizer Padding Token ID: {tokenizer_base.pad_token_id}")
print(f"Tokenizer EOS Token ID: {tokenizer_base.eos_token_id}")

Loading Flan-T5 base model....



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Flan-T5 Base Model Features:
Number of Layers: 12
Hidden Size: 768
Number of Attention Heads: 12
Vocabulary Size: 32128
Model Size (parameters): 247577856


Tokenizer for Flan-T5 Base:
Tokenizer Vocab Size: 32100
Tokenizer Padding Token ID: 0
Tokenizer EOS Token ID: 1


## Dataset Loading and Summary

In this section, we load and inspect the training, validation, and test datasets for NLP classification. Each dataset is loaded from TSV files without headers, and columns are renamed for consistency.

- **Training Dataset**: 
  - Loaded with columns `speech` and `label`.
  - Sample data preview provided.
  - **Statistics**:
    - Total samples: Displays the count of examples.
    - **Label Distribution**: Shows counts for each label type.

- **Validation and Test Datasets**: 
  - Loaded similarly with column renaming and sample previews.
  - **Statistics** for each:
    - Total samples.
    - Label distribution by class.

This overview helps verify dataset integrity and label balance before model training.


In [4]:
import pandas as pd

In [5]:
# Load the dataset without headers
train_df = pd.read_csv("/kaggle/input/nlpdata/NLP_ass_train.tsv", sep='\t', header=None)

# Rename the columns
train_df.columns = ['speech', 'label']

print("Sample data in NLP_train:")
print(train_df.head(), "\n")

Sample data in NLP_train:
                                              speech       label
0  we cannot continue calling ourselves feminists...      normal
1                      nawt yall niggers ignoring me      normal
2  <user> i am bit confused coz chinese ppl can n...  hatespeech
3  and this is why i end up with nigger trainee d...  hatespeech
4                  nogs jews and dykes how enriching   offensive 



In [6]:
validation_df = pd.read_csv("/kaggle/input/nlpdata/NLP_ass_valid.tsv",sep='\t', header=None)

# Rename the columns
validation_df.columns = ['speech', 'label']

print("sample data in NLP_valid:")
print(validation_df.head(),"\n")

sample data in NLP_valid:
                                              speech       label
0  this bitch in whataburger eating a burger with...  hatespeech
1  tired i can not support abortion from a moral ...  hatespeech
2  all of my exes were cute but they were hoes i ...   offensive
3  <user> <user> they visited helped provided foo...      normal
4  it funny how people offended by the word nigge...      normal 



In [7]:
test_df = pd.read_csv("/kaggle/input/nlpdata/NLP_ass_test.tsv",sep='\t',header=None)

# Rename the columns
test_df.columns = ['speech', 'label']

print("sample data in NLP_test:")
print(test_df.head(),"\n")

sample data in NLP_test:
                                              speech       label
0  i dont think im getting my baby them white 9 h...      normal
1  laura loomer raped me while screaming at me in...  hatespeech
2  <user> what did the old lady do was she a nazi...      normal
3  as much as i appreciate a format of open discu...      normal
4  sex be so good a bitch be slow stroking and cr...   offensive 



In [8]:
# Checking the label distribution in each dataset split
print("Training Set Statistics:")
print(f"Total samples: {len(train_df)}")
print("Label distribution:\n", train_df['label'].value_counts(), "\n")

print("Validation Set Statistics (TSV):")
print(f"Total samples: {len(validation_df)}")
print("Label distribution:\n", validation_df['label'].value_counts(), "\n")

print("Test Set Statistics:")
print(f"Total samples: {len(test_df)}")
print("Label distribution:\n", test_df['label'].value_counts(), "\n")


Training Set Statistics:
Total samples: 15383
Label distribution:
 label
normal        6251
hatespeech    4748
offensive     4384
Name: count, dtype: int64 

Validation Set Statistics (TSV):
Total samples: 1922
Label distribution:
 label
normal        781
hatespeech    593
offensive     548
Name: count, dtype: int64 

Test Set Statistics:
Total samples: 1924
Label distribution:
 label
normal        782
hatespeech    594
offensive     548
Name: count, dtype: int64 



## Evaluation Metrics Calculation

This code defines a function to compute key evaluation metrics, **accuracy** and **Macro-F1 score**, between predicted and actual labels:

- **Accuracy**: Measures the overall percentage of correct predictions.
- **Macro-F1 Score**: Calculates the F1 score for each label, then averages them, treating each label equally regardless of frequency (ideal for imbalanced datasets).

### Function Overview:
- **Input**: 
  - `test_df`: DataFrame with actual labels in the `label` column.
  - `predictions`: List of model predictions.
- **Output**:
  - `accuracy`: The fraction of correctly predicted labels.
  - `macro_f1`: The Macro-F1 score, emphasizing balanced performance across labels.

In [9]:
from sklearn.metrics import accuracy_score, f1_score

# Calculate accuracy and Macro-F1 score
def calculate_metrics(test_df, predictions):
    actual_labels = test_df['label'].str.lower().str.replace(" ", "").tolist()
    accuracy = accuracy_score(actual_labels, predictions)
    macro_f1 = f1_score(actual_labels, predictions, average='macro')

    return accuracy, macro_f1

## Zero-Shot Prompting for Hate Speech Classification

This code implements a **zero-shot prompting approach** using FLAN-T5 models (small and base) to classify text as "normal," "hatespeech," or "offensive" without any labeled examples in the prompt.

### Code Overview:
1. **get_predictions_zs function**: Processes each sentence in the `test_df` DataFrame, applies the zero-shot prompt (`prompt_zs`), and returns model predictions.
2. **Prompt format**: A prompt is designed to prompt the model to classify text according to the three labels.
3. **Model Evaluation**: The `calculate_metrics` function computes accuracy and Macro-F1 scores, allowing us to assess each model’s performance.

The code displays sample predictions, providing insight into the model’s zero-shot performance on the test set.

In [10]:
def get_predictions_zs(model, tokenizer, test_df, prompt):
    all_predictions = []
    
    for _, row in test_df.iterrows():
        text = row['speech']
        
        input_text = prompt.format(text)
        inputs = tokenizer(input_text, return_tensors="pt")
        outputs = model.generate(**inputs,max_new_tokens=1)
        pred = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower().replace(" ", "")  # Clean up prediction
        
        all_predictions.append(pred)
    
    return all_predictions

In [11]:
prompt_zs = "Please determine if the text below is 'normal,' 'hatespeech' or 'offensive': {}"

### Zero-Shot Classification Using FLAN-T5 Small Model

The **FLAN-T5 small model** performs zero-shot classification on the test set, using a straightforward prompt to predict labels. Accuracy and Macro-F1 scores provide a quantitative evaluation of the small model’s zero-shot classification ability. This process highlights the model's ability to generalize with minimal guidance on the task.

In [20]:
predictions_small_zs = get_predictions_zs(model_small, tokenizer_small, test_df, prompt_zs)

print("Sample Predictions:")
for i in range(min(20, len(predictions_small_zs))):  # Print at most 5 samples
    print(f"Text: {test_df['speech'].iloc[i]} | Prediction: {predictions_small_zs[i]} | Actual Label: {test_df['label'].iloc[i]}")

Sample Predictions:
Text: i dont think im getting my baby them white 9 he has two white j and nikes not even touched | Prediction: offensive | Actual Label: normal
Text: laura loomer raped me while screaming at me in her disgusting kike language and said we must exterminate the goyim laura loomer loomergate | Prediction: offensive | Actual Label: hatespeech
Text: <user> what did the old lady do was she a nazi or white supremacist | Prediction: hate | Actual Label: normal
Text: as much as i appreciate a format of open discussion i really just want a facebook that i can say beaner on | Prediction: offensive | Actual Label: normal
Text: sex be so good a bitch be slow stroking and crying | Prediction: normal | Actual Label: offensive
Text: <user> <user> <user> with a room full of white folks | Prediction: hate | Actual Label: normal
Text: i do not blame orlando pie rats for loosing like this they are drunk from that black label 😂 😂 😂 mtn 8 ss diski wafa wafa | Prediction: offensive | Actua

In [21]:
accuracy_small_zs, macro_f1_small_zs = calculate_metrics(test_df, predictions_small_zs)

print(f"Accuracy for Flan-T5 Small Model: {accuracy_small_zs:.4f}")
print(f"Macro-F1 Score for Flan-T5 Small Model: {macro_f1_small_zs:.4f}")

Accuracy for Flan-T5 Small Model: 0.2448
Macro-F1 Score for Flan-T5 Small Model: 0.0826


### Zero-Shot Classification Using FLAN-T5 Base Model

The **FLAN-T5 base model** undergoes a similar zero-shot evaluation on the test set, with the same prompt setup. Due to its larger size and greater representational power, the base model may exhibit higher accuracy and F1 scores. This comparison with the small model enables analysis of performance gains from scaling the model.

In [14]:
predictions_base_zs = get_predictions_zs(model_base, tokenizer_base, test_df, prompt_zs)

print("Sample Predictions:")
for i in range(min(20, len(predictions_base_zs))):  # Print at most 5 samples
    print(f"Text: {test_df['speech'].iloc[i]} | Prediction: {predictions_base_zs[i]} | Actual Label: {test_df['label'].iloc[i]}")

Sample Predictions:
Text: i dont think im getting my baby them white 9 he has two white j and nikes not even touched | Prediction: normal | Actual Label: normal
Text: laura loomer raped me while screaming at me in her disgusting kike language and said we must exterminate the goyim laura loomer loomergate | Prediction: hate | Actual Label: hatespeech
Text: <user> what did the old lady do was she a nazi or white supremacist | Prediction: normal | Actual Label: normal
Text: as much as i appreciate a format of open discussion i really just want a facebook that i can say beaner on | Prediction: normal | Actual Label: normal
Text: sex be so good a bitch be slow stroking and crying | Prediction: offensive | Actual Label: offensive
Text: <user> <user> <user> with a room full of white folks | Prediction: normal | Actual Label: normal
Text: i do not blame orlando pie rats for loosing like this they are drunk from that black label 😂 😂 😂 mtn 8 ss diski wafa wafa | Prediction: normal | Actual Label

In [15]:
accuracy_base_zs, macro_f1_base_zs= calculate_metrics(test_df, predictions_base_zs)

print(f"Accuracy for Flan-T5 Base Model: {accuracy_base_zs:.4f}")
print(f"Macro-F1 Score for Flan-T5 Base Model: {macro_f1_base_zs:.4f}")

Accuracy for Flan-T5 Base Model: 0.3643
Macro-F1 Score for Flan-T5 Base Model: 0.1456


## Few-Shot Prompting for Hate Speech Classification

In this section, we use a **few-shot prompting** approach, where each model (FLAN-T5 small and base) leverages a small set of labeled examples to improve its classification accuracy.

### Code Overview:
1. **get_few_shot_examples function**: Selects 3 examples from each class (`normal`, `hatespeech`, and `offensive`) from the `train_df` to create a few-shot prompt.
2. **create_few_shot_prompt function**: Combines these examples into a full prompt, providing reference samples to guide the model’s predictions.
3. **few_shot_inference function**: For each test example, the model uses the few-shot prompt along with the test text for classification.
4. **Model Evaluation**: We calculate accuracy and Macro-F1 scores to evaluate performance, comparing results for the FLAN-T5 small and base models.

This approach simulates minimal supervision, allowing the models to benefit from specific in-context examples.

In [16]:
import random

# Select 3 examples from each class and shuffle them
def get_few_shot_examples(df, examples_per_class=3):
    few_shot_examples = []
    for label in df['label'].unique():
        examples = df[df['label'] == label].sample(examples_per_class, random_state=42)
        few_shot_examples.extend([(text, label) for text in examples['speech']])
    # Shuffle all selected examples randomly
    random.shuffle(few_shot_examples)
    return few_shot_examples

# Generate the full few-shot prompt with mixed examples
def create_few_shot_prompt(few_shot_examples):
    prompt_text = "Examples for Hate speech classification task:\n"
    for i, (text, label) in enumerate(few_shot_examples, 1):
        prompt_text += f"text: \"{text}\" – Label: '{label}' |\n"
    return prompt_text

# Testing function: Pass each test example through the few-shot prompt
def few_shot_inference(model, tokenizer, test_df, full_prompt):
    predictions = []
    for _, row in test_df.iterrows():
        text = row['speech']
        # Add the specific test sentence to classify at the end of the full prompt
        prompt = f"\nUsing these references, Please determine if the text below is 'normal,' 'hatespeech' or 'offensive': \"{text}\"" + full_prompt 
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs,max_new_tokens=1)
        pred = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower().replace(" ", "")
        predictions.append(pred)
    return predictions

few_shot_examples = get_few_shot_examples(train_df)
full_prompt = create_few_shot_prompt(few_shot_examples)

print("Complete few-shot prompt:")
print(full_prompt)

Complete few-shot prompt:
Examples for Hate speech classification task:
text: "eid mubarak to all my muslim brothers and sisters lt <number>" – Label: 'normal' |
text: "that nigga kawhi is nice bruh" – Label: 'normal' |
text: "the ex lying muzrat traitor in chief needs to swing for his crimes" – Label: 'hatespeech' |
text: "he was like ew really i fucking hate that faggot 🤣 🤣 🤣 🤣 i lost it you all it was funny as shit" – Label: 'offensive' |
text: "<user> <user> <user> no we will take care of our beloved indian muslims but we do not want any illegal muslim immigrants from pakistan and bangladesh india is only for indian muslims all hindus sikhs jains and christians" – Label: 'offensive' |
text: "<user> jesus is gay bt he the good kind of gay not like them fucking gay ponies" – Label: 'hatespeech' |
text: "refugees welcome but no irish comedies have to be poorly written now so that mexicans and asians will get the jokes yeah you know what life is too short to listen to this radiohead al

### Few-Shot Classification Using FLAN-T5 Small Model

The **FLAN-T5 small model** uses a few-shot prompt for inference, where 3 examples from each label are presented. The prompt aids the model by providing concrete examples, expected to improve classification accuracy. The calculated accuracy and Macro-F1 score assess the impact of these few-shot references on the model’s performance.

In [22]:
few_shot_predictions_small = few_shot_inference(model_small, tokenizer_small, test_df, full_prompt)

accuracy_few_shot, macro_f1_few_shot = calculate_metrics(test_df, few_shot_predictions_small)

print(f"Accuracy for Few-Shot Prompting with Flan-T5 Small Model: {accuracy_few_shot:.4f}")
print(f"Macro-F1 Score for Few-Shot Prompting with Flan-T5 Small Model: {macro_f1_few_shot:.4f}")

Accuracy for Few-Shot Prompting with Flan-T5 Small Model: 0.3025
Macro-F1 Score for Few-Shot Prompting with Flan-T5 Small Model: 0.0720


### Few-Shot Classification Using FLAN-T5 Base Model

The **FLAN-T5 base model** is evaluated with the same few-shot setup, leveraging additional parameters and complexity. By comparing results with the small model, we can observe the base model’s response to the same few-shot guidance, potentially highlighting improvements in accuracy and F1 scores due to the model’s larger capacity.

In [23]:
few_shot_predictions_base = few_shot_inference(model_base, tokenizer_base, test_df, full_prompt)

accuracy_few_shot, macro_f1_few_shot = calculate_metrics(test_df, few_shot_predictions_base)

print(f"Accuracy for Few-Shot Prompting with Flan-T5 Base Model: {accuracy_few_shot:.4f}")
print(f"Macro-F1 Score for Few-Shot Prompting with Flan-T5 Base Model: {macro_f1_few_shot:.4f}")

Accuracy for Few-Shot Prompting with Flan-T5 Base Model: 0.4064
Macro-F1 Score for Few-Shot Prompting with Flan-T5 Base Model: 0.1927


## Dataset Overlap Calculation

This code defines a function `calculate_dataset_overlap` that computes the number of overlapping sentences between three datasets: training, validation, and test sets. 

### Key Steps:
- The function converts the 'speech' columns of the input DataFrames into sets to facilitate intersection operations.
- It calculates the overlap for the following pairs:
  - **Training and Validation Sets**: Determines how many sentences are common between these two sets.
  - **Validation and Test Sets**: Identifies the shared sentences between validation and test datasets.
  - **Training and Test Sets**: Finds common sentences between the training and test sets.

Finally, the script prints the overlap counts for each of these dataset pairs, providing insight into potential data leakage or redundancy across datasets.

In [19]:
import pandas as pd

# Function to calculate the intersection (overlap) of sentences between different datasets
def calculate_dataset_overlap(train_df, val_df, test_df):
    train_sentences = set(train_df['speech'])
    val_sentences = set(val_df['speech'])
    test_sentences = set(test_df['speech'])

    train_val_overlap = len(train_sentences.intersection(val_sentences))
    val_test_overlap = len(val_sentences.intersection(test_sentences))
    train_test_overlap = len(train_sentences.intersection(test_sentences))

    return train_val_overlap, val_test_overlap, train_test_overlap

train_val_overlap, val_test_overlap, train_test_overlap = calculate_dataset_overlap(train_df, validation_df, test_df)

print(f"Number of overlapping sentences between train and validation sets: {train_val_overlap}")
print(f"Number of overlapping sentences between validation and test sets: {val_test_overlap}")
print(f"Number of overlapping sentences between train and test sets: {train_test_overlap}")

Number of overlapping sentences between train and validation sets: 3
Number of overlapping sentences between validation and test sets: 1
Number of overlapping sentences between train and test sets: 5


## Conclusion

The performance results for **zero-shot** and **few-shot prompting** with FLAN-T5 on the hate speech classification task reveal distinct patterns across model sizes and prompting strategies:

1. **Zero-Shot Prompting**:
   - **FLAN-T5 Small Model** achieved an accuracy of 0.2448 and a Macro-F1 score of 0.0826, indicating limited effectiveness for zero-shot hate speech classification.
   - **FLAN-T5 Base Model** performed better, with an accuracy of 0.3643 and a Macro-F1 score of 0.1456, showing that the base model's larger parameter size may improve generalization in zero-shot settings.

2. **Few-Shot Prompting**:
   - **FLAN-T5 Small Model** improved slightly in the few-shot setup, with an accuracy of 0.3025 and a Macro-F1 score of 0.0720, although the Macro-F1 score indicates room for improvement in capturing diverse label categories.
   - **FLAN-T5 Base Model** demonstrated further gains in few-shot prompting, reaching an accuracy of 0.4064 and a Macro-F1 score of 0.1927, suggesting that providing labeled examples in the prompt significantly boosts performance.

### Summary
Overall, the **FLAN-T5 Base Model** with **few-shot prompting** outperformed other configurations, highlighting the benefit of few-shot examples and increased model capacity for nuanced tasks like hate speech classification. However, further fine-tuning or dataset-specific training would likely yield higher accuracy and Macro-F1 scores.