# Extrinsic Evaluation of GloVe on Sentiment Analysis (IMDB Dataset)

## Introduction
Extrinsic evaluation is a practical way to measure the performance of word embeddings like GloVe by applying them to downstream tasks. In this lab, we will evaluate GloVe embeddings on a binary sentiment analysis task using the IMDB dataset.

### Objectives
1. Load and preprocess the IMDB dataset.
2. Load pre-trained GloVe embeddings.
3. Create a PyTorch dataset for efficient data handling.
4. Implement a DataLoader for batch processing.
5. Train a simple sentiment classification model using GloVe embeddings.
6. Evaluate the model on the test set.


In [1]:
!pip install datasets




### Step 1: Data Loading
We will use the IMDB dataset available via TorchText.

In [2]:
import datasets

train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Step 2: Load Pre-trained GloVe Embeddings
GloVe embeddings provide pre-trained word vectors that we will use to initialize our model's embedding layer.

In [3]:
import gensim.downloader as api
# Load a pre-trained model, e.g., Word2Vec Google News embeddings
model_glove = api.load("glove-wiki-gigaword-200")



### Step 3: Create a PyTorch Dataset
We create a PyTorch dataset that maps tokens to their corresponding indices and pads sequences to a fixed length.

In [4]:
data=train_data[0]

In [5]:
import numpy as np
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords= stopwords.words('english')
punctuations= [punt for punt in string.punctuation]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def extract_sent_emb(text):
  sentence_emb=np.zeros(200)
  tokens= [token for token in text.lower().split() if not ((token in stopwords) or (token in punctuations))]
  cpt=0
  for token in tokens:
    if token in model_glove :
      sentence_emb+=model_glove[token]
      cpt+=1
  if cpt >0:
    sentence_emb/=cpt
  return sentence_emb

In [9]:
extract_sent_emb(data['text']).shape

(200,)

In [10]:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, train_data):
        self.train_data = train_data


    def __len__(self):
        return len(self.train_data)

    def __getitem__(self, idx):
        data = self.train_data[idx]

        return extract_sent_emb(data['text']), data['label']


In [11]:
train_dataset=CustomDataset(train_data)
test_dataset=CustomDataset(test_data)

### Step 4: Create a DataLoader
DataLoaders provide an efficient way to batch and shuffle data.

In [12]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True)

In [13]:
batch=next(iter(train_dataloader))

In [14]:
batch[0].shape, batch[1].shape

(torch.Size([64, 200]), torch.Size([64]))

### Step 5: Define the Model and Training Loop
We define a simple neural network with a pre-trained embedding layer, followed by a fully connected layer for binary classification.

In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
class TinyModel(torch.nn.Module):

    def __init__(self, sent_emb_size, hidden_dim, out_dim ):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(sent_emb_size, hidden_dim)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(hidden_dim, out_dim)


    def forward(self, x):
        x = self.linear1(x)
        #print(x.shape)
        x = self.activation(x)
        #print(x.shape)
        x = self.linear2(x)
        #print(x.shape)
        return x

In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model= TinyModel(200,64,1).to(device)

In [17]:
criterion = torch.nn.BCEWithLogitsLoss()
# Optimizers specified in the torch.optim package
optimizer = torch.optim.Adam(model.parameters())

In [18]:
from tqdm.notebook import tqdm

In [19]:
n_epochs=1
for epoch in range(n_epochs):
  running_loss=[]
  model.train()
  for batch in tqdm(train_dataloader):
    inputs, labels=batch
    inputs, labels=inputs.to(device).float(), labels.to(device).float()

    # Zero your gradients for every batch!
    optimizer.zero_grad()
    outputs=model(inputs).reshape(-1)


    # Compute the loss and its gradients
    loss = criterion(outputs, labels)
    loss.backward()

    # Adjust learning weights
    optimizer.step()
    running_loss.append(loss.item())
  print('Epoch {} loss {} '.format(epoch, np.mean(running_loss)))

  0%|          | 0/391 [00:00<?, ?it/s]

Epoch 0 loss 0.527591282449415 


### Step 6: Model Evaluation
Evaluate the model using accuracy on the test set.

In [20]:
running_loss=[]
predictions=[]
gt=[]
for batch in tqdm(test_dataloader):
    inputs, labels=batch
    inputs, labels=inputs.to(device).float(), labels.to(device).float()

    with torch.no_grad():
      outputs=model(inputs).reshape(-1)

    loss=criterion(outputs, labels)
    # Compute the loss and its gradients
    loss = criterion(outputs, labels)
    #evaluation
    predictions.extend((torch.sigmoid(outputs).detach().cpu().numpy()>=0.5)*1)
    gt.extend(labels.detach().cpu().numpy())


    running_loss.append(loss.item())
print('Test loss {} '.format(np.mean(running_loss)))

  0%|          | 0/391 [00:00<?, ?it/s]

Test loss 0.4567110518665265 


In [21]:
print('Accuracy :', np.mean(np.array(predictions)==np.array(gt)))

Accuracy : 0.78804


In [22]:
from sklearn.metrics import classification_report
print(classification_report(gt, predictions, target_names=['Neg','Pos']))

              precision    recall  f1-score   support

         Neg       0.81      0.76      0.78     12500
         Pos       0.77      0.82      0.79     12500

    accuracy                           0.79     25000
   macro avg       0.79      0.79      0.79     25000
weighted avg       0.79      0.79      0.79     25000

