# Natural Language Processing
## Dependencies
First,  install the following dependencies:

In [None]:
!pip install datasets scikit-learn huggingface_hub transformers==4.37.2 evaluate accelerate

## Dataset
We are going to use the IMDB dataset for sentiment classification

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Subset generation
For the sake of this example, we are going to use a subset of the dataset.

In [2]:
dataset_subset = dataset['train'].train_test_split(test_size=0.2, stratify_by_column='label', shuffle=True, seed=42)['test']
dataset_subset, dataset_subset.to_pandas()['label'].value_counts(),

(Dataset({
     features: ['text', 'label'],
     num_rows: 5000
 }),
 label
 0    2500
 1    2500
 Name: count, dtype: int64)

In [3]:
dataset['train'] = dataset_subset
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## The Naive Bayes Classifier in a Nutshell
We treat the possibility of a word appearing in a document as independent from the possibility of another word appearing in the same document. This is a naive assumption, but it allows us to simplify the calculation of the probability of a document belonging to a class.

### Mathematical Formulation

Given a document $D$ with words $w_1, w_2, ..., w_n$ and a set of classes $C$, the Naive Bayes classifier computes the posterior probability of each class $c \in C$ given the document $D$, $P(c | D)$, as follows:

$$P(c | D) = \frac{P(c) \prod_{i=1}^{n} P(w_i | c)}{P(D)}$$

where:
- $P(c | D)$ is the posterior probability of class $c$ given document $D$,
- $P(c)$ is the prior probability of class $c$,
- $P(w_i | c)$ is the likelihood of word $w_i$ given class $c$,
- $P(D)$ is the probability of document $D$, which acts as a normalizing constant.

Since $P(D)$ is the same for all classes, we focus on maximizing $P(c) \prod_{i=1}^{n} P(w_i | c)$.

### Handling Out-of-Vocabulary Words with Laplace Smoothing

A challenge in applying Naive Bayes to text classification is dealing with Out-of-Vocabulary (OOV) words—words present in the test set but not seen during training. To address this, we use Laplace smoothing, which adjusts the likelihood calculation to ensure no zero probabilities.

#### Laplace Smoothing Formula

The modified formula for $P(w_i | c)$ with Laplace smoothing is:

$$P(w_i | c) = \frac{N_{w_i, c} + \alpha}{N_c + \alpha \times |V|}$$

where:
- $N_{w_i, c}$ is the count of times word $w_i$ appears in documents of class $c$,
- $N_c$ is the total count of all words in documents of class $c$,
- $|V|$ is the size of the vocabulary,
- $\alpha$ is the smoothing parameter (typically set to 1 for add-one smoothing).

This adjustment ensures that every word, including those not seen during training, contributes to the probability calculations, allowing the classifier to make predictions even in the presence of new words.

## `TODO`: Implementing NB Classifier from Scratch
Generally,  your algorithm should produce a model with acc around 0.7. The following two cells are the main part of the implementation.

In [4]:
from collections import Counter, defaultdict
def train_model(sentences, labels):
    '''
    sentences: list of strings
    labels: list of 0 or 1s

    returns: (wscores, cscores)
    - wscores: a dictionary of dicts, where wscores[label][word] is the probability of word given label
    - cscores: a dict, where cscores[label] is the probability of label
    '''
    wscores = defaultdict(Counter)
    cscores = Counter(labels)
    for sentence, label in zip(sentences, labels):
        # add up the word counts for the sentence
        wscores[label] += Counter(sentence.split())
    # since the keys of wscores[label] are the words in the sentence, we can get the total vocab size by taking the union of all the keys
    total_vocab_size = len(set.union(*(set(wscores[l].keys()) for l in wscores)))
    to_return_wscores = {}
    for label in cscores:
        # normalize label counts
        cscores[label] /= len(labels)
        # total word count for this label
        label_word_count = sum(wscores[label].values())
        for word in wscores[label]:
            wscores[label][word] = (wscores[label][word] + 1) / (label_word_count + total_vocab_size)
        # handle oov for this label
        to_return_wscores[label] = defaultdict(lambda: 1/(label_word_count + total_vocab_size))
        to_return_wscores[label].update(dict(wscores[label]))
    return to_return_wscores, dict(cscores)

In [5]:
import math
def inference_model_ele(wscores, cscores, sentence):
    '''
    Helper function, may vary.
    '''
    scores = Counter()
    for label in cscores:
        scores[label] = math.log(cscores[label])
        for word in sentence.split():
            scores[label] += math.log(wscores[label][word] + 1e-10)

    return scores.most_common(1)[0][0]

def inference_model(wscores, cscores, sentences):
    '''
    returns: list of 0 or 1s using our model (wscores, cscores)
    '''
    return [inference_model_ele(wscores, cscores, sentence) for sentence in sentences]

## Putting Everything Together

In [6]:
from sklearn.metrics import accuracy_score
wscores, cscores = train_model(dataset['train']['text'], dataset['train']['label'])
predictions = inference_model(wscores, cscores, dataset['test']['text'])
accuracy = accuracy_score(dataset['test']['label'], predictions)
accuracy

0.80736

## Using the Naive Bayes Classifier from `sklearn`
We see a surge in performance because there are many optimizations in the `sklearn` implementation that we did not consider.

NB is also a strong baseline on small datasets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english', )
# vectorizer = CountVectorizer(stop_words='english', max_features=10000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(dataset['train']['text'])
X_test = vectorizer.transform(dataset['test']['text'])

In [None]:
from sklearn.naive_bayes import MultinomialNB

y_train = dataset['train']['label']
y_test = dataset['test']['label']

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.81164


## Using Pretrained Embeddings (GloVe)
In our previous example, there is not much information encoded about the words. They are encoded as independent co-ocurrence frequencies. On the other hand,  pretrained embeddings are trained on a large corpus to predict the context of a word. This means that the embeddings contains more information about the words.

The following illustration is taken from [CS224N](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture02-wordvecs2.pdf). It is used to illustrate the training of word2vec, but it also illustrates the general idea of training word embeddings.

![word2vec](https://drive.google.com/uc?export=view&id=1rrACOt5WeoVbynB-pcC6h_O-ys1wySgU)



## Loading Glove

In [None]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="JeremiahZ/glove", filename="glove.6B.50d.txt", local_dir=".", )
hf_hub_download(repo_id="JeremiahZ/glove", filename="run_classification.py", local_dir=".", )

In [None]:
import numpy as np
from collections import defaultdict
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector
    oov_embedding = np.mean(list(embeddings_dict.values()), axis=0)
    to_return_dict = defaultdict(lambda: oov_embedding)
    to_return_dict.update(embeddings_dict)
    return to_return_dict

glove_path = './glove.6B.50d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)
glove_embeddings['the'].shape

(50,)

## Logistic Regression with GloVe
We are going to start with a baseline model and improve upon it.

In [None]:
def text_to_embedding(text, embeddings_dict):
    words = text.split()
    embeddings = [embeddings_dict[word] for word in words]
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(50)  # We should always get a embedding since we are using defaultdict, this is out of caution

X_train_embeddings = np.array([text_to_embedding(text, glove_embeddings) for text in dataset['train']['text']])
X_test_embeddings = np.array([text_to_embedding(text, glove_embeddings) for text in dataset['test']['text']])

In [None]:
from sklearn.linear_model import LogisticRegression

y_train = dataset['train']['label']
y_test = dataset['test']['label']

logistic_model = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
logistic_model.fit(X_train_embeddings, y_train)

In [None]:
predictions = logistic_model.predict(X_test_embeddings)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7336


## GRU with GloVe
First,  we need to preprocess our dataset

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import Dataset, DataLoader
import torch

# Tokenization
tokenizer = Tokenizer(num_words=10000)  # Limiting our vocabulary to the top 10,000 words
tokenizer.fit_on_texts(dataset['train']['text'])

# Convert texts to sequences of indices
X_train_seq = tokenizer.texts_to_sequences(dataset['train']['text'])
X_test_seq = tokenizer.texts_to_sequences(dataset['test']['text'])

# Pad sequences to ensure uniform length
maxlen = 100  # This can be adjusted
X_train_pad = pad_sequences(X_train_seq, maxlen=maxlen)
X_test_pad = pad_sequences(X_test_seq, maxlen=maxlen)

# Convert labels to tensors
y_train = torch.tensor(dataset['train']['label'])
y_test = torch.tensor(dataset['test']['label'])

# Dataset
class TextDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return torch.tensor(self.sequences[idx], dtype=torch.long), self.labels[idx]

train_dataset = TextDataset(X_train_pad, y_train)
test_dataset = TextDataset(X_test_pad, y_test)

# DataLoader
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

## `TODO`: Defining the Model

In [None]:
import torch.nn as nn
# Model instantiation
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
embedding_dim = 50  # GloVe 50d
hidden_dim = 64
output_dim = 1

class GRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, embeddings):
        '''
        1. An embedding layer
        2. A GRU layer
        3. A linear layer

        We are initing the embedding layer with the GloVe embeddings
        '''
        super(GRUModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight.data.copy_(embeddings)  # Initialize embedding layer with GloVe embeddings
        self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = self.embedding(x)
        _, hidden = self.gru(x)
        out = self.fc(hidden.squeeze(0))
        return out

# TODO: Prepare the GloVe embedding matrix of size (vocab_size, hidden_dim)
embedding_matrix = torch.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < vocab_size:
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = torch.tensor(embedding_vector, dtype=torch.float32)


model = GRUModel(vocab_size, embedding_dim, hidden_dim, output_dim, embedding_matrix)

## Train!

In [None]:
# device = torch.device()
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 3
log_steps = 10
for epoch in range(epochs):
    model.train()
    total_loss = 0
    step_count = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs).squeeze(1)
        loss = criterion(outputs, labels.float())
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

        step_count += 1
        if step_count % log_steps == 0:
            print(f'Epoch {epoch+1}, Step {step_count}, Loss: {total_loss/log_steps}')
            total_loss = 0

    # print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}')


Epoch 1, Step 10, Loss: 0.6919932782649993
Epoch 1, Step 20, Loss: 0.6947023630142212
Epoch 1, Step 30, Loss: 0.6948124051094056
Epoch 1, Step 40, Loss: 0.6845528841018677
Epoch 1, Step 50, Loss: 0.6874830543994903
Epoch 1, Step 60, Loss: 0.6788933992385864
Epoch 1, Step 70, Loss: 0.6717429339885712
Epoch 1, Step 80, Loss: 0.6770426809787751
Epoch 1, Step 90, Loss: 0.6561667263507843
Epoch 1, Step 100, Loss: 0.6259108662605286
Epoch 1, Step 110, Loss: 0.6040982604026794
Epoch 1, Step 120, Loss: 0.5965084969997406
Epoch 1, Step 130, Loss: 0.6147490441799164
Epoch 1, Step 140, Loss: 0.7254113852977753
Epoch 1, Step 150, Loss: 0.6236443042755127
Epoch 2, Step 10, Loss: 0.6353871285915375
Epoch 2, Step 20, Loss: 0.6063235342502594
Epoch 2, Step 30, Loss: 0.5640124976634979
Epoch 2, Step 40, Loss: 0.5351787120103836
Epoch 2, Step 50, Loss: 0.5364161550998687
Epoch 2, Step 60, Loss: 0.5556027203798294
Epoch 2, Step 70, Loss: 0.485073384642601
Epoch 2, Step 80, Loss: 0.5257476329803467
Epoch 

## Evaluate Our Model on the Test Set

In [None]:
model.eval()
total = 0
correct = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs).squeeze(1)
        predicted = torch.round(torch.sigmoid(outputs))
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test Accuracy: {correct / total}')

Test Accuracy: 0.8214


## `TODO`: Imdb Review Classification with distilBERT
We are going to reuse the pipeline that Huggingface provides in the official `transformers` library. Please go through the arguments defined in `run_classification.py` and Huggingface [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) to write a script that trains a distilBERT model on the IMDB dataset.

In [None]:
!python run_classification.py \
    --model_name_or_path distilbert-base-uncased \
    --dataset_name imdb \
    --remove_splits unsupervised \
    --shuffle_train_dataset \
    --text_column_names text \
    --metric_name accuracy \
    --do_eval \
    --validation_split_name test \
    --max_seq_length 128 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 64 \
    --learning_rate 2e-5 \
    --num_train_epochs 1 \
    --output_dir ./checkpoints/${dataset} \
    --save_strategy no \
    --do_train

## Reference Eval Metrics
```bash
***** eval metrics *****
  eval_accuracy           =       0.87
  eval_loss               =     0.3005
  eval_runtime            = 0:03:10.01
  eval_samples            =      25000
  eval_samples_per_second =    131.568
  eval_steps_per_second   =      2.058
```