# CS: Natural Language Processing

## Hands-on Workshop - First Session

### [Vectorized Operations: Optimized Computations on NumPy Arrays](https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html)
Restricting the NumPy array's contents to only contain data of a single type comes at a great benefit; in `knowing` that an array's contents are homogeneous in the data type, NumPy is able to delegate the task of performing mathematical operations on the array's contents to an optimized, compiled C code. This is a process that is referred to as <span style="color:cyan">*Vectorization*</span>. The outcome of this can be a tremendous speedup relative to the analogous computation performed in Python, which must painstakingly check the data type of each items as it iterates over the arrays, since Python typically works with lists with unrestricted contents.
<br><br/>
This part of the notebook will go through the topics in order:
- [Universal Functions vs For Loops](#Universal-Functions-vs-For-Loops)

- [$L^2$ Distance](#$L^2$-Distance)

- [Positional Encoding](#Positional-Encoding)

---

In [1]:
import numpy as np
import time, torch

#### Universal Functions vs For Loops

In [2]:
n = int(1e6)
array = np.arange(n, dtype=np.float64)

sum_1 = 0
tic1 = time.time()
for i in array:
    sum_1 += i
toc1 = time.time()

tic2 = time.time()
sum_2 = array.sum()
toc2 = time.time()

print(f"Sum of the first{n:10,d} natural numbers are {sum_1:,.0f} and calculated in about{1000*(toc1-tic1):7.2f} ms using FOR loop\
    \n\nSum of the first{n:10,d} natural numbers are {sum_2:,.0f} and calculated in about{1000*(toc2-tic2):7.4f} ms using Vectorization\
    \n\n~{(toc1-tic1)/(toc2-tic2):.0f} times faster!")

Sum of the first 1,000,000 natural numbers are 499,999,500,000 and calculated in about 154.07 ms using FOR loop    

Sum of the first 1,000,000 natural numbers are 499,999,500,000 and calculated in about 0.6130 ms using Vectorization    

~251 times faster!


Timed on my `colab`, the sum is about <span style="color:orange">**200**</span> times faster when performed using a NumPy's vectorized function! This should make it clear that, whenever computational efficiency is important, one should avoid performing explicit for loops over long sequences of data in Python (lists or NumPy arrays). NumPy provides a whole suite of vectorized functions. In fact, the name of the game when it comes to leveraging NumPy to do computations over arrays of numbers is to exclusively leverage its vectorized functions.

#### $L^2$ Distance

Suppose that we have a matrix called $X_1$ which is a NumPy array with the shape $(N_1,D)$. This matrix has $N_1$ rows, and each row corresponds to a $D$-dimensional vector, which we call an instance. There is also another matrix named $X_2$ with the shape $(N_2,D)$. Note that $N_2$ is not essentially equal to $N_1$.

We want to compute the $L^2$ distance between each instance in the first matrix and each instance in the second matrix. The $L^2$ distance or *Euclidean distance* between two vectors $\mathbf x,\,\mathbf y\in\mathbb R^n$ is defined as follows:
$$\mathbf d(\mathbf x,\mathbf y)=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\dots+(x_n-y_n)^2}$$
We want to make a matrix called ```pdists``` in which the $(i,j)_{th}$ entry is the $L^2$ distance between the $i_{th}$ instance in $X_1$ and the $j_{th}$ instance in $X_2$.

In [3]:
def L2_distance_naiv(X1, X2):
    N1, D1 = X1.shape
    N2, D2 = X2.shape
    assert D1==D2, "Shape Mismatch!"
    pdists = -np.ones(shape=(N1,N2), dtype=np.float64)
    for i in range(N1):
        for j in range(N2):
            z_ij = X1[i,:]-X2[j,:]
            s_ij = (z_ij**2).sum()
            pdists[i,j] = s_ij**0.5
    return pdists

def L2_distance_vect(X1, X2):
    N1, D1 = X1.shape
    N2, D2 = X2.shape
    assert D1==D2, "Shape Mismatch!"
    pdists = -np.ones(shape=(N1,N2), dtype=np.float64)
    X1_square = np.sum(a=X1**2, axis=1, keepdims=True)
    X2_square = np.sum(a=X2**2, axis=1).reshape(1, N2)
    X1X2 = np.matmul(X1, X2.T)
    pdists = (X1_square - 2*X1X2 + X2_square)**0.5
    return pdists

In [4]:
X1 = np.random.randn(2023, 97)
X2 = np.random.randn(1402, 97)

tic1 = time.time()
pdists_1 = L2_distance_naiv(X1, X2)
toc1 = time.time()

tic2 = time.time()
pdists_2 = L2_distance_vect(X1, X2)
toc2 = time.time()

print(f"Running time of L2_distance_naiv is about{toc1-tic1:7.4f} sec\
    \n\nRunning time of L2_distance_vect is about{toc2-tic2:7.4f} sec\
    \n\n~{(toc1-tic1)/(toc2-tic2):.0f} times faster!")

Running time of L2_distance_naiv is about 8.7567 sec    

Running time of L2_distance_vect is about 0.0649 sec    

~135 times faster!


We can see a huge improvement in running time of `L2_distance_vect` comparing to `L2_distance_naiv`. This experiment shows that how vectorized and matrix computations in NumPy could be more effiecient than just naively using *`FOR loops`* in Python. NumPy is a powerful module with a beautiful functional API, and many of its functions are implemented in <span style="color:cyan">**C**</span>, bringing about a great efficiency. Nevertheless, one of the biggest disadvantages of NumPy is that it cannot run on **GPU**. By using machine learning frameworks such as <span style="color:cyan">**PyTorch**</span>, we will sidestep this problem.

#### Positional Encoding

Sinusoidal functions are an easy way to encode the position of tokens in a sequence, as it was used in the **Transformer** paper ([Attention Is All You Need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)). In the following cell, we consider three different implementation of these functions:
$$\textit{PE}(pos,2i)=\sin\bigg(\frac{pos}{10000^{2i/d_{model}}}\bigg),\qquad\textit{PE}(pos,2i+1)=\cos\bigg(\frac{pos}{10000^{2i/d_{model}}}\bigg)$$
where $pos$ is the position and **$i$** is the dimention. That is, each dimension of the positional encoding corresponds to a sinusoid.

<img src="../../figs/Positional-Encoding.png" width=737>

In [5]:
def PositionalEncoding_naive(max_length, d_model, n=10000):
    PE = np.zeros((max_length,d_model))
    for pos in range(max_length):
        for j in range(0,d_model,2):
            PE[pos,j] = np.sin(pos/n**(j/d_model))
            PE[pos,j+1] = np.cos(pos/n**(j/d_model))
    return PE


def PositionalEncoding_vt_np(max_length, d_model, n=10000):
    pos = np.arange(max_length).reshape(-1,1)
    dnm = np.arange(0,d_model,2).reshape(1,-1)/d_model
    frc = pos/n**dnm

    PE = np.zeros((max_length,d_model))
    PE[:, 0::2] = np.sin(frc)
    PE[:, 1::2] = np.cos(frc)

    return PE


def PositionalEncoding_vt_pt(max_length, d_model, n=10000, device="cuda"):
    pos = torch.arange(max_length, device=device).view(-1,1)
    dnm = torch.arange(0, d_model, 2, device=device).view(1,-1)/d_model
    dnm = torch.pow(n,dnm)
    frc = pos/dnm

    PE = torch.zeros(max_length, d_model, device=device)
    PE[:, 0::2] = torch.sin(frc)
    PE[:, 1::2] = torch.cos(frc)

    return PE

In [6]:
max_length, d_model = 8192, 16384

tic1 = time.time()
pdists_1 = PositionalEncoding_naive(max_length, d_model)
toc1 = time.time()

tic2 = time.time()
pdists_2 = PositionalEncoding_vt_np(max_length, d_model)
toc2 = time.time()

_ = PositionalEncoding_vt_pt(max_length, d_model)   # Prepares the GPU!
tic3 = time.time()
pdists_3 = PositionalEncoding_vt_pt(max_length, d_model)
toc3 = time.time()

print(f"Running time of PositionalEncoding_naive is about{toc1-tic1:7.2f} sec\
    \n\nRunning time of PositionalEncoding_vt_np is about{toc2-tic2:7.4f} sec\
    \n\nRunning time of PositionalEncoding_vt_pt is about{1e6*(toc3-tic3):7.2f} µs\
    \n\n~{(toc2-tic2)/(toc3-tic3):.0f} times faster!")

Running time of PositionalEncoding_naive is about 123.91 sec    

Running time of PositionalEncoding_vt_np is about 1.3248 sec    

Running time of PositionalEncoding_vt_pt is about 831.13 µs    

~1594 times faster!


Running on **CPU** is about <span style="color:orange">**1500**</span> times slower than running on **GPU** using <span style="color:cyan">**PyTorch**</span> tensors on my `colab` virtual machine.

In [7]:
%timeit PositionalEncoding_vt_np(max_length, d_model)

1.31 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%timeit PositionalEncoding_vt_pt(max_length, d_model)

8.7 ms ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Fake News Detection
["Fake News Detection is a *Natural Language Processing* task that involves identifying and classifying news articles or other types of text as Real or Fake. The goal of Fake News Detection is to develop algorithms that can automatically identify and flag fake news articles, which can be used to combat misinformation and promote the dissemination of accurate information."](https://paperswithcode.com/task/fake-news-detection)
<br><br/>
This part of the notebook will go through the topics in order:
- [Load Data and Prepare Datasets](#Load-Data-and-Prepare-Datasets)

- [Create DataLoaders](#Create-DataLoaders)

- [Config the Model and Optimizer](#Config-the-Model-and-Optimizer)

- [Training Loop](#Training-Loop)

- [Validate the Results](#Validate-the-Results)

---

In [9]:
import pandas as pd
import torch, evaluate

from datasets import DatasetDict, Dataset
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
# @title Hyperparameters
SEED = 42   # @param {type:"integer"}

CASING = "bert-base-uncased"    # @param ["bert-base-uncased", "bert-large-uncased"]

MAX_LENGTH = 256    # @param {type:"slider", min:128, max:512, step:128}

EPOCHS = 3  # @param {type:"slider", min:1, max:7, step:1}

BATCH_SIZE = 16
NUM_LABELS = 2

CHK_FILE = "./FakeNewsClassifier.pt"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [11]:
def set_seed(seed=SEED, device=DEVICE):
    # Reproducibility options
    import os, random, numpy
    os.environ["PYTHONHASHSEED"] = f"{seed}"
    # os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"   # ":16:8"
    random.seed(a=seed, version=2)
    numpy.random.seed(seed=seed)
    torch.manual_seed(seed=seed)
    if device.type=="cuda":
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        torch.cuda.manual_seed(seed=seed)

set_seed()

#### Load & Prepare Data and Setup Datasets

In [12]:
data_dir = "../../data/"
fake_df = pd.read_csv(data_dir+"fake.csv")
fake_df.head()

Unnamed: 0,text
0,The Senate voted 51-48 this afternoon to proce...
1,So much for the SCOTUS not being political Che...
2,White House counselor Kellyanne Conway crawled...
3,Donald Trump may have decided that Russia is g...
4,Have you ever wondered where a phrase started?...


In [13]:
fake_df, true_df = pd.read_csv(data_dir+"fake.csv"), pd.read_csv(data_dir+"true.csv")
df = pd.concat([fake_df, true_df])
df["labels"] = fake_df.size*[0] + true_df.size*[1]
df = df.sample(frac=1, replace=False, random_state=SEED)#.reset_index(drop=True)
text, labels = df.text.to_list(), df.labels.to_list()

In [14]:
tr_ratio, n_samples = 0.75, len(text)
n_tr_samples = int(tr_ratio*n_samples)

dataset = DatasetDict()
dataset["train"] = Dataset.from_dict({"text":text[:n_tr_samples], "labels":labels[:n_tr_samples]})
dataset["validate"] = Dataset.from_dict({"text":text[n_tr_samples:], "labels":labels[n_tr_samples:]})

#### Create DataLoaders

In [15]:
tokenizer = BertTokenizer.from_pretrained(CASING, do_lower_case=True)

tr_encoded_dict = tokenizer(text=dataset["train"]["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH,
                            return_tensors="pt", return_token_type_ids=False, return_attention_mask=True)
tr_dataset = TensorDataset(tr_encoded_dict["input_ids"].to(DEVICE), tr_encoded_dict["attention_mask"].to(DEVICE),
                           torch.tensor(dataset["train"]["labels"], device=DEVICE))
tr_dataloader = DataLoader(tr_dataset, batch_size=BATCH_SIZE, shuffle=True)

val_encoded_dict = tokenizer(text=dataset["validate"]["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH,
                             return_tensors="pt", return_token_type_ids=False, return_attention_mask=True)
val_dataset = TensorDataset(val_encoded_dict["input_ids"].to(DEVICE), val_encoded_dict["attention_mask"].to(DEVICE),
                            torch.tensor(dataset["validate"]["labels"], device=DEVICE))
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

#### Config the Model and Optimizer

In [16]:
warmup_ratio = 0.05
num_training_steps = EPOCHS * len(tr_dataloader)

model = BertForSequenceClassification.from_pretrained(CASING, num_labels=NUM_LABELS).to(DEVICE)
for name, prm in model.named_parameters():
    if ("embeddings" in name) or ("encoder" in name and int(name.split('.')[3])<4):
        prm.requires_grad = False

optimizer = torch.optim.Adam(params=filter(lambda prm:prm.requires_grad, model.parameters()), lr=2e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(warmup_ratio*num_training_steps),
                                            num_training_steps=num_training_steps, last_epoch=-1)
loss_criterion = torch.nn.CrossEntropyLoss(reduction="sum")
metric = evaluate.load("accuracy")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

#### Training Loop

In [17]:
model.train()

tr_lss, tr_acc = [], []
for epoch in range(EPOCHS):
    tmp_lss, n_smpls, monitor = 0, 0, 0
    for batch in tr_dataloader:
        optimizer.zero_grad()
        outputs, targets = model(input_ids=batch[0], attention_mask=batch[1], return_dict=True), batch[2]
        loss = loss_criterion(outputs.logits, targets)
        loss.backward()
        optimizer.step()
        scheduler.step()
        metric.add_batch(predictions=outputs.logits.argmax(dim=1), references=targets)
        tmp_lss += loss.item()
        n_smpls += targets.numel()
    tr_lss.append(tmp_lss/n_smpls)
    tr_acc.append(100*metric.compute()["accuracy"])
    print(f"[Epoch{epoch+1:2d}/{EPOCHS}]  Training Loss:{tr_lss[-1]:7.4f}  *****  Training Accuracy:{tr_acc[-1]:6.2f}%\n")

    if monitor<tr_acc[-1]:
        monitor = tr_acc[-1]
        torch.save(model.state_dict(), CHK_FILE)

[Epoch 1/3]  Training Loss: 0.1461  *****  Training Accuracy: 94.00%

[Epoch 2/3]  Training Loss: 0.0095  *****  Training Accuracy: 99.87%

[Epoch 3/3]  Training Loss: 0.0077  *****  Training Accuracy: 99.87%



#### Validate the Results

In [18]:
model.load_state_dict(torch.load(CHK_FILE), strict=True)
model.eval()

tmp_lss, n_smpls = 0, 0
for batch in val_dataloader:
    with torch.no_grad():
        outputs, targets = model(input_ids=batch[0], attention_mask=batch[1], return_dict=True), batch[2]
        loss = loss_criterion.forward(outputs.logits, targets)
    metric.add_batch(predictions=outputs.logits.argmax(dim=1), references=targets)
    tmp_lss += loss.item()
    n_smpls += targets.numel()
val_lss, val_acc = tmp_lss/n_smpls, 100*metric.compute()["accuracy"]
print(f"Validation Loss:{val_lss:7.4f}  *****  Validation Accuracy:{val_acc:6.2f}%")

Validation Loss: 0.0203  *****  Validation Accuracy: 99.70%
