# Natural Language Processing

## Introduction

In this chapter we begin discussing natural language processing.

We first cover TF-IDF. Next we discuss text embeddings and how to use them with feedforward neural networks.

## Introduction

Natural language processing (NLP) refers to a set of tasks where the input is unstructured text. There are many different possible goals, for example we might wish to classify the text or translate it to some other language.

The first challenge we encounter is how to present text in a format on which we can apply machine learning models.

The classical approach here is TF-IDF so we begin by discussing it.

## TF-IDF

TF-IDF is a product of two statistics - **term frequency** (TF) and **inverse document frequency** (IDF).

First we need some terminology:

- **Term** - this is a single unit of text. Depending on the task this can vary. One obvious choice is splitting text into terms by words. However, there usually are smarter ways of defining what a term is depending on language and task. Nowadays the word **token** is used more commonly instead of term.

## TF-IDF

- **Document** - a collection of terms, for us this is usually going to be a single input row.
- **Corpus** - the set of all documents.

## TF-IDF

Denote by $f_{t, d}$ the number of times the token $t$ appears in document $d$.

Term frequency of token $t$ in document $d$ is defined to be
$$
\text{TF(t, d)} = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}.
$$
I.e. the more times the token $t$ appears in the document the higher its $TF$ will be.

## TF-IDF

The inverse document frequency of token $t$ in corpus $c$ is defined to be
$$
\text{IDF(t, c)} = \log \left(\frac{\text{total number of documents in c}}{\text{number of documents in c that contain t}}\right)
$$
Inverse document frequency measures how rare the token is in the corpus. The less documents the token $t$ appears in the higher its $IDF$ will be.

## TF-IDF

TF-IDF of a token $t$ in document $d$ is then
$$
\text{TF-IDF}(t, d, c) = \text{TF}(t, d)\text{IDF}(t, c).
$$

We apply TF-IDF in ML by first computing the IDF of all terms from the training set. We can then encode each input row as a vector of dimension equal to the number of tokens in our training set. Each component of this vector represents one token and contain the TF-IDF of that token.

In this way we represent our input row as a vector which we can then pass to a ML model.

## TF-IDF

Let's use TF-IDF for **sentiment analysis**. Our input will be tweet on some specific stock or financial markets in general. The goal is to classify wether the sentiment is bearish (meaning pessimistic), bullish (meaning optimistic) or neutral.

The dataset can be found [here](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment).

In [1]:
#| output: false
import pandas as pd

train = pd.read_csv("hf://datasets/zeroshot/twitter-financial-news-sentiment/sent_train.csv")
test = pd.read_csv("hf://datasets/zeroshot/twitter-financial-news-sentiment/sent_valid.csv")

  from .autonotebook import tqdm as notebook_tqdm


## TF-IDF

There is an implementation of TF-IDF in `sklearn`.

Let's use logistic regression for classification after we transform our input using TF-IDF.

## TF-IDF

In [2]:
#| output-location: slide
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

X_train, y_train = train["text"], train["label"]
X_test, y_test = test["text"], test["label"]

def make_pipeline():
  model = LogisticRegression(random_state=34, class_weight="balanced")

  pipeline = Pipeline(
    steps=[
      ("transform", TfidfVectorizer(lowercase=True)),
      ("model", model),
    ],
  )

  return pipeline

pipeline = make_pipeline()
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.69      0.62       347
           1       0.66      0.73      0.70       475
           2       0.91      0.83      0.87      1566

    accuracy                           0.79      2388
   macro avg       0.71      0.75      0.73      2388
weighted avg       0.81      0.79      0.80      2388



## Embeddings

TF-IDF is a good technique if you need to process a lot of data quickly and cheaply and are satisfied with mediocre performance.

One obvious drawback of TF-IDF is that we loose all information on the order of words in a sentence.

## Embeddings

In certain languages (such as English) word order is important to understand the meanings of sentences.

For example, the following two sentences would have the same TF-IDF but their meaning is different:

1. Dog chases cat.
2. Cat chases dog.

## Embeddings

To keep the order of words we need to pass tokens to the model sequentially. We then represent tokens using **embeddings**. That is, to every token we assign a vector in $\mathbb{R}^n.$

There are many different ways for generating token embeddings (outdated historic example [word2vec](https://en.wikipedia.org/wiki/Word2vec)).

In our case we will let the model learn its own embeddings.

## Embeddings

To pass tokens to the model we will one hot encode them (where the dimension of the one hot encoded vector will be equal to the number of unique tokens in our training set) and then project this vector down to a space of lower dimension using a linear transformation. This will give us an embedding.

The model will then be able to learn this embedding on its own by learning the weights of the matrix used to project the one hot encoded tokens.

## Embeddings

This is implemented in `PyTorch` in the [Embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).

In order to use it we need to convert our input string into a tensor of token indices.

Let's write a `Dictionary` class that is going to assign an index to each token we encounter in the training set and convert a string into a list of token indices.

We will use the default English tokenizer supplied by `spaCy` package. However, note that there are much better tokenizers available.

## Embeddings

In [3]:
from spacy.lang.en import English

class Dictionary:
  def __init__(self, min_count=10, init_tokens=None):
    self.nlp = English()
    self.min_count = min_count
    self.init_tokens = init_tokens
    self.i2t, self.t2i, self.no_tokens = self._default_maps()
    self.pad_idx = 0
    self.unk_idx = 1

  def _default_maps(self):
    # <pad> - token used for padding
    # <unk> - unknown, used for tokens not encountered in dictionary building
    i2t = ['<pad>', '<unk>']
    if self.init_tokens != None:
      i2t = [*i2t, *self.init_tokens]
    t2i = {token:index for index, token in enumerate(i2t)}
    return i2t, t2i, len(i2t)

  def build(self, corpus):
    tokens = {}
    for idx, row in enumerate(corpus):
      for token in self.nlp(row):
        if token.text.lower() not in tokens:
          tokens[token.text.lower()] = 1
        else:
          tokens[token.text.lower()] += 1
    i2t, _, _ = self._default_maps()
    self.i2t = [
      *i2t,
      *[token for token, count in tokens.items() if count >= self.min_count]
    ]
    self.t2i = {token:index for index, token in enumerate(self.i2t)}
    self.no_tokens = len(self.i2t)

  def string_to_idx(self, string, seq_length=None):
    tokens = [token.text.lower() for token in self.nlp(string) if not token.is_punct]
    return self.tokens_to_idx(tokens, seq_length)

  def tokens_to_idx(self, tokens, seq_length=None):
    idxs = [self.t2i[token] if token in self.t2i else self.unk_idx for token in tokens]
    if seq_length is not None:
      idxs = idxs + [self.pad_idx] * (seq_length - len(idxs))
      idxs = idxs[:seq_length]
    return idxs

  def idx_to_string(self, indices, ignore_pad=True):
    tokens = self.idx_to_tokens(indices, ignore_pad)
    return tokens.join(' ')

  def idx_to_tokens(self, indices, ignore_pad=True):
    if ignore_pad:
      return [self.i2t[idx] for idx in indices if idx != self.pad_idx]
    return [self.i2t[idx] for idx in indices]

## Embeddings

One more problem is that now our inputs have variable lengths. If we want to use a feedforward NN to classify our data, then we need the input dimensions to be fixed.

To get around this we will fix the input size. The inputs that are shorter than this size will be padded with a special padding token which the model will (hopefully) learn to ignore. And longer inputs will be truncated.

## Embeddings

Another way to handle inputs of arbitrary length would be to use a recurrent neural network (RNN) of some sort. We will cover RNNs in a later chapter.

We also need to write a custom Dataset class to represent our data. This is easy. All we need to do is to inherit from the `Dataset` class and implement the `__getitem__` and `__len__` methods.

## Embeddings

In [4]:
#| output: false
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
import numpy as np
import pandas as pd

class Tweets(Dataset):
  def __init__(self, seq_length, train=False, dictionary=None):
    self.seq_length = seq_length

    if train:
      self.dataset = pd.read_csv("hf://datasets/zeroshot/twitter-financial-news-sentiment/sent_train.csv")
    else:
      self.dataset = pd.read_csv("hf://datasets/zeroshot/twitter-financial-news-sentiment/sent_valid.csv")

    if dictionary is None:
      self.dictionary = Dictionary()
      self.dictionary.build(self.dataset["text"])
    else:
      self.dictionary = dictionary

    self.no_tokens = self.dictionary.no_tokens

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    tokens = self.dictionary.string_to_idx(self.dataset.iloc[idx]["text"], seq_length=self.seq_length)
    tokens = torch.LongTensor(tokens)

    label = self.dataset.iloc[idx]["label"]
    label = torch.zeros(3, dtype=torch.float).scatter_(0, torch.tensor(label), value=1)

    return tokens, label

seq_length = 50
batch_size = 32

train_data = Tweets(
  seq_length=seq_length,
  train=True
)

test_data = Tweets(
  seq_length=seq_length,
  train=False,
  dictionary=train_data.dictionary
)

train_dataloader = DataLoader(
  train_data,
  batch_size=batch_size,
  shuffle=True
)

test_dataloader = DataLoader(
  test_data,
  batch_size=batch_size,
  shuffle=False
)

## CNNs For NLP

So now our input is a matrix of fixed dimension. This is very similar to an image. Maybe a convolutional neural network (CNN) might work for our task?

Convolutional layers indeed have properties that we want:

1. They take the order of inputs into account, hence the model can take word order into account.
2. They can easily learn patterns irrespective where they occur in the input. In our case the model should be able to learn that seeing something like "this movie is bad" anywhere in the text means that the review is negative.

## CNNs For NLP

Nowadays, the go to mechanism for handling text is called attention. We will cover attention in the next chapter when we talk about transformers. For now let's stick with CNNs.

Let's build our model.

## CNNs For NLP

In [5]:
#| output-location: slide
from torch import nn
import torch.nn.functional as F

class TextClassifier(nn.Module):
  def __init__(self, no_tokens, seq_length):
    super().__init__()
    self.seq_length = seq_length
    self.no_tokens = no_tokens
    self.embed_dim = 512
    self.num_out_conv_channels = 256
    self.conv_kernel_sizes = [3, 4, 5]

    self.embedding = nn.Embedding(self.no_tokens, self.embed_dim)
    self.conv_layers = nn.ModuleList([
      nn.Sequential(
        nn.Conv2d(
          in_channels = 1,
          out_channels = self.num_out_conv_channels,
          kernel_size = (k, self.embed_dim)
        ),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=(self.seq_length-k+1, 1)),
        nn.Flatten(),
      )
       for k in self.conv_kernel_sizes
    ]) # Use ModuleList if you need a list of some layers
    self.linear_stack = nn.Sequential(
      nn.Linear(len(self.conv_kernel_sizes)*self.num_out_conv_channels, 3),
    )

  def forward(self, x):
    x = self.embedding(x)
    x = x.unsqueeze(1) # Adds channel dimension

    x = [conv_layer(x) for conv_layer in self.conv_layers]
    x = torch.cat(x, 1)

    return self.linear_stack(x)

print(TextClassifier(train_data.no_tokens, seq_length))

TextClassifier(
  (embedding): Embedding(1724, 512)
  (conv_layers): ModuleList(
    (0): Sequential(
      (0): Conv2d(1, 256, kernel_size=(3, 512), stride=(1, 1))
      (1): ReLU()
      (2): MaxPool2d(kernel_size=(48, 1), stride=(48, 1), padding=0, dilation=1, ceil_mode=False)
      (3): Flatten(start_dim=1, end_dim=-1)
    )
    (1): Sequential(
      (0): Conv2d(1, 256, kernel_size=(4, 512), stride=(1, 1))
      (1): ReLU()
      (2): MaxPool2d(kernel_size=(47, 1), stride=(47, 1), padding=0, dilation=1, ceil_mode=False)
      (3): Flatten(start_dim=1, end_dim=-1)
    )
    (2): Sequential(
      (0): Conv2d(1, 256, kernel_size=(5, 512), stride=(1, 1))
      (1): ReLU()
      (2): MaxPool2d(kernel_size=(46, 1), stride=(46, 1), padding=0, dilation=1, ceil_mode=False)
      (3): Flatten(start_dim=1, end_dim=-1)
    )
  )
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=3, bias=True)
  )
)


## CNNs for NLP

Let's also copy over the code for training for one epoch from the last chapter.

In [6]:
from tqdm import tqdm # This is a library that implements loading bars
import sys

def train_epoch(dataloader, model, loss_fn, optimizer):
  model.train() # Set model to training mode

  total_loss = 0
  total_batches = 0

  with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
    ep_tqdm.set_description("Train")
    for X, y in ep_tqdm:
      X, y = X.to(device), y.to(device)

      # Forward pass
      pred = model(X)
      loss = loss_fn(pred, y)
        
      # Backward pass
      loss.backward()
      optimizer.step()

      # Reset the computed gradients back to zero
      optimizer.zero_grad()

      # Output stats
      total_loss += loss
      total_batches += 1
      ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item())

def eval_epoch(dataloader, model, loss_fn):
  model.eval() # Set model to inference mode
  
  total_loss = 0
  total_batches = 0
  total_samples = 0
  total_correct = 0

  with torch.no_grad(): # Do not compute gradients
    with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
      ep_tqdm.set_description("Val")
      for X, y in ep_tqdm:
        X, y = X.to(device), y.to(device)
        pred = model(X)

        total_loss += loss_fn(pred, y)
        total_correct += (pred.argmax(dim=1) == y.argmax(dim=1)).type(torch.float).sum()
        total_samples += len(X)
        total_batches += 1

        ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item(), accuracy=(total_correct/total_samples).item())

## CNNs For NLP

Let's train the model.

In [7]:
#| output-location: slide
# Hyperparameters
learning_rate = 0.0001
epochs = 10

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = TextClassifier(train_data.no_tokens, seq_length).to(device)

loss_fn = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Organize the training loop
for t in range(epochs):
  print(f"Epoch {t+1}\n-------------------------------")
  train_epoch(train_dataloader, model, loss_fn, optimizer)
  eval_epoch(test_dataloader, model, loss_fn)

print("Done!")

Using cuda device
Epoch 1
-------------------------------
Train: 100%|██████████| 299/299 [00:02<00:00, 128.18batch/s, average_batch_loss=0.706]
Val: 100%|██████████| 75/75 [00:00<00:00, 167.12batch/s, accuracy=0.777, average_batch_loss=0.611]
Epoch 2
-------------------------------
Train: 100%|██████████| 299/299 [00:02<00:00, 139.85batch/s, average_batch_loss=0.45] 
Val: 100%|██████████| 75/75 [00:00<00:00, 200.03batch/s, accuracy=0.796, average_batch_loss=0.53] 
Epoch 3
-------------------------------
Train: 100%|██████████| 299/299 [00:02<00:00, 139.39batch/s, average_batch_loss=0.298]
Val: 100%|██████████| 75/75 [00:00<00:00, 196.34batch/s, accuracy=0.81, average_batch_loss=0.504] 
Epoch 4
-------------------------------
Train: 100%|██████████| 299/299 [00:02<00:00, 134.70batch/s, average_batch_loss=0.191]
Val: 100%|██████████| 75/75 [00:00<00:00, 194.69batch/s, accuracy=0.814, average_batch_loss=0.487]
Epoch 5
-------------------------------
Train: 100%|██████████| 299/299 [00:02

## Practice Task

Try performing sentiment analysis on this [dataset](https://github.com/jputrius/ml_intro/tree/main/data/reviewpolarity). The input is movie reviews and your goal is to classify whether the review is positive or negative.