# Classification

We are going to explore and study a basic and important task of Natural Language Processiing (NLP) which is text classification. This task can help us with various tasks for example sentiments analysis, langauge detection, text author detection and text labelling are only few to name. In this Notebook, we will implement text's langauge detection task using two NLP approaches. First approach is with statistics and we will developed very famous Naive Bayes classifier. Second approach is with Machine Learning and there we will implement a very simple single layer Long Short Term Memory (LSTM) model. We will use Accuracy to compare both models performance (we could also use F1-measure but I want to keep things simple). I used vanilla Python for Naive Bayes implementation and Pytorch for LSTM. The dataset I use is from [Kaggle](https://www.kaggle.com/basilb2s/language-detection) and it consist of text with their respective langauge label. The dataset contains text from 17 different languages, however I picked only 3 langauges with the most enteries (English, French and Spanish). I used 80% data for training and remaining 20% for validation.
Any classification task can be further divided into binary or multi-class classification. Here as we picked 3 langauges so we are going to develop multi class classifier. Let's now start with Naive Bayes implementation.

### Naive Bayes

Naive Bayes is one of the important statistical algorithm for text classification. It has two assumptions:

1. _Bag-of-words assumptions:_ Position of words in text does not matter and all the words are equally important. 
2. _Conditional Independence_: Features (words) probabilities $P(w|c)$ are independent of each other given class c.

Naive Bayes is **naive** because of the **first** assumption. This is a strong assumption and **unrealistic for real data; however, this is very effective**. One thing two remember that both assumptions are incorrect, but they simplify the algorithm big time.

Some notes from lectures of Dan Jurafsky (highly recommend watching [these](https://www.youtube.com/playlist?list=PLLssT5z_DsK8HbD2sPcUIDfQ7zmBarMYv) videos from him):
- Small data ? Naive bayes is perfect
- Reasonable amount of data? go with SVM or Logistic Regression
- Huge data? Training ML models can be time consuming and here Naive Bayes will shine. Infact with enough data classifier wont even matter!

Before diving into more details for Naive Bayes algorithm, lets analyze and prepare the dataset!

In [73]:
import math
from collections import Counter

from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

In [76]:
# dataset courtesy: https://www.kaggle.com/basilb2s/language-detection
ds = pd.read_csv('Language Detection.csv')

print(f'{len(ds["Language"].unique())} languages in dataset ')
print(f'Total dataset count: {len(ds)}')

stats = { lang: len(ds.loc[ds['Language'] == lang])  for lang in ds["Language"].unique()  }
print(stats)

np.random.shuffle(ds.values)
total_ds = ds.loc[(ds['Language'] == 'Spanish') | (ds['Language'] == 'French') | (ds['Language'] == 'English')]
print(f'Total dataset count (3 languages): {len(total_ds)}')

train_ds = total_ds[:2700]
valid_ds = total_ds[2700:]

# Only considering three languages en, fr and es
ds_en = train_ds.loc[train_ds['Language'] == 'English']
ds_fr = train_ds.loc[train_ds['Language'] == 'French']
ds_es = train_ds.loc[train_ds['Language'] == 'Spanish']

vds_en = valid_ds.loc[valid_ds['Language'] == 'English']
vds_fr = valid_ds.loc[valid_ds['Language'] == 'French']
vds_es = valid_ds.loc[valid_ds['Language'] == 'Spanish']

print('\n')
print(f'Train set: \n\tTotat: {len(train_ds)} \n\tEnglish: {len(ds_en)}, French: {len(ds_fr)} Spanish: {len(ds_es)}')
print(f'Train set: \n\tTotat: {len(valid_ds)} \n\tEnglish: {len(vds_en)}, French: {len(vds_fr)} Spanish: {len(vds_es)}')

17 languages in dataset 
Total dataset count: 10337
{'English': 1385, 'Malayalam': 594, 'Hindi': 63, 'Tamil': 469, 'Portugeese': 739, 'French': 1014, 'Dutch': 546, 'Spanish': 819, 'Greek': 365, 'Russian': 692, 'Danish': 428, 'Italian': 698, 'Turkish': 474, 'Sweedish': 676, 'Arabic': 536, 'German': 470, 'Kannada': 369}
Total dataset count (3 languages): 3218


Train set: 
	Totat: 2700 
	English: 1169, French: 843 Spanish: 688
Train set: 
	Totat: 518 
	English: 216, French: 171 Spanish: 131


In [77]:
# concatenating all rows into one corpus
corpus_en = ds_en.Text.str.cat()
corpus_fr = ds_fr.Text.str.cat()
corpus_es = ds_es.Text.str.cat()

Now we have dataset for all three classes. Let's talk about our algorithm. So we want to find out probability of class _c_ given the document _d_ hence $P(c|d)$. According to the Bayes rule we can calculate it as

$$ P(c|d) = \frac{P(d|c)P(c)}{P(d)} $$

Where $P(d)$ is the probability of the given document which is always constant (as we have fixed dataset) so we can drop it. Now it becomes

$$ P(c|d) = P(d|c)P(c) $$

Now $P(c)$ is probability of the class and we also call it **prior probability**. Lets calculate it first!


$$ P(c) = \frac{\text{No. of c documents}}{\text{Total no. of documents}}$$

In [5]:
# Calculating prior probabilities
prior_en = len(ds_en)/no_of_docs
print(f"No. of en docs {len(ds_en)} P(en): {prior_en}")

prior_fr = len(ds_fr)/no_of_docs
print(f"No. of fr docs {len(ds_fr)} P(fr): {prior_fr}")

prior_es = len(ds_es)/no_of_docs
print(f"No. of es docs {len(ds_es)} P(es): {prior_es}")

No. of en docs 1173 P(en): 0.43444444444444447
No. of fr docs 840 P(fr): 0.3111111111111111
No. of es docs 687 P(es): 0.2544444444444444


Now we have our all three prior probabilities calculated. Its time to deal with $P(d|c)$ which we call **likelihood**. As _d_ is a document consists of many words _w_ we can rewrite as 

$$ P(c|d) = P(w_1 .... w_n|c)P(c) $$

Now calculating $P(w_1 .... w_n|c)$ is not trivial, but because of the assumption of __*independence*__, joint probability will become


$$ P(c|d) = P(w_1|c) P(w_2|c) ... P(w_n|c) P(c) $$

Now the only calculation we need to do is for $P(w_i|c)$ and it can be calculated as

$$P(w_i|c) = \frac{ \text{ No. of $w_i$ in c + 1} }{ \text{No. of words in c + (Vocab size + 1)} } $$

Here the _Vocab size_ is the number of unique words in all three classes (the complete corpus). We added + 1 in both the nominator and denominator to handle unknown words!

In [6]:
# Pre-processing
# Lower casing and tokenization i.e here we are simply splitting by spaces
tokens_en = corpus_en.lower().split()
tokens_fr = corpus_fr.lower().split()
tokens_es = corpus_es.lower().split()
print(len(tokens_en))
print(len(tokens_fr))
print(len(tokens_es))

24519
18629
13984


In [7]:
# Creating vacabs
vocab_en = Counter()
vocab_fr = Counter()
vocab_es = Counter()

vocab_en.update(tokens_en)
vocab_fr.update(tokens_fr)
vocab_es.update(tokens_es)

size_vocab_en = len(vocab_en) 
size_vocab_fr = len(vocab_fr) 
size_vocab_es = len(vocab_es) 


# plus one for unknown word
vocab_size = len(set(tokens_en)) + len(set(tokens_fr)) + len(set(tokens_es)) + 1

print('Vocab size: ', vocab_size)

Vocab size:  16926


In [8]:
def nb(text, cls_w_count, cls_prior, cls_vocab, vocab_size):
    tokens = text.strip().lower().split()
    res = cls_prior
    deno = cls_w_count + vocab_size
    
    for t in tokens:
        nom = cls_vocab.get(t, 0) + 1
        res *= (nom / deno)
    
    return res

In [20]:
langs = {
        'English': {'prior': prior_en, 'tokens': len(tokens_en), 'vocab': vocab_en },
        'French': {'prior': prior_fr, 'tokens': len(tokens_fr), 'vocab': vocab_fr },
        'Spanish': {'prior': prior_es, 'tokens': len(tokens_es), 'vocab': vocab_es },
    }

last_pred = -1

true_labels =[]
pred_labels =[]

for i, row in valid_ds.iterrows():
    text = row.Text
    label = row.Language
    true_labels.append(label)
    for lang in langs.keys():
        lang_obj = langs.get(lang)
        prior = lang_obj.get('prior')
        pred = nb(text, lang_obj.get('tokens'), prior, lang_obj.get('vocab'), vocab_size)
        if pred > last_pred:
            lp = pred
            pred_lang = lang
    pred_labels.append(pred_lang)

print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

     English       0.00      0.00      0.00       212
      French       0.00      0.00      0.00       174
     Spanish       0.25      1.00      0.41       132

    accuracy                           0.25       518
   macro avg       0.08      0.33      0.14       518
weighted avg       0.06      0.25      0.10       518



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


As our dataset is pretty imbalanced we have to consider `macro avg`. So with Naive Bayes we were only able to achieve 0.14 F1-score on our validation set and also it was unable to correctly classify English and French (hence, Sckit-learn giving us warning). But as you have noticed it took almost no time to train and generate predictions. Hence, quite fast!
Now lets see the same task with Deep Learning approach!

### Deep Learning Approach

So for Deep Learning we will use LSTM. I would not explain all the technicalities invovled in LSTM and deep learning (e.g batch, optimizer etc), rather I would discuss how to implement LSTM with Pytorch and how Deep Learning can help us to improve classification performance!
However, I would highly recommend very easily explained article from Olah _[Understanding LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)_. 
Lets import Pytorch's modules.

In [26]:
from collections import Counter

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset

from torchtext.vocab import build_vocab_from_iterator


from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [27]:
hyp_params = {
    "batch_size": 32,
    "embedding_dim": 125,
    "hidden_dim": 2
}

So as always, first step is to prepare data. We would create a parallel (source and target) kind of dataset where source is the text and target is the langauge of the source text. In Pytorch, we have to create a `Class` for dataset which should be inhereted from `Dataset` class provided by Pytorch and it must implement three functions: `__init__, __len__, and __getitem__`. `__len__` should return number of rows in our datasets and `__getitem__` should return a pair of `source` and `target`. We will also generate vocabulary of our dataset and encode our text and language using this vocabulary. Now let's code

In [28]:
class LangDataset(Dataset):
    def __init__(self, ds, train_vocab=None):
        self.corpus = ds

        if not train_vocab:
            self.src_vocab, self.trg_vocab = self._build_vocab()
        else:
            self.src_vocab, self.trg_vocab = train_vocab

    def __len__(self):
        return len(self.corpus)
    
    def __getitem__(self, item):
        text = self.corpus.iloc[item].Text
        lang = self.corpus.iloc[item].Language
        
        return {
            'src': self.src_vocab.lookup_indices(text.lower().split()),
            'trg': self.trg_vocab.lookup_indices([lang])
        }
    
    def _build_vocab(self):
        src_tokens = self.corpus.Text.str.cat().lower().split()
        
        src_vocab = build_vocab_from_iterator([src_tokens], specials=["<unk>","<pad>"])
        src_vocab.set_default_index(src_vocab['<unk>'])
        
        trg_vocab = build_vocab_from_iterator([['English', 'French', 'Spanish']])

        return src_vocab, trg_vocab

Now we only need to make batches then we would ready for model's training. To ease creation of batches, Pytorch provides really handy `DataLoader` class. We need to instantiate this class and provide `dataset` class, batch size and a collate function. So now what is collate function?
Collate function is a way to customize data batch process and shuffling. For this tutorial, we could avoid it and let Pytorch use its default collate function but learning it comes very handy in complex situation. So here we are just creating batch ourself, putting tensors on device and also padding them so that they are equal in length.

In [29]:
def collate_fn(batch, pad_value, device):
    trgs = []
    srcs = []
    for row in batch:
        srcs.append(torch.tensor(row["src"], dtype=torch.long).to(device))
        trgs.append(torch.tensor(row["trg"]).to(device))

    padded_srcs = pad_sequence(srcs, padding_value=pad_value)
    padded_trgs = pad_sequence(trgs, padding_value=pad_value)
    return {"src": padded_srcs, "trg": padded_trgs}
    
train_langds = LangDataset(train_ds)
valid_langds = LangDataset(valid_ds, (train_langds.src_vocab, train_langds.trg_vocab))

pad_value = train_langds.src_vocab['<pad>']

train_dt = DataLoader(train_langds, batch_size=hyp_params["batch_size"], shuffle=
                   True, collate_fn=lambda batch_size: collate_fn(batch_size, pad_value, device))

valid_dt = DataLoader(valid_langds, batch_size=hyp_params["batch_size"], shuffle=
                   True, collate_fn=lambda batch_size: collate_fn(batch_size, pad_value, device))

Finally, its time to talk about our classification model. Its really simple its has a embedding layer, then LSTM layer and at the end a linear layer. Embedding layer takes our tensors and return word embedding for them. Then we feed these embeddings to LSTM and finally we send hidden state of LSTM to a linear layer!

The question we should ask here is why we are not using output of LSTM for predictions? (I would try my best to not to go in theoritical details and explain clearly) Actually LSTM is a recurrent neural network and these network are designed to process sequential data. So LSTM takes each word at a time and produce output. So we have output for every input in the output of LSTM which we definitly we dont want. Rather, we want output after the complete sentence. So in `hidden_state` we have result of last token and this result was calculated using all the previous tokens as well. 

In [30]:
class Classifier(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim):
        super(Classifier, self).__init__()


        # Embedding is just an lookup table of size "vocab_size"
        # and each element has "embedding_size" dimension
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x):
        # Shape --> [Sequence_length , batch_size , embedding dims]
        embedding = self.embedding(x)
        # Shape --> (output) [Sequence_length , batch_size , hidden_size]
        # Shape --> (hs, cs) [num_layers, batch_size size, hidden_size]
        outputs, (hidden_state, cell_state) = self.LSTM(embedding)
        
        linear_outputs = self.fc(hidden_state)
        
        return linear_outputs

#### Hyperparameters
Before going to the training details, I would like to mention that `batch size`, `embedding dim` and `hidden dim` are hypermeters. One can experiment with different values of them and may find better accuracy (fine-tuning).

As we are doing multi-class classification, so we will use `CrossEntropyLoss`. Now lets train our model for just 10 epochs and see: 

In [41]:
model = Classifier(len(train_langds.src_vocab), len(train_langds.trg_vocab), hyp_params["embedding_dim"], hyp_params["hidden_dim"])
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    model.train()
    epoch_loss = 0
    print('Epoch: ', epoch)
    for idx, batch in enumerate(train_dt):
        src = batch["src"]  # shape --> e.g. (19, 2) sentence len, batch size
        trg = batch["trg"]  # shape --> e.g. (3, 2) sentence len, batch size

        # Clear the accumulating gradients
        optimizer.zero_grad()

        # shape --> (1, 32, 3) 1, batch size, trg vocab
        output = model(src)

        # Calculate the loss value for every epoch
        # Squeezing to remove first dimension 
        loss = criterion(output.squeeze(0), trg.squeeze(0))

        # Calculate the gradients for weights & biases using back-propagation
        loss.backward()

        epoch_loss += loss.item()

        # Clip the gradient value is it exceeds > 1
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

        # Update the weights values
        optimizer.step()
    print('\tTrain loss: ', epoch_loss/len(train_dt), end=" ")
    
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(valid_dt):
            src = batch["src"]  # shape --> e.g. (19, 2) sentence len, batch size
            trg = batch["trg"]  # shape --> e.g. (3, 2) sentence len, batch size

            output = model(src)

            # Calculate the loss value for every epoch
            loss = criterion(output.squeeze(0), trg.squeeze(0))

            epoch_loss += loss.item()
    
    print('Eval loss: ', epoch_loss/len(valid_dt))

Epoch:  0
	Train loss:  1.1044260284479928 Eval loss:  1.0990194082260132
Epoch:  1
	Train loss:  1.0645176929586073 Eval loss:  1.0432960145613726
Epoch:  2
	Train loss:  1.0041135640705332 Eval loss:  0.9902160132632536
Epoch:  3
	Train loss:  0.9759966948453118 Eval loss:  0.9441788371871499
Epoch:  4
	Train loss:  0.9045599348404828 Eval loss:  0.8792884034268996
Epoch:  5
	Train loss:  0.8536732028512394 Eval loss:  0.8448967968716341
Epoch:  6
	Train loss:  0.8074105367941015 Eval loss:  0.8013440615990582
Epoch:  7
	Train loss:  0.7658839534310734 Eval loss:  0.7683879137039185
Epoch:  8
	Train loss:  0.7182312299223507 Eval loss:  0.7353206031462726
Epoch:  9
	Train loss:  0.6748375633183648 Eval loss:  0.6881032936713275


In [42]:
model.eval()

Classifier(
  (embedding): Embedding(16256, 125)
  (LSTM): LSTM(125, 2)
  (fc): Linear(in_features=2, out_features=3, bias=True)
)

In [43]:
true_labels =[]
pred_labels =[]

for i in valid_langds:
    inp = torch.tensor(i['src']).unsqueeze(1)
    trg = i['trg'][0]
    
    with torch.no_grad():
        pred = model(inp).view(-1).argmax().item()
 
    true_labels.append(trg)
    pred_labels.append(pred)

print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

           0       0.81      0.94      0.87       212
           1       0.70      0.94      0.80       174
           2       0.60      0.18      0.28       132

    accuracy                           0.75       518
   macro avg       0.70      0.69      0.65       518
weighted avg       0.72      0.75      0.70       518



Baaam, F1-score of 65 which is 4 times more than Naive Bayes. However, it is bit more difficult to code a LSTM model and it takes time to train, but just notice how drastically deep learning improved our model. 

### Inference

In [45]:
out_map = {0: 'English', 1: 'French', 2: 'Spanish'}

text = "hello, how are you?"
txt_to_ind = train_langds.src_vocab.lookup_indices(text.split())
inp_tensor = torch.tensor(txt_to_ind).unsqueeze(1)

with torch.no_grad():
    res = model(inp_tensor).view(-1).argmax().item()
out_map[res]

'English'