### Classification

We are going to explore and study a basic and important task of Natural Language Processiing (NLP) which is text classification. This task can help us with various tasks for example sentiments analysis, langauge detection, text author detection and text labelling are only few to name. In this Notebook, we will implement text's langauge detection task using two NLP approaches. First approach is with statistics and we will developed very famous Naive Bayes classifier. Second approach is with Machine Learning and there we will implement a very simple single layer Long Short Term Memory (LSTM) model. We will use Accuracy to compare both models performance (we could also use F1-measure but I want to keep things simple). I used vanilla Python for Naive Bayes implementation and Pytorch for LSTM. The dataset I use is from [Kaggle](https://www.kaggle.com/basilb2s/language-detection) and it consist of text with their respective langauge label. The dataset contains text from 17 different languages, however I picked only 3 langauges with the most enteries (English, French and Spanish). I used 80% data for training and remaining 20% for validation.
Any classification task can be further divided into binary or multi-class classification. Here as we picked 3 langauges so we are going to develop multi class classifier. Let's now start with Naive Bayes implementation.

### Naive Bayes


- If you have data go with rule based
- small data ? Naive bayes is perfect
- Reasonable amount of data? go with SVM or Logistic Regression
- Huge data? Training ML models can be time consuming and here Naive Bayes will shine. Infact with enough data classifier wont even matter!




In [2]:
from collections import Counter
import pandas as pd
import numpy as np

In [5]:
# dataset courtesy: https://www.kaggle.com/basilb2s/language-detection
# https://github.com/jacoxu/StackOverflow
ds = pd.read_csv('Language Detection.csv')

print(f'{len(ds["Language"].unique())} languages in dataset ')
print(f'Total dataset count: {len(ds)}')

stats = { lang: len(ds.loc[ds['Language'] == lang])  for lang in ds["Language"].unique()  }
print(stats)

np.random.shuffle(ds.values)
total_ds = ds.loc[(ds['Language'] == 'Spanish') | (ds['Language'] == 'French') | (ds['Language'] == 'English')]
print(f'Total dataset count (3 languages): {len(total_ds)}')

train_ds = total_ds[:2700]
valid_ds = total_ds[2700:]

# Only considering three languages en, fr and es
ds_en = train_ds.loc[ds['Language'] == 'English']
ds_fr = train_ds.loc[ds['Language'] == 'French']
ds_es = train_ds.loc[ds['Language'] == 'Spanish']

17 languages in dataset 
Total dataset count: 10337
{'English': 1385, 'Malayalam': 594, 'Hindi': 63, 'Tamil': 469, 'Portugeese': 739, 'French': 1014, 'Dutch': 546, 'Spanish': 819, 'Greek': 365, 'Russian': 692, 'Danish': 428, 'Italian': 698, 'Turkish': 474, 'Sweedish': 676, 'Arabic': 536, 'German': 470, 'Kannada': 369}
Total dataset count (3 languages): 3218


In [448]:
# Total number of docs
no_of_docs = len(ds_en) + len(ds_fr) + len(ds_es) 

print(f"Total no of docs: {no_of_docs}")

Total no of docs: 2700


In [449]:
# Calculating prior probabilities
prior_en = len(ds_en)/no_of_docs
print(f"No. of en docs {len(ds_en)} P(en): {prior_en}")

prior_fr = len(ds_fr)/no_of_docs
print(f"No. of fr docs {len(ds_fr)} P(fr): {prior_fr}")

prior_es = len(ds_es)/no_of_docs
print(f"No. of es docs {len(ds_es)} P(es): {prior_es}")

No. of en docs 1170 P(en): 0.43333333333333335
No. of fr docs 852 P(fr): 0.31555555555555553
No. of es docs 678 P(es): 0.2511111111111111


In [450]:
# concatenating all rows into one corpus
corpus_en = ds_en.Text.str.cat()
corpus_fr = ds_fr.Text.str.cat()
corpus_es = ds_es.Text.str.cat()

In [451]:
# Pre-processing
# Lower casing and tokenization i.e here we are simply splitting by spaces
tokens_en = corpus_en.lower().split()
tokens_fr = corpus_fr.lower().split()
tokens_es = corpus_es.lower().split()
print(len(tokens_en))
print(len(tokens_fr))
print(len(tokens_es))

24523
18486
13709


In [452]:
# Creating vacabs
vocab_en = Counter()
vocab_fr = Counter()
vocab_es = Counter()

vocab_en.update(tokens_en)
vocab_fr.update(tokens_fr)
vocab_es.update(tokens_es)

size_vocab_en = len(vocab_en) 
size_vocab_fr = len(vocab_fr) 
size_vocab_es = len(vocab_es) 

print(len(tokens_en))
print()

# plus one for unknown word
vocab_size = len(set(tokens_en)) + len(set(tokens_fr)) + len(set(tokens_es)) + 1

24523



In [453]:
def nb(text, cls_w_count, cls_prior, cls_vocab, vocab_size):
    tokens = text.strip().lower().split()
    res = cls_prior
    dino = cls_w_count + vocab_size
    
    for t in tokens:
        nom = cls_vocab.get(t, 0) + 1
        res *= (nom / dino)
    
    return res

In [454]:
import math
langs = {
        'English': {'prior': prior_en, 'tokens': len(tokens_en), 'vocab': vocab_en },
        'French': {'prior': prior_fr, 'tokens': len(tokens_fr), 'vocab': vocab_fr },
        'Spanish': {'prior': prior_es, 'tokens': len(tokens_es), 'vocab': vocab_es },
    }

last_pred = -1

total = len(valid_ds)
correct = 0

for i, row in valid_ds.iterrows():
    text = row.Text
    label = row.Language
    for lang in langs.keys():
        lang_obj = langs.get(lang)
        prior = lang_obj.get('prior')
        pred = nb(text, lang_obj.get('tokens'), prior, lang_obj.get('vocab'), vocab_size)
        if pred > last_pred:
            lp = pred
            pred_lang = lang
    if pred_lang == label:
        correct += 1

print('Accuracy: ', correct/total)

Accuracy:  0.2722007722007722


### Machine Learning Approach

In [455]:
#!/home/trtm/torch_dev/bin/python3
import torch.nn as nn
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [520]:
from collections import Counter
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import Dataset
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm

class LangDataset(Dataset):
    def __init__(self, ds, train_vocab=None):
        self.corpus = ds

        if not train_vocab:
            self.src_vocab, self.trg_vocab = self._build_vocab()
        else:
            self.src_vocab, self.trg_vocab = train_vocab

    def __len__(self):
        return len(self.corpus)
    
    def __getitem__(self, item):
        
        lang = self.corpus.iloc[item].Language
        text = self.corpus.iloc[item].Text
        
        return {
            'src': self.src_vocab.lookup_indices(text.lower().split()),
            'trg': self.trg_vocab.lookup_indices([lang])
        }
    
    def _build_vocab(self):
        src_tokens = self.corpus.Text.str.cat().lower().split()
        
        src_vocab = build_vocab_from_iterator([src_tokens], specials=["<unk>","<pad>"])
        src_vocab.set_default_index(src_vocab['<unk>'])
        
        trg_vocab = build_vocab_from_iterator([['English', 'French', 'Spanish']])

        return src_vocab, trg_vocab
        
    
def collate_fn(batch, device):
    trgs = []
    srcs = []
    for row in batch:
        srcs.append(torch.tensor(row["src"], dtype=torch.long).to(device))
        trgs.append(torch.tensor(row["trg"]).to(device))

    padded_srcs = pad_sequence(srcs, padding_value=1)
    padded_trgs = pad_sequence(trgs, padding_value=1)
    return {"src": padded_srcs, "trg": padded_trgs}
    
train_langds = LangDataset(train_ds)
valid_langds = LangDataset(valid_ds, (train_langds.src_vocab, train_langds.trg_vocab))


train_dt = DataLoader(train_langds, batch_size=32, shuffle=
                   True, collate_fn=lambda batch_size: collate_fn(batch_size, device))

valid_dt = DataLoader(valid_langds, batch_size=32, shuffle=
                   True, collate_fn=lambda batch_size: collate_fn(batch_size, device))


In [521]:
class Classifier(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_size, hidden_size):
        super(Classifier, self).__init__()


        # Embedding is just an lookup table of size "vocab_size"
        # and each element has "embedding_size" dimension
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.LSTM = nn.LSTM(embedding_size, hidden_size)
        
        self.fc = nn.Linear(hidden_size, output_size)
        
        self.act = nn.Sigmoid()

    def forward(self, x):
        # Shape --> [Sequence_length , batch_size , embedding dims]
        embedding = self.embedding(x)
        # Shape --> (output) [Sequence_length , batch_size , hidden_size]
        # Shape --> (hs, cs) [num_layers, batch_size size, hidden_size]
        outputs, (hidden_state, cell_state) = self.LSTM(embedding)
        
        linear_outputs = self.fc(hidden_state)
        
        return linear_outputs

In [522]:
model = Classifier(len(train_langds.src_vocab), len(train_langds.trg_vocab), 125, 2)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    model.train()
    epoch_loss = 0
    print('Epoch: ', epoch)
    for idx, batch in enumerate(train_dt):
        src = batch["src"]  # shape --> e.g. (19, 2) sentence len, batch size
        trg = batch["trg"]  # shape --> e.g. (3, 2) sentence len, batch size

        # Clear the accumulating gradients
        optimizer.zero_grad()

        output = model(src)

        # Calculate the loss value for every epoch
        loss = criterion(output.squeeze(0), trg.squeeze(0))

        # Calculate the gradients for weights & biases using back-propagation
        loss.backward()

        epoch_loss += loss.item()

        # Clip the gradient value is it exceeds > 1
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

        # Update the weights values
        optimizer.step()
    print('\tTrain loss: ', epoch_loss/len(train_dt))
    
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(valid_dt):
            src = batch["src"]  # shape --> e.g. (19, 2) sentence len, batch size
            trg = batch["trg"]  # shape --> e.g. (3, 2) sentence len, batch size

            output = model(src)

            # Calculate the loss value for every epoch
            loss = criterion(output.squeeze(0), trg.squeeze(0))

            epoch_loss += loss.item()
    
    print('\tEval loss: ', epoch_loss/len(valid_dt))


Epoch:  0
	Train loss:  1.0750935905119952
	Eval loss:  1.0514680883463692
Epoch:  1
	Train loss:  0.9848196240032421
	Eval loss:  0.975800030371722
Epoch:  2
	Train loss:  0.9092971254797543
	Eval loss:  0.9175489439683802
Epoch:  3
	Train loss:  0.8410687951480641
	Eval loss:  0.8594990302534664
Epoch:  4
	Train loss:  0.7871780760147993
	Eval loss:  0.8522900798741508
Epoch:  5
	Train loss:  0.73675176115597
	Eval loss:  0.7715242785565993
Epoch:  6
	Train loss:  0.678215089265038
	Eval loss:  0.7462292243452633
Epoch:  7
	Train loss:  0.6394221372464124
	Eval loss:  0.7418948341818417
Epoch:  8
	Train loss:  0.5864623855142033
	Eval loss:  0.730551532086204
Epoch:  9
	Train loss:  0.5540207045919755
	Eval loss:  0.7163246028563556


In [524]:
total = len(valid_ds)
correct = 0

for i in valid_langds:
    inp = torch.tensor(i['src']).unsqueeze(1)
    trg = i['trg'][0]
    
    pred = model(inp).view(-1).argmax().item()
    
    if pred == trg:
        correct += 1

print('Accuracy: ', correct/total)

Accuracy:  0.6988416988416989
