# Language Classification

## Overview

My aim here is to build a language classifier for EU languages.

Proposed Approach:
1. Inspect test set
1. Create dataset for training / validation
1. Train / valid split
1. Numericalize
1. Build language classification model

## Setup

In [None]:
import sys
from pathlib import Path
import time

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from tqdm import tqdm
import dill  # Better version of pickle, able to save objects with lambda expressions
import copy  # Used for making a deep copy of a model
from collections import Counter, defaultdict

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim

import utils

In [None]:
start = time.time()
np.random.seed(1)

## Settings

In [None]:
PATH = Path('data')  # Directory for all data and temporary files
# Note: to run on the whole training set, specify PATH/'train' as the path.
# However, since that corpus is huge, the learning rate will have to be reduced.
# Also, the whole corpus will need a much higher training time.
TRAIN = PATH/'train_sampl'  # Directory for training text
TEST_FN = PATH/'test'  # Filename for test text
PATH_TMP = PATH/'tmp'  # Temporary directory to save progress

MIN_FREQ = 5  # We'll replace words with lower frequency with unknown
SEQ_LEN = 32  # Length of the sequences passed into our GRU

BS = 512  # Batch size for our RNN

EMB_SZ = 300  # Dimension of word embeddings
HIDDEN_SZ = 200  # Hidden layer dimension of the GRU
EMB_DROP = 0.25  # Dropout applied to embeddings
LAYER_DROP = 0.25  # Dropout applied after GRU

# List of languages
LANGS = list(map(lambda x: x.name, list(TRAIN.iterdir())))

assert torch.cuda.is_available()  # Notebook is written for GPU computations.

In [None]:
PATH_TMP.mkdir(parents=True, exist_ok=True)

## Clarify Goal

Let's first have a look at the test set we are trying to predict. It looks like a simple text classification task.

In [None]:
test = pd.read_csv(TEST_FN, sep = '\t', lineterminator='\n', header=None)
test.rename({0:'label', 1:'text'}, axis = 1, inplace=True)
test[test['label'] == 'en'].head()

Before going any further, let's apply some preprocessing. In particular, I apply the following steps:
1. Remove uninformative meta-comments, such as who is speaking.
1. Replace numbers with a generic *num* token. After all, the specific number shouldn't affect the classification results.
1. Create a special end-of-sentence (*eos*) token.
1. Replace all punctuation with a special *punc* token. 
1. Collapse adjecent white space. In other words, '&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;' becomes '&nbsp;'.

In [None]:
test['text'] = test['text'].apply(utils.preprocess)

Let's check a random English and German sentence after pre-processing.

In [None]:
print(test[test['label']=='en'].iloc[0]["text"])
print('---')
print(test[test['label']=='de'].iloc[0]["text"])

In [None]:
def word_count(x): return len(x.split())
def sentence_count(x): return len(x.split('<eos>')) - 1
test['text'].apply([sentence_count, word_count, len]).describe()

The target for our classification model has the following characteristics:
1. The vast majority of examples are a single sentence.
1. Most of the time, we have a decent number of words (15-33) to predict a language.
1. However, we can have as little as 3 words. This might pose a challenge if those words are not language-specific.
1. Content-wise, most sentences seem to be about parliamentary proceedings (my guess: proceedings of the EUP for more recent years).

## Preprocess Training/Validation Data

Our training set is of similar content, but in different format. Specifically, we don't have sentence-level chunks as in our test set, instead, we have files with different number of sentences. So we need to create a dataframe that resembles our test set.

First, we take all files for a language and concatenate them together.

In [None]:
# An example with English, we'll do all processing steps for all files below.
exampl = utils.concat_docs('en', TRAIN)
exampl[:200]

Now we apply the same pre-processing as we did to our test set, and turn the whole corpus into a list of sentences.

In [None]:
# Example continued.
exampl = utils.txt2list(utils.preprocess(exampl[:1000]))
exampl[:3]

There is one last step we do. The test set had occasionally (although not often) multiple sentences. So we want to have, occasionally, multiple sentences in our training set as well. We can accomplish this by concatenating adjecent sentences together with a small probability (p = 0.02).

Let's apply all the above steps to all languages. I will also put everything into a dataframe with an extra column giving the language label.

In [None]:
dfs = []  # List to store data frames
for lang in LANGS:
    print(' '+lang+' ', end = "")
    txt = utils.concat_docs(lang, TRAIN)  # Concatenate all files
    txt = utils.preprocess(txt)  # Apply preprocessing described in test section
    txt = utils.txt2list(txt)  # Convert to list
    txt = utils.concat_random_sent(txt, p = 0.02)  # Concatenate random adjecent sentences
    temp_df = pd.DataFrame({'text':txt})  
    temp_df['label'] = lang
    dfs.append(temp_df)
df = pd.concat(dfs)[['label', 'text']]
df.reset_index(inplace=True, drop = True)
df.head()

In [None]:
df['text'].apply([sentence_count, word_count, len]).describe()

Our resulting dataframe looks very similar to our test set. 

One difference: wee don't go quite as high on the maximum words and sentences. That's not going to matter though, as later on I will truncate all text at 32 words anyway.

In [None]:
del(dfs, temp_df, txt, exampl)
dill.dump(df, open(PATH_TMP/'df.pickle', mode = 'wb'))

In [None]:
#df = dill.load(open(PATH_TMP/'df.pickle', mode = 'rb'))

## Train/Validation Split

Let's split the data for training and validation. No big surprises here.

I use 1% of the data as validation. If that seems unusual, note that our dataset contains ~2 million rows, so our validation set will contain ~20k. I'm only using the validation set to monitor performance and check for over-fitting; 20k examples are more than enough for that.

In [None]:
len(df.index)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(np.array(df['text']), np.array(df['label']), 
                                                  test_size=0.01, random_state=42)
y_train

In [None]:
del(df)

## Numericalize

We need to turn our words into intiger indices. Later we use these indices to look up the embeddings for each word.

Let's start by counting the number of times a word appears in our training data.

In [None]:
words = Counter()
for row in tqdm(X_train, position=0, leave=False): words.update(row.split())
words.most_common(10)

No big surprises: the most frequent words are punctuation, end-of-sentence tokens, articles and prepositions.

Now we drop all words under 5 (what is specified under MIN_FREQ). Doing so is to make our embedding matrix smaller. Decreasing MIN_FREQ will increase the accuracy of our model, at a higher memory and computation cost.

We also add two special tokens: unknown (*unk*) and padding (*pad*). Unknown is any word not appearing in our list. These are any new words in our test or validation set, as well as the words we previously dropped due to low frequence. Padding will later be used to make all sequences equal length. 

In [None]:
words = {k:v for k, v in tqdm(words.items(), leave = False) if v >= MIN_FREQ}
words = sorted(words, key=words.get, reverse=True)
words = ['<unk>','<pad>'] + words

In [None]:
vocab_size = len(words)
vocab_size

We have more than 300k unique words. Now we need to create a mapping from words to integers (and back). I'll use the dictionaries below to do so.

Note that unknown is mapped to 0, and padding is mapped to 1 (they are the first two elements by construction).

In [None]:
word2idx = defaultdict(lambda: 0, {o:i for i,o in enumerate(words)})
idx2word = defaultdict(lambda: '<unk>', {i:o for i,o in enumerate(words)})

Here is the first sentence of the training set converted into indices:

In [None]:
print([word2idx[w] for w in X_train[0].split()])

Now we apply the above to the whole training set. I also truncate long sentences at 32 words -- that should be more than enough to classify a language, and having more words would needlessly slow down computation time.

I also pad sentences that are shorter than 32 with the special padding character. This way, all example are of equal length, making subsequent computations easier.

In [None]:
X_train = utils.numericalize(X_train, word2idx, maxlen = SEQ_LEN)
X_val = utils.numericalize(X_val, word2idx, maxlen = SEQ_LEN)
print(X_train.shape

Of course, we can alway convert our indices back. Below is our first training example converted back to text.

In [None]:
utils.de_numericalize(X_train[:1], idx2word)

We'll also replace languages with contiguous integers.

In [None]:
lang2idx = defaultdict(lambda: 0, {o:i for i,o in enumerate(LANGS)})
idx2lang = defaultdict(lambda: '<unk>', {i:o for i,o in enumerate(LANGS)})

In [None]:
print(y_train[:5])  # Before
y_train = np.array([lang2idx[x] for x in y_train])
y_val = np.array([lang2idx[x] for x in y_val])
print(y_train[:5])  # After

In [None]:
with open(PATH_TMP/'numericalized.pickle', mode = 'wb') as f:
    dill.dump([words, vocab_size, word2idx, idx2word, X_train, X_val, y_train, y_val], f)

In [None]:
#with open(PATH_TMP/'numericalized.pickle', mode = 'rb') as f:
#    (words, vocab_size, word2idx, idx2word, X_train, X_val, y_train, y_val) = dill.load(f)

In [None]:
end = time.time()
print(f'Time after pre-processing : {(end - start)/60} mins')

## Define Dataloaders

We take our training and validation data, convert them from numpy arrays to torch tensors, and put them in dataloaders. 

In [None]:
X_train = torch.from_numpy(X_train).type(torch.int64)
y_train = torch.from_numpy(y_train).type(torch.int64)
X_val = torch.from_numpy(X_val).type(torch.int64)
y_val = torch.from_numpy(y_val).type(torch.int64)

train_dl = DataLoader(TensorDataset(X_train, y_train), batch_size=BS, shuffle = True)
valid_dl = DataLoader(TensorDataset(X_val, y_val), batch_size=BS, shuffle = False)

## Define Model



In [None]:
class Lang_Detect(nn.Module):
    def __init__(self, emb_sz = EMB_SZ, vocab_size = vocab_size,
                 hidden_sz = HIDDEN_SZ, out_sz = len(LANGS), 
                 emb_drop = EMB_DROP, layer_drop = LAYER_DROP):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_sz)
        self.emb_drop = nn.Dropout(emb_drop)
        self.emb.weight.data.uniform_(-0.05, 0.05)
        self.gru = nn.GRU(emb_sz, hidden_sz)
        self.drop = nn.Dropout(layer_drop)
        self.lout = nn.Linear(hidden_sz, out_sz)
        self.hidden_sz = hidden_sz
                
    def forward(self, seq): 
        bs, _ = seq.shape
        h =  torch.zeros(1, bs, self.hidden_sz).cuda()
        embedded = self.emb(seq).transpose(0, 1)
        outputs, _ = self.gru(self.emb_drop(embedded), h)
        output = self.lout(self.drop(outputs[-1]))
        return output

In [None]:
model = Lang_Detect().cuda()

In [None]:
loss_func = nn.CrossEntropyLoss().cuda()

In [None]:
def loss_batch(xb, yb, model, loss_func, opt=None):
    '''https://github.com/fastai/fastai_v1/blob/master/dev_nb/001a_nn_basics.ipynb'''
    # Note: changed this by adding yb.view(-1) to match dimensions

    loss = loss_func(model(xb.cuda()), yb.cuda())

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

In [None]:
class Learner(object):
    
    def __init__(self, model, loss_func, train_dl = None, valid_dl = None):
        self.model = model
        self.loss_func = loss_func
        self.train_dl = train_dl
        self.valid_dl = valid_dl
        self.losses = []
    
    def lr_find(self, start = 1e-6, end = 1e3, exp_smooth_param = 0.95):
        
        self.losses = []
        old_state_dict = copy.deepcopy(self.model.state_dict())
        self.model.train()
        lr = start; lrs = []; losses = []; i = 0
        for xb,yb in tqdm(self.train_dl, leave = False,
                         position = 0):
            opt = optim.Adam(self.model.parameters(), lr=lr)
            loss, _ = loss_batch(xb, yb, self.model, self.loss_func, opt)
            lrs.append(lr), losses.append(loss)
            if (lr > end) or (i > 10 and loss > 3*np.mean(losses[:i])):
                break
            lr *= 1.03; i += 1        
        self.losses = losses
        self.plot_loss(x = lrs, xlog=True, exp_smooth_param = exp_smooth_param)
        self.losses = []  # Reset list
        self.model.load_state_dict(old_state_dict)
        
    def plot_loss(self, x = None, xlog = False, exp_smooth_param = 0.95,
                 skip_edges = False):
        y_smooth = utils.exp_smooth(np.array(self.losses), exp_smooth_param)
        if skip_edges:
            y_smooth = y_smooth[10:-10]
        f, ax = plt.subplots(figsize=(5, 5))
        if xlog:
            ax.set(yscale = 'log', xscale = 'log')
        else:
            ax.set(yscale = 'log')
        if x is not None:
            if skip_edges:
                x = x[10:-10]
            ax = plt.plot(x, y_smooth)
        else:
            ax = plt.plot(y_smooth)     
            
    def fit(self, lr, epochs):
        
        opt = optim.Adam(self.model.parameters(), lr=lr)
        
        for epoch in range(epochs):
        
            # Fit model to training data
            self.model.train()
            losses, nums = zip(*[loss_batch(xb, yb, self.model, self.loss_func, opt) 
                                 for xb,yb in tqdm(self.train_dl, leave = False,
                                                  position = 0)])
            train_loss = np.sum(np.multiply(losses,nums)) / np.sum(nums)
            self.losses = self.losses+list(losses)
            
            if self.valid_dl != None:              
                
                self.model.eval()
                with torch.no_grad():
                    
                    losses,nums = zip(*[loss_batch(xb, yb, self.model, self.loss_func)
                                        for xb,yb in valid_dl])
                    val_loss = np.sum(np.multiply(losses,nums)) / np.sum(nums)
                    
                    val_preds = self.predict(self.valid_dl)
                    y_val = self.valid_dl.dataset.tensors[1]
                    acc = utils.accuracy(val_preds, y_val)                    
                    
                print(f'Epoch {epoch}. Training loss: {train_loss}. ' +
                      f'Validation loss: {val_loss}. Accuracy: {acc}')
                
            else:
                print(f'Epoch {epoch}. Training loss: {train_loss}.')
                        
    def predict(self, dl):
        self.model.eval()
        with torch.no_grad():
            res = [self.model(xb.cuda()).argmax(dim = -1).view(-1) for 
                   xb, _ in tqdm(dl, leave = False, position = 0)]
        return torch.cat(res)

In [None]:
learn = Learner(model, loss_func, train_dl, valid_dl)

In [None]:
preds = learn.predict(valid_dl)
utils.accuracy(preds, y_val)

In [None]:
lr = 1e-3

In [None]:
learn.fit(lr, 3)
dill.dump(learn.model.state_dict(), open(PATH_TMP/'model0.pickle', mode = 'wb'))
learn.plot_loss()

In [None]:
learn.fit(lr/1e2, 3)
dill.dump(learn.model.state_dict(), open(PATH_TMP/'model1.pickle', mode = 'wb'))
learn.plot_loss()

In [None]:
end = time.time()
print(f'Time after training : {(end - start)/60} mins')

## Predict Test Set

In [None]:
X_test = utils.numericalize(np.array(test['text']), word2idx)
y_test = np.array([lang2idx[x] for x in test['label']])

In [None]:
X_test = torch.from_numpy(X_test).type(torch.int64)
y_test = torch.from_numpy(y_test).type(torch.int64)

test_dl = DataLoader(TensorDataset(X_test, y_test), batch_size=BS, shuffle = False)

In [None]:
preds = learn.predict(test_dl)
utils.accuracy(preds, y_test)

In [None]:
test['pred'] = [idx2lang[x] for x in utils.conv2np(preds)]
test['correct'] = (test['pred'] == test['label'])*1
inc_total = len(test.index)-sum(test['correct'])
print(f"Total number of mispredicted: {inc_total}")
test.groupby(by = 'label')['correct'].agg('mean').sort_values()

In [None]:
def print_incorrect(i = None, lang = None):
    
    flag = (test['correct']==0)
    if lang is not None:
        flag = flag  & (test['label']==lang)
    if i is not None:
        ex = test[flag].iloc[i]
    else:
        ex = test[flag].sample(1).iloc[0]
    print(f'class: {ex["label"]}, ', end = "")
    print(f'predicted: {ex["pred"]}, ', end = "")
    print(f'text: {ex["text"]}, ', end = "")
    print('\n')

In [None]:
for i in range(inc_total): print_incorrect(i)

In [None]:
end = time.time()
print(f'Total time : {(end - start)/60} mins')