# Language Detection Model

I will try to build Language Detection model that tries to minimize the euclidian distances between n-gram frequency vectors. 

We will need to choose the n value. Here are a few hypotheses:

* 1-grams are just individual charachters and will not retain any sequential information about the data, which seems  important for language detection.

* 2-grams are pairs of chachters. 2 could possibily be an adequate n value, however, 2-grams only capture immediate sequencial information.

* 3-grams or 4-grams intuitively seem to be enough to capture most of the useful sequential information.

* 5-grams and larger may overfit to the training data. These will contain many whole words, becoming more like a "dictionary" which will be ineffective when there are words in the test set that are not part of the training set.

Let's test these.

In [1]:
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

### Let's import the training data.

In [2]:
# Size of traning data to be read in bytes
TRAINING_DATA_SIZE = 2000000

LANGUAGES = [
    'sv', 'da', 'de', 'nl', 'en', 'fr', 'es', 'pt', 'it', 'ro', 'et',
    'fi','lt', 'lv', 'pl', 'sk', 'cs', 'sl', 'hu', 'bg',  'el'
]

# Files names
files = [
    "train/europarl-v7.{lang}-en.{lang}".format(lang=x)
    for x in LANGUAGES
]

# Open files
corpus_raw = [
    open(x).read(TRAINING_DATA_SIZE)
    for x in files
]

Let's remove punctuations. While these probably not entirely useless for language detection, these will increase the number of n-grams which would reduce our confidence in the n-gram frequencies.

In [3]:
corpus = [
    re.sub(r'[?”_"%()!--+,:;./\]\[\xad\n0-9\=<>]', '', x)
    for x in corpus_raw
]

Now we can extract the n-grams from the training data.

In [4]:
N_VALUES = [1,2,3,4,5,6,7]

count_vectorizers = [
    CountVectorizer(ngram_range=(n, n), analyzer='char_wb')
    for n in N_VALUES
]

counts_array = [
    count_vectorizer.fit_transform(corpus)
    for count_vectorizer in count_vectorizers
]

print("n-values of {} produce {} n-grams repectively".format(
    N_VALUES, list(map(lambda x: x.shape[1], counts_array)
)))

n-values of [1, 2, 3, 4, 5, 6, 7] produce [186, 5957, 61537, 300332, 819272, 1417028, 1834018] n-grams repectively


### Calculating Term Weights
$$Term\ Weight = log(Term\ frequency\ in\ a\ particular\ languge\ + 1)
$$

In [None]:
def compute_weights(counts):
    # Every language should have the same mean count frequency
    counts = counts/counts.mean(axis=1)
        
    # Sublinear transform to ensure no terms have overwhelmingly large weight
    # + 1 to ensure non negative values
    counts = np.log(counts+1)

    return counts

language_weights = [compute_weights(c) for c in counts_array]

## Socring test samples
We will treat test samples exactly the same way as our training data. Our prediction will simply be the language with the lowest Euclidean distance from the test sample.

In [None]:
def predict(text, n):
    term_weight = compute_weights(count_vectorizers[n-1].transform([text]))    
    distances = [
        np.linalg.norm(term_weight-language_weight) for language_weight in language_weights[n-1]
    ]
    return LANGUAGES[np.argmin(distances)]

def run_tests(n):
    print("\nRuning test for n = {}".format(n))
    right = 0
    wrong = 0
    tests = open('europarl.test')
    for x in range(TRAINING_DATA_SIZE):
        line = tests.readline()
        if line == '':
            print("Error rate: {}%".format(100*wrong/(right+wrong)))
            return
        [lang, text] = line.split('\t')
        if predict(text, n) == lang:
            right = right + 1
        else:
            wrong = wrong + 1

for n in N_VALUES:
    run_tests(n)


Runing test for n = 1
Error rate: 5.371428571428571%

Runing test for n = 2
Error rate: 7.719047619047619%

Runing test for n = 3
Error rate: 43.31428571428572%

Runing test for n = 4
Error rate: 66.94285714285714%

Runing test for n = 5
Error rate: 93.72857142857143%

Runing test for n = 6
Error rate: 95.23809523809524%

Runing test for n = 7


N = 1 gives us the best error rate. This is odd since N = 1 takes into account only charchter frequencies. Let look at what is going on. 

In [None]:
count_vectorizers[0].transform(["This is an english sentence"]).todense()

It is clear that test samples are very small and will comatin mostly zeros. This is a very low signal to noise ratio. The distances between zero and the actual language values for these dimentions do not provide much information useful for classification. Let's see what would happen if we can consider another space consisting only of dimentions that exist in the test sample.

In [None]:
def predict(text, n):
    term_weight = compute_weights(count_vectorizers[n-1].transform([text]))
    
    # Inculde only dimentions where term wieght is not zero
    non_zero_dimentiuons = term_weight != 0
    relevent_dimentions = [
        language_weight[non_zero_dimentiuons] for language_weight in language_weights[n-1]
    ]
    term_weight = term_weight[non_zero_dimentiuons]
    
    distances = [np.linalg.norm(term_weight-language_weight) for language_weight in relevent_dimentions]
    return LANGUAGES[np.argmin(distances)]

for n in N_VALUES:
    run_tests(n)