# Keras Implementation of CLD3

Author: Pierre Nugues

Reimplementation of Google's _Compact language detector_ (CLD3) from a high-level description. Source: ``https://github.com/google/cld3``

Still missing:
* generator or train_on_batch

## A Dataset: *Tatoeba*

As dataset, we use Tatoeba: A database of texts with language tags. The corpus is available here: https://tatoeba.org/eng/downloads

### Formatting the Dataset

We read the dataset and we split the lines

In [1]:
dataset_raw = open('sentences.csv', encoding='utf8').read().strip()
dataset_raw = dataset_raw.split('\n')
dataset_raw[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

We split the fields and we remove possible whitespaces

In [2]:
dataset_raw = list(map(lambda x: tuple(x.split('\t')), dataset_raw))
dataset_raw = list(map(lambda x: tuple(map(str.strip, x)), dataset_raw))
print(len(dataset_raw), 'texts')
dataset_raw[:3]

8023136 texts


[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

We pad strings that are less than three characters. If not done, training will crash. We also limit the length of the texts.

In [3]:
MAXLEN_TEXT = 200
for i in range(len(dataset_raw)):
    if len(dataset_raw[i][2]) == 0:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '   ')
    if len(dataset_raw[i][2]) == 1: 
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '  ')
    if len(dataset_raw[i][2]) == 2:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + ' ')
    dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2][:MAXLEN_TEXT])

We shuffle the dataset

In [4]:
from random import shuffle, seed
import numpy as np
np.random.seed(1234)
shuffle(dataset_raw)

We can decimate the dataset to have faster training times

In [5]:
DECIMATE = False
if DECIMATE:
    dataset_raw = dataset_raw[:int(len(dataset_raw)/10)]
dataset_raw[:3]

[('2608667', 'rus', 'Это самое важное, главное, что ты улыбаешься.'),
 ('5169605', 'ita', 'Smetto di lavorare attorno a mezzanotte.'),
 ('3376481', 'hun', 'Futottunk a macska után.')]

The languages. Some texts have no language tag, and some others are marked with the cryptic \\\\N

### Understanding the Dataset

How many languages?

In [6]:
languages = set([x[1] for x in dataset_raw])
len(languages)

347

We count the texts per language

In [7]:
def count_texts(dataset):
    text_counts = {}
    for record in dataset:
        lang = record[1]
        if lang in text_counts:
            text_counts[lang] += 1
        else:
            text_counts[lang] = 1
    return text_counts

Languages with the most examples

In [8]:
text_counts = count_texts(dataset_raw)
langs = sorted(text_counts.keys(), key=text_counts.get, reverse=True)
[(lang, text_counts[lang]) for lang in langs][:25]

[('eng', 1264754),
 ('ita', 738799),
 ('rus', 732078),
 ('tur', 684619),
 ('epo', 609518),
 ('deu', 488568),
 ('fra', 402078),
 ('por', 347430),
 ('spa', 314873),
 ('hun', 269208),
 ('ber', 226376),
 ('heb', 195329),
 ('jpn', 187684),
 ('ukr', 154466),
 ('kab', 129245),
 ('fin', 106936),
 ('nld', 104337),
 ('pol', 99754),
 ('mkd', 77778),
 ('cmn', 60993),
 ('mar', 54656),
 ('dan', 43995),
 ('lit', 38408),
 ('ces', 37596),
 ('toki', 35017)]

We consider languages that have more than 3,000 examples in the dataset or we only use those in French, English, and Swedish

In [9]:
SMALL_LANGUAGE_SET = False
considered_langs = [lang for lang in langs if text_counts[lang] > 3000]
if SMALL_LANGUAGE_SET:
    considered_langs = ['fra', 'eng', 'swe']
print(considered_langs)
LANG_NBR = len(considered_langs)
LANG_NBR

['eng', 'ita', 'rus', 'tur', 'epo', 'deu', 'fra', 'por', 'spa', 'hun', 'ber', 'heb', 'jpn', 'ukr', 'kab', 'fin', 'nld', 'pol', 'mkd', 'cmn', 'mar', 'dan', 'lit', 'ces', 'toki', 'swe', 'ara', 'lat', 'ell', 'srp', 'ina', 'bul', 'pes', 'ron', 'nds', 'tlh', 'jbo', 'nob', 'tat', 'tgl', 'ind', 'bel', 'hin', 'isl', 'vie', 'lfn', 'uig', 'bre', 'tuk', 'kor', 'ile', 'eus', 'cat', 'yue', 'oci', 'hrv', 'ido', 'aze', 'ben', 'glg', 'wuu', 'mhr', 'slk', 'afr', 'avk', 'cor', 'run', 'gos', 'vol', 'est']


70

We extract the texts in these languages. This will form our dataset.

In [10]:
dataset = list(filter(lambda x: x[1] in considered_langs, dataset_raw))
print(len(dataset))
dataset[:5]

7934544


[('2608667', 'rus', 'Это самое важное, главное, что ты улыбаешься.'),
 ('5169605', 'ita', 'Smetto di lavorare attorno a mezzanotte.'),
 ('3376481', 'hun', 'Futottunk a macska után.'),
 ('240215',
  'eng',
  'Please accept our condolences on the death of your father.'),
 ('6482091', 'rus', 'Ты единственный, кто умеет это делать.')]

## Feature Extraction

### Functions to Count Characters Ngrams

We use hash codes to convert ngrams

The number of codes we use

In [11]:
MAX_CHARS = 4096
MAX_BIGRAMS = 8192
MAX_TRIGRAMS = 8192

We normalize the counts as in CLD3

In [12]:
def normalize(d):
    sum_chars = sum(d.values())
    d = {k:v/sum_chars for k, v in d.items()}
    return d

We compute the hash code and we add one to avoid a value of 0 as it is padding symbol in the subsequent matrices.

By default, we set the characters in lowercase and we sort the ngrams by frequency order.

In [13]:
from collections import Counter

def hash_chars(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    hash_codes = map(lambda x: hash(x) % MAX_CHARS + 1, string)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

In [14]:
def hash_bigrams(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    bigrams = [string[i:i + 2] for i in range(len(string) - 1)]
    hash_codes = map(lambda x: hash(x) % MAX_BIGRAMS + 1, bigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

In [15]:
def hash_trigrams(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    trigrams = [string[i:i + 3] for i in range(len(string) - 2)]
    hash_codes = map(lambda x: hash(x) % MAX_TRIGRAMS + 1, trigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

Google's example in CLD3's presentation's text (``https://github.com/google/cld3``)

In [16]:
print('Chars:', hash_chars('Banana'))
print('Bigrams:', hash_bigrams('Banana'))
print('Trigrams:', hash_trigrams('Banana'))

Chars: ((3232, 2755, 1698), (0.5, 0.3333333333333333, 0.16666666666666666))
Bigrams: ((5340, 70, 4333), (0.4, 0.4, 0.2))
Trigrams: ((1923, 6678, 2369), (0.5, 0.25, 0.25))


Another sentence

In [17]:
print('Chars:', hash_chars("Let's try something."))
print('Bigrams:', hash_bigrams("Let's try something."))
print('Trigrams:', hash_trigrams("Let's try something."))

Chars: ((2492, 402, 3662, 339, 2714, 3862, 2468, 193, 282, 119, 3165, 357, 2755, 2880, 1394), (0.15, 0.1, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
Bigrams: ((7661, 3750, 2266, 1868, 3861, 3282, 7259, 2068, 7909, 3666, 6166, 5116, 2213, 464, 2423, 6111, 2403, 6853), (0.10526315789473684, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842))
Trigrams: ((7253, 7772, 6181, 4242, 1676, 7392, 1864, 3388, 2944, 3582, 4943, 1585, 3573, 8115, 7360, 917, 6702, 6306), (0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.055555

### Building the $\mathbf{X}$ lists

We compute the character, bigram, and trigram counts of the texts and we create $\mathbf{X}$ lists

In [18]:
from tqdm import tqdm
X_list_inx_chars = []
X_list_freq_chars = []
X_list_inx_bigrams = []
X_list_freq_bigrams = []
X_list_inx_trigrams = []
X_list_freq_trigrams = []

for i in tqdm(range(len(dataset))):
    k, v = hash_chars(dataset[i][-1])
    X_list_inx_chars.append(k)
    X_list_freq_chars.append(v)
    
    k, v = hash_bigrams(dataset[i][-1])
    X_list_inx_bigrams.append(k)
    X_list_freq_bigrams.append(v)
    
    k, v = hash_trigrams(dataset[i][-1])
    X_list_inx_trigrams.append(k)
    X_list_freq_trigrams.append(v)

100%|██████████| 7934544/7934544 [14:30<00:00, 9114.00it/s] 


In [19]:
print(X_list_inx_chars[:2])
print(X_list_freq_chars[:2])
print(X_list_inx_bigrams[:2])
print(X_list_freq_bigrams[:2])
print(X_list_inx_trigrams[:2])
print(X_list_freq_trigrams[:2])

[(339, 1088, 2493, 1035, 4025, 66, 580, 1594, 2751, 3053, 83, 1118, 1077, 541, 2529, 2435, 2504, 406, 1775, 2702, 883, 1394), (2492, 282, 339, 3232, 402, 2468, 119, 2755, 3737, 3662, 2701, 357, 2714, 1097, 1394)]
[(0.13333333333333333, 0.1111111111111111, 0.08888888888888889, 0.08888888888888889, 0.06666666666666667, 0.044444444444444446, 0.044444444444444446, 0.044444444444444446, 0.044444444444444446, 0.044444444444444446, 0.044444444444444446, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223, 0.022222222222222223), (0.15, 0.125, 0.125, 0.125, 0.1, 0.075, 0.05, 0.05, 0.05, 0.025, 0.025, 0.025, 0.025, 0.025, 0.025)]
[(5189, 7498, 2475, 5312, 98, 4930, 5029, 572, 6330, 4715, 6247, 7350, 5401, 766, 5515, 6410, 30, 2586, 428, 2699, 7046, 2016, 3000, 4955, 409, 3236, 4403, 7311, 3173, 721, 1487, 296, 5155, 6075, 6797, 138

### Unique ngrams

We now extract all the unique ngrams (hash codes). This part is not necessary to train the model and it could be skipped. We use it to determine if we have enough hash codes and the feature vector lengths.

In [20]:
unique_chars = set()
for x_list_inx_chars in X_list_inx_chars:
    unique_chars.update(set(x_list_inx_chars))

unique_bigrams = set()
for x_list_inx_bigrams in X_list_inx_bigrams:
    unique_bigrams.update(set(x_list_inx_bigrams))

unique_trigrams = set()
for x_list_inx_trigrams in X_list_inx_trigrams:
    unique_trigrams.update(set(x_list_inx_trigrams))

We check the hash coding capacity. Have we used all the hash codes? Will there be collisions?

In [21]:
unique_char_cnt = len(unique_chars)
unique_char_cnt

3520

In [22]:
unique_bigram_cnt = len(unique_bigrams)
unique_bigram_cnt

8192

In [23]:
unique_trigram_cnt = len(unique_trigrams)
unique_trigram_cnt

8192

### Length of the index vectors

How will we align the $\mathbf{X}$ matrices? 

We will compute the maximal length of the exhaustive feature lists and we use: max(median, mean) plus two standard deviations (roughly).

#### Characters

What is the max length of the feature vectors for the characters?

In [24]:
char_vect_lengths = [len(x_list_inx_chars) for x_list_inx_chars in X_list_inx_chars]
max(char_vect_lengths)

121

In [25]:
import statistics
print(statistics.mean(char_vect_lengths))
print(statistics.stdev(char_vect_lengths))
statistics.median(char_vect_lengths)

16.24075145339165
3.7880676669943725


16.0

#### Bigrams

What is the max length of the feature vectors for the bigrams?

In [26]:
bigram_vect_lengths = [len(x_list_inx_bigrams) for x_list_inx_bigrams in X_list_inx_bigrams]
max(bigram_vect_lengths)

178

In [27]:
print(statistics.mean(bigram_vect_lengths))
print(statistics.stdev(bigram_vect_lengths))
statistics.median(bigram_vect_lengths)

29.839839945433535
13.815408024391429


28.0

#### Trigrams

What is the max length of the feature vectors for the trigrams?

In [28]:
trigram_vect_lengths = [len(x_list_inx_trigrams) for x_list_inx_trigrams in X_list_inx_trigrams]
max(trigram_vect_lengths)

193

In [29]:
print(statistics.mean(trigram_vect_lengths))
print(statistics.stdev(trigram_vect_lengths))
statistics.median(trigram_vect_lengths)

32.211026871865606
18.252929704574356


29.0

Here are the maximal lengths: max(median, mean) + 2 x stdev and a small margin.

In [30]:
MAXLEN_CHARS = 30
MAXLEN_BIGRAMS = 60
MAXLEN_TRIGRAMS = 70

## Building $\mathbf{X}$ and $\mathbf{y}$

We can now build the matrices by converting the $\mathbf{X}$ lists into arrays and padding them.

### The $\mathbf{X}$ matrix

#### Characters

We pad the character sequences

In [31]:
from keras.preprocessing.sequence import pad_sequences
X_inx_chars = pad_sequences(X_list_inx_chars, padding='post', maxlen=MAXLEN_CHARS)
X_inx_chars[:2]

Using TensorFlow backend.


array([[ 339, 1088, 2493, 1035, 4025,   66,  580, 1594, 2751, 3053,   83,
        1118, 1077,  541, 2529, 2435, 2504,  406, 1775, 2702,  883, 1394,
           0,    0,    0,    0,    0,    0,    0,    0],
       [2492,  282,  339, 3232,  402, 2468,  119, 2755, 3737, 3662, 2701,
         357, 2714, 1097, 1394,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]], dtype=int32)

The character frequencies

In [32]:
X_freq_chars = pad_sequences(X_list_freq_chars, padding='post', dtype='float32', maxlen=MAXLEN_CHARS)
X_freq_chars[:2]

array([[0.13333334, 0.11111111, 0.08888889, 0.08888889, 0.06666667,
        0.04444445, 0.04444445, 0.04444445, 0.04444445, 0.04444445,
        0.04444445, 0.02222222, 0.02222222, 0.02222222, 0.02222222,
        0.02222222, 0.02222222, 0.02222222, 0.02222222, 0.02222222,
        0.02222222, 0.02222222, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.15      , 0.125     , 0.125     , 0.125     , 0.1       ,
        0.075     , 0.05      , 0.05      , 0.05      , 0.025     ,
        0.025     , 0.025     , 0.025     , 0.025     , 0.025     ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]],
      dtype=float32)

#### Bigrams

We do the same for the bigrams.

In [33]:
X_inx_bigrams = pad_sequences(X_list_inx_bigrams, padding='post', maxlen=MAXLEN_BIGRAMS)
X_inx_bigrams[:2]

array([[5189, 7498, 2475, 5312,   98, 4930, 5029,  572, 6330, 4715, 6247,
        7350, 5401,  766, 5515, 6410,   30, 2586,  428, 2699, 7046, 2016,
        3000, 4955,  409, 3236, 4403, 7311, 3173,  721, 1487,  296, 5155,
        6075, 6797, 1389,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0],
       [2687, 2213, 7622, 1212,  582, 5667, 5795, 5070, 7661, 7284,  696,
        7680, 3088, 2276, 2395, 4899, 6065, 5739, 3878, 3337, 2631, 7118,
        5853,  447, 5200, 4411, 3669, 5340, 5358, 6273,  999,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0]], dtype=int32)

In [34]:
X_freq_bigrams = pad_sequences(X_list_freq_bigrams, padding='post', dtype='float32', maxlen=MAXLEN_BIGRAMS)
X_freq_bigrams[:2]

array([[0.09090909, 0.04545455, 0.04545455, 0.04545455, 0.04545455,
        0.04545455, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.02272727, 0.02272727, 0.02272727, 0.02272727,
        0.02272727, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.07692308, 0.05128205, 0.05128205, 0.05128205, 0.05128205,
        0.05128205, 0.05128205, 0.02564103, 0.02564103, 0.02564103,
        0.02564103, 0.02564103, 0.02564103, 0.0

#### Trigrams

And the trigrams

In [35]:
X_inx_trigrams = pad_sequences(X_list_inx_trigrams, padding='post', maxlen=MAXLEN_TRIGRAMS)
X_inx_trigrams[:2]

array([[6946, 7267, 1852, 4930, 1106, 1185, 8091, 1708, 1501, 1228,  934,
        2108, 3213, 4645, 5504,  400, 1490, 2586, 2297, 4107, 3854, 6791,
        2735,  232, 8121, 5311, 7596, 1803, 4142, 2270, 4419, 1329,  471,
        1144, 6625, 7711, 4491, 1306, 5879,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [5984, 3642, 3573, 6923, 5077, 1622, 3896, 4771, 7773, 2128, 7057,
        1533,    5, 7126, 6789, 6435, 1576, 4961, 6089, 1141, 4840, 6932,
        5571, 7123, 7501, 3866, 7884, 6305, 6940, 1739,  713, 2811, 1564,
        3144, 2455,  752, 1711,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)

In [36]:
X_freq_trigrams = pad_sequences(X_list_freq_trigrams, padding='post', dtype='float32', maxlen=MAXLEN_TRIGRAMS)
X_freq_trigrams[:2]

array([[0.04651163, 0.04651163, 0.04651163, 0.04651163, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.02325581,
        0.02325581, 0.02325581, 0.02325581, 0.02325581, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.05263158, 0.02631579, 0.02631579, 0.0

### The $\mathbf{y}$ vector

We extract the responses

In [37]:
y_list = [dataset[i][1] for i in range(len(dataset))]
y_list[:10]

['rus', 'ita', 'hun', 'eng', 'rus', 'eng', 'mkd', 'dan', 'rus', 'por']

We create an index of them.

In [38]:
y_set = set(y_list)
inx2lang = dict(enumerate(y_set))
lang2inx = {v: k for k, v in inx2lang.items()}
lang2inx

{'toki': 0,
 'ile': 1,
 'bel': 2,
 'fra': 3,
 'run': 4,
 'ces': 5,
 'tlh': 6,
 'dan': 7,
 'srp': 8,
 'tur': 9,
 'bul': 10,
 'kab': 11,
 'pes': 12,
 'hin': 13,
 'ido': 14,
 'yue': 15,
 'lfn': 16,
 'lat': 17,
 'ina': 18,
 'vol': 19,
 'isl': 20,
 'lit': 21,
 'pol': 22,
 'nds': 23,
 'uig': 24,
 'glg': 25,
 'por': 26,
 'deu': 27,
 'cor': 28,
 'tgl': 29,
 'rus': 30,
 'mar': 31,
 'ita': 32,
 'tuk': 33,
 'ell': 34,
 'hun': 35,
 'spa': 36,
 'mkd': 37,
 'swe': 38,
 'ben': 39,
 'mhr': 40,
 'ara': 41,
 'nob': 42,
 'gos': 43,
 'nld': 44,
 'hrv': 45,
 'epo': 46,
 'cmn': 47,
 'ber': 48,
 'ron': 49,
 'oci': 50,
 'ind': 51,
 'aze': 52,
 'avk': 53,
 'afr': 54,
 'tat': 55,
 'slk': 56,
 'eng': 57,
 'est': 58,
 'wuu': 59,
 'cat': 60,
 'ukr': 61,
 'vie': 62,
 'jbo': 63,
 'fin': 64,
 'bre': 65,
 'eus': 66,
 'kor': 67,
 'jpn': 68,
 'heb': 69}

In [39]:
y_list_num = list(map(lambda x: lang2inx[x], y_list))
y_list_num[:3]

[30, 32, 35]

We encode them as one-hot vectors

In [40]:
from keras.utils import to_categorical
y = to_categorical(y_list_num)
y[:3]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.]], dtype=float32)

## Training and Validation Sets
We create a training and a validation sets

### We shuffle the indices
We shuffle them again. This is not necessary, as we already shuffled the dataset

In [41]:
indices = list(range(X_inx_chars.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X_inx_chars = X_inx_chars[indices, :]
X_freq_chars = X_freq_chars[indices, :]
X_inx_bigrams = X_inx_bigrams[indices, :]
X_freq_bigrams = X_freq_bigrams[indices, :]
X_inx_trigrams = X_inx_trigrams[indices, :]
X_freq_trigrams = X_freq_trigrams[indices, :]
y = y[indices, :]

[1794002, 2745398, 5006546, 365595, 5105077, 6536124, 2713909, 1837034, 6722572, 2354267]


### We split the dataset

Number of training examples

In [42]:
training_examples = int(X_inx_chars.shape[0] * 0.8)
training_examples

6347635

The training set

In [43]:
X_train_inx_chars = X_inx_chars[:training_examples, :]
X_train_freq_chars = X_freq_chars[:training_examples, :]

X_train_inx_bigrams = X_inx_bigrams[:training_examples, :]
X_train_freq_bigrams = X_freq_bigrams[:training_examples, :]

X_train_inx_trigrams = X_inx_trigrams[:training_examples, :]
X_train_freq_trigrams = X_freq_trigrams[:training_examples, :]

y_train = y[:training_examples]

In [44]:
print(y_train[0])
print(X_train_inx_chars[0])
X_train_freq_chars[0]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 282  357 1698  402  339  193 2755  763 2714 3806 2468 3737 2492  119
 1394    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


array([0.125     , 0.125     , 0.08333334, 0.08333334, 0.08333334,
       0.08333334, 0.08333334, 0.04166667, 0.04166667, 0.04166667,
       0.04166667, 0.04166667, 0.04166667, 0.04166667, 0.04166667,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ],
      dtype=float32)

The validation set

In [45]:
X_val_inx_chars = X_inx_chars[training_examples:, :]
X_val_freq_chars = X_freq_chars[training_examples:, :]

X_val_inx_bigrams = X_inx_bigrams[training_examples:, :]
X_val_freq_bigrams = X_freq_bigrams[training_examples:, :]

X_val_inx_trigrams = X_inx_trigrams[training_examples:, :]
X_val_freq_trigrams = X_freq_trigrams[training_examples:, :]

y_val = y[training_examples:]

# Building the Keras Architecture

We now create an architecture resembling CLD3

In [46]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import layers, optimizers, backend

# Char frequency input
char_freq_input = Input(shape=(MAXLEN_CHARS,), dtype='float32', name='char_freq_input')

# Char index input
char_inx_input = Input(shape=(MAXLEN_CHARS,), dtype='int32', name='char_inx_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_inx_input)

# The weighted mean
flattened_chars = backend.dot(char_freq_input, embedded_chars)[0]

# Bigram freq input
bigram_freq_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='float32', name='bigram_freq_input')

# Bigram index input
bigram_inx_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='int32', name='bigram_inx_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_inx_input)

# The weighted mean
flattened_bigrams = backend.dot(bigram_freq_input, embedded_bigrams)[0]

# Trigram freq input
trigram_freq_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='float32', name='trigram_freq_input')

# Trigram index input
trigram_inx_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='int32', name='trigram_inx_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_inx_input)

# The weighted mean
flattened_trigrams = backend.dot(trigram_freq_input, embedded_trigrams)[0]

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)

dense_layer = layers.Dense(512, activation='relu')(flattened)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)

model = Model([char_inx_input, char_freq_input, 
               bigram_inx_input, bigram_freq_input, 
               trigram_inx_input, trigram_freq_input], lang_output)
model.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_inx_input (InputLayer)     [(None, 30)]         0                                            
__________________________________________________________________________________________________
bigram_inx_input (InputLayer)   [(None, 60)]         0                                            
__________________________________________________________________________________________________
trigram_inx_input (InputLayer)  [(None, 70)]         0                                            
__________________________________________________________________________________________________
char_freq_input (InputLayer)    [(None, 30)]         0                                            
______________________________________________________________________________________________

## Fitting the Model

In [47]:
"""history = model.fit([X_chars, X_bigrams, X_trigrams], y, 
                    epochs=3,
                   validation_split=0.2)"""

'history = model.fit([X_chars, X_bigrams, X_trigrams], y, \n                    epochs=3,\n                   validation_split=0.2)'

In [48]:
history = model.fit(
    [X_train_inx_chars, X_train_freq_chars, 
     X_train_inx_bigrams, X_train_freq_bigrams, 
     X_train_inx_trigrams, X_train_freq_trigrams], 
    y_train, 
    epochs=3,
    validation_data=(
        [X_val_inx_chars, X_val_freq_chars, 
         X_val_inx_bigrams, X_val_freq_bigrams,
         X_val_inx_trigrams, X_val_freq_trigrams], 
        y_val))

Train on 6347635 samples, validate on 1586909 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Predicting and Evaluating

We evaluate the model.

In [49]:
scores = model.evaluate([X_val_inx_chars, X_val_freq_chars, 
                         X_val_inx_bigrams, X_val_freq_bigrams,
                         X_val_inx_trigrams, X_val_freq_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

Scores: [0.0777907051800631, 0.9778311]
loss: 7.78%
acc: 97.78%


[None, None]

We predict the whole validation set and we get the probabilities

In [50]:
y_predicted = model.predict([X_val_inx_chars, X_val_freq_chars, 
                             X_val_inx_bigrams, X_val_freq_bigrams,
                             X_val_inx_trigrams, X_val_freq_trigrams])
print(y_predicted[:5])
print(y_val[:5])

[[7.04574880e-14 3.90093231e-13 1.09105097e-12 1.66777966e-10
  1.89596563e-17 3.15424228e-13 1.68895632e-13 1.35844103e-12
  4.96976219e-14 1.96622052e-09 4.14903707e-13 6.19145706e-02
  7.00953988e-19 1.00578479e-16 2.01899409e-15 1.24947997e-19
  2.00285308e-16 3.31493527e-14 4.17139793e-13 7.11285510e-22
  2.82210753e-15 3.97695270e-17 1.92978324e-13 3.36266892e-15
  1.37285040e-12 2.73302206e-19 5.79867154e-10 8.77869652e-11
  3.67478807e-15 4.53468182e-13 8.72249686e-08 3.85859407e-18
  2.75173967e-10 1.03844492e-15 2.86783560e-18 2.00533745e-09
  2.14996666e-11 8.21473011e-12 2.40458890e-12 2.46160194e-20
  1.04294897e-11 2.22734910e-16 7.62901027e-12 9.55917833e-20
  6.91279309e-14 1.78477545e-16 2.98484189e-11 6.98477980e-16
  9.38085258e-01 2.44814185e-15 1.46439675e-16 1.35915944e-15
  1.42681165e-16 9.80109785e-15 1.28481627e-21 5.52593301e-12
  2.91562399e-16 2.07551407e-10 7.34579811e-18 1.85386941e-17
  1.19242561e-12 9.24781862e-10 3.01527609e-16 2.15328650e-15
  2.1062

#### Names of the predicted and true classes

Prediction

In [51]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['ber', 'toki', 'epo', 'ita', 'spa', 'fin', 'ita', 'rus', 'por', 'epo']

Truth

In [52]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['ber', 'toki', 'epo', 'ita', 'spa', 'fin', 'ita', 'rus', 'por', 'epo']

### The detailed F1s

We now compute the precision and recall by language as well as the micro and macro F1.

In [53]:
lang_names = sorted(list(lang2inx.keys()), key=lambda x: lang2inx[x])

In [54]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print('Micro F1:', f1_score(y_val_symb, y_pred, average='micro'))
print('Macro F1', f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

        toki       1.00      0.98      0.99      7064
         ile       0.77      0.79      0.78      1330
         bel       0.96      0.93      0.95      2529
         fra       0.99      0.99      0.99     80048
         run       0.95      0.88      0.91       695
         ces       0.96      0.96      0.96      7408
         tlh       0.98      0.99      0.98      3379
         dan       0.89      0.92      0.90      8820
         srp       0.78      0.93      0.85      6063
         tur       1.00      1.00      1.00    136529
         bul       0.96      0.84      0.89      4723
         kab       0.83      0.73      0.78     25803
         pes       0.99      0.99      0.99      4287
         hin       0.99      0.96      0.97      2514
         ido       0.82      0.73      0.77      1005
         yue       0.89      0.94      0.91      1082
         lfn       0.83      0.82      0.82      1708
         lat       0.94    

### Confusion Matrix

And finally the confusion matrix

In [55]:
from sklearn.metrics import confusion_matrix
print(lang2inx)
cf = confusion_matrix(y_val_symb, y_pred)
print(cf)

{'toki': 0, 'ile': 1, 'bel': 2, 'fra': 3, 'run': 4, 'ces': 5, 'tlh': 6, 'dan': 7, 'srp': 8, 'tur': 9, 'bul': 10, 'kab': 11, 'pes': 12, 'hin': 13, 'ido': 14, 'yue': 15, 'lfn': 16, 'lat': 17, 'ina': 18, 'vol': 19, 'isl': 20, 'lit': 21, 'pol': 22, 'nds': 23, 'uig': 24, 'glg': 25, 'por': 26, 'deu': 27, 'cor': 28, 'tgl': 29, 'rus': 30, 'mar': 31, 'ita': 32, 'tuk': 33, 'ell': 34, 'hun': 35, 'spa': 36, 'mkd': 37, 'swe': 38, 'ben': 39, 'mhr': 40, 'ara': 41, 'nob': 42, 'gos': 43, 'nld': 44, 'hrv': 45, 'epo': 46, 'cmn': 47, 'ber': 48, 'ron': 49, 'oci': 50, 'ind': 51, 'aze': 52, 'avk': 53, 'afr': 54, 'tat': 55, 'slk': 56, 'eng': 57, 'est': 58, 'wuu': 59, 'cat': 60, 'ukr': 61, 'vie': 62, 'jbo': 63, 'fin': 64, 'bre': 65, 'eus': 66, 'kor': 67, 'jpn': 68, 'heb': 69}
[[ 6936     5     0 ...     0     0     0]
 [    0  1045     0 ...     0     0     0]
 [    0     0  2362 ...     0     0     0]
 ...
 [    0     0     0 ...  1333     5     0]
 [    0     0     0 ...     1 37692     0]
 [    0     0     

The most frequent confusions for some languages

In [56]:
languages = ['fra', 'eng', 'swe']

In [57]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [    0    31     0 79236     0     5     2     5     1    11     0     5
     0     0     6     0     4    28    64     0     0     0     1     1
     0    12   108    34     0     0     1     0    55     0     0    21
    76     0     4     0     0     1     1     2     7     1    53     0
     7     2    53     1     0     2     0     0     0   151     0     0
    24     0     1     4     5    15     7     0     0     0]
Most confused: eng 0.0018863681790925444
====
Language: eng
Confusions: [     2      4      0     52      4      3      8     21     10     46
      0     11      0      0      4      0      3     31     11      1
      0      3      8      6      0      2     51    104      1      7
      2      1     54      1      0     51     47      0      7      0
      0      1      3     10     61      0     27      0     19      0
      6      3      0      1      2      0      0 252745      2      0
      5      0      1      1     12     10      8

## Predicting Texts

Now let us use the model to predict some short texts

In [58]:
def extract_features(sentence):
    char_items = hash_chars(sentence)
    bigram_items = hash_bigrams(sentence)
    trigram_items = hash_trigrams(sentence)
    return [pad_sequences([char_items[0]], padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([char_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([bigram_items[0]], padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([bigram_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([trigram_items[0]], padding='post', maxlen=MAXLEN_TRIGRAMS),
            pad_sequences([trigram_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_TRIGRAMS)]

In [59]:
sentence = "Banana"
extract_features(sentence)

[array([[3232, 2755, 1698,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]], dtype=int32),
 array([[0.5       , 0.33333334, 0.16666667, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ]],
       dtype=float32),
 array([[5340,   70, 4333,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,

In [60]:
sentence = "Salut les gars !"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'fra'

In [61]:
sentence = "Hi guys!"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'eng'

In [62]:
sentence = "Hejsan grabbar!"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'swe'

## Another Architecture

A simplifed model, where we do not use the ngram frequencies

In [63]:
# Char index input
char_inx_input = Input(shape=(MAXLEN_CHARS,), dtype='int32', name='char_inx_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_inx_input)

# The mean
flattened_chars = layers.GlobalAveragePooling1D()(embedded_chars)
#flattened_chars = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_chars)

# Bigram index input
bigram_inx_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='int32', name='bigram_inx_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_inx_input)

# The mean
flattened_bigrams = layers.GlobalAveragePooling1D()(embedded_bigrams)
#flattened_bigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_bigrams)

# Trigram index input
trigram_inx_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='int32', name='trigram_inx_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_inx_input)

# The mean
flattened_trigrams = layers.GlobalAveragePooling1D()(embedded_trigrams)
#flattened_trigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_trigrams)

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)

dense_layer = layers.Dense(512, activation='relu')(flattened)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)

model2 = Model([char_inx_input, 
               bigram_inx_input, 
               trigram_inx_input], lang_output)
model2.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model2.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_inx_input (InputLayer)     [(None, 30)]         0                                            
__________________________________________________________________________________________________
bigram_inx_input (InputLayer)   [(None, 60)]         0                                            
__________________________________________________________________________________________________
trigram_inx_input (InputLayer)  [(None, 70)]         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 30, 64)       262208      char_inx_input[0][0]             
____________________________________________________________________________________________

In [64]:
history = model2.fit(
    [X_train_inx_chars, 
     X_train_inx_bigrams, 
     X_train_inx_trigrams], 
    y_train, 
    epochs=3,
    validation_data=(
        [X_val_inx_chars, 
         X_val_inx_bigrams,
         X_val_inx_trigrams], 
        y_val))

Train on 6347635 samples, validate on 1586909 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


We evaluate this new model

In [65]:
scores = model2.evaluate([X_val_inx_chars, 
                         X_val_inx_bigrams,
                         X_val_inx_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

Scores: [0.055934741905906714, 0.9836752]
loss: 5.59%
acc: 98.37%


[None, None]

In [66]:
y_predicted = model2.predict([X_val_inx_chars, 
                             X_val_inx_bigrams,
                             X_val_inx_trigrams])
print(y_predicted[:5])
print(y_val[:5])

[[1.62915260e-16 7.68185580e-22 1.53335678e-17 2.29747193e-10
  9.84093776e-19 2.59905504e-14 2.49798751e-16 3.80058707e-16
  3.60504758e-13 4.88998558e-15 1.78505417e-16 1.51086096e-02
  4.34674492e-27 1.89399909e-17 7.38861472e-21 7.34128584e-18
  1.06722382e-14 2.28205441e-18 2.39637676e-25 3.07811060e-26
  1.40160723e-24 9.31882673e-16 1.90341788e-14 9.61095079e-20
  3.64583900e-24 6.41069195e-24 3.70318671e-14 1.09952720e-14
  2.85399516e-19 9.84719169e-15 1.39628380e-13 5.12328168e-19
  1.47078625e-13 6.22210840e-18 8.35612300e-14 1.47256111e-12
  1.29385863e-17 1.57711300e-18 4.91019808e-20 5.62566245e-20
  4.93918974e-18 7.11364422e-21 2.64097594e-18 3.74421269e-20
  1.69922843e-16 1.25943206e-21 5.95967936e-13 6.22249160e-13
  9.84891415e-01 4.47145114e-20 7.13274611e-21 2.22884329e-17
  3.84933435e-24 7.91127299e-20 3.45859720e-25 3.99961844e-18
  3.47619964e-19 1.09384966e-11 7.12028270e-17 1.19794302e-18
  3.27979132e-21 1.29149295e-15 6.68191329e-20 3.66134553e-12
  2.4728

In [67]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['ber', 'toki', 'epo', 'ita', 'spa', 'fin', 'ita', 'rus', 'por', 'epo']

In [68]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['ber', 'toki', 'epo', 'ita', 'spa', 'fin', 'ita', 'rus', 'por', 'epo']

Although this model is simpler than the first one, it seems slightly more accurate

In [69]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print('Micro F1:', f1_score(y_val_symb, y_pred, average='micro'))
print('Macro F1', f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

        toki       1.00      1.00      1.00      7064
         ile       0.91      0.68      0.78      1330
         bel       0.97      0.95      0.96      2529
         fra       1.00      0.99      0.99     80048
         run       0.88      0.96      0.92       695
         ces       0.97      0.96      0.97      7408
         tlh       0.98      0.99      0.99      3379
         dan       0.94      0.88      0.91      8820
         srp       0.87      0.88      0.88      6063
         tur       1.00      1.00      1.00    136529
         bul       0.94      0.90      0.92      4723
         kab       0.83      0.81      0.82     25803
         pes       0.99      1.00      0.99      4287
         hin       0.99      0.98      0.99      2514
         ido       0.94      0.74      0.83      1005
         yue       0.93      0.93      0.93      1082
         lfn       0.79      0.91      0.85      1708
         lat       0.93    

In [70]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [    0    31     0 79236     0     5     2     5     1    11     0     5
     0     0     6     0     4    28    64     0     0     0     1     1
     0    12   108    34     0     0     1     0    55     0     0    21
    76     0     4     0     0     1     1     2     7     1    53     0
     7     2    53     1     0     2     0     0     0   151     0     0
    24     0     1     4     5    15     7     0     0     0]
Most confused: eng 0.0018863681790925444
====
Language: eng
Confusions: [     2      4      0     52      4      3      8     21     10     46
      0     11      0      0      4      0      3     31     11      1
      0      3      8      6      0      2     51    104      1      7
      2      1     54      1      0     51     47      0      7      0
      0      1      3     10     61      0     27      0     19      0
      6      3      0      1      2      0      0 252745      2      0
      5      0      1      1     12     10      8

In [71]:
def extract_features_inx(sentence):
    char_items = hash_chars(sentence)
    bigram_items = hash_bigrams(sentence)
    trigram_items = hash_trigrams(sentence)
    return [pad_sequences([char_items[0]], padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([bigram_items[0]], padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([trigram_items[0]], padding='post', maxlen=MAXLEN_TRIGRAMS)]

In [72]:
sentence = "Banana"
extract_features_inx(sentence)

[array([[3232, 2755, 1698,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]], dtype=int32),
 array([[5340,   70, 4333,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0]], dtype=int32),
 array([[1923, 6678, 2369,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,   

In [73]:
sentence = "Salut les gars !"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'fra'

In [74]:
sentence = "Hello guys!"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'eng'

In [75]:
sentence = "Hejsan grabbar!"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'swe'