# Keras Implementation of CLD3

Author: Pierre Nugues

Reimplementation of Google's _Compact language detector_ (CLD3) from a high-level description. Source: ``https://github.com/google/cld3``

Still missing:
* weighted average of embeddings, instead of average
* generator or train_on_batch

## A Dataset: *Tatoeba*

Tatoeba is a database of texts with language tags. The corpus is available here: https://tatoeba.org/eng/downloads

### Understanding the Dataset

We read the dataset and we split the lines

In [1]:
dataset_raw = open('sentences.csv', encoding='utf8').read().strip()
dataset_raw = dataset_raw.split('\n')
dataset_raw[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

We split the fields and we remove possible whitespaces

In [2]:
dataset_raw = list(map(lambda x: tuple(x.split('\t')), dataset_raw))
dataset_raw = list(map(lambda x: tuple(map(str.strip, x)), dataset_raw))
print(len(dataset_raw), 'texts')
dataset_raw[:3]

8023136 texts


[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

We pad strings that are less than three characters. If not done, training will crash

In [3]:
for i in range(len(dataset_raw)):
    if len(dataset_raw[i][2]) == 0:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '   ')
    if len(dataset_raw[i][2]) == 1: 
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '  ')
    if len(dataset_raw[i][2]) == 2:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + ' ')

We shuffle the dataset

In [4]:
from random import shuffle, seed
import numpy as np
np.random.seed(1234)
shuffle(dataset_raw)

We can decimate the dataset to have faster training times

In [5]:
DECIMATE = True
if DECIMATE:
    dataset_raw = dataset_raw[:int(len(dataset_raw)/10)]
dataset_raw[:3]

[('424134', 'rus', 'Столкнулись два грузовика.'),
 ('911425', 'nds', 'Se hett en Dochter, de Mary heet.'),
 ('3884161', 'tur', 'Yakışıklı olduğumu düşünüyor musun?')]

The languages. Some texts have no language, and some others are marked with \\\\N

In [6]:
languages = set([x[1] for x in dataset_raw])
len(languages)

307

We count the texts per language

In [7]:
def count_texts(dataset):
    text_counts = {}
    for record in dataset:
        lang = record[1]
        if lang in text_counts:
            text_counts[lang] += 1
        else:
            text_counts[lang] = 1
    return text_counts

Languages with the most examples

In [8]:
text_counts = count_texts(dataset_raw)
langs = sorted(text_counts.keys(), key=text_counts.get, reverse=True)
[(lang, text_counts[lang]) for lang in langs][:25]

[('eng', 126397),
 ('ita', 73657),
 ('rus', 73470),
 ('tur', 67894),
 ('epo', 61384),
 ('deu', 49034),
 ('fra', 40191),
 ('por', 34628),
 ('spa', 31495),
 ('hun', 26840),
 ('ber', 22608),
 ('heb', 19673),
 ('jpn', 18936),
 ('ukr', 15496),
 ('kab', 12891),
 ('fin', 10540),
 ('nld', 10362),
 ('pol', 9972),
 ('mkd', 7813),
 ('cmn', 6066),
 ('mar', 5519),
 ('dan', 4412),
 ('lit', 3825),
 ('ces', 3736),
 ('toki', 3567)]

We consider languages that have more than 3,000 examples in the dataset or we only use those in French, English, and Swedish

In [9]:
SMALL_LANGUAGE_SET = False
considered_langs = [lang for lang in langs if text_counts[lang] > 3000]
if SMALL_LANGUAGE_SET:
    considered_langs = ['fra', 'eng', 'swe']
print(considered_langs)
LANG_NBR = len(considered_langs)
LANG_NBR

['eng', 'ita', 'rus', 'tur', 'epo', 'deu', 'fra', 'por', 'spa', 'hun', 'ber', 'heb', 'jpn', 'ukr', 'kab', 'fin', 'nld', 'pol', 'mkd', 'cmn', 'mar', 'dan', 'lit', 'ces', 'toki', 'ara', 'swe', 'lat', 'ell', 'srp']


30

We extract the texts in these languages

In [10]:
dataset = list(filter(lambda x: x[1] in considered_langs, dataset_raw))
print(len(dataset))
dataset[:5]

756648


[('424134', 'rus', 'Столкнулись два грузовика.'),
 ('3884161', 'tur', 'Yakışıklı olduğumu düşünüyor musun?'),
 ('6875215', 'hun', 'Nem szenvedtem már eleget?'),
 ('7488095', 'spa', '¿Viste mi diccionario de italiano?'),
 ('5672242',
  'epo',
  'Tomo diras, ke li estas tro laca por relegi la lecionojn.')]

## Feature Extraction

### Functions to Count Characters Ngrams

We use hash codes to convert ngrams

The number of codes we use

In [11]:
MAX_CHARS = 4096
MAX_BIGRAMS = 4096
MAX_TRIGRAMS = 4096

We normalize the counts as in CLD3

In [12]:
def normalize(d):
    sum_chars = sum(d.values())
    d = {k:v/sum_chars for k, v in d.items()}
    return d

We compute the hash code and we add one to avoid a value of 0 as it is padding symbol in the subsequent matrices

In [13]:
from collections import Counter

def hash_chars(string, lc=True):
    if lc:
        string = string.lower()
    hash_codes = map(lambda x: hash(x) % MAX_CHARS + 1, string)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [14]:
def hash_bigrams(string, lc=True):
    if lc:
        string = string.lower()
    bigrams = [string[i:i + 2] for i in range(len(string) - 1)]
    hash_codes = map(lambda x: hash(x) % MAX_BIGRAMS + 1, bigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [15]:
def hash_trigrams(string, lc=True):
    if lc:
        string = string.lower()
    trigrams = [string[i:i + 3] for i in range(len(string) - 2)]
    hash_codes = map(lambda x: hash(x) % MAX_TRIGRAMS + 1, trigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [16]:
hash_chars("Let's try something.")

{1654: 0.05,
 2304: 0.1,
 218: 0.15,
 2550: 0.05,
 2485: 0.1,
 3562: 0.1,
 2327: 0.05,
 3459: 0.05,
 3208: 0.05,
 1064: 0.05,
 902: 0.05,
 933: 0.05,
 3095: 0.05,
 881: 0.05,
 2384: 0.05}

In [17]:
hash_bigrams("Let's try something.")

{1965: 0.05263157894736842,
 829: 0.10526315789473684,
 3227: 0.05263157894736842,
 904: 0.05263157894736842,
 2287: 0.05263157894736842,
 2979: 0.05263157894736842,
 1347: 0.05263157894736842,
 500: 0.05263157894736842,
 2934: 0.05263157894736842,
 4036: 0.05263157894736842,
 3216: 0.05263157894736842,
 1015: 0.05263157894736842,
 2406: 0.05263157894736842,
 3975: 0.05263157894736842,
 2177: 0.05263157894736842,
 882: 0.05263157894736842,
 3160: 0.05263157894736842,
 4038: 0.05263157894736842}

In [18]:
hash_trigrams("Let's try something.")

{1688: 0.05555555555555555,
 654: 0.05555555555555555,
 3663: 0.05555555555555555,
 807: 0.05555555555555555,
 392: 0.05555555555555555,
 1911: 0.05555555555555555,
 1619: 0.05555555555555555,
 3270: 0.05555555555555555,
 209: 0.05555555555555555,
 3876: 0.05555555555555555,
 623: 0.05555555555555555,
 89: 0.05555555555555555,
 2482: 0.05555555555555555,
 3280: 0.05555555555555555,
 3686: 0.05555555555555555,
 1009: 0.05555555555555555,
 414: 0.05555555555555555,
 2474: 0.05555555555555555}

### Counting the ngrams in the dataset

We add the character, bigram, and trigram counts to the texts. The format is:
`(text_id, language_id, text, char_cnt, bigram_cnt, trigram_cnt)`

In [19]:
dataset_feat = list(map(lambda x: x + (hash_chars(x[-1]), hash_bigrams(x[-1]), hash_trigrams(x[-1])), 
                              dataset))
dataset_feat[:2]

[('424134',
  'rus',
  'Столкнулись два грузовика.',
  {1238: 0.07692307692307693,
   2141: 0.038461538461538464,
   3533: 0.07692307692307693,
   734: 0.07692307692307693,
   2495: 0.07692307692307693,
   834: 0.038461538461538464,
   2720: 0.07692307692307693,
   3473: 0.07692307692307693,
   972: 0.038461538461538464,
   3562: 0.07692307692307693,
   163: 0.038461538461538464,
   1963: 0.07692307692307693,
   21: 0.07692307692307693,
   3268: 0.038461538461538464,
   754: 0.038461538461538464,
   439: 0.038461538461538464,
   2384: 0.038461538461538464},
  {1337: 0.04,
   2120: 0.04,
   2578: 0.04,
   2501: 0.04,
   610: 0.04,
   1297: 0.04,
   1174: 0.04,
   197: 0.04,
   1318: 0.04,
   3016: 0.04,
   1897: 0.04,
   1492: 0.04,
   1026: 0.04,
   1681: 0.04,
   1808: 0.04,
   2214: 0.04,
   3251: 0.04,
   1094: 0.04,
   3057: 0.04,
   791: 0.04,
   1417: 0.04,
   1354: 0.04,
   3159: 0.04,
   479: 0.04,
   83: 0.04},
  {2389: 0.041666666666666664,
   3029: 0.041666666666666664,
   4

In [20]:
dataset_feat[0][3].keys()

dict_keys([1238, 2141, 3533, 734, 2495, 834, 2720, 3473, 972, 3562, 163, 1963, 21, 3268, 754, 439, 2384])

In [21]:
dataset_feat[0][4].keys()

dict_keys([1337, 2120, 2578, 2501, 610, 1297, 1174, 197, 1318, 3016, 1897, 1492, 1026, 1681, 1808, 2214, 3251, 1094, 3057, 791, 1417, 1354, 3159, 479, 83])

### Unique ngrams

We now extract all the unique ngrams (hash codes).

In [22]:
unique_chars = set([item for text in dataset_feat for item in list(text[3].keys())])

In [23]:
unique_bigrams = set([item for text in dataset_feat for item in list(text[4].keys())])

In [24]:
unique_trigrams = set([item for text in dataset_feat for item in list(text[5].keys())])

## Building the Keras Architecture

### Embeddings

The dimension of the embedding indices

In [25]:
unique_char_cnt = len(unique_chars)
unique_char_cnt

2711

What is the max length of the feature vectors for the characters?

In [26]:
def length(i, j):
    return len(dataset_feat[i][j].keys())

max([length(i, 3) for i in range(len(dataset_feat))])

86

In [27]:
import statistics
print(statistics.mean([length(i, 3) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 3) for i in range(len(dataset_feat))]))
statistics.median([length(i, 3) for i in range(len(dataset_feat))])

16.28117962381451
3.7617453433411927


16.0

In [28]:
unique_bigram_cnt = len(unique_bigrams)
unique_bigram_cnt

4096

What is the max length of the feature vectors for the bigrams?

In [29]:
max([length(i, 4) for i in range(len(dataset_feat))])

320

In [30]:
print(statistics.mean([length(i, 4) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 4) for i in range(len(dataset_feat))]))
statistics.median([length(i, 4) for i in range(len(dataset_feat))])

29.966491684376354
14.0720490075001


28.0

In [31]:
unique_trigram_cnt = len(unique_trigrams)
unique_trigram_cnt

4096

What is the max length of the feature vectors for the trigrams?

In [32]:
max([length(i, 5) for i in range(len(dataset_feat))])

681

In [33]:
print(statistics.mean([length(i, 5) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 5) for i in range(len(dataset_feat))]))
statistics.median([length(i, 5) for i in range(len(dataset_feat))])

32.37771063955763
19.158929900629644


29.0

In [34]:
MAXLEN_CHARS = 20
MAXLEN_BIGRAMS = 50
MAXLEN_TRIGRAMS = 70

### The architecture

In [36]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import layers, optimizers, backend

# Char input
char_input = Input(shape=(20,), dtype='int32', name='char_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_input)
flattened_chars = layers.GlobalAveragePooling1D()(embedded_chars)
#flattened_chars = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_chars)

# Bigram input
bigram_input = Input(shape=(50,), dtype='int32', name='bigram_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_input)
flattened_bigrams = layers.GlobalAveragePooling1D()(embedded_bigrams)
#flattened_bigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_bigrams)

# Trigram input
trigram_input = Input(shape=(70,), dtype='int32', name='trigram_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_input)
flattened_trigrams = layers.GlobalAveragePooling1D()(embedded_trigrams)
#flattened_trigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_trigrams)

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)
dense_layer = layers.Dense(512, activation='relu')(flattened)
# dense_layer = layers.Dropout(0.6)(dense_layer)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)
model = Model([char_input, bigram_input, trigram_input], lang_output)
model.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input (InputLayer)         [(None, 20)]         0                                            
__________________________________________________________________________________________________
bigram_input (InputLayer)       [(None, 50)]         0                                            
__________________________________________________________________________________________________
trigram_input (InputLayer)      [(None, 70)]         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 20, 64)       262208      char_input[0][0]                 
______________________________________________________________________________________________

## Building $\mathbf{X}$ and $\mathbf{y}$

### The $\mathbf{X}$ matrix

We extract the keys (hash codes) from the corpus

In [37]:
X_list_chars = [list(dataset_feat[i][3].keys()) for i in range(len(dataset_feat))]
print(X_list_chars[:2])

[[1238, 2141, 3533, 734, 2495, 834, 2720, 3473, 972, 3562, 163, 1963, 21, 3268, 754, 439, 2384], [3459, 2850, 1862, 917, 94, 1654, 3562, 3208, 961, 3737, 45, 1064, 1796, 3095, 2327, 2485, 3480]]


In [38]:
X_list_bigrams = [list(dataset_feat[i][4].keys()) for i in range(len(dataset_feat))]
print(X_list_bigrams[:2])

[[1337, 2120, 2578, 2501, 610, 1297, 1174, 197, 1318, 3016, 1897, 1492, 1026, 1681, 1808, 2214, 3251, 1094, 3057, 791, 1417, 1354, 3159, 479, 83], [2147, 2641, 270, 2620, 2472, 3742, 2276, 982, 711, 2150, 464, 1850, 273, 2569, 381, 2881, 3536, 96, 1324, 2494, 1355, 593, 3828, 2250, 4073, 2566, 1673, 159, 2574, 2320, 3423, 3712]]


In [39]:
X_list_trigrams = [list(dataset_feat[i][5].keys()) for i in range(len(dataset_feat))]
print(X_list_trigrams[:2])

[[2389, 3029, 4051, 1655, 1843, 2569, 3251, 3513, 2104, 1547, 2186, 139, 2469, 458, 2874, 3386, 1820, 1067, 3828, 2773, 1062, 2142, 2704, 626], [3699, 1488, 3524, 1013, 1977, 2519, 2324, 2975, 23, 3975, 1057, 3167, 3628, 1236, 2069, 846, 2029, 679, 3870, 2485, 237, 3312, 3721, 1039, 2819, 1046, 709, 2128, 3816, 280, 722, 2430, 1010]]


And we pad the sequences

In [40]:
from keras.preprocessing.sequence import pad_sequences
X_chars = pad_sequences(X_list_chars, padding='post', maxlen=MAXLEN_CHARS)
X_chars[:2]

Using TensorFlow backend.


array([[1238, 2141, 3533,  734, 2495,  834, 2720, 3473,  972, 3562,  163,
        1963,   21, 3268,  754,  439, 2384,    0,    0,    0],
       [3459, 2850, 1862,  917,   94, 1654, 3562, 3208,  961, 3737,   45,
        1064, 1796, 3095, 2327, 2485, 3480,    0,    0,    0]],
      dtype=int32)

In [41]:
X_bigrams = pad_sequences(X_list_bigrams, padding='post', maxlen=MAXLEN_BIGRAMS)
X_bigrams[:2]

array([[1337, 2120, 2578, 2501,  610, 1297, 1174,  197, 1318, 3016, 1897,
        1492, 1026, 1681, 1808, 2214, 3251, 1094, 3057,  791, 1417, 1354,
        3159,  479,   83,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [2147, 2641,  270, 2620, 2472, 3742, 2276,  982,  711, 2150,  464,
        1850,  273, 2569,  381, 2881, 3536,   96, 1324, 2494, 1355,  593,
        3828, 2250, 4073, 2566, 1673,  159, 2574, 2320, 3423, 3712,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [42]:
X_trigrams = pad_sequences(X_list_trigrams, padding='post', maxlen=MAXLEN_TRIGRAMS)
X_trigrams[:2]

array([[2389, 3029, 4051, 1655, 1843, 2569, 3251, 3513, 2104, 1547, 2186,
         139, 2469,  458, 2874, 3386, 1820, 1067, 3828, 2773, 1062, 2142,
        2704,  626,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [3699, 1488, 3524, 1013, 1977, 2519, 2324, 2975,   23, 3975, 1057,
        3167, 3628, 1236, 2069,  846, 2029,  679, 3870, 2485,  237, 3312,
        3721, 1039, 2819, 1046,  709, 2128, 3816,  280,  722, 2430, 1010,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)

### The $\mathbf{y}$ vector

We extract the responses

In [43]:
y_list = [dataset_feat[i][1] for i in range(len(dataset_feat))]
y_list[:10]

['rus', 'tur', 'hun', 'spa', 'epo', 'jpn', 'eng', 'epo', 'heb', 'jpn']

In [44]:
y_set = set(y_list)
inx2lang = dict(enumerate(y_set))
lang2inx = {v: k for k, v in inx2lang.items()}
lang2inx

{'eng': 0,
 'por': 1,
 'fin': 2,
 'epo': 3,
 'ukr': 4,
 'nld': 5,
 'hun': 6,
 'ber': 7,
 'tur': 8,
 'rus': 9,
 'ita': 10,
 'cmn': 11,
 'jpn': 12,
 'lit': 13,
 'ara': 14,
 'dan': 15,
 'ces': 16,
 'spa': 17,
 'heb': 18,
 'toki': 19,
 'swe': 20,
 'ell': 21,
 'mkd': 22,
 'deu': 23,
 'mar': 24,
 'srp': 25,
 'fra': 26,
 'lat': 27,
 'kab': 28,
 'pol': 29}

In [45]:
y_list_num = list(map(lambda x: lang2inx[x], y_list))
y_list_num[:3]

[9, 8, 6]

We encode them as one-hot vectors

In [46]:
from keras.utils import to_categorical
y = to_categorical(y_list_num)
y[:3]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
      dtype=float32)

In [None]:
"""history = model.fit([X_chars, X_bigrams, X_trigrams], y, 
                    epochs=3,
                   validation_split=0.2)"""

## Training and Validation Sets

### We shuffle the indices

In [48]:
indices = list(range(X_chars.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X_chars = X_chars[indices, :]
X_bigrams = X_bigrams[indices, :]
X_trigrams = X_trigrams[indices, :]
y = np.array(y)[indices]

[347592, 272856, 708757, 679749, 20495, 471207, 10591, 12316, 174363, 90159]


### We split the dataset

In [49]:
training_examples = int(X_chars.shape[0] * 0.8)

X_train_chars = X_chars[:training_examples, :]
X_train_bigrams = X_bigrams[:training_examples, :]
X_train_trigrams = X_trigrams[:training_examples, :]

y_train = y[:training_examples]

X_val_chars = X_chars[training_examples:, :]
X_val_bigrams = X_bigrams[training_examples:, :]
X_val_trigrams = X_trigrams[training_examples:, :]

y_val = y[training_examples:]

## Fitting the Model

In [50]:
X_train_chars[0]

array([2664, 2850,  933, 3562,  218, 2304, 3095, 2976, 3208, 1654, 1802,
       2327, 3373,  961, 3737, 2384,    0,    0,    0,    0], dtype=int32)

In [51]:
history = model.fit([X_train_chars, X_train_bigrams, X_train_trigrams], 
                    y_train, 
                    epochs=3,
                    validation_data=([X_val_chars, X_val_bigrams, X_val_trigrams], y_val))

Train on 605318 samples, validate on 151330 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Predicting and Evaluating

In [53]:
y_predicted = model.predict([X_val_chars, X_val_bigrams, X_val_trigrams])
print(y_predicted[:2])
print(y_val[:2])

# evaluate the model
scores = model.evaluate([X_val_chars, X_val_bigrams, X_val_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

[[3.8083164e-07 2.8835956e-10 2.3617963e-10 1.8801352e-10 1.0107810e-13
  1.2335891e-11 6.7405428e-09 6.4613956e-01 6.6538174e-05 2.0980557e-09
  3.3848793e-10 1.9140409e-06 6.9774308e-07 2.5517316e-08 2.7330510e-10
  2.2839211e-09 1.9437271e-11 1.2512748e-10 2.6236424e-13 3.4532625e-16
  2.7099106e-10 7.3032055e-14 5.9510047e-13 7.2904336e-09 1.3161155e-14
  4.7616758e-11 1.3070505e-09 5.3681006e-08 3.5379082e-01 9.8696962e-13]
 [6.7519935e-15 2.0320501e-08 1.1011691e-08 3.1073802e-07 9.9984944e-01
  5.5877412e-12 5.6106331e-10 2.8939251e-09 1.1841821e-11 5.6283725e-06
  1.0962262e-07 5.2893211e-07 5.0946074e-09 4.7430365e-10 9.9601803e-12
  2.3856728e-17 1.2085843e-07 2.8845143e-08 1.4241751e-04 7.9641209e-13
  1.5744631e-12 1.9381034e-17 3.9636756e-09 7.5677650e-14 2.9867769e-07
  1.0283632e-06 1.5745734e-13 4.0543791e-15 1.2137009e-11 3.4414029e-09]]
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 

[None, None]

#### Indices of the predicted and true classes

Prediction

In [66]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['ber', 'ukr', 'eng', 'tur', 'deu', 'ita', 'tur', 'rus', 'kab', 'ber']

Truth

In [67]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['ber', 'ukr', 'eng', 'tur', 'deu', 'ita', 'tur', 'rus', 'kab', 'ber']

### The detailed F1s

In [80]:
lang2inx

{'eng': 0,
 'por': 1,
 'fin': 2,
 'epo': 3,
 'ukr': 4,
 'nld': 5,
 'hun': 6,
 'ber': 7,
 'tur': 8,
 'rus': 9,
 'ita': 10,
 'cmn': 11,
 'jpn': 12,
 'lit': 13,
 'ara': 14,
 'dan': 15,
 'ces': 16,
 'spa': 17,
 'heb': 18,
 'toki': 19,
 'swe': 20,
 'ell': 21,
 'mkd': 22,
 'deu': 23,
 'mar': 24,
 'srp': 25,
 'fra': 26,
 'lat': 27,
 'kab': 28,
 'pol': 29}

In [84]:
lang_names = sorted(list(lang2inx.keys()), key=lambda x: lang2inx[x])

In [85]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print(f1_score(y_val_symb, y_pred, average='micro'))
print(f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

         eng       0.99      1.00      1.00     25105
         por       0.97      0.99      0.98      6952
         fin       0.99      0.99      0.99      2175
         epo       0.99      1.00      0.99     12157
         ukr       0.96      0.97      0.97      3069
         nld       0.98      0.97      0.98      2086
         hun       0.99      0.99      0.99      5308
         ber       0.80      0.93      0.86      4594
         tur       1.00      0.99      1.00     13468
         rus       0.99      0.99      0.99     14629
         ita       0.99      0.99      0.99     14844
         cmn       0.95      0.93      0.94      1239
         jpn       0.99      0.99      0.99      3874
         lit       0.96      0.98      0.97       743
         ara       0.99      0.99      0.99       714
         dan       0.94      0.96      0.95       887
         ces       0.97      0.97      0.97       785
         spa       0.99    

### Confusion Matrix

In [87]:
from sklearn.metrics import confusion_matrix
print(lang2inx)
cf = confusion_matrix(y_val_symb, y_pred)
print(cf)

{'eng': 0, 'por': 1, 'fin': 2, 'epo': 3, 'ukr': 4, 'nld': 5, 'hun': 6, 'ber': 7, 'tur': 8, 'rus': 9, 'ita': 10, 'cmn': 11, 'jpn': 12, 'lit': 13, 'ara': 14, 'dan': 15, 'ces': 16, 'spa': 17, 'heb': 18, 'toki': 19, 'swe': 20, 'ell': 21, 'mkd': 22, 'deu': 23, 'mar': 24, 'srp': 25, 'fra': 26, 'lat': 27, 'kab': 28, 'pol': 29}
[[25059     6     1     2     0     4     5     5     1     0     7     0
      0     0     0     3     0     2     0     0     0     0     0     1
      0     0     6     0     3     0]
 [    0  6879     1     7     0     0     0     1     2     0    21     0
      0     1     0     1     1    27     0     0     0     0     0     0
      0     0     2     8     1     0]
 [    1     0  2144     4     0     2     0     2     4     0     4     0
      1     6     0     0     0     0     0     0     0     0     0     1
      0     0     3     1     2     0]
 [    7     8     2 12113     0     0     1     3     0     0     7     0
      0     2     0     1     0     6     0

The most frequent confusions for some languages

In [88]:
languages = ['fra', 'eng', 'swe']

In [89]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [   9    7    0    1    0    0    0    2    0    0   10    0    0    3
    0    0    0    7    0    0    0    0    0    0    0    0 8028    4
    3    0]
Most confused: ita 0.0012385434728758979
====
Language: eng
Confusions: [25059     6     1     2     0     4     5     5     1     0     7     0
     0     0     0     3     0     2     0     0     0     0     0     1
     0     0     6     0     3     0]
Most confused: ita 0.00027882891854212307
====
Language: swe
Confusions: [  9   0   3   2   0   2   2   3   1   0   0   0   0   2   0  27   0   0
   0   0 606   0   0   3   0   2   0   0   1   0]
Most confused: dan 0.04072398190045249
====


## Prediction

In [90]:
sentence = "Salut les gars !"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'fra'

In [91]:
sentence = "Hi guys!"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'eng'

In [92]:
sentence = "Hejsan grabbar!"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'hun'