# Keras Implementation of CLD3

Author: Pierre Nugues

Reimplementation of Google's _Compact language detector_ (CLD3) from a high-level description. Source: ``https://github.com/google/cld3``

Still missing:
* weighted average of embeddings, instead of average
* generator or train_on_batch

## A Dataset: *Tatoeba*

Tatoeba is a database of texts with language tags. The corpus is available here: https://tatoeba.org/eng/downloads

### Understanding the Dataset

We read the dataset and we split the lines

In [1]:
dataset_raw = open('sentences.csv', encoding='utf8').read().strip()
dataset_raw = dataset_raw.split('\n')
dataset_raw[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

We split the fields and we remove possible whitespaces

In [2]:
dataset_raw = list(map(lambda x: tuple(x.split('\t')), dataset_raw))
dataset_raw = list(map(lambda x: tuple(map(str.strip, x)), dataset_raw))
print(len(dataset_raw), 'texts')
dataset_raw[:3]

8023136 texts


[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

We pad strings that are less than three characters. If not done, training will crash

In [3]:
for i in range(len(dataset_raw)):
    if len(dataset_raw[i][2]) == 0:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '   ')
    if len(dataset_raw[i][2]) == 1: 
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '  ')
    if len(dataset_raw[i][2]) == 2:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + ' ')

We shuffle the dataset

In [4]:
from random import shuffle, seed
import numpy as np
np.random.seed(1234)
shuffle(dataset_raw)

We can decimate the dataset to have faster training times

In [5]:
DECIMATE = False
if DECIMATE:
    dataset_raw = dataset_raw[:int(len(dataset_raw)/10)]
dataset_raw[:3]

[('4116828', 'ara', 'حمّد الشمس!'),
 ('3422297', 'eng', 'Tom rarely goes to church.'),
 ('5875935', 'ukr', 'Я одягнув фартук.')]

The languages. Some texts have no language, and some others are marked with \\\\N

In [6]:
languages = set([x[1] for x in dataset_raw])
len(languages)

347

We count the texts per language

In [7]:
def count_texts(dataset):
    text_counts = {}
    for record in dataset:
        lang = record[1]
        if lang in text_counts:
            text_counts[lang] += 1
        else:
            text_counts[lang] = 1
    return text_counts

Languages with the most examples

In [8]:
text_counts = count_texts(dataset_raw)
langs = sorted(text_counts.keys(), key=text_counts.get, reverse=True)
[(lang, text_counts[lang]) for lang in langs][:25]

[('eng', 1264754),
 ('ita', 738799),
 ('rus', 732078),
 ('tur', 684619),
 ('epo', 609518),
 ('deu', 488568),
 ('fra', 402078),
 ('por', 347430),
 ('spa', 314873),
 ('hun', 269208),
 ('ber', 226376),
 ('heb', 195329),
 ('jpn', 187684),
 ('ukr', 154466),
 ('kab', 129245),
 ('fin', 106936),
 ('nld', 104337),
 ('pol', 99754),
 ('mkd', 77778),
 ('cmn', 60993),
 ('mar', 54656),
 ('dan', 43995),
 ('lit', 38408),
 ('ces', 37596),
 ('toki', 35017)]

We consider languages that have more than 3,000 examples in the dataset or we only use those in French, English, and Swedish

In [9]:
SMALL_LANGUAGE_SET = False
considered_langs = [lang for lang in langs if text_counts[lang] > 3000]
if SMALL_LANGUAGE_SET:
    considered_langs = ['fra', 'eng', 'swe']
print(considered_langs)
LANG_NBR = len(considered_langs)
LANG_NBR

['eng', 'ita', 'rus', 'tur', 'epo', 'deu', 'fra', 'por', 'spa', 'hun', 'ber', 'heb', 'jpn', 'ukr', 'kab', 'fin', 'nld', 'pol', 'mkd', 'cmn', 'mar', 'dan', 'lit', 'ces', 'toki', 'swe', 'ara', 'lat', 'ell', 'srp', 'ina', 'bul', 'pes', 'ron', 'nds', 'tlh', 'jbo', 'nob', 'tat', 'tgl', 'ind', 'bel', 'hin', 'isl', 'vie', 'lfn', 'uig', 'bre', 'tuk', 'kor', 'ile', 'eus', 'cat', 'yue', 'oci', 'hrv', 'ido', 'aze', 'ben', 'glg', 'wuu', 'mhr', 'slk', 'afr', 'avk', 'cor', 'run', 'gos', 'vol', 'est']


70

We extract the texts in these languages

In [10]:
dataset = list(filter(lambda x: x[1] in considered_langs, dataset_raw))
print(len(dataset))
dataset[:5]

7934544


[('4116828', 'ara', 'حمّد الشمس!'),
 ('3422297', 'eng', 'Tom rarely goes to church.'),
 ('5875935', 'ukr', 'Я одягнув фартук.'),
 ('5012913', 'por', 'Foi uma história estranha.'),
 ('108532', 'jpn', '彼は義務の観念がすっかりなくなっている。')]

## Feature Extraction

### Functions to Count Characters Ngrams

We use hash codes to convert ngrams

The number of codes we use

In [11]:
MAX_CHARS = 4096
MAX_BIGRAMS = 4096
MAX_TRIGRAMS = 4096

We normalize the counts as in CLD3

In [12]:
def normalize(d):
    sum_chars = sum(d.values())
    d = {k:v/sum_chars for k, v in d.items()}
    return d

We compute the hash code and we add one to avoid a value of 0 as it is padding symbol in the subsequent matrices

In [13]:
from collections import Counter

def hash_chars(string, lc=True):
    if lc:
        string = string.lower()
    hash_codes = map(lambda x: hash(x) % MAX_CHARS + 1, string)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [14]:
def hash_bigrams(string, lc=True):
    if lc:
        string = string.lower()
    bigrams = [string[i:i + 2] for i in range(len(string) - 1)]
    hash_codes = map(lambda x: hash(x) % MAX_BIGRAMS + 1, bigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [15]:
def hash_trigrams(string, lc=True):
    if lc:
        string = string.lower()
    trigrams = [string[i:i + 3] for i in range(len(string) - 2)]
    hash_codes = map(lambda x: hash(x) % MAX_TRIGRAMS + 1, trigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    return d

In [16]:
hash_chars("Let's try something.")

{1720: 0.05,
 912: 0.1,
 1174: 0.15,
 1286: 0.05,
 1626: 0.1,
 2081: 0.1,
 1644: 0.05,
 1625: 0.05,
 962: 0.05,
 549: 0.05,
 3287: 0.05,
 2507: 0.05,
 298: 0.05,
 3436: 0.05,
 2847: 0.05}

In [17]:
hash_bigrams("Let's try something.")

{612: 0.05263157894736842,
 212: 0.10526315789473684,
 1199: 0.05263157894736842,
 1637: 0.05263157894736842,
 3578: 0.05263157894736842,
 834: 0.05263157894736842,
 1967: 0.05263157894736842,
 354: 0.05263157894736842,
 2240: 0.05263157894736842,
 4024: 0.05263157894736842,
 1038: 0.05263157894736842,
 52: 0.05263157894736842,
 3847: 0.05263157894736842,
 2421: 0.05263157894736842,
 1845: 0.05263157894736842,
 3235: 0.05263157894736842,
 406: 0.05263157894736842,
 1295: 0.05263157894736842}

In [18]:
hash_trigrams("Let's try something.")

{2225: 0.05555555555555555,
 3547: 0.05555555555555555,
 3312: 0.05555555555555555,
 2471: 0.05555555555555555,
 707: 0.05555555555555555,
 1164: 0.05555555555555555,
 2432: 0.05555555555555555,
 2986: 0.05555555555555555,
 1721: 0.05555555555555555,
 642: 0.05555555555555555,
 50: 0.05555555555555555,
 3620: 0.05555555555555555,
 3686: 0.05555555555555555,
 1255: 0.05555555555555555,
 1127: 0.05555555555555555,
 3327: 0.05555555555555555,
 280: 0.05555555555555555,
 3610: 0.05555555555555555}

### Counting the ngrams in the dataset

We add the character, bigram, and trigram counts to the texts. The format is:
`(text_id, language_id, text, char_cnt, bigram_cnt, trigram_cnt)`

In [19]:
dataset_feat = list(map(lambda x: x + (hash_chars(x[-1]), hash_bigrams(x[-1]), hash_trigrams(x[-1])), 
                              dataset))
dataset_feat[:2]

[('4116828',
  'ara',
  'حمّد الشمس!',
  {107: 0.09090909090909091,
   1924: 0.18181818181818182,
   2548: 0.09090909090909091,
   3809: 0.09090909090909091,
   2081: 0.09090909090909091,
   1499: 0.09090909090909091,
   4001: 0.09090909090909091,
   876: 0.09090909090909091,
   725: 0.09090909090909091,
   1642: 0.09090909090909091},
  {64: 0.1,
   1215: 0.1,
   70: 0.1,
   823: 0.1,
   423: 0.1,
   224: 0.1,
   4086: 0.1,
   2854: 0.1,
   2700: 0.1,
   1239: 0.1},
  {21: 0.1111111111111111,
   3898: 0.1111111111111111,
   3814: 0.1111111111111111,
   1660: 0.1111111111111111,
   1512: 0.1111111111111111,
   2122: 0.1111111111111111,
   484: 0.1111111111111111,
   3715: 0.1111111111111111,
   1440: 0.1111111111111111}),
 ('3422297',
  'eng',
  'Tom rarely goes to church.',
  {1174: 0.07692307692307693,
   962: 0.11538461538461539,
   549: 0.038461538461538464,
   2081: 0.15384615384615385,
   1644: 0.11538461538461539,
   641: 0.038461538461538464,
   912: 0.07692307692307693,
   1720

In [20]:
dataset_feat[0][3].keys()

dict_keys([107, 1924, 2548, 3809, 2081, 1499, 4001, 876, 725, 1642])

In [21]:
dataset_feat[0][4].keys()

dict_keys([64, 1215, 70, 823, 423, 224, 4086, 2854, 2700, 1239])

### Unique ngrams

We now extract all the unique ngrams (hash codes).

In [22]:
unique_chars = set([item for text in dataset_feat for item in list(text[3].keys())])

In [23]:
unique_bigrams = set([item for text in dataset_feat for item in list(text[4].keys())])

In [24]:
unique_trigrams = set([item for text in dataset_feat for item in list(text[5].keys())])

## Building the Keras Architecture

### Embeddings

The dimension of the embedding indices

In [25]:
unique_char_cnt = len(unique_chars)
unique_char_cnt

3507

What is the max length of the feature vectors for the characters?

In [26]:
def length(i, j):
    return len(dataset_feat[i][j].keys())

max([length(i, 3) for i in range(len(dataset_feat))])

121

In [27]:
import statistics
print(statistics.mean([length(i, 3) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 3) for i in range(len(dataset_feat))]))
statistics.median([length(i, 3) for i in range(len(dataset_feat))])

16.240892608321285
3.7984960581432814


16.0

In [28]:
unique_bigram_cnt = len(unique_bigrams)
unique_bigram_cnt

4096

What is the max length of the feature vectors for the bigrams?

In [29]:
max([length(i, 4) for i in range(len(dataset_feat))])

333

In [30]:
print(statistics.mean([length(i, 4) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 4) for i in range(len(dataset_feat))]))
statistics.median([length(i, 4) for i in range(len(dataset_feat))])

29.877754915720423
14.138026362815763


28.0

In [31]:
unique_trigram_cnt = len(unique_trigrams)
unique_trigram_cnt

4096

What is the max length of the feature vectors for the trigrams?

In [32]:
max([length(i, 5) for i in range(len(dataset_feat))])

748

In [33]:
print(statistics.mean([length(i, 5) for i in range(len(dataset_feat))]))
print(statistics.stdev([length(i, 5) for i in range(len(dataset_feat))]))
statistics.median([length(i, 5) for i in range(len(dataset_feat))])

32.24697600769496
19.16247970967665


29.0

In [34]:
MAXLEN_CHARS = 20
MAXLEN_BIGRAMS = 50
MAXLEN_TRIGRAMS = 70

### The architecture

In [35]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import layers, optimizers, backend

# Char input
char_input = Input(shape=(20,), dtype='int32', name='char_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_input)
flattened_chars = layers.GlobalAveragePooling1D()(embedded_chars)
#flattened_chars = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_chars)

# Bigram input
bigram_input = Input(shape=(50,), dtype='int32', name='bigram_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_input)
flattened_bigrams = layers.GlobalAveragePooling1D()(embedded_bigrams)
#flattened_bigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_bigrams)

# Trigram input
trigram_input = Input(shape=(70,), dtype='int32', name='trigram_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_input)
flattened_trigrams = layers.GlobalAveragePooling1D()(embedded_trigrams)
#flattened_trigrams = layers.Lambda(lambda x: backend.mean(x, axis=1))(embedded_trigrams)

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)
dense_layer = layers.Dense(512, activation='relu')(flattened)
# dense_layer = layers.Dropout(0.6)(dense_layer)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)
model = Model([char_input, bigram_input, trigram_input], lang_output)
model.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input (InputLayer)         [(None, 20)]         0                                            
__________________________________________________________________________________________________
bigram_input (InputLayer)       [(None, 50)]         0                                            
__________________________________________________________________________________________________
trigram_input (InputLayer)      [(None, 70)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 20, 64)       262208      char_input[0][0]                 
______________________________________________________________________________________________

## Building $\mathbf{X}$ and $\mathbf{y}$

### The $\mathbf{X}$ matrix

We extract the keys (hash codes) from the corpus

In [36]:
X_list_chars = [list(dataset_feat[i][3].keys()) for i in range(len(dataset_feat))]
print(X_list_chars[:2])

[[107, 1924, 2548, 3809, 2081, 1499, 4001, 876, 725, 1642], [1174, 962, 549, 2081, 1644, 641, 912, 1720, 1625, 3436, 1626, 146, 3287, 2503, 2847]]


In [37]:
X_list_bigrams = [list(dataset_feat[i][4].keys()) for i in range(len(dataset_feat))]
print(X_list_bigrams[:2])

[[64, 1215, 70, 823, 423, 224, 4086, 2854, 2700, 1239], [1556, 52, 1340, 1016, 1512, 3940, 1959, 2670, 2771, 2240, 2046, 2248, 3001, 836, 3578, 834, 1553, 1832, 3127, 1008, 1982, 516, 3510]]


In [38]:
X_list_trigrams = [list(dataset_feat[i][5].keys()) for i in range(len(dataset_feat))]
print(X_list_trigrams[:2])

[[21, 3898, 3814, 1660, 1512, 2122, 484, 3715, 1440], [309, 3069, 3183, 1941, 1134, 3592, 469, 1722, 3063, 859, 803, 271, 36, 2472, 707, 789, 2385, 667, 1082, 2526, 1814, 1762, 2440, 2699]]


And we pad the sequences

In [39]:
from keras.preprocessing.sequence import pad_sequences
X_chars = pad_sequences(X_list_chars, padding='post', maxlen=MAXLEN_CHARS)
X_chars[:2]

Using TensorFlow backend.


array([[ 107, 1924, 2548, 3809, 2081, 1499, 4001,  876,  725, 1642,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [1174,  962,  549, 2081, 1644,  641,  912, 1720, 1625, 3436, 1626,
         146, 3287, 2503, 2847,    0,    0,    0,    0,    0]],
      dtype=int32)

In [40]:
X_bigrams = pad_sequences(X_list_bigrams, padding='post', maxlen=MAXLEN_BIGRAMS)
X_bigrams[:2]

array([[  64, 1215,   70,  823,  423,  224, 4086, 2854, 2700, 1239,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [1556,   52, 1340, 1016, 1512, 3940, 1959, 2670, 2771, 2240, 2046,
        2248, 3001,  836, 3578,  834, 1553, 1832, 3127, 1008, 1982,  516,
        3510,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [41]:
X_trigrams = pad_sequences(X_list_trigrams, padding='post', maxlen=MAXLEN_TRIGRAMS)
X_trigrams[:2]

array([[  21, 3898, 3814, 1660, 1512, 2122,  484, 3715, 1440,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [ 309, 3069, 3183, 1941, 1134, 3592,  469, 1722, 3063,  859,  803,
         271,   36, 2472,  707,  789, 2385,  667, 1082, 2526, 1814, 1762,
        2440, 2699,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)

### The $\mathbf{y}$ vector

We extract the responses

In [42]:
y_list = [dataset_feat[i][1] for i in range(len(dataset_feat))]
y_list[:10]

['ara', 'eng', 'ukr', 'por', 'jpn', 'ita', 'fra', 'eng', 'ita', 'eng']

In [43]:
y_set = set(y_list)
inx2lang = dict(enumerate(y_set))
lang2inx = {v: k for k, v in inx2lang.items()}
lang2inx

{'rus': 0,
 'eus': 1,
 'lat': 2,
 'hrv': 3,
 'heb': 4,
 'vol': 5,
 'glg': 6,
 'pes': 7,
 'tlh': 8,
 'mar': 9,
 'ber': 10,
 'epo': 11,
 'toki': 12,
 'ron': 13,
 'fin': 14,
 'nob': 15,
 'lfn': 16,
 'nld': 17,
 'wuu': 18,
 'nds': 19,
 'srp': 20,
 'kor': 21,
 'pol': 22,
 'deu': 23,
 'eng': 24,
 'vie': 25,
 'cat': 26,
 'aze': 27,
 'dan': 28,
 'avk': 29,
 'ita': 30,
 'spa': 31,
 'bel': 32,
 'tgl': 33,
 'mkd': 34,
 'est': 35,
 'ben': 36,
 'mhr': 37,
 'jpn': 38,
 'oci': 39,
 'uig': 40,
 'tuk': 41,
 'jbo': 42,
 'tat': 43,
 'swe': 44,
 'bre': 45,
 'kab': 46,
 'ukr': 47,
 'hin': 48,
 'por': 49,
 'ile': 50,
 'ara': 51,
 'cmn': 52,
 'run': 53,
 'slk': 54,
 'ind': 55,
 'afr': 56,
 'ces': 57,
 'lit': 58,
 'ido': 59,
 'cor': 60,
 'yue': 61,
 'ina': 62,
 'bul': 63,
 'gos': 64,
 'isl': 65,
 'tur': 66,
 'ell': 67,
 'hun': 68,
 'fra': 69}

In [44]:
y_list_num = list(map(lambda x: lang2inx[x], y_list))
y_list_num[:3]

[51, 24, 47]

We encode them as one-hot vectors

In [45]:
from keras.utils import to_categorical
y = to_categorical(y_list_num)
y[:3]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.]], dtype=float32)

In [46]:
"""history = model.fit([X_chars, X_bigrams, X_trigrams], y, 
                    epochs=3,
                   validation_split=0.2)"""

'history = model.fit([X_chars, X_bigrams, X_trigrams], y, \n                    epochs=3,\n                   validation_split=0.2)'

## Training and Validation Sets

### We shuffle the indices

In [47]:
indices = list(range(X_chars.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X_chars = X_chars[indices, :]
X_bigrams = X_bigrams[indices, :]
X_trigrams = X_trigrams[indices, :]
y = np.array(y)[indices]

[1794002, 2745398, 5006546, 365595, 5105077, 6536124, 2713909, 1837034, 6722572, 2354267]


### We split the dataset

In [48]:
training_examples = int(X_chars.shape[0] * 0.8)

X_train_chars = X_chars[:training_examples, :]
X_train_bigrams = X_bigrams[:training_examples, :]
X_train_trigrams = X_trigrams[:training_examples, :]

y_train = y[:training_examples]

X_val_chars = X_chars[training_examples:, :]
X_val_bigrams = X_bigrams[training_examples:, :]
X_val_trigrams = X_trigrams[training_examples:, :]

y_val = y[training_examples:]

## Fitting the Model

In [49]:
X_train_chars[0]

array([ 632, 2503, 1174,  641, 1626, 1816, 3436, 2081, 3942,  962, 1644,
       2847,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [50]:
history = model.fit([X_train_chars, X_train_bigrams, X_train_trigrams], 
                    y_train, 
                    epochs=3,
                    validation_data=([X_val_chars, X_val_bigrams, X_val_trigrams], y_val))

Train on 6347635 samples, validate on 1586909 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Predicting and Evaluating

In [51]:
y_predicted = model.predict([X_val_chars, X_val_bigrams, X_val_trigrams])
print(y_predicted[:2])
print(y_val[:2])

# evaluate the model
scores = model.evaluate([X_val_chars, X_val_bigrams, X_val_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

[[8.29119604e-34 2.06736934e-29 6.38253725e-29 4.93435618e-37
  9.85048162e-32 0.00000000e+00 0.00000000e+00 1.02865266e-26
  1.76564762e-34 4.02209204e-33 3.07658352e-18 4.25958699e-24
  9.27007677e-37 1.46505669e-36 1.40878355e-32 2.63277501e-19
  0.00000000e+00 1.00000000e+00 2.08595164e-31 3.37861718e-11
  6.21861170e-27 1.72227218e-29 1.79640911e-28 2.35137519e-21
  1.92654682e-19 7.97181881e-37 4.27708297e-35 0.00000000e+00
  1.01088055e-16 5.38404625e-32 5.82194742e-30 1.93429564e-32
  4.05722683e-33 1.18793228e-38 5.78134080e-36 1.18771087e-37
  8.75117237e-26 1.45657480e-29 2.36645729e-28 7.64476208e-35
  1.61438659e-36 2.98597700e-27 1.62687678e-26 3.09930782e-32
  1.20343621e-22 2.77655788e-33 7.37242169e-21 1.00032606e-34
  1.81362810e-30 1.54944651e-28 0.00000000e+00 2.20623508e-28
  1.58873267e-22 0.00000000e+00 1.94891035e-38 1.13907377e-36
  1.50884300e-14 1.95457018e-29 8.73848467e-36 0.00000000e+00
  0.00000000e+00 7.94853075e-29 0.00000000e+00 1.22689496e-38
  5.4842

[None, None]

#### Indices of the predicted and true classes

Prediction

In [52]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['nld', 'epo', 'jpn', 'eng', 'fra', 'rus', 'epo', 'por', 'heb', 'fin']

Truth

In [53]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['nld', 'epo', 'jpn', 'eng', 'fra', 'rus', 'epo', 'por', 'heb', 'fin']

### The detailed F1s

In [54]:
lang2inx

{'rus': 0,
 'eus': 1,
 'lat': 2,
 'hrv': 3,
 'heb': 4,
 'vol': 5,
 'glg': 6,
 'pes': 7,
 'tlh': 8,
 'mar': 9,
 'ber': 10,
 'epo': 11,
 'toki': 12,
 'ron': 13,
 'fin': 14,
 'nob': 15,
 'lfn': 16,
 'nld': 17,
 'wuu': 18,
 'nds': 19,
 'srp': 20,
 'kor': 21,
 'pol': 22,
 'deu': 23,
 'eng': 24,
 'vie': 25,
 'cat': 26,
 'aze': 27,
 'dan': 28,
 'avk': 29,
 'ita': 30,
 'spa': 31,
 'bel': 32,
 'tgl': 33,
 'mkd': 34,
 'est': 35,
 'ben': 36,
 'mhr': 37,
 'jpn': 38,
 'oci': 39,
 'uig': 40,
 'tuk': 41,
 'jbo': 42,
 'tat': 43,
 'swe': 44,
 'bre': 45,
 'kab': 46,
 'ukr': 47,
 'hin': 48,
 'por': 49,
 'ile': 50,
 'ara': 51,
 'cmn': 52,
 'run': 53,
 'slk': 54,
 'ind': 55,
 'afr': 56,
 'ces': 57,
 'lit': 58,
 'ido': 59,
 'cor': 60,
 'yue': 61,
 'ina': 62,
 'bul': 63,
 'gos': 64,
 'isl': 65,
 'tur': 66,
 'ell': 67,
 'hun': 68,
 'fra': 69}

In [55]:
lang_names = sorted(list(lang2inx.keys()), key=lambda x: lang2inx[x])

In [56]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print(f1_score(y_val_symb, y_pred, average='micro'))
print(f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

         rus       0.99      1.00      0.99    146566
         eus       0.89      0.92      0.91      1224
         lat       0.95      0.95      0.95      6682
         hrv       0.60      0.31      0.41      1024
         heb       1.00      1.00      1.00     38896
         vol       0.96      0.90      0.93       601
         glg       0.94      0.48      0.64       865
         pes       1.00      0.99      0.99      4353
         tlh       0.98      0.99      0.98      3514
         mar       0.99      1.00      1.00     10747
         ber       0.88      0.91      0.89     45489
         epo       1.00      1.00      1.00    122253
        toki       0.99      1.00      1.00      6909
         ron       0.99      0.93      0.96      3792
         fin       0.99      0.99      0.99     21341
         nob       0.88      0.68      0.77      2711
         lfn       0.89      0.83      0.86      1630
         nld       0.98    

### Confusion Matrix

In [57]:
from sklearn.metrics import confusion_matrix
print(lang2inx)
cf = confusion_matrix(y_val_symb, y_pred)
print(cf)

{'rus': 0, 'eus': 1, 'lat': 2, 'hrv': 3, 'heb': 4, 'vol': 5, 'glg': 6, 'pes': 7, 'tlh': 8, 'mar': 9, 'ber': 10, 'epo': 11, 'toki': 12, 'ron': 13, 'fin': 14, 'nob': 15, 'lfn': 16, 'nld': 17, 'wuu': 18, 'nds': 19, 'srp': 20, 'kor': 21, 'pol': 22, 'deu': 23, 'eng': 24, 'vie': 25, 'cat': 26, 'aze': 27, 'dan': 28, 'avk': 29, 'ita': 30, 'spa': 31, 'bel': 32, 'tgl': 33, 'mkd': 34, 'est': 35, 'ben': 36, 'mhr': 37, 'jpn': 38, 'oci': 39, 'uig': 40, 'tuk': 41, 'jbo': 42, 'tat': 43, 'swe': 44, 'bre': 45, 'kab': 46, 'ukr': 47, 'hin': 48, 'por': 49, 'ile': 50, 'ara': 51, 'cmn': 52, 'run': 53, 'slk': 54, 'ind': 55, 'afr': 56, 'ces': 57, 'lit': 58, 'ido': 59, 'cor': 60, 'yue': 61, 'ina': 62, 'bul': 63, 'gos': 64, 'isl': 65, 'tur': 66, 'ell': 67, 'hun': 68, 'fra': 69}
[[146035      0      0 ...      0      0      0]
 [     0   1129      2 ...      0      4      0]
 [     1      4   6343 ...      0      3     40]
 ...
 [     0      0      0 ...   6069      0      0]
 [     1      1      5 ...      0  53

The most frequent confusions for some languages

In [58]:
languages = ['fra', 'eng', 'swe']

In [59]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [    1     1     8     0     0     0     0     0     5     0    10     4
     0     4     6     1     5     3     0     0     1     0     0    10
    27     0    13     0     8    10    70    44     0     2     0     0
     0     0     0    39     0     0     0     0     2    15     3     0
     0    22    22     0     1     1     1     0     3     5     4     2
     3     0    15     0     1     0     2     0     2 80189]
Most confused: ita 0.0008688636504685657
====
Language: eng
Confusions: [     2      2     21      0      1      3      0      0      7      0
     27      5      1      4     20      2      2     55      0     15
      4      0      7     36 252676      8     11      0     45      6
     58     37      0     12      0      2      0      0      0      0
      0      4      0      0      9     18      5      0      2     29
     21      1      2      4      2      3     21     13      3      3
      8      0     21      0      4      0     30

## Prediction

In [60]:
sentence = "Salut les gars !"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'fra'

In [94]:
sentence = "Hello guys!"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'eng'

In [62]:
sentence = "Hejsan grabbar!"
preds= model.predict(
    [pad_sequences([list(hash_chars(sentence).keys())], padding='post', maxlen=MAXLEN_CHARS),
    pad_sequences([list(hash_bigrams(sentence).keys())], padding='post', maxlen=MAXLEN_BIGRAMS),
    pad_sequences([list(hash_trigrams(sentence).keys())], padding='post', maxlen=MAXLEN_TRIGRAMS)])
inx2lang[np.argmax(preds)]

'dan'