# Keras Implementation of CLD3

Author: Pierre Nugues

Reimplementation of Google's _Compact language detector_ (CLD3) from a high-level description. Source: ``https://github.com/google/cld3``

Still missing:
* generator or train_on_batch

## A Dataset: *Tatoeba*

As dataset, we use Tatoeba: A database of texts with language tags. The corpus is available here: https://tatoeba.org/eng/downloads

### Formatting the Dataset

We read the dataset and we split the lines

In [1]:
dataset_raw = open('sentences.csv', encoding='utf8').read().strip()
dataset_raw = dataset_raw.split('\n')
dataset_raw[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

We split the fields and we remove possible whitespaces

In [2]:
dataset_raw = list(map(lambda x: tuple(x.split('\t')), dataset_raw))
dataset_raw = list(map(lambda x: tuple(map(str.strip, x)), dataset_raw))
print(len(dataset_raw), 'texts')
dataset_raw[:3]

8023136 texts


[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

We pad strings that are less than three characters. If not done, training will crash. We also limit the length of the texts.

In [3]:
MAXLEN_TEXT = 200
for i in range(len(dataset_raw)):
    if len(dataset_raw[i][2]) == 0:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '   ')
    if len(dataset_raw[i][2]) == 1: 
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + '  ')
    if len(dataset_raw[i][2]) == 2:
        dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2] + ' ')
    dataset_raw[i] = (dataset_raw[i][0], dataset_raw[i][1], dataset_raw[i][2][:MAXLEN_TEXT])

We shuffle the dataset

In [4]:
from random import shuffle, seed
import numpy as np
np.random.seed(1234)
shuffle(dataset_raw)

We can decimate the dataset to have faster training times

In [5]:
DECIMATE = False
if DECIMATE:
    dataset_raw = dataset_raw[:int(len(dataset_raw)/10)]
dataset_raw[:3]

[('3246907', 'ita', 'Non avete altro di cui parlare?'),
 ('6959233', 'fra', "Quelqu'un a volé mon déjeuner."),
 ('1180391', 'deu', 'Ich habe nochmal überlegt und meine Meinung geändert.')]

The languages. Some texts have no language tag, and some others are marked with the cryptic \\\\N

### Understanding the Dataset

How many languages?

In [6]:
languages = set([x[1] for x in dataset_raw])
len(languages)

347

We count the texts per language

In [7]:
def count_texts(dataset):
    text_counts = {}
    for record in dataset:
        lang = record[1]
        if lang in text_counts:
            text_counts[lang] += 1
        else:
            text_counts[lang] = 1
    return text_counts

Languages with the most examples

In [8]:
text_counts = count_texts(dataset_raw)
langs = sorted(text_counts.keys(), key=text_counts.get, reverse=True)
[(lang, text_counts[lang]) for lang in langs][:25]

[('eng', 1264754),
 ('ita', 738799),
 ('rus', 732078),
 ('tur', 684619),
 ('epo', 609518),
 ('deu', 488568),
 ('fra', 402078),
 ('por', 347430),
 ('spa', 314873),
 ('hun', 269208),
 ('ber', 226376),
 ('heb', 195329),
 ('jpn', 187684),
 ('ukr', 154466),
 ('kab', 129245),
 ('fin', 106936),
 ('nld', 104337),
 ('pol', 99754),
 ('mkd', 77778),
 ('cmn', 60993),
 ('mar', 54656),
 ('dan', 43995),
 ('lit', 38408),
 ('ces', 37596),
 ('toki', 35017)]

We consider languages that have more than 3,000 examples in the dataset or we only use those in French, English, and Swedish

In [9]:
SMALL_LANGUAGE_SET = False
considered_langs = [lang for lang in langs if text_counts[lang] > 3000]
if SMALL_LANGUAGE_SET:
    considered_langs = ['fra', 'eng', 'swe']
print(considered_langs)
LANG_NBR = len(considered_langs)
LANG_NBR

['eng', 'ita', 'rus', 'tur', 'epo', 'deu', 'fra', 'por', 'spa', 'hun', 'ber', 'heb', 'jpn', 'ukr', 'kab', 'fin', 'nld', 'pol', 'mkd', 'cmn', 'mar', 'dan', 'lit', 'ces', 'toki', 'swe', 'ara', 'lat', 'ell', 'srp', 'ina', 'bul', 'pes', 'ron', 'nds', 'tlh', 'jbo', 'nob', 'tat', 'tgl', 'ind', 'bel', 'hin', 'isl', 'vie', 'lfn', 'uig', 'bre', 'tuk', 'kor', 'ile', 'eus', 'cat', 'yue', 'oci', 'hrv', 'ido', 'aze', 'ben', 'glg', 'wuu', 'mhr', 'slk', 'afr', 'avk', 'cor', 'run', 'gos', 'vol', 'est']


70

We extract the texts in these languages. This will form our dataset.

In [10]:
dataset = list(filter(lambda x: x[1] in considered_langs, dataset_raw))
print(len(dataset))
dataset[:5]

7934544


[('3246907', 'ita', 'Non avete altro di cui parlare?'),
 ('6959233', 'fra', "Quelqu'un a volé mon déjeuner."),
 ('1180391', 'deu', 'Ich habe nochmal überlegt und meine Meinung geändert.'),
 ('5465336', 'epo', 'Ŝia nomo estas amuza.'),
 ('339801',
  'spa',
  'Están haciendo planes para colonizar el estado de Missouri.')]

## Feature Extraction

### Functions to Count Characters Ngrams

We use hash codes to convert ngrams

The number of codes we use

In [11]:
MAX_CHARS = 4096
MAX_BIGRAMS = 8192
MAX_TRIGRAMS = 8192

We normalize the counts as in CLD3

In [12]:
def normalize(d):
    sum_chars = sum(d.values())
    d = {k:v/sum_chars for k, v in d.items()}
    return d

We compute the hash code and we add one to avoid a value of 0 as it is padding symbol in the subsequent matrices.

By default, we set the characters in lowercase and we sort the ngrams by frequency order.

In [13]:
from collections import Counter

def hash_chars(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    hash_codes = map(lambda x: hash(x) % MAX_CHARS + 1, string)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

In [14]:
def hash_bigrams(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    bigrams = [string[i:i + 2] for i in range(len(string) - 1)]
    hash_codes = map(lambda x: hash(x) % MAX_BIGRAMS + 1, bigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

In [15]:
def hash_trigrams(string, lc=True, freq_sort=True):
    if lc:
        string = string.lower()
    trigrams = [string[i:i + 3] for i in range(len(string) - 2)]
    hash_codes = map(lambda x: hash(x) % MAX_TRIGRAMS + 1, trigrams)
    d = dict(Counter(hash_codes))
    d = normalize(d)
    if freq_sort:
        k, v = zip(*sorted(d.items(), key=lambda x: x[1], reverse=True))
    else:
        k, v = zip(*d.items())
    return k, v

Google's example in CLD3's presentation's text (``https://github.com/google/cld3``)

In [91]:
print('Chars:', hash_chars('Banana'))
print('Bigrams:', hash_bigrams('Banana'))
print('Trigrams:', hash_trigrams('Banana'))

Chars: ((208, 324, 541), (0.5, 0.3333333333333333, 0.16666666666666666))
Bigrams: ((4366, 3769, 8123), (0.4, 0.4, 0.2))
Trigrams: ((7381, 5853, 1734), (0.5, 0.25, 0.25))


Another sentence

In [17]:
print('Chars:', hash_chars("Let's try something."))
print('Bigrams:', hash_bigrams("Let's try something."))
print('Trigrams:', hash_trigrams("Let's try something."))

Chars: ((1798, 3017, 1439, 1376, 1222, 2656, 582, 1139, 3558, 3207, 1137, 1998, 324, 360, 3942), (0.15, 0.1, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
Bigrams: ((7361, 6888, 5951, 4396, 170, 2820, 7025, 5300, 2004, 2358, 2661, 3195, 3169, 295, 7062, 4618, 6726, 3741), (0.10526315789473684, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842))
Trigrams: ((2381, 1664, 7147, 2312, 6912, 1265, 7845, 2982, 6866, 328, 4308, 6065, 6090, 7248, 3746, 1145, 4175, 6034), (0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.05555555555555555, 0.0555

### Building the $\mathbf{X}$ lists

We compute the character, bigram, and trigram counts of the texts and we create $\mathbf{X}$ lists

In [18]:
from tqdm import tqdm
X_list_inx_chars = []
X_list_freq_chars = []
X_list_inx_bigrams = []
X_list_freq_bigrams = []
X_list_inx_trigrams = []
X_list_freq_trigrams = []

for i in tqdm(range(len(dataset))):
    k, v = hash_chars(dataset[i][-1])
    X_list_inx_chars.append(k)
    X_list_freq_chars.append(v)
    
    k, v = hash_bigrams(dataset[i][-1])
    X_list_inx_bigrams.append(k)
    X_list_freq_bigrams.append(v)
    
    k, v = hash_trigrams(dataset[i][-1])
    X_list_inx_trigrams.append(k)
    X_list_freq_trigrams.append(v)

100%|██████████| 7934544/7934544 [14:26<00:00, 9161.50it/s]  


In [19]:
print(X_list_inx_chars[:2])
print(X_list_freq_chars[:2])
print(X_list_inx_bigrams[:2])
print(X_list_freq_bigrams[:2])
print(X_list_inx_trigrams[:2])
print(X_list_freq_trigrams[:2])

[(1376, 208, 3017, 582, 324, 3558, 1798, 1222, 1998, 3393, 3769, 3050, 3416, 1087, 585), (3416, 1376, 3017, 324, 499, 1222, 3558, 2112, 2656, 208, 3393, 3207, 3769, 2809, 582, 3942)]
[(0.16129032258064516, 0.12903225806451613, 0.0967741935483871, 0.0967741935483871, 0.06451612903225806, 0.06451612903225806, 0.06451612903225806, 0.06451612903225806, 0.06451612903225806, 0.03225806451612903, 0.03225806451612903, 0.03225806451612903, 0.03225806451612903, 0.03225806451612903, 0.03225806451612903), (0.13333333333333333, 0.13333333333333333, 0.1, 0.1, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333, 0.03333333333333333)]
[(7312, 2879, 3512, 7948, 4768, 5074, 1556, 8133, 7361, 112, 6891, 5723, 707, 7025, 6774, 7800, 3904, 3949, 7235, 7220, 3832, 1329, 2610, 7198, 5408, 4609, 5108), (676, 3234, 5074, 5858, 6986, 34

### Unique ngrams

We now extract all the unique ngrams (hash codes). This part is not necessary to train the model and it could be skipped. We use it to determine if we have enough hash codes and the feature vector lengths.

In [20]:
unique_chars = set()
for x_list_inx_chars in X_list_inx_chars:
    unique_chars.update(set(x_list_inx_chars))

unique_bigrams = set()
for x_list_inx_bigrams in X_list_inx_bigrams:
    unique_bigrams.update(set(x_list_inx_bigrams))

unique_trigrams = set()
for x_list_inx_trigrams in X_list_inx_trigrams:
    unique_trigrams.update(set(x_list_inx_trigrams))

We check the hash coding capacity. Have we used all the hash codes? Will there be collisions?

In [21]:
unique_char_cnt = len(unique_chars)
unique_char_cnt

3488

In [22]:
unique_bigram_cnt = len(unique_bigrams)
unique_bigram_cnt

8192

In [23]:
unique_trigram_cnt = len(unique_trigrams)
unique_trigram_cnt

8192

### Length of the index vectors

How will we align the $\mathbf{X}$ matrices? 

We will compute the maximal length of the exhaustive feature lists and we use: max(median, mean) plus two standard deviations (roughly).

#### Characters

What is the max length of the feature vectors for the characters?

In [24]:
char_vect_lengths = [len(x_list_inx_chars) for x_list_inx_chars in X_list_inx_chars]
max(char_vect_lengths)

121

In [25]:
import statistics
print(statistics.mean(char_vect_lengths))
print(statistics.stdev(char_vect_lengths))
statistics.median(char_vect_lengths)

16.216995960952513
3.767972324072555


16.0

#### Bigrams

What is the max length of the feature vectors for the bigrams?

In [26]:
bigram_vect_lengths = [len(x_list_inx_bigrams) for x_list_inx_bigrams in X_list_inx_bigrams]
max(bigram_vect_lengths)

180

In [27]:
print(statistics.mean(bigram_vect_lengths))
print(statistics.stdev(bigram_vect_lengths))
statistics.median(bigram_vect_lengths)

29.785678672901682
13.752314194807772


28.0

#### Trigrams

What is the max length of the feature vectors for the trigrams?

In [28]:
trigram_vect_lengths = [len(x_list_inx_trigrams) for x_list_inx_trigrams in X_list_inx_trigrams]
max(trigram_vect_lengths)

192

In [29]:
print(statistics.mean(trigram_vect_lengths))
print(statistics.stdev(trigram_vect_lengths))
statistics.median(trigram_vect_lengths)

32.21706263145053
18.270657997285003


29.0

Here are the maximal lengths: max(median, mean) + 2 x stdev and a small margin.

In [30]:
MAXLEN_CHARS = 30
MAXLEN_BIGRAMS = 60
MAXLEN_TRIGRAMS = 70

## Building $\mathbf{X}$ and $\mathbf{y}$

We can now build the matrices by converting the $\mathbf{X}$ lists into arrays and padding them.

### The $\mathbf{X}$ matrix

#### Characters

We pad the character sequences

In [31]:
from keras.preprocessing.sequence import pad_sequences
X_inx_chars = pad_sequences(X_list_inx_chars, padding='post', maxlen=MAXLEN_CHARS)
X_inx_chars[:2]

Using TensorFlow backend.


array([[1376,  208, 3017,  582,  324, 3558, 1798, 1222, 1998, 3393, 3769,
        3050, 3416, 1087,  585,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0],
       [3416, 1376, 3017,  324,  499, 1222, 3558, 2112, 2656,  208, 3393,
        3207, 3769, 2809,  582, 3942,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]], dtype=int32)

The character frequencies

In [32]:
X_freq_chars = pad_sequences(X_list_freq_chars, padding='post', dtype='float32', maxlen=MAXLEN_CHARS)
X_freq_chars[:2]

array([[0.16129032, 0.12903225, 0.09677419, 0.09677419, 0.06451613,
        0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.03225806,
        0.03225806, 0.03225806, 0.03225806, 0.03225806, 0.03225806,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.13333334, 0.13333334, 0.1       , 0.1       , 0.06666667,
        0.06666667, 0.06666667, 0.06666667, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.03333334, 0.03333334, 0.03333334,
        0.03333334, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]],
      dtype=float32)

#### Bigrams

We do the same for the bigrams.

In [33]:
X_inx_bigrams = pad_sequences(X_list_inx_bigrams, padding='post', maxlen=MAXLEN_BIGRAMS)
X_inx_bigrams[:2]

array([[7312, 2879, 3512, 7948, 4768, 5074, 1556, 8133, 7361,  112, 6891,
        5723,  707, 7025, 6774, 7800, 3904, 3949, 7235, 7220, 3832, 1329,
        2610, 7198, 5408, 4609, 5108,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0],
       [ 676, 3234, 5074, 5858, 6986, 3461, 1053, 5661, 7312, 1518, 3678,
        6711, 7530, 1372, 2467, 1016, 1910, 4768, 3904, 6679, 7697,  796,
        5322, 1030, 7451, 3252,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0]], dtype=int32)

In [34]:
X_freq_bigrams = pad_sequences(X_list_freq_bigrams, padding='post', dtype='float32', maxlen=MAXLEN_BIGRAMS)
X_freq_bigrams[:2]

array([[0.06666667, 0.06666667, 0.06666667, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.03333334, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.03333334, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.03333334, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.03333334, 0.03333334, 0.03333334,
        0.03333334, 0.03333334, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.06896552, 0.06896552, 0.06896552, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.0

#### Trigrams

And the trigrams

In [35]:
X_inx_trigrams = pad_sequences(X_list_inx_trigrams, padding='post', maxlen=MAXLEN_TRIGRAMS)
X_inx_trigrams[:2]

array([[ 499, 6750, 1770, 2488, 6574,  498, 8058, 3663,  301, 1326, 7967,
        4412, 2643, 5409, 1759, 6589, 1243, 2992, 5966, 5363,  241, 2102,
        3884, 4721, 2077, 1693, 2964, 2081, 6522,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [2509, 6276, 4046, 3162, 4184, 2306, 3028,  439, 1770, 4132,  334,
        2884,  874, 7645, 6768, 1903, 1515, 7786, 6750,  206, 5262, 6470,
        2310, 4941, 1525,  227,  639, 6782,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)

In [36]:
X_freq_trigrams = pad_sequences(X_list_freq_trigrams, padding='post', dtype='float32', maxlen=MAXLEN_TRIGRAMS)
X_freq_trigrams[:2]

array([[0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.03448276,
        0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.03571429, 0.03571429, 0.03571429, 0.0

### The $\mathbf{y}$ vector

We extract the responses

In [37]:
y_list = [dataset[i][1] for i in range(len(dataset))]
y_list[:10]

['ita', 'fra', 'deu', 'epo', 'spa', 'fra', 'deu', 'ile', 'tur', 'heb']

We create an index of them.

In [38]:
y_set = set(y_list)
inx2lang = dict(enumerate(y_set))
lang2inx = {v: k for k, v in inx2lang.items()}
lang2inx

{'cat': 0,
 'afr': 1,
 'run': 2,
 'gos': 3,
 'pol': 4,
 'uig': 5,
 'fra': 6,
 'lat': 7,
 'mhr': 8,
 'nob': 9,
 'dan': 10,
 'nds': 11,
 'vie': 12,
 'ron': 13,
 'fin': 14,
 'hun': 15,
 'ukr': 16,
 'bul': 17,
 'ind': 18,
 'avk': 19,
 'vol': 20,
 'pes': 21,
 'glg': 22,
 'bel': 23,
 'slk': 24,
 'eng': 25,
 'bre': 26,
 'wuu': 27,
 'ido': 28,
 'isl': 29,
 'ben': 30,
 'cmn': 31,
 'tat': 32,
 'toki': 33,
 'jbo': 34,
 'oci': 35,
 'mar': 36,
 'ara': 37,
 'kab': 38,
 'kor': 39,
 'lfn': 40,
 'est': 41,
 'ita': 42,
 'eus': 43,
 'nld': 44,
 'mkd': 45,
 'rus': 46,
 'ile': 47,
 'tuk': 48,
 'ber': 49,
 'spa': 50,
 'deu': 51,
 'heb': 52,
 'epo': 53,
 'ces': 54,
 'jpn': 55,
 'tgl': 56,
 'srp': 57,
 'ell': 58,
 'yue': 59,
 'por': 60,
 'hrv': 61,
 'ina': 62,
 'lit': 63,
 'swe': 64,
 'tur': 65,
 'tlh': 66,
 'aze': 67,
 'cor': 68,
 'hin': 69}

In [39]:
y_list_num = list(map(lambda x: lang2inx[x], y_list))
y_list_num[:3]

[42, 6, 51]

We encode them as one-hot vectors

In [40]:
from keras.utils import to_categorical
y = to_categorical(y_list_num)
y[:3]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.]], dtype=float32)

## Training and Validation Sets
We create a training and a validation sets

### We shuffle the indices
We shuffle them again. This is not necessary, as we already shuffled the dataset

In [41]:
indices = list(range(X_inx_chars.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X_inx_chars = X_inx_chars[indices, :]
X_freq_chars = X_freq_chars[indices, :]
X_inx_bigrams = X_inx_bigrams[indices, :]
X_freq_bigrams = X_freq_bigrams[indices, :]
X_inx_trigrams = X_inx_trigrams[indices, :]
X_freq_trigrams = X_freq_trigrams[indices, :]
y = y[indices, :]

[1794002, 2745398, 5006546, 365595, 5105077, 6536124, 2713909, 1837034, 6722572, 2354267]


### We split the dataset

Number of training examples

In [42]:
training_examples = int(X_inx_chars.shape[0] * 0.8)
training_examples

6347635

The training set

In [43]:
X_train_inx_chars = X_inx_chars[:training_examples, :]
X_train_freq_chars = X_freq_chars[:training_examples, :]

X_train_inx_bigrams = X_inx_bigrams[:training_examples, :]
X_train_freq_bigrams = X_freq_bigrams[:training_examples, :]

X_train_inx_trigrams = X_inx_trigrams[:training_examples, :]
X_train_freq_trigrams = X_freq_trigrams[:training_examples, :]

y_train = y[:training_examples]

In [96]:
print(y_train[0])
print(X_train_inx_chars[0])
X_train_freq_chars[0]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1376  582 3017 1137 1798 2809  360  208 2934 3393 3558 3769 3942    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


array([0.1923077 , 0.15384616, 0.11538462, 0.11538462, 0.11538462,
       0.03846154, 0.03846154, 0.03846154, 0.03846154, 0.03846154,
       0.03846154, 0.03846154, 0.03846154, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ],
      dtype=float32)

The validation set

In [44]:
X_val_inx_chars = X_inx_chars[training_examples:, :]
X_val_freq_chars = X_freq_chars[training_examples:, :]

X_val_inx_bigrams = X_inx_bigrams[training_examples:, :]
X_val_freq_bigrams = X_freq_bigrams[training_examples:, :]

X_val_inx_trigrams = X_inx_trigrams[training_examples:, :]
X_val_freq_trigrams = X_freq_trigrams[training_examples:, :]

y_val = y[training_examples:]

# Building the Keras Architecture

We now create an architecture resembling CLD3

In [45]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import layers, optimizers, backend

# Char frequency input
char_freq_input = Input(shape=(MAXLEN_CHARS,), dtype='float32', name='char_freq_input')

# Char index input
char_inx_input = Input(shape=(MAXLEN_CHARS,), dtype='int32', name='char_inx_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_inx_input)

# The weighted mean
flattened_chars = backend.dot(char_freq_input, embedded_chars)[0]

# Bigram freq input
bigram_freq_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='float32', name='bigram_freq_input')

# Bigram index input
bigram_inx_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='int32', name='bigram_inx_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_inx_input)

# The weighted mean
flattened_bigrams = backend.dot(bigram_freq_input, embedded_bigrams)[0]

# Trigram freq input
trigram_freq_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='float32', name='trigram_freq_input')

# Trigram index input
trigram_inx_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='int32', name='trigram_inx_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_inx_input)

# The weighted mean
flattened_trigrams = backend.dot(trigram_freq_input, embedded_trigrams)[0]

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)

dense_layer = layers.Dense(512, activation='relu')(flattened)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)

model = Model([char_inx_input, char_freq_input, 
               bigram_inx_input, bigram_freq_input, 
               trigram_inx_input, trigram_freq_input], lang_output)
model.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_inx_input (InputLayer)     [(None, 30)]         0                                            
__________________________________________________________________________________________________
bigram_inx_input (InputLayer)   [(None, 60)]         0                                            
__________________________________________________________________________________________________
trigram_inx_input (InputLayer)  [(None, 70)]         0                                            
__________________________________________________________________________________________________
char_freq_input (InputLayer)    [(None, 30)]         0                                            
______________________________________________________________________________________________

## Fitting the Model

In [46]:
"""history = model.fit([X_chars, X_bigrams, X_trigrams], y, 
                    epochs=3,
                   validation_split=0.2)"""

'history = model.fit([X_chars, X_bigrams, X_trigrams], y, \n                    epochs=3,\n                   validation_split=0.2)'

In [48]:
history = model.fit(
    [X_train_inx_chars, X_train_freq_chars, 
     X_train_inx_bigrams, X_train_freq_bigrams, 
     X_train_inx_trigrams, X_train_freq_trigrams], 
    y_train, 
    epochs=3,
    validation_data=(
        [X_val_inx_chars, X_val_freq_chars, 
         X_val_inx_bigrams, X_val_freq_bigrams,
         X_val_inx_trigrams, X_val_freq_trigrams], 
        y_val))

Train on 6347635 samples, validate on 1586909 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Predicting and Evaluating

We evaluate the model.

In [97]:
scores = model.evaluate([X_val_inx_chars, X_val_freq_chars, 
                         X_val_inx_bigrams, X_val_freq_bigrams,
                         X_val_inx_trigrams, X_val_freq_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

Scores: [0.076903877699372, 0.9778626]
loss: 7.69%
acc: 97.79%


[None, None]

We predict the whole validation set and we get the probabilities

In [98]:
y_predicted = model.predict([X_val_inx_chars, X_val_freq_chars, 
                             X_val_inx_bigrams, X_val_freq_bigrams,
                             X_val_inx_trigrams, X_val_freq_trigrams])
print(y_predicted[:5])
print(y_val[:5])

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 1.95579659e-25
  1.21977515e-20 8.12010985e-32 3.36048942e-20 6.49913382e-31
  0.00000000e+00 0.00000000e+00 2.77644821e-35 3.83825223e-28
  0.00000000e+00 1.27309104e-21 4.31282118e-32 1.72053009e-23
  8.73432520e-19 3.46895509e-24 0.00000000e+00 0.00000000e+00
  0.00000000e+00 9.88139164e-29 0.00000000e+00 2.33776544e-18
  0.00000000e+00 4.72338386e-20 5.96887584e-32 6.04027685e-38
  0.00000000e+00 0.00000000e+00 3.23008743e-35 3.97554853e-34
  9.03226323e-27 0.00000000e+00 3.81964463e-35 1.94402977e-32
  3.13051901e-26 2.00930390e-29 9.99999642e-01 4.82056492e-37
  0.00000000e+00 0.00000000e+00 1.35817392e-29 0.00000000e+00
  2.77424462e-21 5.46700589e-28 2.42822219e-18 0.00000000e+00
  0.00000000e+00 3.61760527e-07 2.04510976e-27 2.49610926e-23
  0.00000000e+00 1.60811790e-27 2.02422896e-29 1.08059305e-31
  0.00000000e+00 2.18431677e-30 0.00000000e+00 0.00000000e+00
  1.61889934e-22 0.00000000e+00 0.00000000e+00 3.50571469e-37
  1.0661

#### Names of the predicted and true classes

Prediction

In [50]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['kab', 'deu', 'fra', 'spa', 'epo', 'deu', 'por', 'hun', 'eng', 'heb']

Truth

In [51]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['kab', 'deu', 'fra', 'spa', 'epo', 'deu', 'por', 'hun', 'eng', 'heb']

### The detailed F1s

We now compute the precision and recall by language as well as the micro and macro F1.

In [52]:
lang_names = sorted(list(lang2inx.keys()), key=lambda x: lang2inx[x])

In [53]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print('Micro F1:', f1_score(y_val_symb, y_pred, average='micro'))
print('Macro F1', f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

         cat       0.88      0.76      0.81      1212
         afr       0.79      0.76      0.77       754
         run       0.80      0.96      0.88       730
         gos       0.95      0.69      0.80       662
         pol       0.99      0.99      0.99     20176
         uig       1.00      0.99      1.00      1530
         fra       0.99      0.99      0.99     80124
         lat       0.95      0.90      0.93      6570
         mhr       0.94      0.94      0.94       844
         nob       0.88      0.60      0.71      2784
         dan       0.87      0.94      0.90      8765
         nds       0.92      0.92      0.92      3471
         vie       1.00      0.99      0.99      2093
         ron       0.98      0.94      0.96      3867
         fin       0.99      0.99      0.99     21335
         hun       0.99      0.99      0.99     53914
         ukr       0.97      0.97      0.97     30964
         bul       0.89    

### Confusion Matrix

And finally the confusion matrix

In [54]:
from sklearn.metrics import confusion_matrix
print(lang2inx)
cf = confusion_matrix(y_val_symb, y_pred)
print(cf)

{'cat': 0, 'afr': 1, 'run': 2, 'gos': 3, 'pol': 4, 'uig': 5, 'fra': 6, 'lat': 7, 'mhr': 8, 'nob': 9, 'dan': 10, 'nds': 11, 'vie': 12, 'ron': 13, 'fin': 14, 'hun': 15, 'ukr': 16, 'bul': 17, 'ind': 18, 'avk': 19, 'vol': 20, 'pes': 21, 'glg': 22, 'bel': 23, 'slk': 24, 'eng': 25, 'bre': 26, 'wuu': 27, 'ido': 28, 'isl': 29, 'ben': 30, 'cmn': 31, 'tat': 32, 'toki': 33, 'jbo': 34, 'oci': 35, 'mar': 36, 'ara': 37, 'kab': 38, 'kor': 39, 'lfn': 40, 'est': 41, 'ita': 42, 'eus': 43, 'nld': 44, 'mkd': 45, 'rus': 46, 'ile': 47, 'tuk': 48, 'ber': 49, 'spa': 50, 'deu': 51, 'heb': 52, 'epo': 53, 'ces': 54, 'jpn': 55, 'tgl': 56, 'srp': 57, 'ell': 58, 'yue': 59, 'por': 60, 'hrv': 61, 'ina': 62, 'lit': 63, 'swe': 64, 'tur': 65, 'tlh': 66, 'aze': 67, 'cor': 68, 'hin': 69}
[[ 922    0    0 ...    0    0    0]
 [   0  570    0 ...    0    0    0]
 [   0    0  702 ...    0    0    0]
 ...
 [   0    0    0 ...  863    0    0]
 [   0    0    0 ...    0  726    0]
 [   0    0    0 ...    0    0 2294]]


The most frequent confusions for some languages

In [55]:
languages = ['fra', 'eng', 'swe']

In [56]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [   25     1     1     0     1     0 79430    27     0     1     2     2
     0     4     2    14     0     0     0     4     0     0     1     0
     0   130    14     0     4     0     0     0     0     1     2    46
     0     0     4     0     5     1    99     1     9     0     0    17
     0    18    91    29     0    23     1     0     1     3     0     0
    63     0    35     1     0     9     2     0     0     0]
Most confused: eng 0.0016224851480205681
====
Language: eng
Confusions: [     3     13      7      2     15      0     56     34      0      0
     14     15      0      4     12     20      0      0      3      4
      0      0      0      0      0 252564      6      0      9      0
      0      1      0      6      0      1      0      0     10      0
      2      6     81      0     52      0      0     11      1     22
     42     93      0     23      5      0      1      3      0      0
     33      0     13      2      6     22      7

## Predicting Texts

Now let us use the model to predict some short texts

In [57]:
def extract_features(sentence):
    char_items = hash_chars(sentence)
    bigram_items = hash_bigrams(sentence)
    trigram_items = hash_trigrams(sentence)
    return [pad_sequences([char_items[0]], padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([char_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([bigram_items[0]], padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([bigram_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([trigram_items[0]], padding='post', maxlen=MAXLEN_TRIGRAMS),
            pad_sequences([trigram_items[1]], dtype='float32', padding='post', maxlen=MAXLEN_TRIGRAMS)]

In [101]:
sentence = "Banana"
extract_features(sentence)

[array([[208, 324, 541,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0]], dtype=int32),
 array([[0.5       , 0.33333334, 0.16666667, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ]],
       dtype=float32),
 array([[4366, 3769, 8123,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,

In [102]:
sentence = "Salut les gars !"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'fra'

In [103]:
sentence = "Hi guys!"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'eng'

In [104]:
sentence = "Hejsan grabbar!"
preds = model.predict(extract_features(sentence))
inx2lang[np.argmax(preds)]

'dan'

## Another Architecture

A simplifed model, where we do not use the ngram frequencies

In [62]:
# Char index input
char_inx_input = Input(shape=(MAXLEN_CHARS,), dtype='int32', name='char_inx_input')
embedded_chars = layers.Embedding(MAX_CHARS + 1, 64, mask_zero=True)(char_inx_input)

# The weighted mean
flattened_chars = layers.GlobalAveragePooling1D()(embedded_chars)

# Bigram index input
bigram_inx_input = Input(shape=(MAXLEN_BIGRAMS,), dtype='int32', name='bigram_inx_input')
embedded_bigrams = layers.Embedding(MAX_BIGRAMS + 1, 64, mask_zero=True)(bigram_inx_input)

# The weighted mean
flattened_bigrams = layers.GlobalAveragePooling1D()(embedded_bigrams)

# Trigram index input
trigram_inx_input = Input(shape=(MAXLEN_TRIGRAMS,), dtype='int32', name='trigram_inx_input')
embedded_trigrams = layers.Embedding(MAX_TRIGRAMS + 1, 64, mask_zero=True)(trigram_inx_input)

# The weighted mean
flattened_trigrams = layers.GlobalAveragePooling1D()(embedded_trigrams)

flattened = layers.concatenate([flattened_chars, flattened_bigrams, flattened_trigrams], axis=-1)

dense_layer = layers.Dense(512, activation='relu')(flattened)
lang_output = layers.Dense(LANG_NBR, activation='softmax')(dense_layer)

model2 = Model([char_inx_input, 
               bigram_inx_input, 
               trigram_inx_input], lang_output)
model2.compile(optimizer='nadam', loss='categorical_crossentropy', metrics=['acc'])
model2.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_inx_input (InputLayer)     [(None, 30)]         0                                            
__________________________________________________________________________________________________
bigram_inx_input (InputLayer)   [(None, 60)]         0                                            
__________________________________________________________________________________________________
trigram_inx_input (InputLayer)  [(None, 70)]         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 30, 64)       262208      char_inx_input[0][0]             
____________________________________________________________________________________________

In [63]:
history = model2.fit(
    [X_train_inx_chars, 
     X_train_inx_bigrams, 
     X_train_inx_trigrams], 
    y_train, 
    epochs=3,
    validation_data=(
        [X_val_inx_chars, 
         X_val_inx_bigrams,
         X_val_inx_trigrams], 
        y_val))

Train on 6347635 samples, validate on 1586909 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


We evaluate this new model

In [105]:
scores = model2.evaluate([X_val_inx_chars, 
                         X_val_inx_bigrams,
                         X_val_inx_trigrams], y_val)
print('Scores:', scores)
list(map(lambda x: print("%s: %.2f%%" % (x[0], x[1] * 100)), zip(model.metrics_names, scores)))

Scores: [0.05736815190206216, 0.9836935]
loss: 5.74%
acc: 98.37%


[None, None]

In [106]:
y_predicted = model2.predict([X_val_inx_chars, 
                             X_val_inx_bigrams,
                             X_val_inx_trigrams])
print(y_predicted[:5])
print(y_val[:5])

[[2.26628581e-38 0.00000000e+00 1.13446156e-34 1.70438958e-31
  4.01623771e-33 1.31119443e-17 6.80804981e-17 2.65705128e-27
  1.16511981e-32 2.13565283e-34 1.03613338e-31 2.26511209e-29
  5.25932629e-25 1.59582940e-28 1.46535750e-31 9.02106790e-30
  4.02487991e-22 3.43379150e-32 5.07940595e-24 2.41717550e-37
  4.01911506e-35 2.73342065e-17 0.00000000e+00 4.98672451e-27
  0.00000000e+00 8.68558142e-21 5.72089356e-32 1.31907237e-22
  0.00000000e+00 2.70744764e-38 8.59820916e-24 3.59971215e-15
  7.73362204e-25 1.08744549e-32 9.56700920e-22 6.26743949e-26
  1.33666579e-20 5.88734062e-16 9.99992967e-01 4.05791351e-24
  1.57189526e-31 0.00000000e+00 6.47023838e-26 1.46521844e-30
  1.74412693e-26 2.26482518e-23 9.89106904e-19 1.96691115e-38
  3.53252461e-38 7.05105276e-06 6.34146361e-26 1.91933125e-25
  2.11844344e-26 1.92480937e-24 9.80019834e-30 1.61290777e-24
  0.00000000e+00 3.21921871e-26 3.74367185e-19 1.17527554e-24
  3.32372522e-27 0.00000000e+00 2.99842560e-31 4.22535211e-28
  7.1713

In [65]:
y_pred = np.argmax(y_predicted, axis=-1)
list(map(inx2lang.get, y_pred))[:10]

['kab', 'deu', 'fra', 'spa', 'epo', 'deu', 'por', 'hun', 'eng', 'heb']

In [66]:
y_val_symb = np.argmax(y_val, axis=-1)
list(map(inx2lang.get, y_val_symb))[:10]

['kab', 'deu', 'fra', 'spa', 'epo', 'deu', 'por', 'hun', 'eng', 'heb']

Although this model is simpler than the first one, it seems slightly more accurate

In [67]:
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_val_symb, y_pred, target_names=lang_names))
print('Micro F1:', f1_score(y_val_symb, y_pred, average='micro'))
print('Macro F1', f1_score(y_val_symb, y_pred, average='macro'))

              precision    recall  f1-score   support

         cat       0.95      0.78      0.85      1212
         afr       0.83      0.78      0.81       754
         run       0.93      0.93      0.93       730
         gos       0.87      0.77      0.81       662
         pol       1.00      0.99      0.99     20176
         uig       1.00      0.99      1.00      1530
         fra       1.00      0.99      1.00     80124
         lat       0.96      0.94      0.95      6570
         mhr       0.98      0.95      0.97       844
         nob       0.84      0.76      0.80      2784
         dan       0.93      0.91      0.92      8765
         nds       0.98      0.90      0.94      3471
         vie       0.99      0.99      0.99      2093
         ron       0.98      0.95      0.97      3867
         fin       0.99      0.98      0.99     21335
         hun       0.99      1.00      1.00     53914
         ukr       0.99      0.98      0.98     30964
         bul       0.95    

In [68]:
for language in languages:
    if language not in lang2inx:
        continue
    print('Language:', language)
    print('Confusions:', cf[lang2inx[language]])
    print('Most confused:',
          inx2lang[np.argsort(cf[lang2inx[language]])[-2]], 
          np.sort(cf[lang2inx[language]])[-2] / np.sum(cf[lang2inx[language]]))
    print('====')

Language: fra
Confusions: [   25     1     1     0     1     0 79430    27     0     1     2     2
     0     4     2    14     0     0     0     4     0     0     1     0
     0   130    14     0     4     0     0     0     0     1     2    46
     0     0     4     0     5     1    99     1     9     0     0    17
     0    18    91    29     0    23     1     0     1     3     0     0
    63     0    35     1     0     9     2     0     0     0]
Most confused: eng 0.0016224851480205681
====
Language: eng
Confusions: [     3     13      7      2     15      0     56     34      0      0
     14     15      0      4     12     20      0      0      3      4
      0      0      0      0      0 252564      6      0      9      0
      0      1      0      6      0      1      0      0     10      0
      2      6     81      0     52      0      0     11      1     22
     42     93      0     23      5      0      1      3      0      0
     33      0     13      2      6     22      7

In [69]:
def extract_features_inx(sentence):
    char_items = hash_chars(sentence)
    bigram_items = hash_bigrams(sentence)
    trigram_items = hash_trigrams(sentence)
    return [pad_sequences([char_items[0]], padding='post', maxlen=MAXLEN_CHARS),
            pad_sequences([bigram_items[0]], padding='post', maxlen=MAXLEN_BIGRAMS),
            pad_sequences([trigram_items[0]], padding='post', maxlen=MAXLEN_TRIGRAMS)]

In [70]:
sentence = "Banana"
extract_features_inx(sentence)

[array([[208, 324, 541,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0]], dtype=int32),
 array([[4366, 3769, 8123,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0]], dtype=int32),
 array([[7381, 5853, 1734,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
  

In [71]:
sentence = "Salut les gars !"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'fra'

In [108]:
sentence = "Hello guys!"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'eng'

In [73]:
sentence = "Hejsan grabbar!"
preds = model2.predict(extract_features_inx(sentence))
inx2lang[np.argmax(preds)]

'swe'