# Assignment #3: A simple language classifier

#### Author: Hicham Mohamad (hi8826mo-s)

## Table of Contents
1. [Scope of the lab](#t1)
2. [Programming: Extracting the features](#t2)
3. [Programming: Building $\mathbf{X}$](#t3)
4. [Programming: Building $\mathbf{y}$](#t4)
5. [Programming: Building the Model](#t5)
6. [Predicting](#t6)
7. [Predict the language of a text](#t7)
8. [Submission](#t8) 

## Objectives

In this assignment, you will implement a **language detector** inspired from Google's _Compact language detector_, version 3 (CLD3): https://github.com/google/cld3. **CLD3** is written in C++ and its code is available from GitHub. The objectives of the assignment are to:
* Write a program to classify languages
* Use neural networks
* Know what a classifier is
* Write a short report of 1 to 2 pages to describe your program. You will notably comment the performance you obtained and how you could improve it.

## Description 

### System Overview

Read the GitHub description of CLD3, https://github.com/google/cld3, (_Model_ section). In your individual report you will:
1. Summarize the system in two or three sentences;
2. Outline the CLD3 overall architecture in a figure. Use building blocks only and do not specify the parameters.

## Imports 

In [24]:
import bz2
import json
import os
import numpy as np
import requests
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, classification_report
from sklearn.metrics import confusion_matrix

### Dataset

As dataset, we will use **Tatoeba**, https://tatoeba.org/eng/downloads. It consists of more than 8 million short texts in 347 languages and it is available in one file called `sentences.csv`.

The dataset is structured in this way: There is one text per line, where each line consists of the three following fields separated by tabulations and ended by a carriage return:
```
sentence id [tab] language code [tab] text [cr]
```
Each text (sentence) has a unique id and has a language code that follows the ISO 639-3 standard (see below). 

**NOTE**: 

CLD3 is a **neural network model** for language identification. The code package on Github contains the inference code and a trained model. The inference code extracts character **ngrams** from the input text and computes the **fraction of times** each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer. The remaining components of the network are a hidden (Rectified linear) layer and a softmax layer.

To get a language prediction for the input text, we simply perform a forward pass through the network.

### Scope of the lab <a name="t1"/>

In this lab, you will consider three languages only: French (fra), English (eng), and Swedish (swe). Below is an excerpt of the **Tatoeba dataset** limited to these three languages: 

```
1276    eng     Let's try something.
1277    eng     I have to go to sleep.
1280    eng     Today is June 18th and it is Muiriel's birthday!
...
1115    fra     Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
1279    fra     Je ne supporte pas ce type.
1441    fra     Pour une fois dans ma vie je fais un bon geste... Et ça ne sert à rien.
...
337413  swe     Vi trodde att det var ett flygande tefat.
341910  swe     Detta är huset jag bodde i när jag var barn.
341938  swe     Vi hade roligt på stranden igår.
...
```
Tatoeba is updated continuously. The examples from this dataset come from a corpus your instructor downloaded on September 15, 2020.

### Understanding the $\mathbf{X}$ matrix (feature matrix)

You will now investigate the CLD3 features:
 *  What are the **features** CLD3 extracts from each text?
 * Create manually a simplified $\mathbf{X}$ matrix where you will **represent the 9 texts with CLD3 features**. You will use a restricted set of features: You will only consider the letters _a_, _b_, and _n_ and the bigrams _an_, _ba_, and _na_. You will ignore the the rest of letters and bigrams as well as the trigrams. Your matrix will have 9 rows and 6 columns, each column will contain these counts: `[#a, #b, #n, #an, #ba, #na]`.

The CLD3's original description uses **relative frequencies** (counts of a letter divided by the total counts of letters in the text). Here, you will use the **raw counts**. To help you, your instructor filled the fourth row of the matrix corresponding to the first text in French. Fill in the rest. You will include this matrix in your report. 

$\mathbf{X} =
\begin{bmatrix}
0& 0& 1& 0& 0& 0\\
0& 0& 0& 0& 0& 0\\
3& 1& 2& 1& 0& 0\\
8& 0& 8& 1& 0& 0\\
1& 0& 1& 0& 0& 0\\
4& 1& 6& 0& 0& 0\\
4& 0& 1& 1& 0& 0\\
5& 2& 2& 0& 1& 0\\
2& 0& 1& 1& 0& 0\\
\end{bmatrix}$
; $\mathbf{y} =
\begin{bmatrix}
     \text{eng} \\
     \text{eng}\\
     \text{eng}\\
    \text{fra}\\
   \text{fra}  \\
     \text{fra}\\
    \text{swe}\\
 \text{swe}   \\
 \text{swe}   
\end{bmatrix}$

## Programming: Extracting the features <a name="t2"/>

Before you start programming, download the Tatoeba dataset.

### Loading and filtering the dataset

Run the code to read the dataset and split it into lines. You may have to change the path

In [2]:
#dataset = open('../../corpus/sentences.csv', encoding='utf8').read().strip()
dataset = open('sentences/sentences.csv', encoding='utf8').read().strip()
dataset = dataset.split('\n')
dataset[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

Run the code to **split the fields** and remove possible **whitespaces**

In [3]:
dataset = list(map(lambda x: tuple(x.split('\t')), dataset))
#dataset[:3]
dataset = list(map(lambda x: tuple(map(str.strip, x)), dataset))
dataset[:3]

[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

In [4]:
dataset[0][1]

'cmn'

- Write the code to extract the **French, English, and Swedish** texts. You will call the resulting dataset: `dataset_small`

In [5]:
# Write your code here
#for sentence in dataset:
    #if (sentence[1] is ['eng', 'fra', 'swe']):
#    if (sentence[1] == 'swe'):
#        print(sentence)
#        dataset_small = list(filter(lambda sentence: sentence[1]=='swe', dataset))
dataset_small = list(filter(lambda sentence: sentence[1] in ['eng','fra','swe'], dataset))

In [6]:
dataset_small[:5]

[('1115',
  'fra',
  "Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent."),
 ('1276', 'eng', "Let's try something."),
 ('1277', 'eng', 'I have to go to sleep.'),
 ('1279', 'fra', 'Je ne supporte pas ce type.'),
 ('1280', 'eng', "Today is June 18th and it is Muiriel's birthday!")]

### Functions to Count Characters Ngrams

- Write a function `count_chars(string, lc=True)` to count characters (**unigrams**) of a string. You will set the text in **lowercase** if `lc` is set to `True`. As in CLD3, you will return the **relative frequencies** of the unigrams, i.e. counts of a letter divided by the total counts of letters in the text.

In [7]:
# Write your code here
def count_chars(string, lc=True):
    frequency = {}
    if lc != False:
        string = string.lower()
    #print(len(string))
    #words = string.split()
    #for word in words:
    for c in range(len(string)):
        if string[c] in frequency:
            frequency[string[c]] += 1
        else:
            frequency[string[c]] = 1
            
        #print(list(word))
        #unigrams = list(map(lambda c: c, word))
        #print(unigrams)
#    for char in len(string):
#        chars = string[char].split()
    #return list(map(lambda f: f/len(string), list(frequency.values())))
    #map(lambda f: f/len(string), list(frequency.items()))
    #map(lambda f: f/len(string), frequency.items()[1])
    
    # dict comprehensions can be used to create dictionaries 
    # from arbitrary key and value expressions
    frequency = {k:v/len(string) for k, v in frequency.items()}
    
    #print(list(frequency.values())/len(string))
    #print(frequency.keys())
    #print(frequency)
    return frequency

In [8]:
def count_chars2(string, lc=True):     
    counts = {}     
    if lc:         
        string = string.lower()     
        for char in string:         
            if char in counts:             
                counts[char] += 1         
            else:             
                counts[char] = 1
                
    sum_chars = sum(counts.values()) 
    counts = {k:v/sum_chars for k, v in counts.items()}     
    return counts


- Write a function `count_bigrams(string, lc=True)` to count the characters **bigrams** of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the bigrams.

In [9]:
# Write your code here
def count_bigrams(string, lc=True):
    frequency = {}
    if lc!=False:
        string = string.lower()
        
    bigrams = []
    for i in range(len(string) - 1):
        bigrams.append(string[i:i+2])
    #print(len(string))
    #words = string.split()
    #for word in words:
    for bi in range(len(bigrams)):
        if bigrams[bi] in frequency:
            frequency[bigrams[bi]] += 1
            #frequency[bigrams[bi]] = frequency[bigrams[bi]]/len(string)
        else:
            frequency[bigrams[bi]] = 1
            #frequency[string[c]] /= len(string)
            #frequency[bigrams[bi]] = frequency[bigrams[bi]]/len(string)
        #print(list(word))
        #unigrams = list(map(lambda c: c, word))
        #print(unigrams)
#    for char in len(string):
#        chars = string[char].split()

    # dict comprehensions can be used to create dictionaries 
    # from arbitrary key and value expressions
    frequency = {k:v/len(string) for k, v in frequency.items()}
    return frequency

Write a function `count_trigrams(string, lc=True)` to count the characters **trigrams** of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the trigrams.

In [10]:
# Write your code here
def count_trigrams(string, lc=True):
    frequency = {}
    if lc!=False:
        string = string.lower()
        
    trigrams = []
    for i in range(len(string) - 3 + 1):
        trigrams.append(string[i:i+3])
    #print(len(string))
    #words = string.split()
    #for word in words:
    for tri in range(len(trigrams)):
        if trigrams[tri] in frequency:
            frequency[trigrams[tri]] += 1
        else:
            frequency[trigrams[tri]] = 1
            
        #print(list(word))
        #unigrams = list(map(lambda c: c, word))
        #print(unigrams)
#    for char in len(string):
#        chars = string[char].split()

    # dict comprehensions can be used to create dictionaries 
    # from arbitrary key and value expressions
    frequency = {k:v/len(string) for k, v in frequency.items()}
    return frequency

In [11]:
count_chars("Let's try something.")

{'l': 0.05,
 'e': 0.1,
 't': 0.15,
 "'": 0.05,
 's': 0.1,
 ' ': 0.1,
 'r': 0.05,
 'y': 0.05,
 'o': 0.05,
 'm': 0.05,
 'h': 0.05,
 'i': 0.05,
 'n': 0.05,
 'g': 0.05,
 '.': 0.05}

In [12]:
count_bigrams("Let's try something.")

{'le': 0.05,
 'et': 0.1,
 "t'": 0.05,
 "'s": 0.05,
 's ': 0.05,
 ' t': 0.05,
 'tr': 0.05,
 'ry': 0.05,
 'y ': 0.05,
 ' s': 0.05,
 'so': 0.05,
 'om': 0.05,
 'me': 0.05,
 'th': 0.05,
 'hi': 0.05,
 'in': 0.05,
 'ng': 0.05,
 'g.': 0.05}

In [13]:
count_trigrams("Let's try something.")

{'let': 0.05,
 "et'": 0.05,
 "t's": 0.05,
 "'s ": 0.05,
 's t': 0.05,
 ' tr': 0.05,
 'try': 0.05,
 'ry ': 0.05,
 'y s': 0.05,
 ' so': 0.05,
 'som': 0.05,
 'ome': 0.05,
 'met': 0.05,
 'eth': 0.05,
 'thi': 0.05,
 'hin': 0.05,
 'ing': 0.05,
 'ng.': 0.05}

### Counting the ngrams in the dataset (modified: only unigrams)

You will now **extract the features** from each text. For this, add the character, bigram, and trigram **relative frequencies** to the texts using this format:
`(text_id, language_id, text, char_cnt, bigram_cnt, trigram_cnt)`.

From the datapoint:
`('1276', 'eng', "Let's try something.")`,
you must return:

`('1276', 'eng', "Let's try something.", 
  {'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 'n': 0.05, 'g': 0.05, '.': 0.05},
  {'le': 0.05263157894736842, 'et': 0.10526315789473684, "t'": 0.05263157894736842, "'s": 0.05263157894736842, 's ': 0.05263157894736842, ' t': 0.05263157894736842, 'tr': 0.05263157894736842, 'ry': 0.05263157894736842, 'y ': 0.05263157894736842, ' s': 0.05263157894736842, 'so': 0.05263157894736842, 'om': 0.05263157894736842, 'me': 0.05263157894736842, 'th': 0.05263157894736842, 'hi': 0.05263157894736842, 'in': 0.05263157894736842, 'ng': 0.05263157894736842, 'g.': 0.05263157894736842},
  {'let': 0.05555555555555555, "et'": 0.05555555555555555, "t's": 0.05555555555555555, "'s ": 0.05555555555555555, 's t': 0.05555555555555555, ' tr': 0.05555555555555555, 'try': 0.05555555555555555, 'ry ': 0.05555555555555555, 'y s': 0.05555555555555555, ' so': 0.05555555555555555, 'som': 0.05555555555555555, 'ome': 0.05555555555555555, 'met': 0.05555555555555555, 'eth': 0.05555555555555555, 'thi': 0.05555555555555555, 'hin': 0.05555555555555555, 'ing': 0.05555555555555555, 'ng.': 0.05555555555555555})`

- You will store the **extracted features** in a **list** that you will call `dataset_small_feat`

#### NOTE: (MODIFIED TASK) 
You will now extract the features from each text. For this, add ONLY the character , AND NOT bigram and trigram, relative frequencies to the texts using this format: (text_id, language_id, text, char_cnt), i.e. without bigram_cnt, trigram_cnt !!!!

In [16]:
# Write your code here
# we can only concatenate tuple (not "dict") to tuple
dataset_small_feat = list(map(lambda sentence: sentence + (count_chars(sentence[2]), ), 
                              dataset_small))

In [17]:
dataset_small_feat[:2]

[('1115',
  'fra',
  "Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.",
  {'l': 0.044444444444444446,
   'o': 0.05555555555555555,
   'r': 0.05555555555555555,
   's': 0.07777777777777778,
   'q': 0.022222222222222223,
   'u': 0.044444444444444446,
   "'": 0.011111111111111112,
   'i': 0.06666666666666667,
   ' ': 0.16666666666666666,
   'a': 0.08888888888888889,
   'd': 0.022222222222222223,
   'e': 0.05555555555555555,
   'm': 0.011111111111111112,
   'n': 0.08888888888888889,
   'é': 0.022222222222222223,
   'v': 0.011111111111111112,
   't': 0.05555555555555555,
   'c': 0.022222222222222223,
   'f': 0.011111111111111112,
   'ê': 0.011111111111111112,
   ',': 0.011111111111111112,
   'g': 0.011111111111111112,
   'ç': 0.011111111111111112,
   'p': 0.011111111111111112,
   '.': 0.011111111111111112}),
 ('1276',
  'eng',
  "Let's try something.",
  {'l': 0.05,
   'e': 0.1,
   't': 0.15,
   "'": 0.05,
   's': 0.1,
   ' ': 0.1,
   'r': 0.05,
  

The unigram frequencies

In [18]:
# dataset_small is of this format: (text_id, language_id, text, char_cnt)
dataset_small_feat[0][3].items()

dict_items([('l', 0.044444444444444446), ('o', 0.05555555555555555), ('r', 0.05555555555555555), ('s', 0.07777777777777778), ('q', 0.022222222222222223), ('u', 0.044444444444444446), ("'", 0.011111111111111112), ('i', 0.06666666666666667), (' ', 0.16666666666666666), ('a', 0.08888888888888889), ('d', 0.022222222222222223), ('e', 0.05555555555555555), ('m', 0.011111111111111112), ('n', 0.08888888888888889), ('é', 0.022222222222222223), ('v', 0.011111111111111112), ('t', 0.05555555555555555), ('c', 0.022222222222222223), ('f', 0.011111111111111112), ('ê', 0.011111111111111112), (',', 0.011111111111111112), ('g', 0.011111111111111112), ('ç', 0.011111111111111112), ('p', 0.011111111111111112), ('.', 0.011111111111111112)])

The bigram frequencies (NOTE: this is cancelled)

In [19]:
#dataset_small_feat[0][4].items()

## Programming: Building $\mathbf{X}$ <a name="t3"/>

You will now build the $\mathbf{X}$ matrix. In this assignment, you will only consider **unigrams** to speed up the training step. **This means that you will set aside the character bigrams and trigrams**.

When you are done with the lab requirements, feel free to improve the program and include bigrams and trigrams. To add bigrams, a possible method is to add the bigram dictionary to the unigram one using **update** and then to extract the resulting dictionary. You can easily extend this to trigrams. Feel free to use another method if you want.

In [20]:
INCLUDE_BIGRAMS = False
if INCLUDE_BIGRAMS:
    for i in range(len(dataset_small_feat)):
        dataset_small_feat[i][3].update(dataset_small_feat[i][4])

### Vectorizing the features

The CLD3 architecture uses **embeddings**. In this lab, we will simplify it and we will use a **feature vector** instead consisting of the **character frequencies**. For example, you will represent the text:

`"Let's try something."`

with:

`{'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 
 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 
 'n': 0.05, 'g': 0.05, '.': 0.05}`

To create the $\mathbf{X}$ matrix, we need to transform the dictionaries of `dataset_small` into **numerical vectors**. The `DictVectorizer` class from the **scikit-learn** library, see here [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html], has two methods, `fit()` and `transform()`, and a combination of both `fit_transform()` to convert dictionaries into such vectors.

You will now write the code to:

1. Extract the character frequency dictionaries from `dataset_small` corresponding to its 3rd index and set them in a list;
2. Convert the list of dictionaries into an $\mathbf{X}$ matrix using `DictVectorizer`.

#### Extracting the character frequencies

Produce a new list of datapoints with the **unigrams** only. Each item in this list will be a dictionary. You will call it `X_cat`

In [21]:
# Write your code here
X_cat = list(map(lambda sentence: count_chars(sentence[2]), dataset_small))

In [22]:
X_cat[:2]

[{'l': 0.044444444444444446,
  'o': 0.05555555555555555,
  'r': 0.05555555555555555,
  's': 0.07777777777777778,
  'q': 0.022222222222222223,
  'u': 0.044444444444444446,
  "'": 0.011111111111111112,
  'i': 0.06666666666666667,
  ' ': 0.16666666666666666,
  'a': 0.08888888888888889,
  'd': 0.022222222222222223,
  'e': 0.05555555555555555,
  'm': 0.011111111111111112,
  'n': 0.08888888888888889,
  'é': 0.022222222222222223,
  'v': 0.011111111111111112,
  't': 0.05555555555555555,
  'c': 0.022222222222222223,
  'f': 0.011111111111111112,
  'ê': 0.011111111111111112,
  ',': 0.011111111111111112,
  'g': 0.011111111111111112,
  'ç': 0.011111111111111112,
  'p': 0.011111111111111112,
  '.': 0.011111111111111112},
 {'l': 0.05,
  'e': 0.1,
  't': 0.15,
  "'": 0.05,
  's': 0.1,
  ' ': 0.1,
  'r': 0.05,
  'y': 0.05,
  'o': 0.05,
  'm': 0.05,
  'h': 0.05,
  'i': 0.05,
  'n': 0.05,
  'g': 0.05,
  '.': 0.05}]

#### Vectorize `X_cat`

Convert your `X_cat` matrix into a **numerical representation** using `DictVectorizer`: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

**NOTE**: The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

In [25]:
# Write your code here
# Encoding the features
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_cat)

## Programming: Building $\mathbf{y}$ <a name="t4"/>

You will now convert the list of **language symbols** into a $\mathbf{y}$ vector

Extract the **language symbols** from `dataset_small_feat` and call the resulting list `y_cat`

In [26]:
# Write your code here
y_cat = list(map(lambda sentence: sentence[1], dataset_small_feat))

In [27]:
y_cat[:5]

['fra', 'eng', 'eng', 'fra', 'eng']

Build two indices mapping the symbols to integers and the integers to symbols. Both indices will be **dictionaries** that you will call: `lang2inx`and `inx2lang`.

In [28]:
# Write your code here
y_symbols = ['fra', 'eng', 'swe']
inx2lang = dict(enumerate(y_symbols))

# we build an inverted dictionary
lang2inx = {v:k for k, v in inx2lang.items()}

In [29]:
inx2lang

{0: 'fra', 1: 'eng', 2: 'swe'}

In [30]:
lang2inx

{'fra': 0, 'eng': 1, 'swe': 2}

Convert your `y_cat` vector into a numerical vector. Call this vector `y`.

In [31]:
# Write your code here
y = list(map(lambda lang: lang2inx[lang], y_cat))
#y = [lang2inx[i] for i in y_cat]

In [32]:
y[:5]

[0, 1, 1, 0, 1]

## Programming: Building the Model <a name="t5"/>

Create a neural network using sklearn with a **hidden layer** of 50 nodes and a **relu activation** layer: https://scikit-learn.org/stable/modules/neural_networks_supervised.html. Set the maximal number of **iterations** to 5, in the beginning, and **verbose** to True. Use the default values for the rest. You will call you classifier `clf`

In [38]:
# Write your code here
#classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
clf = MLPClassifier(activation='relu', hidden_layer_sizes=(50,), 
                    verbose=True, max_iter=5)

In [39]:
clf

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(50,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=True, warm_start=False)

### Training and Validation Sets

You will now split the dataset into training and a validation sets

#### We shuffle the indices

In [40]:
indices = list(range(X.shape[0]))

np.random.shuffle(indices)
print(indices[:10])

X = X[indices, :]
y = np.array(y)[indices]
print(X.shape)
print(y.shape)

[514315, 1749829, 740705, 1459710, 739350, 1535081, 1839597, 1581195, 441083, 792085]
(1844225, 354)
(1844225,)


#### We split the dataset

In [41]:
training_examples = int(X.shape[0] * 0.8)

X_train = X[:training_examples, :]
y_train = y[:training_examples]

X_val = X[training_examples:, :]
y_val = y[training_examples:]

### Fitting the model

Fit the model on the training set

#### Notes: MLPClassifier

MLPClassifier, Multi-layer Perceptron classifier, trains iteratively since at each time step the partial derivatives of the **loss function** with respect to the model parameters are computed to update the parameters.

It can also have a **regularization** term added to the loss function that shrinks model parameters to prevent **overfitting**.

This implementation works with data represented as dense numpy arrays or sparse scipy arrays of floating point values.

In [42]:
# Write your code here
clf.fit(X_train, y_train)

Iteration 1, loss = 0.12518509
Iteration 2, loss = 0.07273527
Iteration 3, loss = 0.06674220
Iteration 4, loss = 0.06253729
Iteration 5, loss = 0.05981266




MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(50,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=True, warm_start=False)

## Predicting <a name="t6"/>

Predict the `X_val` languages. You will call the result `y_val_pred`

In [43]:
# Write your code here
y_val_pred = clf.predict(X_val)

In [44]:
print(y_val_pred[:10])
print(y_val[:10])

[1 0 0 1 1 2 0 1 1 0]
[1 0 0 1 1 2 0 1 1 0]


#### Evaluating

In [45]:
# evaluate the model
accuracy_score(y_val, y_val_pred)

0.9788203717008499

In [46]:
print(classification_report(y_val, y_val_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_val_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_val_pred, average='macro'))

             precision    recall  f1-score   support

        fra       0.97      0.95      0.96     87715
        eng       0.98      0.99      0.99    273517
        swe       0.97      0.89      0.93      7613

avg / total       0.98      0.98      0.98    368845

Micro F1: 0.9788203717008499
Macro F1 0.9583552353995707


### Confusion Matrix

In [47]:
confusion_matrix(y_val, y_val_pred)

array([[ 83759,   3915,     41],
       [  2852, 270482,    183],
       [    77,    744,   6792]], dtype=int64)

- Increase the number of iterations to improve the **score**. You may also change the **parameters**.

## Predict the language of a text <a name="t7"/>

Now predict the languages of the strings below.

In [48]:
docs = ["Salut les gars !", "Hejsan grabbar!", "Hello guys!", "Hejsan tjejer!"]

- Create **features vectors** from this list. Call this matrix `X_test`

In [49]:
# Write your code here
# Extract the character frequency dictionaries from the strings list 'docs'
X_docs = list(map(lambda sentence: count_chars(sentence), docs))
#X_docs[:5]

In [50]:
# Convert the list of dictionaries into an X matrix using DictVectorizer
# Vectorize X_freq (Encoding the features)
#vec = DictVectorizer(sparse=True)
#X_docs = vec.fit(X_docs)
#X_test = vec.fit_transform(X_docs)
X_test = vec.transform(X_docs)
print(X_test.shape)

#y_docs = list(map(lambda sentence: sentence, dataset_small_feat))
# building y
#y_test = list(map(lambda lang: lang2inx[lang], y_docs))
#y = [lang2inx[i] for i in y_cat]

(4, 354)


- And run the **prediction** that you will store in a variable called `pred_languages`

In [53]:
# Write your code here
pred_languages = clf.predict(X_test)
pred_languages = [inx2lang.get(i) for i in pred_languages ]


In [54]:
pred_languages

['fra', 'swe', 'eng', 'swe']

## Submission <a name="t8"/>

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [59]:
STIL_ID = ["hi8826mo-s"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "3_language_detector_HichamMohamad.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the predicted languages.

In [60]:
ANSWER = json.dumps({'pred_langs': pred_languages})
ANSWER

'{"pred_langs": ["fra", "swe", "eng", "swe"]}'

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [61]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [62]:
ASSIGNMENT = 3
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [63]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': '0fc40478b302739bcea019cabc596cf85f88c764f3cc49d0bba05f2efef9940f62629668fc1a0ad4d0727ad1bf9e322fbd44f839df6408fada9fd6a398daf46f',
 'submission_id': 'a91e248b-de96-44a4-a49e-7465878e8337'}

## Postscript from Pierre Nugues

I created this assignment from an examination I wrote last year for the course on applied machine learning. I simplified it from the `README.md` on GitHub, https://github.com/google/cld3. I found the C++ code difficult to understand and I reimplemented a Keras/Tensorflow version of it from this `README`. Should you be interested, you can find it here: https://github.com/pnugues/language-detector.