**Language Detection using FastText**

**Reference :** https://fasttext.cc/docs/en/supervised-tutorial.html

**Drive Link (data, prediction and models) :** https://drive.google.com/drive/folders/1UWe1KH3Hyppc1U52b13k_v7P1uRwt16e?usp=sharing

**Dataset :** http://www.statmt.org/europarl/

I cleaned the data, generated a csv file from each language corpus and then merged these csv files to create a single (multi-label) dataset, so that we can use it in the supervised training.

In [None]:
%cd drive/MyDrive

/content/drive/MyDrive


In [None]:
!mkdir lang_detect
%cd lang_detect

/content/drive/MyDrive/lang_detect


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('data/europarl.csv',names = ['lang','text'])
df.head()

Unnamed: 0,lang,text
0,bg,Състав на Парламента: вж. протоколи
1,bg,Одобряване на протокола от предишното заседани...
2,bg,Състав на Парламента: вж. протоколи
3,bg,Проверка на пълномощията: вж. протоколи
4,bg,Внасяне на документи: вж. протоколи


**Dataset Info**

In [None]:
print("Languages in the dataset : \n", df['lang'].unique())
print("\n")
print("Number of unique lnguages in the dataset : " , len(df['lang'].unique()) )

Languages in the dataset : 
 ['da' 'lt' 'en' 'hu' 'it' 'es' 'el' 'lv' 'sv' 'cs' 'pt' 'sl' 'bg' 'sk'
 'ro' 'pl' 'de' 'fi' 'et' 'nl' 'fr']


Number of unique lnguages in the dataset :  21


**Shuffle the rows**

In [None]:
df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True)

In [None]:
df.head()

Unnamed: 0,lang,text
0,da,men indtil nu har portugal haft seks finansier...
1,lt,"gerb. pirmininke, apie tai taip pat yra kalbam..."
2,en,"secondly, on religion and identity, there is a..."
3,da,en fremtrædende afrikansk leder kom med en meg...
4,hu,nem vagyok a domináns piaci helyzetben lévő re...


**FastText Classification**


Normalization and Label Formatting

In [None]:
def normalize_text(row): 
    label = "__label__" + str(row['lang'])
    text = str(row['text'])
    return ' '.join(( label + ' , ' + text ).split())

df['normalized'] = df.apply( lambda row: normalize_text(row), axis=1 )

In [None]:
df.head()

Unnamed: 0,lang,text,normalized
0,da,for det første værdien af stærke myndighedsorg...,"__label__da , for det første værdien af stærke..."
1,nl,aan de orde is de mondelinge vraag (b5-0491/20...,"__label__nl , aan de orde is de mondelinge vra..."
2,hu,"(fr) elnök úr, három ügyet szeretnék megemlíte...","__label__hu , (fr) elnök úr, három ügyet szere..."
3,bg,Относно: Микрокредити,"__label__bg , Относно: Микрокредити"
4,pl,"zważywszy, że energia jądrowa nie jest energią...","__label__pl , zważywszy, że energia jądrowa ni..."


In [None]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(df , test_size = 0.25 , random_state = 1 )

In [None]:
train.info

<bound method DataFrame.info of        lang  ...                                         normalized
354030   es  ...  __label__es , es muy importante que las autori...
643945   da  ...  __label__da , vi mener ikke, at der er retsgru...
85908    ro  ...  __label__ro , după cum au menţionat câţiva din...
227782   it  ...  __label__it , sono quindi soddisfatto che si a...
169251   lv  ...  __label__lv , es jau esmu runājis ar lielāko d...
...     ...  ...                                                ...
491263   sk  ...            __label__sk , - správa: pervenche berčs
791624   nl  ...  __label__nl , het tweede principe is dat de ve...
470924   sk  ...  __label__sk , zloženie parlamentu: pozri zápis...
491755   en  ...     __label__en , that concludes the joint debate.
128037   da  ...  __label__da , hr. formand, kære kolleger, det ...

[624228 rows x 3 columns]>

In [None]:
test.info

<bound method DataFrame.info of        lang  ...                                         normalized
183899   lv  ...  __label__lv , tajā pat laikā es lūdzu eiropas ...
713947   el  ...  __label__el , Σε καμία περίπτωση δεν μπορούμε ...
261445   sv  ...  __label__sv , herr talman, föredraganden har y...
149882   pt  ...  __label__pt , em relação à primeira proposta, ...
176906   es  ...  __label__es , por último, el cuarto pilar: el ...
...     ...  ...                                                ...
156441   es  ...  __label__es , en cuanto a si la enmienda es o ...
649789   lt  ...  __label__lt , aš suprantu, kad atsakant į bend...
325587   da  ...  __label__da , hr. formand, eu's medlemsstaters...
730582   hu  ...  __label__hu , iparunk a világ legnagyobb autóg...
358002   de  ...  __label__de , deshalb fordern wir die aussetzu...

[208077 rows x 3 columns]>

In [None]:
train = train['normalized']
test = test['normalized']

In [None]:
np.savetxt('data/europarl.train', train.values, fmt="%s")
np.savetxt('data/europarl.eval', test.values, fmt="%s")

Preprocessing for FastText

In [None]:
!cat data/europarl.train | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > data/europarl.pp.train

In [None]:
!cat data/europarl.eval | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > data/europarl.pp.eval


Training FastText Classifier

In [None]:
!git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.22 MiB | 11.11 MiB/s, done.
Resolving deltas: 100% (2417/2417), done.
Checking out files: 100% (526/526), done.


In [None]:
%cd fastText/
!make

/content/drive/My Drive/lang_detect/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/vector.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/model.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/utils.cc
c++ -pthread -std=c++1

In [None]:
%cd ..

/content/drive/My Drive/lang_detect


FastText will generate two files:

1)europarl.bin : this is the learned model which contains the optimized parameters for predicting the language label from a given text.

2)europarl.vec : a text file that contains the learned vocabulary (around 1.8million) and their embeddings.

In [None]:
!./fastText/fasttext supervised -input data/europarl.pp.train -output model/europarl -lr 0.5 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss one-vs-all

Read 45M words
Number of words:  1277035
Number of labels: 21
Progress: 100.0% words/sec/thread:  246539 lr:  0.000000 avg.loss:  0.055461 ETA:   0h 0m 0s


Evaluating the model on the Validation Set

In [None]:
!./fastText/fasttext test model/europarl.bin data/europarl.pp.eval

N	208077
P@1	0.99
R@1	0.99


Using FastText, I obtain the following results on the Validation set :

Precision : 99%

Recall : 99%

F1 Score  : 99%

In [None]:
!./fastText/fasttext predict model/europarl.bin  data/europarl.pp.eval > prediction/europarl.pp.eval.predict

Evaluating the model on the Test Set

In [None]:
# Loading the test dataset
test = pd.read_csv('data/europarl.test', sep='\t', names=['lang', 'text'])

# Normalizing the text in the test dataset so it conform with `fastText` format
test['normalized'] = test.apply(lambda row: normalize_text(row), axis=1)

# Finally lets shuffle the examples and save the final test dataset
test = test.reindex(np.random.permutation(test.index)).reset_index(drop=True)
np.savetxt('data/europarl_normalized.test', test['normalized'].values, fmt='%s')


In [None]:
# Preprocess the normalized test set using FastText
!cat data/europarl_normalized.test | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > data/europarl_normalized.pp.test


In [None]:
!./fastText/fasttext test model/europarl.bin data/europarl_normalized.pp.test
!./fastText/fasttext predict model/europarl.bin  data/europarl_normalized.pp.test > prediction/europarl_normalized.pp.test.predict

N	20828
P@1	0.993
R@1	0.993


Using FastText, I obtain the following results on the Test set :

Precision : 99.3%

Recall : 99.3%

F1 Score : 99.3%

**Using the model for Further Prediction**

In [None]:
!pip install fasttext

In [None]:
import fasttext  as ft
model = ft.load_model('/content/drive/MyDrive/lang_detect/model/europarl.bin')
text = [
          "Ah, bueno, me alegro de que me preguntes eso",       # Spanish
          "Have you gone crazy? Are you a witch or not?",       # English
          "Quem vê cara não vê coração.",                       # Portugese
          "Опознай Родината, за да я обикнеш!",                 # Bulgarian
          "Vær den forandring, som du ønsker at se i verden."   # Danish
]
model.predict(text)



([['__label__es'],
  ['__label__en'],
  ['__label__pt'],
  ['__label__bg'],
  ['__label__da']],
 [array([0.997295], dtype=float32),
  array([0.9981424], dtype=float32),
  array([1.00001], dtype=float32),
  array([1.00001], dtype=float32),
  array([0.743178], dtype=float32)])