# Language Detection system 

Iván Vallés Pérez - 2018

A very simple language detection system for small data has been developed. It computes tf-idf features using the data provided and a set of features given some wikipedia articles of the intended languages.

In [None]:
%cd ..

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os

import pandas as pd
import numpy as np
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from src.common_paths import get_data_path, get_output_path
from src.modeling import train_model

np.random.seed(655321)

In [3]:
# Data load and train-test splits
df = pd.read_csv(os.path.join(get_data_path(), "lang_data.csv"), encoding="utf-8", sep=",").fillna({"text": "empty"})
df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True)

x_train, x_test, y_train, y_test = train_test_split(df.text, df.language, stratify=df.language)

In [4]:
# Train the model
model = train_model(x_train, y_train)

Loading wikipedia language: enwiki




Loading wikipedia language: nlwiki




Loading wikipedia language: afwiki


In [5]:
# Show performance metrics
print ("=============== Results Training set ===============")
print ("Accuracy: {}\n".format(accuracy_score(model.predict(x_train), y_train)))
print (classification_report(model.predict(x_train), y_train))
print ("===================================================")

print ("\n -- \t -- \t -- \t -- \t -- \t -- \t --\n")

print ("================= Results Test set ================")
print ("Accuracy: {}\n".format(accuracy_score(model.predict(x_test), y_test)))
print (classification_report(model.predict(x_test), y_test))
print ("===================================================")

joblib.dump(model, os.path.join(get_output_path(), 'model.pkl'))

Accuracy: 0.972757162987318

             precision    recall  f1-score   support

  Afrikaans       0.93      1.00      0.97       470
    English       0.99      0.99      0.99      1544
 Nederlands       0.97      0.57      0.72       115

avg / total       0.97      0.97      0.97      2129


 -- 	 -- 	 -- 	 -- 	 -- 	 -- 	 --

Accuracy: 0.956338028169014

             precision    recall  f1-score   support

  Afrikaans       0.92      0.94      0.93       163
    English       0.98      0.99      0.99       513
 Nederlands       0.70      0.47      0.56        34

avg / total       0.95      0.96      0.95       710



['C:\\Users\\Ivan Valles Perez\\Desktop\\exercise\\language_detection\\output\\model.pkl']

## Hand-testing the model

In [29]:
model.predict(["Can machines think?"])[0]

'English'

In [27]:
model.predict(["Kunnen machines denken?"])[0]

'Nederlands'

In [28]:
model.predict(["Kan masjiene dink?"])[0]

'Afrikaans'