<a href="https://colab.research.google.com/github/mdjamina/m1_ml_lang_detector/blob/main/ML_lang_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus
[Tatoeba](https://tatoeba.org/fr/downloads
) est une collection de phrases et de traductions.

## Téléchargement du corpus

In [None]:
!wget https://downloads.tatoeba.org/exports/sentences.tar.bz2

--2022-03-09 14:04:07--  https://downloads.tatoeba.org/exports/sentences.tar.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160000593 (153M) [application/octet-stream]
Saving to: ‘sentences.tar.bz2’


2022-03-09 14:04:24 (10.1 MB/s) - ‘sentences.tar.bz2’ saved [160000593/160000593]



Extraction de l'archive téléchargée 

In [None]:
!tar -xf /content/sentences.tar.bz2

## Chargement du corpus

In [None]:
langs =['eng','pol','deu','fra','spa','ita','tur','por','ara','ckb','rus','ukr']

In [None]:
import pandas as pd

data = pd.read_csv('/content/sentences.csv', sep='\t', header=None)
data.columns = ['id','I_Id', 'content']


In [None]:
data

Unnamed: 0,id,lang,content
0,1,cmn,我們試試看！
1,2,cmn,我该去睡觉了。
2,3,cmn,你在干什麼啊？
3,4,cmn,這是什麼啊？
4,5,cmn,今天是６月１８号，也是Muiriel的生日！
...,...,...,...
10252186,10703392,sat,ᱤᱧᱤᱡ ᱥᱮᱛᱟ ᱫᱚ ᱵᱷᱩ ᱭᱟᱭ ᱾
10252187,10703393,deu,Stell dich an.
10252188,10703394,deu,Stellt euch an.
10252189,10703395,eng,"I’m sure it’s boring, but I’m not sure at whic..."


## Néttoyage des données

In [None]:
#identification des colonnes qui contient des valeurs NAN (non renseigner)
data.isnull().any()

id         False
lang       False
content    False
dtype: bool

In [None]:
#suppression des valeurs NAN
data.dropna(subset = ["lang"], inplace=True)

## regrouppement des dialects par langue principale

In [None]:
#TODO
# site iso langues avec leur regroupement
# telecharger fichier csv(iso) contient : langue : arb=dz, ara=egy ..,  le telecharger avec  pandas data frames 

In [None]:
!wget https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3-macrolanguages.tab

--2022-03-09 14:00:16--  https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3-macrolanguages.tab
Resolving iso639-3.sil.org (iso639-3.sil.org)... 172.67.29.248, 104.22.10.254, 104.22.11.254, ...
Connecting to iso639-3.sil.org (iso639-3.sil.org)|172.67.29.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5012 (4.9K)
Saving to: ‘iso-639-3-macrolanguages.tab’


2022-03-09 14:00:17 (27.8 MB/s) - ‘iso-639-3-macrolanguages.tab’ saved [5012/5012]



In [None]:
macro_lang = pd.read_csv('/content/iso-639-3-macrolanguages.tab', sep='\t', header=None)
macro_lang.columns = ['lang',	'I_Id',	'I_Status']

macro_lang

Unnamed: 0,M_Id,I_Id,I_Status
0,M_Id,I_Id,I_Status
1,aka,fat,A
2,aka,twi,A
3,ara,aao,A
4,ara,abh,A
...,...,...,...
450,zho,,A
451,zho,wuu,A
452,zho,yue,A
453,zza,diq,A


In [None]:
a = pd.DataFrame()


# pre-processing

In [None]:
data_counts = pd.DataFrame( data['lang'].value_counts(), columns=['lang','count'] )

In [None]:
data_counts[data_counts['lang']<100]

## TODO

In [None]:
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(data['content'], data['lang'], test_size=0.20, random_state=1) # ramdom_state = meme corpus (meme decoupage, decouper en 2 train et test le corpus, car on a besoin de faire les tests)

In [None]:
y_train.value_counts()

eng    1256800
rus     721945
ita     642206
tur     573363
epo     545285
        ...   
oji          1
urh          1
ryu          1
sot          1
cyo          1
Name: lang, Length: 400, dtype: int64

# Model

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),])


model.fit(x_train, y_train)  




In [None]:
predictions = model.predict(x_test)

In [None]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[ 0  0  0 ...  0  0  1]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 ...
 [ 0  0  0 ... 95  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0 16]]


In [None]:
print(metrics.classification_report(y_test,predictions)) 

In [None]:
print(metrics.accuracy_score(y_test,predictions))

0.948265262087597


In [None]:
model.predict(['그러나 주로 개인적인 용도로 이용되는 위키도 있는데, 이를 개인 위키라고 한다.'])[0]

'kor'

In [None]:
model.predict(['ויקי יכולה להיות שיטה טובה לשיתוף ידע בקהילות שפועלות באמצעות האינטרנט או בתוך חברות מסחריות. היא חלופה'])[0]

'heb'

In [None]:
model.predict(['einsetzt. Zudem nutzen auch viele '])[0]

'deu'

In [None]:
model.predict(['এটা নতুনকৈ সৃষ্টি কৰিব পাৰি বা ইতিপূৰ্বে থকা পৃষ্ঠা এটা সম্পাদনা কৰিব পাৰি'])[0]

'asm'

In [None]:


model.predict(['से प्रत्येक एक विशिष्ट भाषा से संबंधित है। विकिपीडिया के अलावा, सार्वजनिक और निजी दोनों उपयोग में सैकड़ों हजारों अ'])[0]

'mar'