#### > 24/04/2020
### > Gozde Orhan

# Language Detection, Turkish or English?

This project is done in order to propose a machine learning system to classify words into their languages, namely Turkish or English.

The project is executed for a case study where 2 csv files were given: Dictionary_Turkish and Dictionary_English. Unfortunately the files are not permitted to be shared publicly. Both csv files consist of 4 to 6 characters long well-written, genuine words in Turkish and English.

The proposed system is an *SVM classifier* preeceded by a simple if-else statement. 

- First, if the string contains one of the following: q, w, x, the word is classified as English. 
- If not, then, the word is fed into to the classifier to be labelled. 

The system does a great job detecting English words however it is less successful to detect Turkish words. 

The classifier and its parameters are selected based on comparison of following classifiers: Decision Tree, Random Forest, AdaBoost, GradientBoosting and SVM in terms of *precision and recall score for Turkish class*. Top performed classifier was selected based on best F1 score of Turkish class, which was SVM with a polynomial kernel of degree 2.

## Required packages

In [1]:
import numpy as np
import pandas as pd
import re #regular experession, imported to find if a string has whitespace

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE #imported to generate synthetic data for minority class

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, recall_score, confusion_matrix, precision_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

Using TensorFlow backend.


## Pre-processing

In [2]:
#Read csv files into pandas dataframe
turkish_df = pd.read_csv('Dictionary_Turkish.csv', keep_default_na=False)
english_df = pd.read_csv('Dictionary_English.csv', keep_default_na=False)

#Number of words (rows) in datasets
print('Number of words in Turkish dataset: '+ str(len(turkish_df)),
      'Number of words in English dataset: '+ str(len(english_df)), sep = '\n')

Number of words in Turkish dataset: 2235
Number of words in English dataset: 21666


In [3]:
#Preview dataset
turkish_df.head(2)

Unnamed: 0,Kelimeler
0,frak
1,fasit


In [4]:
#Preview dataset
english_df.head(2)

Unnamed: 0,Words
0,flying
1,fillet


In [5]:
#Column name of turkish_df changed for convenience
turkish_df = turkish_df.rename(columns={'Kelimeler': 'Words'})

#Words are changed to lowercase for consistency
turkish_df['Words']=turkish_df['Words'].str.lower()
english_df['Words']=english_df['Words'].str.lower()

In [6]:
#Drop duplicate, if any
turkish_df = turkish_df.drop_duplicates(subset='Words',keep='first', inplace=False)
english_df = english_df.drop_duplicates(subset='Words',keep='first', inplace=False)

#Number of words (rows) in datasets
print('Number of words in Turkish dataset: '+ str(len(turkish_df)),
      'Number of words in English dataset: '+ str(len(english_df)), sep = '\n')

Number of words in Turkish dataset: 2235
Number of words in English dataset: 21666


There were no duplicates.

In [7]:
#Control missing data
print('Turkish dataset: ', turkish_df.isnull().sum(), sep = '\n')

print()

print('English dataset: ', english_df.isnull().sum(), sep = '\n')

Turkish dataset: 
Words    0
dtype: int64

English dataset: 
Words    0
dtype: int64


There were no empty cells.

In [8]:
#Data preparation, assign language codes to words
turkish_df.insert(1, "Lang_Code", "0")
english_df.insert(1, "Lang_Code", "1")

In [9]:
#See words' lengths in order to see if they are distinctive enough
print('Turkish dataset: ', turkish_df["Words"].apply(len).describe(), sep = '\n')

print()

print('English dataset: ', english_df["Words"].apply(len).describe(), sep = '\n')

Turkish dataset: 
count    2235.000000
mean        5.268009
std         0.709465
min         4.000000
25%         5.000000
50%         5.000000
75%         6.000000
max         6.000000
Name: Words, dtype: float64

English dataset: 
count    21666.000000
mean         5.416459
std          0.726006
min          4.000000
25%          5.000000
50%          6.000000
75%          6.000000
max          6.000000
Name: Words, dtype: float64


The length feature was not distinctive because both corpus included words 4 to 6 letters long.

In [10]:
#Combine datasets, create a single dataset consists of TR and ENG words
combined_df = pd.concat([turkish_df, english_df])

#Drop if duplicates exists - it existed but the TR ones are kept due to the fact that TR words are already less
combined_df = combined_df.drop_duplicates(subset='Words', keep='first', inplace=False)

#Shuffle data to avoid any bias may occur
combined_df = combined_df.sample(frac=1)

#Reset index
combined_df = combined_df.reset_index(drop=True)

print(len(combined_df))

23767


In [11]:
#Drop strings containing whitespace (e.g. uc iki)
for i in range(len(combined_df)):
    st = combined_df.at[i,'Words']
    if bool(re.search(r"\s", st)) == True:
        combined_df = combined_df.drop([i])

#Reset index        
combined_df = combined_df.reset_index(drop=True)  

print(len(combined_df))

23756


In order to feed words into classifiers, words were tranformed into vectors. Vectors are (26x6) long because maximum length of words is 6 and alphabet has 26 characters (Turkish characters such as ç, ö, ü, etc. are omitted). 

Example (parantheses are for illustrative purposes):

- 'a' is represented as (**1**0000000000000000000000000)
- 'masa' is represented as (00000000000**1**00000000000000)(**1**0000000000000000000000000)(00000000000000000**1**00000000)(**1**0000000000000000000000000)(000000000000000000000000000000)(000000000000000000000000000000)

In [12]:
for i in range(len(combined_df)):   
    
    word = combined_df.at[i,'Words'] #get word
    n = len(word)
    vec = " "
    
    for j in range(n):
        letter = word[j]
        pos = ord(letter)-97 #get unicode
        vector = (str(0)*pos) + str(1) + str(0)*(25-pos) #fill 1 and 0's for existing letters
        vec = vec + vector #update empty string
        
    if n <= 6:
        remaining = 6-n
        vec = vec + str(0)*26*remaining #fill 0's if word is shorter than 6

    result = [int(let) for let in str(vec) if let.isdigit()]
    
    for k in range(len(result)):
        combined_df.at[i, k] = (result[k])

combined_df.head()

Unnamed: 0,Words,Lang_Code,0,1,2,3,4,5,6,7,...,146,147,148,149,150,151,152,153,154,155
0,jitney,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,urnful,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,itch,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,rose,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,asemia,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training

In [13]:
#Get digitized vector columns as features
features = combined_df.iloc[:, 2:]

#Get Lang_Code as target
labels = combined_df.iloc[:, 1]

#Split dataset - 85% training and %15 test
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = .15)

y_train.value_counts() #THERE IS A HUGE IMBALANCE!

1    18329
0     1863
Name: Lang_Code, dtype: int64

In [14]:
#In order to avoid bias as possible, over-sampling method SMOTE is utilized
sm = SMOTE(sampling_strategy='auto', random_state=95)
x_bal, y_bal = sm.fit_resample(x_train, y_train)

y_bal.value_counts() #now the class balance is 1:1

1    18329
0    18329
Name: Lang_Code, dtype: int64

In [15]:
#Encoding labels
Encoder = LabelEncoder()
y_train = Encoder.fit_transform(y_train)
y_test = Encoder.fit_transform(y_test)
y_bal = Encoder.fit_transform(y_bal)

In [16]:
classifiers = [DecisionTreeClassifier(), RandomForestClassifier(), AdaBoostClassifier(), 
               GradientBoostingClassifier(), SVC(C=1, kernel='poly', degree=2)]

In [17]:
for j in classifiers:
    
    name = j.__class__.__name__
    print('=================================== ' + str(name) + ' ===================================')
    
    #Fit classifier
    classifier = j
    classifier.fit(x_bal,y_bal)

    #Predict the labels on test set
    predictions = classifier.predict(x_test)

    #Get evaluation metrics and print
    print(' ')
    print('Accuracy Score  -> ',accuracy_score(predictions, y_test))
    print('F1 Score with   -> ',f1_score(predictions, y_test, average=None))
    print('Recall Score    -> ',recall_score(predictions, y_test, average=None))
    print('Precision Score -> ',precision_score(predictions, y_test, average=None))
    print(' ')
    print('Confusion matrix: ', confusion_matrix(predictions, y_test), sep='\n')

 
Accuracy Score  ->  0.8866442199775533
F1 Score with   ->  [0.44198895 0.93691443]
Recall Score    ->  [0.44077135 0.93720712]
Precision Score ->  [0.4432133  0.93662192]
 
Confusion matrix: 
[[ 160  203]
 [ 201 3000]]
 
Accuracy Score  ->  0.9172278338945006
F1 Score with   ->  [0.43809524 0.95532334]
Recall Score    ->  [0.70121951 0.92764706]
Precision Score ->  [0.31855956 0.98470184]
 
Confusion matrix: 
[[ 115   49]
 [ 246 3154]]
 
Accuracy Score  ->  0.7623456790123457
F1 Score with   ->  [0.35589354 0.8542921 ]
Recall Score    ->  [0.24528302 0.951341  ]
Precision Score ->  [0.64819945 0.77521074]
 
Confusion matrix: 
[[ 234  720]
 [ 127 2483]]
 
Accuracy Score  ->  0.819023569023569
F1 Score with   ->  [0.38512869 0.89389702]
Recall Score    ->  [0.29360465 0.94471488]
Precision Score ->  [0.55955679 0.84826725]
 
Confusion matrix: 
[[ 202  486]
 [ 159 2717]]
 
Accuracy Score  ->  0.877665544332211
F1 Score with   ->  [0.50341686 0.93024   ]
Recall Score    ->  [0.42746615 0

## Test - Evaluation

Since every classifier was successful classifying English words, mostly precision and recall score metrics for Turkish class were taken into consideration. SVM outperformed other classifiers with the average recall being ~42% and average precision ~60%.

## Future Work

The proposed ML system is significantly successful identifying English words, which may be an indicator that if a larger Turkish corpus can be utilized, classifier would be enable to succesfully detect language. In addition, word-length constraint in vectorization is challenging since it limits the algorithm's ability to detect the language of longer words.

## Now it's your turn to test! Enjoy! :)

In [18]:
#Interactive block of code enables users to enter their own words for algorithm to classify!

#Allow users to enter 10 words, run code again if further testing required
count = 0
valid = True

while valid==True:
    if count==10:
        valid = False
    else:
        word = input('Enter word to predict:\n')

        #Since our algorithm deals with max 6-length words, check length
        if len(word) <= 6:
            word = word.lower()
            n = len(word)
            vec = ''
            chars = set('wxq')

            if any((c in chars) for c in word): #if string contains w, x or q
                print('English')
                count+=1
                
            else:
                #Vectorize
                for j in range(n):
                    letter = word[j]
                    pos = ord(letter)-97 #get unicode
                    vector = (str(0)*pos) + str(1) + str(0)*(25-pos) #fill 1 and 0's for existing letters
                    vec = vec + vector #update empty string

                if n <= 6:
                    remaining = 6-n
                    vec = vec + str(0)*26*remaining #fill 0's if word is shorter than 6

                result = [int(let) for let in str(vec) if let.isdigit()]
                result = (np.asarray(result)).reshape(1,-1)
                
                
                prediction_rf = classifier.predict(result) #feed word into classifier

                if (prediction_rf[0] == 0):
                    print('Turkish')
                else:
                    print('English')
            print('\n')
            count+=1
        else:
            count+=1
            print('Word must be less than ' + str(7) + ' letters long')
            print('\n')

Enter word to predict:
mavi
English


Enter word to predict:
kahve
Turkish


Enter word to predict:
cam
Turkish


Enter word to predict:
remote
English


Enter word to predict:
chair
English


Enter word to predict:
black
English


Enter word to predict:
sehpa
English


Enter word to predict:
kapi
Turkish


Enter word to predict:
guzel
Turkish


Enter word to predict:
islev
Turkish


