## Language Detection

As a human, you can easily detect the languages you know. For example, I can easily identify Hindi,Marathi,Sanskrit,Japanese and English, but being an Indian, it is also not possible for me to identify all Indian languages. This is where the language identification task can be used. Google Translate is one of the most popular language translators in the world which is used by so many people around the world. It also includes a machine learning model to detect languages that you can use if you don’t know which language you want to translate.
<br>

The most important part of training a language detection model is data. The more data you have about every language, the more accurate your model will perform in real-time. The dataset that I am using is collected from Kaggle, which contains data about 22 popular languages and contains 1000 sentences in each of the languages, so it will be an appropriate dataset for training a language detection model with machine learning.  [Kaggle link to get the Dataset](https://www.kaggle.com/saadsikander/movies-ratings)

### Loading Language Detection Data

In [1]:
# Loading library
import pandas as pd
import numpy as np

In [None]:
# # Convert TXT to CSV
# # Readinag given csv file and creating dataframe
# df = pd.read_csv("Language Detection.txt")
# # storing this dataframe in a csv file
# df.to_csv('Language Detection.csv', index = None)
# # Readinf from csv file and
# df_data_csv = pd.read_csv("Language Detection.csv")
# df_data_csv.head()

In [4]:
data = pd.read_table("Language Detection.txt", delimiter=",")
data.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [5]:
# Checking For Null values
data.isnull().sum()

Text        0
language    0
dtype: int64

In [6]:
# See all the languages present in this dataset.
data["language"].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

### Language Detection Model

#### CountVectorize X 
Convert the text data in X to matrix of token counts. <br>
 [Learn more about CountVectorizer.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [7]:
# Loading library
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
x = np.array(data["Text"])
y = np.array(data["language"])

In [10]:
# Convert a collection of text documents to a matrix of token counts
cv = CountVectorizer()
X = cv.fit_transform(x)

#### Split the data into training and test sets

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### Training a machine learning model for the task of Language Detection

As this is a problem of multiclass classification, so using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification.
<br>
 [Learn more about Multinomial Naïve Bayes algorithm.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [12]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.953168044077135

#### Testing the model to detect the language of a text by taking a user input

In [15]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)
# Input in French Language (Hello how are you) = Bonjour comment vas-tu

['French']


### Saving Model

 ##### Saving model using Pickle

In [16]:
# Loading library
import pickle

In [17]:
# Save the Modle to file in the current working directory
Pkl_Filename = "Language_Detection.pkl"  
with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(model, file)

In [18]:
# Load the Model back from file
Pkl_Filename = "Language_Detection.pkl" 
with open(Pkl_Filename, 'rb') as file:  
    LD_Model = pickle.load(file)
LD_Model

MultinomialNB()

In [20]:
# Check prediction
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = LD_Model.predict(data)
print(output)
# Input in Russian Language (Testing the model) = Тестирование модели

['Russian']
