# Language Detection with Machine Learning
Aman Kharwal
October 30, 2021

Language detection is a natural language processing task where we identify the language of a text or document. This notebook will guide you through training a machine learning model for language detection using Python.

## Introduction
Using machine learning for language identification was challenging a few years ago due to the lack of data. However, with abundant data now available, several powerful models can be used for this task. We will use a dataset from Kaggle containing 22 popular languages.

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Load the dataset
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")
data.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


## Dataset Overview
Let's check the dataset for any null values and the languages present.


In [5]:
data.isnull().sum()
data["language"].value_counts()

language
Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: count, dtype: int64

## Preparing the Data
Now, we will split the data into training and testing sets.

In [7]:
x = np.array(data["Text"])
y = np.array(data["language"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Training the Model
We will use the Multinomial Naïve Bayes algorithm to train our model.

In [9]:
model = MultinomialNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.953168044077135

## Testing the Model
Let's test the model by detecting the language of a user input text.

In [11]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a Text:  merhaba kardeşim deneme bu


['Turkish']


## Summary
Using machine learning for language identification has become easier with the availability of comprehensive datasets. This notebook provides a simple way to detect languages using Python.