Let’s start the task of language detection with machine learning by importing the necessary Python libraries:

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

and the dataset:

In [None]:
data = pd.read_csv('LanguageDetectionData/dataset.csv')
display(data.head())

Let's check the type of our data:

In [None]:
display(data.info())

Let’s have a look at whether this dataset contains any null values or not:

In [None]:
display(data.isna().sum())

Now let’s have a look at all the languages present in this dataset:

In [None]:
data['language'].value_counts()

This dataset contains 22 languages with 1000 sentences from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

Now let’s split the data into training and test sets:

In [None]:
x = np.array(data['Text'])
y = np.array(data['language'])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

As this is a problem of multiclass classification, so I will be using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification:

In [None]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)
accuracy = model.score(X_test, y_test)

Now let’s use this model to detect the language of a text by taking a user input:

In [None]:
user = input('Enter a Text: ')
data = cv.transform([user]).toarray()
output = model.predict(data)
print(f'Predicted Language:{output[0]}')
print(f'Accuracy: {accuracy:.2f}%')