<a href="https://colab.research.google.com/github/joereuben/language-identification/blob/main/Multinomial_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The Objective**
I am tasked with writing a language identification program that can detect the language of text given to it. The output of the prediction will be the language code.


---

I will be using character bigrams for vectorization and the naive bayes algorithm as the classifier

# Importing Libraries
IN this program, I will be using four txt files containing text in 4 different languages. The texts will have to be cleaned, vectorized before being trained by the algorithm

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix,classification_report

# Preprocessing Function
A function for cleaning up our data by removing non-alphabetic characters before feeding it into the algorithm

In [40]:
def process_data(file):
   
   data = ""
   with open(file,"r",encoding="utf-8") as f:
      data = f.read()
 
   data = data.replace("?",".")
   data = data.replace("!",".")
   data = data.replace("»","")
   data = data.replace("«","")
   data = data.replace(":","")
   data = data.replace(";","")
   data = data.replace("...",".")
   data = data.replace("…",".")
   data = data.replace("\n",".")
   data = data.replace("  "," ")
   data = data.replace("\"","")
   data = data.replace("„","")
   sentences = data.split(".")
   for i in range(len(sentences)):
      sentences[i] = sentences[i].strip()
      
   sentences = [x for x in sentences if x != ""]
   return sentences

# Data Reading
Loading the txt files that contains our sample data. These will be cleaned by the function above, used to create an array of the languages and then another array with the language code related to each language 

In [41]:
italian = process_data("il fratello maggiore sta guardando.txt")
english = process_data("big brother is watching.txt")
german = process_data("großer bruder schaut zu.txt")
portuguese = process_data("irmão mais velho está assistindo.txt")

X = np.array(italian + english + german + portuguese)
y = np.array(['it']*len(italian) + ['en']*len(english) + ['de']*len(german) + ['pt']*len(portuguese))


# Training
Splitting the dataset into training and test sets and start working with the models

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Vectorizer
Using the CountVectorizer with each bigram to create the number of occurences for the bigram. Then we can create the pipeline that vectorizes our data and gives it to the naive bayes model

In [51]:
cnt = CountVectorizer(analyzer = 'char',ngram_range=(2,2))

pipeline = Pipeline([
   ('vectorizer',cnt),  
   ('model',MultinomialNB())
])

Fitting the pipeline and calculating predictions on the test

In [44]:
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)

In [45]:
# confusion_matrix(y_test, y_pred)

In [46]:
# print(classification_report(y_test, y_pred))

Creating a dictionary with the language codes as keys and the language names as values to give a more friendly output on predictions

In [47]:
languages = {"en":"English", "it":"Italian", "de":"German", "pt":"Portuguese"}

# Tests
A test prediction with a sentence in English. Expected outcome is 'en', which is passed into the languages dictionary for an output of "English"

In [48]:
lang = pipeline.predict(["""Braverman lived in France for two years, as an Erasmus 
Programme student and then as an Entente Cordiale Scholar, where she completed 
a master's degree in European and French law at Panthéon-Sorbonne University. """])

print("This text is in", languages[lang[0]])

This text is in English


A test prediction with a sentence in German. Expected outcome is 'de'

In [49]:
lang = pipeline.predict(["Braverman lebte zwei Jahre in Frankreich, als Studentin im Erasmus-Programm und dann als Stipendiatin der Entente Cordiale, wo sie einen Master-Abschluss in europäischem und französischem Recht an der Universität Panthéon-Sorbonne absolvierte"])
print("This text is in", languages[lang[0]])

This text is in German


A test prediction with a sentence in Portuguese. Expected outcome is 'pt'

In [50]:
lang = pipeline.predict(["Braverman morou na França por dois anos, como aluna do Programa Erasmus e depois como Entente Cordiale Scholar, onde completou um mestrado em direito europeu e francês na Universidade Panthéon-Sorbonne"])
print("This text is in", languages[lang[0]])

This text is in Portuguese
