# Multi Language Detection

## Install Packages

In [1]:
# !pip install datasets
# !pip install scikit-learn
# !pip install neattext
# !pip install scikit-multilearn
# !pip install pickle

## Load Dataset from Hugging Face

In [2]:
from datasets import load_dataset
import pandas as pd

In [3]:
dataset = load_dataset("papluca/language-identification", "default")
trainingData = dataset['train']

df = trainingData.to_pandas()
df

Unnamed: 0,labels,text
0,pt,"os chefes de defesa da estónia, letónia, lituâ..."
1,bg,размерът на хоризонталната мрежа може да бъде ...
2,zh,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...
3,th,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...
4,ru,Он увеличил давление .
...,...,...
69995,ja,本格的なゲーミングヘッドホンでした。 今まで使ってた1万円するパナソニックのヘッドホンは何だ...
69996,el,"Ναι , ξέρω ένα που είναι ακόμα έτσι , αλλά αυτ..."
69997,ur,اور مجھے اس ملک کے بارے میں معلوم نہیں ہے کہ گ...
69998,es,Se me rompió uno al sacarlo del cargador. Cali...


## Data Pre-Processing

### Clean Data

In [4]:
import neattext as nt
import neattext.functions as nfx

In [5]:
pd.options.display.max_colwidth=999

# Define the languages to keep
languages_to_keep = ['en', 'es', 'ru', 'de', 'fr', 'it']
df = df[df['labels'].isin(languages_to_keep)]

# Cap all text lengths at 50 characters to avoid long texts and help the model recognize shorter inputs
df['text'] = df['text'].str[:50]

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].str[:50]


Unnamed: 0,labels,text
4,ru,Он увеличил давление .
11,es,Un producto de una calidad y capacidad increíbles
13,it,Una donna sta affettando della carne.
15,de,"Alles in allem ein super schönes Teil, deshalb die"
17,de,Einer Freundin Geschenk da sie Flugbegleiterin ist
...,...,...
69979,es,"Tuve que devolverlos, el auricular izquierdo se de"
69987,it,Il governo israeliano autorizza nuovi insediamenti
69989,fr,Satisfaite de mon achat qui correspond au descript
69992,es,Muy bueno para gente con alergias


### Restructure Data for Multi-Label Classification
Since we want this model to recognize every language in a string, we are going to be using Multi-Label Classification. This data is currently setup for Multi-Class Classification, where one input will be assigned one output. We will want it to look something like this:

| Text                                              | en | es | fr | hi |
| ------------------------------------------------- | -- | -- | -- | -- |
| test an english phrase                            |  1 |  0 |  0 |  0 |
| probar una frase en ingles                        |  0 |  1 |  0 |  0 |
| test an english phrase probar una frase en ingles |  1 |  1 |  0 |  0 |

In [6]:
import numpy as np

In [7]:
# Create a set of unique labels to represent all the languages in the dataset
unique_labels = df['labels'].unique()
num_rows = len(df)
num_cols = len(unique_labels) + 1  
reformatted_data = np.zeros((num_rows, num_cols), dtype=object)

reformatted_data[:, 0] = df['text']

# Fill the boolean values for labels
for i, label in enumerate(unique_labels):
    reformatted_data[:, i+1] = (df['labels'] == label)

# Create a DataFrame from the NumPy array
reformatted_df = pd.DataFrame(reformatted_data, columns=['text'] + list(unique_labels))

reformatted_df

Unnamed: 0,text,ru,es,it,de,fr,en
0,Он увеличил давление .,True,False,False,False,False,False
1,Un producto de una calidad y capacidad increíbles,False,True,False,False,False,False
2,Una donna sta affettando della carne.,False,False,True,False,False,False
3,"Alles in allem ein super schönes Teil, deshalb die",False,False,False,True,False,False
4,Einer Freundin Geschenk da sie Flugbegleiterin ist,False,False,False,True,False,False
...,...,...,...,...,...,...,...
20995,"Tuve que devolverlos, el auricular izquierdo se de",False,True,False,False,False,False
20996,Il governo israeliano autorizza nuovi insediamenti,False,False,True,False,False,False
20997,Satisfaite de mon achat qui correspond au descript,False,False,False,False,True,False
20998,Muy bueno para gente con alergias,False,True,False,False,False,False


### Create Data for Multi-Language Recognition
The current dataset provides only one language per item in the text column. To train a model fo multi-language recognition, we need to create some new data that has texts with multiple languages and the labels marked accordingly. To do this, we will systematically add rows to the data that contain another random language and apply the other language label to the row.

In [8]:
pd.options.display.max_colwidth=999

sample_size = 5000
lang1 = reformatted_df.sample(n=sample_size)
lang2 = reformatted_df.sample(n=sample_size)
lang3 = reformatted_df.sample(n=sample_size)

# Generate rows with 2 languages
cap_length = 50
data_rows = [
    [(lang1['text'].iloc[index])[:cap_length] + ' ' + (lang2['text'].iloc[index])[:cap_length]] +
    (lang1.iloc[index, 1:] | lang2.iloc[index, 1:]).astype(bool).tolist()
    for index in range(len(lang1))
]

new_data = pd.DataFrame(data_rows, columns=['text'] + list(unique_labels))

# Generate rows with 3 languages
data_rows = [
    [(new_data['text'].iloc[index])[:cap_length] + ' ' + (lang3['text'].iloc[index])[:cap_length]] +
    (new_data.iloc[index, 1:] | lang3.iloc[index, 1:]).astype(bool).tolist()
    for index in range(len(new_data))
]

more_new_data = pd.DataFrame(data_rows, columns=['text'] + list(unique_labels))
more_new_data

Unnamed: 0,text,ru,es,it,de,fr,en
0,В 1880-х годах в промышленность начала свою деятел Wurden nicht wie auf dem Foto in einer festen Verp,True,False,False,True,False,False
1,"Cette graisse protège bien la chaîne et tiens long В вода , используемая при испытаниях на токсичност",True,True,False,False,True,False
2,de momento las pulseras están bien y dentro del us Llego bien de tiempo y se instala fácil. Pero el c,True,True,False,False,False,False
3,Después de tenerlas conectadas todas las Navidades nel luglio 2006 il primo ministro giordano maaruf,True,True,True,False,False,False
4,Due mucche da latte che bevono da uno stagno. It i Die Farben sind wirklich schön! NUR dauert das eee,False,False,True,True,False,True
...,...,...,...,...,...,...,...
4995,Buen producto calidad precio.lo volveré a comprar В этом заявлении приводятся оценочные данные о пот,True,True,False,False,False,True
4996,Haiti è un buco di merda. La manopla funciona y a This product was okay but had to air it out for we,False,True,True,False,False,True
4997,"Sieht sehr gut aus, ist aber definitiv zu klein. I These are gorgeous and comfortable. Would definite",True,False,False,True,False,True
4998,"Preis/Leistung ist Top, Versand lief schnell. Gut A couple of the zippers were broken. For the price",False,False,False,True,False,True


### Combine New Data with Existing

In [9]:
pd.options.display.max_colwidth=999

multi_df = pd.concat([reformatted_df, new_data, more_new_data])
# transform to booleans int for model training purposes
multi_df[unique_labels] = multi_df[unique_labels].astype(int)
multi_df

Unnamed: 0,text,ru,es,it,de,fr,en
0,Он увеличил давление .,1,0,0,0,0,0
1,Un producto de una calidad y capacidad increíbles,0,1,0,0,0,0
2,Una donna sta affettando della carne.,0,0,1,0,0,0
3,"Alles in allem ein super schönes Teil, deshalb die",0,0,0,1,0,0
4,Einer Freundin Geschenk da sie Flugbegleiterin ist,0,0,0,1,0,0
...,...,...,...,...,...,...,...
4995,Buen producto calidad precio.lo volveré a comprar В этом заявлении приводятся оценочные данные о пот,1,1,0,0,0,1
4996,Haiti è un buco di merda. La manopla funciona y a This product was okay but had to air it out for we,0,1,1,0,0,1
4997,"Sieht sehr gut aus, ist aber definitiv zu klein. I These are gorgeous and comfortable. Would definite",1,0,0,1,0,1
4998,"Preis/Leistung ist Top, Versand lief schnell. Gut A couple of the zippers were broken. For the price",0,0,0,1,0,1


### Create Train/Test Split

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X = multi_df['text']
y = multi_df[unique_labels]
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Build and Train Model
Using the skmultilearn library, we will attempt a few different methods and compare results as described (here)[https://medium.com/analytics-vidhya/an-introduction-to-multi-label-text-classification-b1bcb7c7364c].

NOTE: Currently experiencing out of memory errors. Attempting to run on a more powerful machine. 

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset
from sklearn.metrics import accuracy_score

### Binary Relevance

#### Train Model

In [13]:
bin_rel_model = Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('clf', BinaryRelevance(LogisticRegression(solver='sag'))),
            ])

bin_rel_model.fit(X_train, y_train)

#### Save Model for Future Use

In [16]:
from pickle import dump, load

In [18]:
dump(bin_rel_model, open('model.pkl', 'wb'))

#### Model Performance

In [19]:
predictions = bin_rel_model.predict(X_test)

predictions.toarray()

array([[0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       ...,
       [0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0]], dtype=int32)

In [20]:
accuracy_score(y_test, predictions)

0.8011612903225807

#### Some Manual Testing

In [30]:
# Prediction Function
def detect(text: str) -> list[str]:
    prediction_array = bin_rel_model.predict([text]).toarray()[0]
    return [unique_labels[i] for i in range(len(prediction_array)) if prediction_array[i] == 1]

In [39]:
detect("test an english phrase that is long enough")

['en']

In [32]:
detect("протестируйте более длинную английскую фразу с большим количеством символов, пока она не сможет распознать")

['ru']

In [33]:
detect("test a longer english phrase with more characters until it can recognize prueba una frase en inglés más larga con más caracteres hasta que pueda reconocerla")

['es', 'en']

In [37]:
detect("Guten Tag! Wie geht es Ihnen? Как вы себя чувствуете? I hope you're having einen wunderbaren Tag! Это прекрасный день, не так ли? It's a beautiful day, isn't it? Я надеюсь, что вы наслаждаетесь этим солнечным днем! Spring is in the air, and it feels so refreshing! Весна приносит новую энергию и радость жизни! Es ist eine Freude, die wärmende Sonne zu spüren und die blühenden Blumen zu sehen.")

['ru', 'de', 'en']

### Classifier Chain

#### Train Model

In [40]:
class_chain_model = Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('clf', ClassifierChain(LogisticRegression(solver='sag'))),
            ])

class_chain_model.fit(X_train, y_train)

#### Model Performance

In [42]:
class_chain_predictions = class_chain_model.predict(X_test)

class_chain_predictions.toarray()

array([[0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

In [43]:
accuracy_score(y_test, class_chain_predictions)

0.8020645161290323

### Label Powerset
This method is not likely to provide good results since best practice is to provide all possible combinations of labels within the training data.

#### Train Model

In [44]:
powerset_model = Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('clf',  LabelPowerset(LogisticRegression(max_iter=120))),
            ])

powerset_model.fit(X_train, y_train)

#### Model Performance

In [46]:
powerset_predictions = powerset_model.predict(X_test)

powerset_predictions.toarray()

array([[0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       ...,
       [0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0]], dtype=int64)

In [47]:
accuracy_score(y_test, powerset_predictions)

0.7162580645161291

## Conclusion
We got some decent results training a multi-label classification model to identify multiple languages within a given string. There were definitely some limitations and I ran into a few issues during the training process. I limited the number of languages that the model supports to avoid some encoding issues and save on memory (I will look to improve on this in the future). I also had to shorten the text field length to save on memory and allow shorter inputs for successful detects. This string trimming could likely be improved with shortening by word count instead of character count (for applicaple languages). 

### Future Work
In later projects, I hope to create a python library that utilizes the model created here to provide multi-language detection to users who might need this niche functionality. This might be as simple as loading the model and using the "detect" function defined in this project.