# **Language Identification**

## Abstract

Language identification is the task of determining the language of a given text. This notebook implements a language classifier that takes a piece of text as input, and outputs a label corresponding to the predicted lanugage. 

## Table of Contents

>[Language Identification](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=K7Nmylf8rXTn)

>>[Abstract](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=Upo7YGEQrXRF)

>>[Table of Contents](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=cIWRfWDUtRAW)

>>[Setup](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=uY-DPgCprXOt)

>>[Imports](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=XRnJhRxArXKW)

>>[Download the Dataset](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=kEKXyRbhtXM5)

>>[Load the Dataset into DataFrames](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=tSnF_yeGtidv)

>>[Preprocess the Dataset](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=M0d9p6cYbu9l)

>>[Text Vectorization](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=v8J6FnCQ1JXF)

>>[Model Creation](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=742Ajk5C2v_b)

>>[Model Training](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=uE0H0LMF2vwt)

>>[Model Evaluation](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=vM1zJpQn20AG)

>>[Model Inference](#folderId=1WfdtJTLa8SuoyexztTWl78sU9abBg09q&updateTitle=true&scrollTo=7guHSVHn2ysj)



## Setup

In [1]:
!pip install datasets -q

[K     |████████████████████████████████| 441 kB 7.2 MB/s 
[K     |████████████████████████████████| 115 kB 58.9 MB/s 
[K     |████████████████████████████████| 212 kB 62.6 MB/s 
[K     |████████████████████████████████| 163 kB 65.6 MB/s 
[K     |████████████████████████████████| 127 kB 65.7 MB/s 
[?25h

## Imports

In [2]:
import string

import numpy as np
import pandas as pd
from datasets import load_dataset

In [3]:
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

from keras import Model
from keras.layers import TextVectorization, Input, Embedding, Flatten, Dense 

## Download the Dataset

The model is trained on the [European Parliament Proceedings Parallel Corpus](https://https://www.statmt.org/europarl/).

In [4]:
dataset = load_dataset("papluca/language-identification")

Downloading readme:   0%|          | 0.00/4.99k [00:00<?, ?B/s]



Downloading and preparing dataset csv/papluca--language-identification to /root/.cache/huggingface/datasets/papluca___csv/papluca--language-identification-b9299393bab34ec8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/papluca___csv/papluca--language-identification-b9299393bab34ec8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 70000
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
})

## Load the Dataset into DataFrames

In [6]:
label_encoder = LabelEncoder()

def one_hot_encode(labels):
  label_encoder.fit(labels)
  y = label_encoder.transform(labels)
  y = to_categorical(
      y,
      num_classes=len(labels.unique())    
  )

  return y

In [7]:
# Dataframes for each data split 
df_train = pd.DataFrame(dataset["train"])
df_val = pd.DataFrame(dataset["validation"])
df_test = pd.DataFrame(dataset["test"])

In [8]:
# One-hot encoded labeles
y_train = one_hot_encode(df_train["labels"])
y_val = one_hot_encode(df_val["labels"])
y_test = one_hot_encode(df_test["labels"])

In [9]:
df_train

Unnamed: 0,labels,text
0,pt,"os chefes de defesa da estónia, letónia, lituâ..."
1,bg,размерът на хоризонталната мрежа може да бъде ...
2,zh,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...
3,th,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...
4,ru,Он увеличил давление .
...,...,...
69995,ja,本格的なゲーミングヘッドホンでした。 今まで使ってた1万円するパナソニックのヘッドホンは何だ...
69996,el,"Ναι , ξέρω ένα που είναι ακόμα έτσι , αλλά αυτ..."
69997,ur,اور مجھے اس ملک کے بارے میں معلوم نہیں ہے کہ گ...
69998,es,Se me rompió uno al sacarlo del cargador. Cali...


In [10]:
df_train[df_train["labels"] == "it"]

Unnamed: 0,labels,text
13,it,Una donna sta affettando della carne.
53,it,L'India e il Pakistan rimangono fuori dal trat...
60,it,L'Egitto impone lo stato di emergenza dopo 95 ...
85,it,Un animale marrone peloso sta dietro ad alcune...
117,it,Due ragazze brune si siedono in cima a una mot...
...,...,...
69946,it,Sondaggi aperti alle elezioni presidenziali russe
69948,it,Il ministero egiziano esorta ancora una volta ...
69952,it,"Inoltre, le aziende di tutto lo Utah si stanno..."
69956,it,Nel 2001 sono passati alla Syndia attraverso l...


In [11]:
num_languages = len(df_train["labels"].unique())
print(f"Total number of languages in the training dataset: {num_languages}")

Total number of languages in the training dataset: 20


## Preprocess the Dataset

In [12]:
def preprocess(text):

  # Lowercase the text
  text = text.lower()

  # Remove punctuation
  text = text.translate(str.maketrans("", "", string.punctuation))

  return text

In [13]:
df_train["preprocessed_text"] = df_train["text"].apply(preprocess)
df_val["preprocessed_text"] = df_val["text"].apply(preprocess)
df_test["preprocessed_text"] = df_test["text"].apply(preprocess)

In [14]:
df_train

Unnamed: 0,labels,text,preprocessed_text
0,pt,"os chefes de defesa da estónia, letónia, lituâ...",os chefes de defesa da estónia letónia lituâni...
1,bg,размерът на хоризонталната мрежа може да бъде ...,размерът на хоризонталната мрежа може да бъде ...
2,zh,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...,很好，以前从不去评价，不知道浪费了多少积分，现在知道积分可以换钱，就要好好评价了，后来我就把...
3,th,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...,สำหรับ ของเก่า ที่ จริงจัง ลอง honeychurch ...
4,ru,Он увеличил давление .,он увеличил давление
...,...,...,...
69995,ja,本格的なゲーミングヘッドホンでした。 今まで使ってた1万円するパナソニックのヘッドホンは何だ...,本格的なゲーミングヘッドホンでした。 今まで使ってた1万円するパナソニックのヘッドホンは何だ...
69996,el,"Ναι , ξέρω ένα που είναι ακόμα έτσι , αλλά αυτ...",ναι ξέρω ένα που είναι ακόμα έτσι αλλά αυτό ...
69997,ur,اور مجھے اس ملک کے بارے میں معلوم نہیں ہے کہ گ...,اور مجھے اس ملک کے بارے میں معلوم نہیں ہے کہ گ...
69998,es,Se me rompió uno al sacarlo del cargador. Cali...,se me rompió uno al sacarlo del cargador calid...


## Text Vectorization

In [15]:
vocabulary_size = 20000
max_seq_len = 32

text_vectorization = TextVectorization(
    max_tokens=vocabulary_size,
    output_sequence_length=max_seq_len
)

text_vectorization.adapt(df_train["preprocessed_text"])
text_vectorization_vocabulary = text_vectorization.get_vocabulary()

In [16]:
text_vectorization_vocabulary[:20]

['',
 '[UNK]',
 '्',
 'de',
 'a',
 'la',
 'the',
 'i',
 'que',
 'na',
 'на',
 'in',
 'ya',
 'и',
 'un',
 'в',
 'die',
 'it',
 'es',
 'के']

In [17]:
# Vectorized text 
x_train = text_vectorization(df_train["preprocessed_text"])
x_val = text_vectorization(df_val["preprocessed_text"])
x_test = text_vectorization(df_test["preprocessed_text"])

In [18]:
x_train.shape

TensorShape([70000, 32])

## Model Creation

In [19]:
inputs = Input(shape=(max_seq_len), name="inputs_layer")

embedding_layer = Embedding(vocabulary_size, 32, name="embedding_layer")(inputs)

flatten_layer = Flatten(name="flatten_layer")(embedding_layer)

outputs = Dense(units=num_languages, activation="softmax", name="outputs_layer")(flatten_layer)

model = Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs_layer (InputLayer)   [(None, 32)]              0         
                                                                 
 embedding_layer (Embedding)  (None, 32, 32)           640000    
                                                                 
 flatten_layer (Flatten)     (None, 1024)              0         
                                                                 
 outputs_layer (Dense)       (None, 20)                20500     
                                                                 
Total params: 660,500
Trainable params: 660,500
Non-trainable params: 0
_________________________________________________________________


## Model Training

In [20]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=4, validation_data=(x_val, y_val))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7efc1ae65c50>

## Model Evaluation

In [39]:
def decode_labels(y):

  labels = []
  for y_i in y:
    labels.append(np.argmax(y_i, axis=0))
  return labels

In [45]:
y_true = decode_labels(y_test)
y_pred = decode_labels(model.predict(x_test))

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.93      0.96       500
           1       0.98      0.99      0.99       500
           2       1.00      1.00      1.00       500
           3       1.00      0.99      1.00       500
           4       1.00      1.00      1.00       500
           5       1.00      0.99      0.99       500
           6       0.99      1.00      1.00       500
           7       1.00      0.91      0.95       500
           8       0.99      0.98      0.98       500
           9       0.36      0.55      0.43       500
          10       1.00      0.98      0.99       500
          11       0.88      0.98      0.93       500
          12       0.98      0.98      0.98       500
          13       1.00      0.96      0.98       500
          14       0.98      0.98      0.98       500
          15       1.00      0.15      0.27       500
          16       0.97      0.95      0.96       500
          17       0.98    

## Model Inference

In [46]:
df_test[:10]

Unnamed: 0,labels,text,preprocessed_text
0,nl,Een man zingt en speelt gitaar.,een man zingt en speelt gitaar
1,nl,De technologisch geplaatste Nasdaq Composite I...,de technologisch geplaatste nasdaq composite i...
2,es,Es muy resistente la parte trasera rígida y lo...,es muy resistente la parte trasera rígida y lo...
3,it,"""In tanti modi diversi, l'abilità artistica de...",in tanti modi diversi labilità artistica dei m...
4,ar,منحدر يواجه العديد من النقاشات المتجهه إزاء ال...,منحدر يواجه العديد من النقاشات المتجهه إزاء ال...
5,ru,Через каждые сто градусов пятна краски меняют ...,через каждые сто градусов пятна краски меняют ...
6,tr,"Sözlüğün yanı sıra, ortalama modern okuyucu iç...",sözlüğün yanı sıra ortalama modern okuyucu içi...
7,nl,Verschillende mensen op motorfietsen op een ma...,verschillende mensen op motorfietsen op een ma...
8,fr,"Bonjour, Le produit est conforme à la descript...",bonjour le produit est conforme à la descripti...
9,es,"No funciona lo he devuelto, no hace nada",no funciona lo he devuelto no hace nada


In [47]:
predictions = np.argmax(model.predict(x_test[:10]), axis=1)
label_encoder.inverse_transform(predictions)



array(['nl', 'nl', 'es', 'it', 'ar', 'ru', 'tr', 'nl', 'fr', 'es'],
      dtype=object)

In [51]:
inference_sentences = [
    "i love my dog",
    "amo il mio cane!"
    ]

for sentence in inference_sentences:
  s = preprocess(sentence)
  s = text_vectorization(s)
  s_pred = model.predict(s)
  s_lang_index = np.argmax(s_pred, axis=1)
  s_lang = label_encoder.inverse_transform(s_lang_index)

  print(f"Sentence: {sentence}\nPredicted Language: {s_lang}\n\n")

Sentence: i love my dog
Predicted Language: ['en']


Sentence: amo il mio cane!
Predicted Language: ['it']


