Steps: \
1. Preprocess the data \
2. Train the model using fasttext \
3. Using trained model to predict \
4. Using FastText pretrained model. \
5. Using Google Translator to translate into English



**Preprocessing data**

In [1]:
import pandas as pd
import numpy as np
import math

In [None]:
df = pd.read_csv('/content/europarl.csv', names=['lang', 'text'], encoding='latin1')

In [12]:
df.head(3)

Unnamed: 0,lang,text
0,bg,Ð¡ÑÑÑÐ°Ð² Ð½Ð° ÐÐ°ÑÐ»Ð°Ð¼ÐµÐ½ÑÐ°: Ð²Ð¶. ...
1,bg,ÐÐ´Ð¾Ð±ÑÑÐ²Ð°Ð½Ðµ Ð½Ð° Ð¿ÑÐ¾ÑÐ¾ÐºÐ¾Ð»Ð° Ð...
2,bg,Ð¡ÑÑÑÐ°Ð² Ð½Ð° ÐÐ°ÑÐ»Ð°Ð¼ÐµÐ½ÑÐ°: Ð²Ð¶. ...


In [44]:
df.shape

(832305, 3)

Shuffle data

In [13]:
df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True)

In [23]:
df[:2]

Unnamed: 0,lang,text
0,bg,Ð Ð¼Ð¾Ð¼ÐµÐ½ÑÐ° ÑÐµÐºÐ°Ñ ÑÐ°Ð·Ð¸ÑÐºÐ²Ð°Ð...
1,pl,"jeÅli chodzi o daphne (ochrona dzieci, kobiet..."


Data normalization and label formatting.

In [41]:
def normalize_text(row):

    label = '__label__' + str(row['lang'])
    txt = str(row['text'])

    return ' '.join(( label + ' , ' + txt ).split())

In [42]:
df['normalized'] = df.apply( lambda row: normalize_text(row), axis=1 )

In [43]:
df['normalized'].head(2)

0    __label__bg , Ð Ð¼Ð¾Ð¼ÐµÐ½ÑÐ° ÑÐµÐºÐ°Ñ ÑÐ...
1    __label__pl , jeÅli chodzi o daphne (ochrona ...
Name: normalized, dtype: object

Split the data into 75% train and 25% test

In [50]:
split = math.floor(len(df)* 0.75)
train = df['normalized'][:split].copy()
test = df['normalized'][split:].copy()

In [52]:
np.savetxt('/content/europarl.train', train.values, fmt="%s")
np.savetxt('/content/europarl.eval', test.values, fmt="%s")


**Using Fasttext to train the model**

In [2]:
!git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Enumerating objects: 3946, done.[K
remote: Counting objects: 100% (1005/1005), done.[K
remote: Compressing objects: 100% (153/153), done.[K
remote: Total 3946 (delta 904), reused 862 (delta 851), pack-reused 2941[K
Receiving objects: 100% (3946/3946), 8.26 MiB | 33.54 MiB/s, done.
Resolving deltas: 100% (2511/2511), done.


In [3]:
%%bash
cd fastText/
make

c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/vector.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/model.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/utils.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -

In [4]:
%%bash
mkdir -p model

In [30]:
%%bash
/content/fastText/fasttext supervised -input /content/europarl.train -output /content/model/europarl

Read 1M wordsRead 2M wordsRead 3M wordsRead 4M wordsRead 5M wordsRead 6M wordsRead 7M wordsRead 8M wordsRead 9M wordsRead 10M wordsRead 11M wordsRead 12M wordsRead 13M wordsRead 14M wordsRead 15M wordsRead 16M wordsRead 17M wordsRead 18M wordsRead 19M wordsRead 20M wordsRead 21M wordsRead 22M wordsRead 23M wordsRead 24M wordsRead 25M wordsRead 26M wordsRead 27M wordsRead 28M wordsRead 29M wordsRead 30M wordsRead 31M wordsRead 32M wordsRead 33M wordsRead 34M wordsRead 35M wordsRead 36M wordsRead 37M wordsRead 37M words
Number of words:  1800628
Number of labels: 21
Progress:   0.1% words/sec/thread:  112859 lr:  0.099927 avg.loss:  3.060272 ETA:   0h 2m17sProgress:   0.2% words/sec/thread:  122845 lr:  0.099838 avg.loss:  3.060119 ETA:   0h 2m 6sProgress:   0.3% words/sec/thread:  136572 lr:  0.099732 avg.loss:  3.035014 ETA:   0h 1m53sProgress:   0.4% words/sec/thread:  159491 lr:  0.099584 avg.loss:  2.825322 ETA:   0h 1m37sProgress:   0.6% w

Test the accuracy of the trained model in the evaluation data set.

In [34]:
%%bash
/content/fastText/fasttext test /content/model/europarl.bin /content/europarl.eval

N	202305
P@1	0.989
R@1	0.989


The accuracy is 98.9%.

Predict using the trained model

In [7]:
!pip3 install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199678 sha256=e3bcb46b023231d3c6342b8456bb613619644b907ba012bbc7d886551c8de941
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [35]:
import fasttext as ft

In [36]:
model = ft.load_model('/content/model/europarl.bin')



In [38]:
text = open('/content/europarl.test').readlines()

In [40]:
predictions = model.predict(text)

ValueError: ignored

In [None]:
predictions[:10]

Test the accuracy of the trained model on the test data set

In [41]:
def normalize_text(row):
    """make text exmaple """
    label = '__label__' + str(row['lang'])
    txt = str(row['text'])

    return ' '.join(( label + ' , ' + txt ).split())

In [43]:
test = pd.read_csv('/content/europarl.test', sep='\t', names=['lang', 'text'])
test['normalized'] = test.apply(lambda row: normalize_text(row), axis=1)
test = test.reindex(np.random.permutation(test.index)).reset_index(drop=True)
np.savetxt('/content/europarl_normalized.test', test['normalized'].values, fmt='%s')

In [44]:
%%bash
/content/fastText/fasttext test /content/model/europarl.bin /content/europarl_normalized.test

N	20828
P@1	0.981
R@1	0.981


The accuracy on training dataset is 98.1%.

**Fasttext pretrained model**

In [52]:
pretrained_text = ("Hago contenido con mucho esfuerzo, "
          "sería muy motivador si pudieras"
          "suscribirte a mi canal")

In [45]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

--2023-07-31 14:42:24--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.165.83.35, 18.165.83.79, 18.165.83.91, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.165.83.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘lid.176.bin’


2023-07-31 14:42:24 (247 MB/s) - ‘lid.176.bin’ saved [131266198/131266198]



In [47]:
model_loc = '/content/lid.176.bin'
model = ft.load_model(model_loc)



In [53]:
print(pretrained_text)

Hago contenido con mucho esfuerzo, sería muy motivador si pudierassuscribirte a mi canal


In [54]:
result = model.predict(pretrained_text)
print(result)

(('__label__es',), array([0.99334759]))


Pretrained model predicted the text as Spanish(language code 'es'), the probability is 0.99.

**Google Translate: translate into English**

In [57]:
!pip install googletrans==3.1.0a0

Collecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==3.1.0a0)
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting hstspreload (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading hstspreload-2023.1.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting chardet==3.* (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna==2.* (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading idna-2.10-py2.py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━

In [58]:
from googletrans import Translator

In [59]:
translator = Translator()

In [60]:
result = translator.translate(pretrained_text, lang_tgt='en')

In [61]:
print(result)

Translated(src=es, dest=en, text=I make content with a lot of effort, it would be very motivating if you could subscribe to my channel, pronunciation=I make content with a lot of effort, it would be very motivating if you could subscribe to my channel, extra_data="{'translat...")
