## FastText Word Vectors
**fastText** is an open-source, free, lightweight library developed by Facebook AI Research for efficient text classification and representation learning. It is based on the word2vec algorithm but extends it to handle subword information. This means that fastText can capture the meaning of words even if they are not present in the training data by breaking them down into their constituent subwords and learning representations for them. 

**FastText** can be used for a variety of NLP tasks, including text classification, sentiment analysis, and language modeling. It is particularly useful for languages with complex morphology, where words can be formed by combining multiple morphemes or subwords. FastText has been shown to achieve state-of-the-art performance on several benchmark datasets and is widely used in industry and academia.

<img src = "img.png" width = "700px" height = "500px"></img>

* So based on feeding charecters instead of word, the **fastText** technique don't have OOV problem.

<img src = "img1.png" width = "700px" height = "500px"></img>

* For custom word embedding the **fastText** is the first choice.

* Here we first download pre-trained **fastText model** and use it and then we train a custom fastText model based on our own dataset.

In [1]:
# So first we import fasttext:
import fasttext

In [3]:
# Next we load the dataset:
model_en = fasttext.load_model("E://model//cc.en.300.bin")    # this model is trained using fasttext.



In [4]:
# Now to know which methods are available for this model:
dir(model_en)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_labels',
 '_words',
 'f',
 'get_analogies',
 'get_dimension',
 'get_input_matrix',
 'get_input_vector',
 'get_label_id',
 'get_labels',
 'get_line',
 'get_meter',
 'get_nearest_neighbors',
 'get_output_matrix',
 'get_sentence_vector',
 'get_subword_id',
 'get_subwords',
 'get_word_id',
 'get_word_vector',
 'get_words',
 'is_quantized',
 'labels',
 'predict',
 'quantize',
 'save_model',
 'set_args',
 'set_matrices',
 'test',
 'test_label',
 'words']

In [5]:
# So one of the methods is get_nearest_neighbors which list us the nearst words to a given word:
model_en.get_nearest_neighbors('microphone')

[(0.8392396569252014, 'microphones'),
 (0.8321380019187927, 'mic'),
 (0.7451598644256592, 'microphone.The'),
 (0.729803740978241, 'mics'),
 (0.7280287742614746, 'microphone.'),
 (0.7029080986976624, 'Microphone'),
 (0.6825612187385559, 'mic.'),
 (0.6728053092956543, 'earpiece'),
 (0.6616590619087219, 'microphone-'),
 (0.611439049243927, 'headset')]

In [6]:
# You can get word vector for each individual word:
model_en.get_word_vector("better")

array([-0.06745417, -0.00459055,  0.0292529 ,  0.06398505, -0.05890018,
       -0.02680354,  0.0673181 , -0.01874053, -0.01697014,  0.01742896,
       -0.05997003,  0.01556631, -0.01342714,  0.00979408,  0.00383289,
       -0.00058665,  0.06775226,  0.01824615, -0.02152427,  0.02759496,
       -0.01638171, -0.00524545, -0.01547334,  0.00694389, -0.04726338,
       -0.04980147,  0.0288943 , -0.03506989,  0.08879457,  0.00633675,
        0.04459114, -0.00759133, -0.02217289,  0.04462425,  0.06456985,
        0.02746628, -0.0354324 , -0.01573211, -0.0217766 ,  0.00482742,
        0.01072012, -0.01384184, -0.06619801,  0.02753797, -0.01134966,
        0.02921755, -0.00332457,  0.00827356,  0.00517734,  0.01100364,
       -0.01660618,  0.0219216 , -0.02292698,  0.0007693 , -0.02968867,
       -0.03125088, -0.0202588 , -0.02813895, -0.0620019 , -0.0088591 ,
        0.00561525, -0.05416813, -0.04537176, -0.05977495, -0.02932148,
        0.0088325 , -0.0464927 , -0.02898412, -0.01737385, -0.00

In [7]:
# Shape of each vector will be 300:
model_en.get_word_vector("better").shape

(300,)

In [8]:
# Next method is get_anologies which find relationship between first two words and predict it for the 3rd word:
model_en.get_analogies("berlin","germany","france")

[(0.7303731441497803, 'paris'),
 (0.6408537030220032, 'france.'),
 (0.6393311023712158, 'avignon'),
 (0.6316676139831543, 'paris.'),
 (0.5895596742630005, 'montpellier'),
 (0.5884554386138916, 'rennes'),
 (0.5850598812103271, 'grenoble'),
 (0.5832924246788025, 'london'),
 (0.5806092619895935, 'strasbourg'),
 (0.574320375919342, 'Paris.')]

In [11]:
model_en.get_analogies("berlin","germany","Afghanistan")

[(0.7053670883178711, 'Kabul'),
 (0.6333202123641968, 'Jalalabad'),
 (0.6297706365585327, 'Kandahar'),
 (0.6237329244613647, 'Afghan'),
 (0.6198960542678833, 'Afghanistan.The'),
 (0.614515483379364, 'Aghanistan'),
 (0.6143754124641418, 'Afghanistan.In'),
 (0.6122112274169922, 'Herat'),
 (0.5971596837043762, 'Afghanistan.He'),
 (0.583417534828186, 'Helmand')]

In [12]:
# or
model_en.get_analogies("driving","car","phone")

[(0.610385537147522, 'texting'),
 (0.5203558802604675, 'phone-calling'),
 (0.5153835415840149, 'cellphone'),
 (0.5135326981544495, 'cell-phone'),
 (0.5117910504341125, 'dialing'),
 (0.5087355971336365, 'texing'),
 (0.5079342722892761, 'text-messaging'),
 (0.500900387763977, 'txting'),
 (0.4960441589355469, 'texting.'),
 (0.4951859414577484, 'Texting')]

### Custom train word embeddings on indian food receipes 😋
dataset credits: https://www.kaggle.com/datasets/sooryaprakash12/cleaned-indian-recipes-dataset

In [14]:
# Let's import pandas and read the CSV file:
import pandas as pd

df = pd.read_csv("Cleaned_Indian_Food_Dataset.csv")
df.sample(5)

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
2342,Grated Beetroot Chilli Paratha Recipe,"3 Green Chillies,1/2 teaspoon Cumin powder (Je...",40,Indian,To prepare Grated Beetroot Chilli Paratha Reci...,https://www.archanaskitchen.com/grated-beetroo...,"salt,cumin powder (jeera),wheat flour,ghee,red...",https://www.archanaskitchen.com/images/archana...,8
4703,Mushroom Stuffed Ravioli With Burnt Butter Sau...,"1 tablespoon Parsley leaves - chopped,4-5 Butt...",55,Italian Recipes,For the ravioli dough: To begin making the rav...,https://www.archanaskitchen.com/mushroom-stuff...,"virgin olive oil,cloves garlic,onion,black pep...",https://www.archanaskitchen.com/images/archana...,21
5567,Bharwa Bhindi And Pyaaz Ki Sabzi Recipe,"Salt - as required,Water - as required to spri...",55,Rajasthani,To begin making the Bharwa Bhindi And Pyaaz Ki...,https://www.archanaskitchen.com/bharwa-bhindi-...,"salt,coriander (dhania) leaves,green chilli,re...",https://www.archanaskitchen.com/images/archana...,12
3763,Cabbage Palya (Recipe In Hindi),"2 cups cabbage - finely chopped, 2 tablespoons...",40,South Indian Recipes,"To make the cabbage palya, first cut the cabba...",https://www.archanaskitchen.com/cabbage-palya-...,"tomato,coriander (dhania) leaves,salt,mustard ...",https://www.archanaskitchen.com/images/archana...,12
5009,Cucumber Mor Kuzhambu Recipe (Cucumber Curry),"2 to 3 tablespoons Hung Curd (Greek Yogurt),1 ...",35,South Indian Recipes,To begin making the Cucumber Mor Kuzhambu reci...,https://www.archanaskitchen.com/cucumber-mor-k...,"salt,coconut scrapped,ginger,hung curd (greek ...",https://www.archanaskitchen.com/images/archana...,9


In [15]:
# So here from this CSV file we just need 'TranslatedInstructions' column:
df.TranslatedInstructions[0]

'To begin making the Masala Karela Recipe,de-seed the karela and slice.\nDo not remove the skin as the skin has all the nutrients.\nAdd the karela to the pressure cooker with 3 tablespoon of water, salt and turmeric powder and pressure cook for three whistles.\nRelease the pressure immediately and open the lids.\nKeep aside.Heat oil in a heavy bottomed pan or a kadhai.\nAdd cumin seeds and let it sizzle.Once the cumin seeds have sizzled, add onions and saute them till it turns golden brown in color.Add the karela, red chilli powder, amchur powder, coriander powder and besan.\nStir to combine the masalas into the karela.Drizzle a little extra oil on the top and mix again.\nCover the pan and simmer Masala Karela stirring occasionally until everything comes together well.\nTurn off the heat.Transfer Masala Karela into a serving bowl and serve.Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family.\n'

In [16]:
# Next we use regular expressions (regex) for cleaning the text. First we import regex and then create a function to remove
# all the extra symbols, spaces ...
import re

def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(r'[ \n]+', ' ', text)
    return text.strip().lower() 

In [17]:
# So next we apply the function on the entire column using map function:
df.TranslatedInstructions = df.TranslatedInstructions.map(preprocess)

In [18]:
# Now we'll print the individual cell to see the text is cleaned or not:
df.TranslatedInstructions[0]

'to begin making the masala karela recipe de seed the karela and slice do not remove the skin as the skin has all the nutrients add the karela to the pressure cooker with 3 tablespoon of water salt and turmeric powder and pressure cook for three whistles release the pressure immediately and open the lids keep aside heat oil in a heavy bottomed pan or a kadhai add cumin seeds and let it sizzle once the cumin seeds have sizzled add onions and saute them till it turns golden brown in color add the karela red chilli powder amchur powder coriander powder and besan stir to combine the masalas into the karela drizzle a little extra oil on the top and mix again cover the pan and simmer masala karela stirring occasionally until everything comes together well turn off the heat transfer masala karela into a serving bowl and serve serve masala karela along with panchmel dal and phulka for a weekday meal with your family'

In [19]:
# The way fasttext work, whenever you train the model you need to have a specific format file. We just need raw text because
# this is unsupervised training (continous bag of words 'CBOW' and skip-grams are unsupervised form of training). So we 
# this column into a text file.
# Now in food_receipes.txt every line is one receipes.
df.to_csv("food_receipes.txt", columns=["TranslatedInstructions"], header=None, index=False)

In [20]:
# So now training the model is very very simple. fasttext has a method called 'train_unsupervised' and when you supply the 
# raw text to the method it will train the model.
# Now what is doing this method is, it's going all over the text and it's using unsupervised approach, we studied two 
# approaches CBOW and skip-grams, by default it's using skip-gram. so it will takes a pair words, for example 'karela red 
# chilli', here the middle word will be target word and the other two words will be context.
model = fasttext.train_unsupervised("food_receipes.txt")

In [21]:
# Now if we find similar words to a given word, it will most accurately find them comparing with previous model which was 
# trained on general Wikipedia text.
model.get_nearest_neighbors("paneer")

[(0.6259598135948181, 'tikka'),
 (0.6213865876197815, 'nawabi'),
 (0.6091339588165283, 'bhurji'),
 (0.5993036031723022, 'tandoori'),
 (0.5893545746803284, 'kulcha'),
 (0.5847510099411011, 'tikkas'),
 (0.5834465026855469, 'tofu'),
 (0.5817282795906067, 'shashlik'),
 (0.5760965347290039, 'makhanwala'),
 (0.5663161873817444, 'reshmi')]

In [22]:
model.get_nearest_neighbors("chutney")

[(0.9385766386985779, 'chutneys'),
 (0.7264912724494934, 'dhaniya'),
 (0.7180773615837097, 'imli'),
 (0.7073185443878174, 'khajur'),
 (0.6787099242210388, 'brahmi'),
 (0.6748732924461365, 'ratalu'),
 (0.6746129989624023, 'madurai'),
 (0.6730553507804871, 'south'),
 (0.6638494729995728, 'pudina'),
 (0.6520487070083618, 'chitranna')]

In [23]:
model.get_nearest_neighbors("halwa")

[(0.7761377692222595, 'khoya'),
 (0.7603527903556824, 'sheera'),
 (0.7088071703910828, 'mawa'),
 (0.7058563232421875, 'rabri'),
 (0.6968113780021667, 'badam'),
 (0.6908909678459167, 'kheer'),
 (0.6873621940612793, 'kesari'),
 (0.6864914298057556, 'burfi'),
 (0.6679328680038452, 'mohan'),
 (0.6669387221336365, 'peda')]

In [25]:
# The vector shape is:
model.get_word_vector("dosa").shape    # We can change it to 300 or whatever we want.

(100,)

https://fasttext.cc/docs/en/unsupervised-tutorial.html for details on parameters in train_unsupervised function. Based on the need one can use following parameters for fine tunning,

1. epochs = Default value is 5. Epoch is how many times it will loop over the same dataset for the training
2. lr = Learning rate
3. thread = Number of threads for the training

### Let's play with Pashto Language pre-trained model

In [9]:
model_pa = fasttext.load_model("E://model//cc.ps.300.bin") 



In [10]:
model_pa.get_nearest_neighbors("ښه")   # 'better'

[(0.35875511169433594, 'راغلاست'),
 (0.3346844017505646, 'بد'),
 (0.32112669944763184, 'ښې'),
 (0.3061031699180603, 'ترا'),
 (0.29902970790863037, 'سيهغه'),
 (0.28519681096076965, 'چوئي'),
 (0.28469809889793396, 'دښه'),
 (0.2829279899597168, 'انشأ'),
 (0.2791367471218109, 'هم'),
 (0.2775557041168213, 'دا')]

In [11]:
model_pa.get_nearest_neighbors("ښوونځی")    # 'School'

[(0.7729200720787048, 'اوښوونځی'),
 (0.700410783290863, 'ښوونځیوته'),
 (0.6793037056922913, 'ښوونخی'),
 (0.6609989404678345, 'ښوونځیمتعالیه'),
 (0.6547590494155884, 'ښوونځو'),
 (0.6524228453636169, 'شوونځی'),
 (0.6488434076309204, 'ښونځی'),
 (0.642431914806366, 'ښوونځ'),
 (0.636870801448822, 'شونځی'),
 (0.6325881481170654, 'ښوونځېو')]

In [12]:
model_en.get_nearest_neighbors("هیواد")  # 'country'

[(0.6082401871681213, 'هیوادڅخه'),
 (0.6079029440879822, 'یوهیواد'),
 (0.606614351272583, 'هیوادنه'),
 (0.5960032343864441, 'هرهیواد'),
 (0.5951245427131653, 'هیوادکی'),
 (0.5737670063972473, 'هیوادوه'),
 (0.5737472176551819, 'هیوادکې'),
 (0.5695239305496216, 'هیوادني'),
 (0.5644078850746155, 'هیوادکي'),
 (0.5520766973495483, 'هیوادپه')]

In [13]:
model_pa.get_nearest_neighbors("پوهنتون")   # 'university'

[(0.7608622312545776, 'پوهنتونكاردان'),
 (0.7469274997711182, 'پوهنتوندعوت'),
 (0.7419348955154419, 'پوهنتونمريم'),
 (0.7280351519584656, 'پوهنتونسلام'),
 (0.7244547605514526, 'پوهنتونکمبریج'),
 (0.718766987323761, 'پوهنتونخاتم'),
 (0.7186954617500305, 'پوهنتوني'),
 (0.7164924740791321, 'پوهنتونکال'),
 (0.7035044431686401, 'پوهنتونآف'),
 (0.6881459355354309, 'پوهنتونرڼا')]

In [14]:
model_pa.get_nearest_neighbors("کابل")   # Kabul

[(0.49844151735305786, 'رسنۍ:کابل'),
 (0.4903295338153839, '۱۳۶۳کال'),
 (0.48934024572372437, 'پرکابل'),
 (0.489153653383255, 'باستانشناسی'),
 (0.4876821041107178, 'چهاراسیاب'),
 (0.48572057485580444, 'دفارمسي'),
 (0.4856768250465393, 'کوټوالۍ'),
 (0.4856499433517456, 'جلالاباد'),
 (0.48431241512298584, 'آساد'),
 (0.48382672667503357, 'پولیتخنیک')]

In [17]:
model_pa.get_analogies("کتاب","چلول","موټر")

[(0.4749954342842102, 'کتابپلورنځي'),
 (0.4077494442462921, 'کتابفروش'),
 (0.3944847881793976, 'کتابګوټی'),
 (0.39322584867477417, 'الإسفار'),
 (0.3837553560733795, 'الماخوذات'),
 (0.38224607706069946, 'الکتاب'),
 (0.37697646021842957, 'الارشميدس'),
 (0.3747272193431854, 'کتابځای'),
 (0.37194523215293884, 'کتابپلورلو'),
 (0.3717959523200989, 'الکتانی')]

In [20]:
model_pa.get_word_vector("پوهنتون").shape

(300,)