**Word2Vec**
- **Training Unit:** The basic unit for training the neural network is a word.
- **Architectures:**
  1. **CBOW (Continuous Bag-of-Words):**
    - Goal: Predict a target word given its context words.
    - Example: [The, cat, on, the] → Predict "mat" (for "The cat sat on the mat").
    - Analogous to "filling in the blank."
  2. **Skip-Gram:**
    - Goal: Predict context words given a target word.
    - Example: "sat" → Predict [The, cat, on, the, mat] (with window size 2).
    - Works well for rare words.

- **Limitation:** Fails for out-of-vocabulary (OOV) words (e.g., "unseenword" not in training data).

---

**FastText**
- **Training Unit:** The basic unit for training is a character n-gram rather than a whole word.
  - Represents words as a sum of their character n-grams (e.g., "apple" → <ap, app, ppl, ple, le> for n=3).
  - Handles OOV words by breaking them into known subword units.
- **Architectures:**
  - Extends Word2Vec’s Skip-Gram/CBOW but operates on subword units.
  - Example: For the word "jumping", FastText uses n-grams like jum, umpi, mpin, ping.
- **Advantages:**
  - Robust for morphologically rich languages (e.g., Turkish, Finnish).
  - Better embeddings for rare or misspelled words.

# FastText Pre-trained Models

I'm using google colab and we don't have fasttext library we need to download it and also we need to download model as well but model is 7gb let's see if we can download the model.

In [4]:
!pip install fasttext



In [5]:
# Download the English pre-trained model (compressed .bin file)
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

# Uncompress the file
!gzip -d cc.en.300.bin.gz

--2025-02-06 20:19:07--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.164.78.121, 18.164.78.72, 18.164.78.81, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.164.78.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4503593528 (4.2G) [application/octet-stream]
Saving to: ‘cc.en.300.bin.gz’


2025-02-06 20:19:33 (163 MB/s) - ‘cc.en.300.bin.gz’ saved [4503593528/4503593528]



In [54]:
import fasttext

# Load the pre-trained model
model_en = fasttext.load_model('cc.en.300.bin')

we can check the methods using dir(model) it will show us the methods we can perform

In [55]:
model_en.get_nearest_neighbors('man')

[(0.7658417224884033, 'woman'),
 (0.6753811836242676, 'man.He'),
 (0.6618252396583557, 'guy'),
 (0.65586918592453, 'man.The'),
 (0.6558194160461426, 'man--he'),
 (0.6558161377906799, 'man.When'),
 (0.6423407196998596, 'gentleman'),
 (0.6419808864593506, 'man--a'),
 (0.6405567526817322, 'woman.He'),
 (0.6402797102928162, 'man.That')]

In [56]:
#it is 300 dimension
print(model_en.get_word_vector('man').shape)
model_en.get_word_vector('man')[:10]

(300,)


array([ 0.20510274, -0.11743978, -0.01554255,  0.1793493 , -0.22804922,
       -0.12455802,  0.1232509 ,  0.05384731,  0.04117953, -0.00567295],
      dtype=float32)

In [57]:
model_en.get_analogies('paris', 'france', 'australia')

[(0.7186220288276672, 'sydney'),
 (0.6982361674308777, 'melbourne'),
 (0.6714049577713013, 'brisbane'),
 (0.6550672054290771, 'adelaide'),
 (0.6140936017036438, 'australian'),
 (0.6055131554603577, 'singapore'),
 (0.5983640551567078, 'auckland'),
 (0.5921200513839722, 'sydney.'),
 (0.5818527936935425, 'queensland'),
 (0.5817395448684692, 'australia.')]

In [58]:
model_en.get_nearest_neighbors('chutney')

[(0.8078702092170715, 'chutneys'),
 (0.7138292789459229, 'thokku'),
 (0.701572060585022, 'Chutney'),
 (0.6875490546226501, 'achaar'),
 (0.684525728225708, 'piccalilli'),
 (0.6737173199653625, 'raita'),
 (0.6715506911277771, 'chatni'),
 (0.6610829830169678, 'chutney.'),
 (0.6505922675132751, 'gojju'),
 (0.6398508548736572, 'kasundi')]

In [59]:
model_en.get_nearest_neighbors('halwa')

[(0.8563978672027588, 'kheer'),
 (0.8392286896705627, 'burfi'),
 (0.8193163871765137, 'Halwa'),
 (0.7894062995910645, 'kesari'),
 (0.778471827507019, 'payasam'),
 (0.7706475853919983, 'burfis'),
 (0.7590622901916504, 'laddoo'),
 (0.7504664659500122, 'ladoo'),
 (0.7471016645431519, 'rabdi'),
 (0.7396334409713745, 'laddu')]

In [63]:
model_en.get_nearest_neighbors('saragva')[:3]

[(0.5384978652000427,
  'ReportsTabloidCrimeYakuzaTokyoGinzaIkebukuroKabukichoRoppongiShibuyaShimbashiShinjukuUenoJapanChibaFukuokaKobeKyotoNagoyaOkinawaOsakaSaitamaYokohamaSportsBaseballHorse'),
 (0.5373231768608093,
  'NoidaVaranasiBareillyMathuraAligarhMoradabadSaharanpurBijnorJaunpurGorakhpurMuzaffarnagarSultanpurDehradunHaridwarNainitalRoorkeeGarhwalBardhamanMurshidabadHooghlyMedinipurNorth'),
 (0.5331498980522156,
  'NagarBhiwaniKarnalKurukshetraMahendragarhSirsaPanipatJindJhajjarRewariSolanShimlaKangraHamirpurMandiJammuSrinagarRanchiJamshedpurMangaloreMysoreBelgaumGulbargaTumkurBijapurDavanagereDharwadShimogaUdupiHassanBidarHubliKolarBagalkotKannadaChitradurgaMandyaGadagBellaryRaichurThiruvananthapuramThrissurErnakulamMalappuramKochiKottayamKannurKozhikodeKollamPalakkadPathanamthittaCalicutTrivandrumAlappuzhaKasaragodBhopalIndoreGwaliorJabalpurUjjainSagarChhatarpurPuneNagpurAurangabadNashikKolhapurAhmed')]

**If model doesn't have or understand or out of vocabulary word then it will give us the garbage values.**

# Customizing train word embeddings on indian food receipes

- dataset credits: https://www.kaggle.com/datasets/sooryaprakash12/cleaned-indian-recipes-dataset

In [37]:
import pandas as pd

df = pd.read_csv('/content/Cleaned_Indian_Food_Dataset.csv')

df.head()

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
0,Masala Karela Recipe,"1 tablespoon Red Chilli powder,3 tablespoon Gr...",45,Indian,"To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...,"salt,amchur (dry mango powder),karela (bitter ...",https://www.archanaskitchen.com/images/archana...,10
1,Spicy Tomato Rice (Recipe),"2 teaspoon cashew - or peanuts, 1/2 Teaspoon ...",15,South Indian Recipes,"To make tomato puliogere, first cut the tomato...",https://www.archanaskitchen.com/spicy-tomato-r...,"tomato,salt,chickpea lentils,green chilli,rice...",https://www.archanaskitchen.com/images/archana...,12
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1 Onion - sliced,1 teaspoon White Urad Dal (Sp...",50,South Indian Recipes,"To begin making the Ragi Vermicelli Recipe, fi...",https://www.archanaskitchen.com/ragi-vermicell...,"salt,rice vermicelli noodles (thin),asafoetida...",https://www.archanaskitchen.com/images/archana...,12
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"1/2 teaspoon Turmeric powder (Haldi),1 tablesp...",45,Andhra,To begin making Gongura Chicken Curry Recipe f...,https://www.archanaskitchen.com/gongura-chicke...,"tomato,salt,ginger,sorrel leaves (gongura),fen...",https://www.archanaskitchen.com/images/archana...,15
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"oil - as per use, 1 tablespoon coriander seed...",30,Andhra,"To make Andhra Style Alam Pachadi, first heat ...",https://www.archanaskitchen.com/andhra-style-a...,"tomato,salt,ginger,red chillies,curry,asafoeti...",https://www.archanaskitchen.com/images/archana...,12


In [38]:
df.shape

(5938, 9)

In [39]:
df['TranslatedInstructions'][0]

'To begin making the Masala Karela Recipe,de-seed the karela and slice.\nDo not remove the skin as the skin has all the nutrients.\nAdd the karela to the pressure cooker with 3 tablespoon of water, salt and turmeric powder and pressure cook for three whistles.\nRelease the pressure immediately and open the lids.\nKeep aside.Heat oil in a heavy bottomed pan or a kadhai.\nAdd cumin seeds and let it sizzle.Once the cumin seeds have sizzled, add onions and saute them till it turns golden brown in color.Add the karela, red chilli powder, amchur powder, coriander powder and besan.\nStir to combine the masalas into the karela.Drizzle a little extra oil on the top and mix again.\nCover the pan and simmer Masala Karela stirring occasionally until everything comes together well.\nTurn off the heat.Transfer Masala Karela into a serving bowl and serve.Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family.\n'

In this text we have white spaces and next line '\n' we need to remove it from our dataset we are going to use regular expression to remove it.

In [42]:
import re

def preprocessing(text):
  #remove the special characters in text
  text = re.sub(r'[^\w\s]', ' ', text)

  #remove the \n text from our text
  text = re.sub(r'[ \n]+', ' ', text)

  #strip will remove leading(at beginning of the string) and trailing(at the end) space from text
  return text.strip().lower()

In [44]:
df['TranslatedInstructions'] = df['TranslatedInstructions'].map(preprocessing)
df.head()

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
0,Masala Karela Recipe,"1 tablespoon Red Chilli powder,3 tablespoon Gr...",45,Indian,to begin making the masala karela recipe de se...,https://www.archanaskitchen.com/masala-karela-...,"salt,amchur (dry mango powder),karela (bitter ...",https://www.archanaskitchen.com/images/archana...,10
1,Spicy Tomato Rice (Recipe),"2 teaspoon cashew - or peanuts, 1/2 Teaspoon ...",15,South Indian Recipes,to make tomato puliogere first cut the tomatoe...,https://www.archanaskitchen.com/spicy-tomato-r...,"tomato,salt,chickpea lentils,green chilli,rice...",https://www.archanaskitchen.com/images/archana...,12
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1 Onion - sliced,1 teaspoon White Urad Dal (Sp...",50,South Indian Recipes,to begin making the ragi vermicelli recipe fir...,https://www.archanaskitchen.com/ragi-vermicell...,"salt,rice vermicelli noodles (thin),asafoetida...",https://www.archanaskitchen.com/images/archana...,12
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"1/2 teaspoon Turmeric powder (Haldi),1 tablesp...",45,Andhra,to begin making gongura chicken curry recipe f...,https://www.archanaskitchen.com/gongura-chicke...,"tomato,salt,ginger,sorrel leaves (gongura),fen...",https://www.archanaskitchen.com/images/archana...,15
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"oil - as per use, 1 tablespoon coriander seed...",30,Andhra,to make andhra style alam pachadi first heat o...,https://www.archanaskitchen.com/andhra-style-a...,"tomato,salt,ginger,red chillies,curry,asafoeti...",https://www.archanaskitchen.com/images/archana...,12


In [45]:
df['TranslatedInstructions'][0]

'to begin making the masala karela recipe de seed the karela and slice do not remove the skin as the skin has all the nutrients add the karela to the pressure cooker with 3 tablespoon of water salt and turmeric powder and pressure cook for three whistles release the pressure immediately and open the lids keep aside heat oil in a heavy bottomed pan or a kadhai add cumin seeds and let it sizzle once the cumin seeds have sizzled add onions and saute them till it turns golden brown in color add the karela red chilli powder amchur powder coriander powder and besan stir to combine the masalas into the karela drizzle a little extra oil on the top and mix again cover the pan and simmer masala karela stirring occasionally until everything comes together well turn off the heat transfer masala karela into a serving bowl and serve serve masala karela along with panchmel dal and phulka for a weekday meal with your family'

We have generated the raw text now to build the model we just need to take that particular column and train it on model. we will do the unsupervised learning as continues bags of word (CBOW) and skip grams are unsupervised methods.

In [47]:
df.to_csv('food_receipes.txt', columns=['TranslatedInstructions'], header=None, index=False)

By default fasttext use skip gram but if we want to use CBOW we just need to pass "cbow" as parameter and by default dimension of word vector is 100 if we want to change we need to pass "dim = 300" or whatever dimensions we need and by default "epochs = 5" and by default learning rate is 0.05 which we can change by mentioning "lr = 0.05" passing in parameter.

In [49]:
# model = fasttext.train_unsupervised("food_receipes.txt", "cbow")
model = fasttext.train_unsupervised("food_receipes.txt")

In [60]:
model.get_nearest_neighbors('chutney')

[(0.9411395788192749, 'chutneys'),
 (0.7474164366722107, 'dhaniya'),
 (0.7241044044494629, 'khajur'),
 (0.7203799486160278, 'pudina'),
 (0.7160243988037109, 'imli'),
 (0.6955734491348267, 'ratalu'),
 (0.6909687519073486, 'sippe'),
 (0.6894674301147461, 'pudi'),
 (0.676975429058075, 'mavinakayi'),
 (0.6698997616767883, 'mullu')]

In [61]:
model.get_nearest_neighbors('halwa')

[(0.7772210240364075, 'khoya'),
 (0.7146174907684326, 'sheera'),
 (0.7145729064941406, 'rabri'),
 (0.6892331838607788, 'burfi'),
 (0.6825693845748901, 'badam'),
 (0.6824300289154053, 'halbai'),
 (0.6692677736282349, 'mohan'),
 (0.6689146161079407, 'kesari'),
 (0.65965336561203, 'mawa'),
 (0.6557658314704895, 'kheer')]

In [64]:
model.get_nearest_neighbors('saragva')

[(0.8631535768508911, 'fansi'),
 (0.8501744866371155, 'bhoplya'),
 (0.8421438336372375, 'agathi'),
 (0.8389979600906372, 'bhuga'),
 (0.8367308974266052, 'kumani'),
 (0.8366513848304749, 'sukhi'),
 (0.8364609479904175, 'chawli'),
 (0.8359675407409668, 'olya'),
 (0.8283489942550659, 'kalyana'),
 (0.8257309794425964, 'mezhukkupuratti')]