# Food embeddings made using ingredients
>"Visualize food recipes based on their clustering as per ingredients"

- toc:true
- badges: true
- comments: true
- author: Pushkar G. Ghanekar
- categories: [python, exploratory-data-analysis, data-visualization, web-scrapping]

# Get data for food items and recipes 

I use a Kaggle dataset containing 6000+ recipes from https://www.archanaskitchen.com/. Using this data as base collection of recipes representing most of the indian food, I analyze which spices occur most freqeuntly and which spices are most connected to each other. 

* Dataset for Indian recipe: This dataset 6000+ recipe scrapped from  | [Link to the dataset](https://www.kaggle.com/kanishk307/6000-indian-food-recipes-dataset)

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
#----- PLOTTING PARAMS ----# 
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}

plot_params = {
'font.size' : 22,
'axes.titlesize' : 24,
'axes.labelsize' : 20,
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
}
 
plt.rcParams.update(plot_params)

In [3]:
# Functions for PMF and CDF, we will come to those later in the notebook 
def pmf(pandas_series):
    value, counts = np.unique(pandas_series, return_counts=True)
    pmf = np.c_[ value, counts / sum(counts) ]
    return pmf 

def cdf(pandas_series):
    pmf_data = pmf(pandas_series)
    cdf = np.zeros(shape=pmf_data.shape)
    
    for i in range(0, cdf.shape[0]):
        cdf[i] = [ pmf_data[i][0], np.sum(pmf_data[0:i][:,1]) ]
    return cdf 

### Read the dataset

In [4]:
food_df = pd.read_csv('./IndianFoodDatasetCSV.csv')

In [5]:
food_df.columns

Index(['Srno', 'RecipeName', 'TranslatedRecipeName', 'Ingredients',
       'TranslatedIngredients', 'PrepTimeInMins', 'CookTimeInMins',
       'TotalTimeInMins', 'Servings', 'Cuisine', 'Course', 'Diet',
       'Instructions', 'TranslatedInstructions', 'URL'],
      dtype='object')

In [6]:
food_df.shape

(6871, 15)

In [7]:
food_df.sample(4)

Unnamed: 0,Srno,RecipeName,TranslatedRecipeName,Ingredients,TranslatedIngredients,PrepTimeInMins,CookTimeInMins,TotalTimeInMins,Servings,Cuisine,Course,Diet,Instructions,TranslatedInstructions,URL
5216,7494,Mango Musk Melon Chia Seeds Pudding Recipe,Mango Musk Melon Chia Seeds Pudding Recipe,"2 cups Musk Melon - pulp,2 Mango (Ripe) - alph...","2 cups Musk Melon - pulp,2 Mango (Ripe) - alph...",10,15,25,2,Continental,World Breakfast,Vegetarian,To begin with Mango Musk Melon Chia Seeds Pudd...,To begin with Mango Musk Melon Chia Seeds Pudd...,https://www.archanaskitchen.com/mango-musk-mel...
63,71,Vegan Chickpea Omelette Recipe (Spiced Chickpe...,Vegan Chickpea Omelette Recipe (Spiced Chickpe...,"1 cup Gram flour (besan),1/2 cup Coconut milk,...","1 cup Gram flour (besan),1/2 cup Coconut milk,...",20,30,50,4,North Indian Recipes,North Indian Breakfast,High Protein Vegetarian,To begin making the Vegan Chickpea Omelette Re...,To begin making the Vegan Chickpea Omelette Re...,http://www.archanaskitchen.com/vegan-chickpea-...
5692,8640,मैंगलोर फिश करी रेसिपी - Mangalore Fish Curry ...,Mangalore Fish Curry Recipe,"500 ग्राम किंग फिश - काट ले,1 प्याज - बारीक का...","500 ग्राम किंग फिश - काट ले,1 प्याज - बारीक का...",10,30,40,4,Mangalorean,Lunch,Non Vegeterian,मैंगलोर फिश करी रेसिपी बनाने के लिए सबसे पहले ...,मैंगलोर फिश करी रेसिपी बनाने के लिए सबसे पहले ...,https://www.archanaskitchen.com/mangalore-fish...
6104,9733,Thai Red Curry with Chicken and Brinjal Recipe,Thai Red Curry with Chicken and Brinjal Recipe,"300 grams Chicken - boneless pieces,400 ml Coc...","300 grams Chicken - boneless pieces,400 ml Coc...",5,30,35,4,Thai,Lunch,Non Vegeterian,To begin making the Vegetarian Thai Red Curry ...,To begin making the Vegetarian Thai Red Curry ...,https://www.archanaskitchen.com/thai-red-curry...


In [8]:
food_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6871 entries, 0 to 6870
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Srno                    6871 non-null   int64 
 1   RecipeName              6871 non-null   object
 2   TranslatedRecipeName    6871 non-null   object
 3   Ingredients             6865 non-null   object
 4   TranslatedIngredients   6865 non-null   object
 5   PrepTimeInMins          6871 non-null   int64 
 6   CookTimeInMins          6871 non-null   int64 
 7   TotalTimeInMins         6871 non-null   int64 
 8   Servings                6871 non-null   int64 
 9   Cuisine                 6871 non-null   object
 10  Course                  6871 non-null   object
 11  Diet                    6871 non-null   object
 12  Instructions            6871 non-null   object
 13  TranslatedInstructions  6871 non-null   object
 14  URL                     6871 non-null   object
dtypes: i

In [9]:
# dropping miscellaneous columns and NaN entries
columns_to_drop = ['Srno', 'RecipeName', 'CookTimeInMins', 'TotalTimeInMins', 'Servings', 'Instructions', 'TranslatedInstructions', 'Ingredients', 'URL']
food_df = food_df.drop(columns=columns_to_drop)

In [10]:
food_df.shape

(6871, 6)

## Check out null entries

In [11]:
food_df.columns[food_df.isnull().any()]

Index(['TranslatedIngredients'], dtype='object')

In [12]:
food_df.iloc[ np.where(food_df.isnull().any())[0] ]

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,PrepTimeInMins,Cuisine,Course,Diet
1,Spicy Tomato Rice (Recipe),"2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,South Indian Recipes,Main Course,Vegetarian


Column-wise view of number of null entries in the dataframe

In [13]:
null_columns=food_df.columns[food_df.isnull().any()]
food_df[null_columns].isnull().sum()

TranslatedIngredients    6
dtype: int64

Views columns with atleast 1 null entry

In [14]:
is_NaN = food_df.isnull()
row_has_NaN = is_NaN.any(axis=1)
food_df[row_has_NaN]

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,PrepTimeInMins,Cuisine,Course,Diet
287,Pear And Walnut Salad Recipe,,10,Continental,Appetizer,Vegetarian
1262,Spinach and Cottage Cheese Eggless Ravioli Recipe,,25,Italian Recipes,Dinner,Vegetarian
1809,Thai Jasmine Sticky Rice Recipe,,30,North Indian Recipes,Lunch,Vegetarian
1827,Classic Pavakkai Stir Fry Recipe (Bitter Gourd...,,30,Greek,Appetizer,Vegetarian
5386,Urulaikizhangu Puli Thokku Recipe (South India...,,15,South Indian Recipes,Lunch,Vegetarian
5586,Mashed Peas Recipe,,20,North Indian Recipes,North Indian Breakfast,Vegetarian


In [15]:
food_df = food_df.dropna()
print(food_df.shape)

(6865, 6)


In [16]:
# data has indian-inspired international cuisines which are not what we are interested in
cuisin_counts = food_df['Cuisine'].value_counts()
cuisin_counts_more_than_50 = cuisin_counts.iloc[np.where(cuisin_counts > 50)]

In [17]:
food_df_top_cuisine = food_df.loc[ food_df['Cuisine'].isin(list(cuisin_counts_more_than_50.index))  ] #Dropping entries in `food_df` which have non-ind

In [18]:
food_df_top_cuisine.shape

(6075, 6)

In [19]:
food_df_top_cuisine['Cuisine'].value_counts()

Indian                   1157
Continental              1020
North Indian Recipes      936
South Indian Recipes      681
Italian Recipes           235
Bengali Recipes           175
Maharashtrian Recipes     173
Kerala Recipes            163
Tamil Nadu                156
Karnataka                 149
Fusion                    135
Rajasthani                123
Mexican                   119
Andhra                    118
Gujarati Recipes﻿         115
Goan Recipes               99
Punjabi                    84
Chettinad                  74
Asian                      72
Thai                       66
Chinese                    59
Kashmiri                   59
French                     54
Middle Eastern             53
Name: Cuisine, dtype: int64

In [20]:
south_indian_tag = ['Chettinad', 'Andhra', 'Karnataka', 'Tamil Nadu', 'Kerala Recipes', 'South Indian Recipes']
central_indian_tag = ['Gujarati Recipes', 'Goan Recipes', 'Rajasthani', 'Maharashtrian Recipes']
north_indian_tag = ['North Indian Recipes', 'Punjabi', 'Kashmiri']

In [21]:
food_df_top_cuisine

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,PrepTimeInMins,Cuisine,Course,Diet
0,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,Indian,Side Dish,Diabetic Friendly
1,Spicy Tomato Rice (Recipe),"2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,South Indian Recipes,Main Course,Vegetarian
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"500 grams Chicken,2 Onion - chopped,1 Tomato -...",15,Andhra,Lunch,Non Vegeterian
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"1 tablespoon chana dal, 1 tablespoon white ura...",10,Andhra,South Indian Breakfast,Vegetarian
...,...,...,...,...,...,...
6866,Goan Mushroom Xacuti Recipe,"20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक ...",15,Goan Recipes,Lunch,Vegetarian
6867,Sweet Potato & Methi Stuffed Paratha Recipe,"1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद ...",30,North Indian Recipes,North Indian Breakfast,Diabetic Friendly
6868,Ullikadala Pulusu Recipe | Spring Onion Curry,150 grams Spring Onion (Bulb & Greens) - chopp...,5,Andhra,Side Dish,Vegetarian
6869,Kashmiri Style Kokur Yakhni Recipe-Chicken Coo...,"1 kg Chicken - medium pieces,1/2 cup Mustard o...",30,Kashmiri,Lunch,Non Vegeterian


In [22]:
food_df_top_cuisine.loc[food_df_top_cuisine['Cuisine'].isin(south_indian_tag), 'Combined_cuisine'] = 'South Indian'
food_df_top_cuisine.loc[food_df_top_cuisine['Cuisine'].isin(central_indian_tag), 'Combined_cuisine'] = 'Central Indian'
food_df_top_cuisine.loc[food_df_top_cuisine['Cuisine'].isin(north_indian_tag), 'Combined_cuisine'] = 'North Indian'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [34]:
food_df_top_cuisine.loc[ ~food_df_top_cuisine['Combined_cuisine'].isnull() ]['Combined_cuisine'].value_counts()

South Indian      1341
North Indian      1079
Central Indian     395
Name: Combined_cuisine, dtype: int64

In [35]:
food_df_top_cuisine.head(5)

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,PrepTimeInMins,Cuisine,Course,Diet,Combined_cuisine
0,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,Indian,Side Dish,Diabetic Friendly,
1,Spicy Tomato Rice (Recipe),"2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,South Indian Recipes,Main Course,Vegetarian,South Indian
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,South Indian
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"500 grams Chicken,2 Onion - chopped,1 Tomato -...",15,Andhra,Lunch,Non Vegeterian,South Indian
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"1 tablespoon chana dal, 1 tablespoon white ura...",10,Andhra,South Indian Breakfast,Vegetarian,South Indian


In [36]:
food_df_top_cuisine['Course'].value_counts()

Lunch                      1542
Side Dish                   871
Snack                       823
Dinner                      691
Dessert                     610
Appetizer                   539
World Breakfast             253
Main Course                 239
South Indian Breakfast      237
North Indian Breakfast      109
Indian Breakfast             90
Vegetarian                   36
One Pot Dish                 23
High Protein Vegetarian       5
Brunch                        3
Vegan                         3
Eggetarian                    1
Name: Course, dtype: int64

In [37]:
breakfast_tag = ['South Indian Breakfast', 'North Indian Breakfast', 'Indian Breakfast', 'World Breakfast']

In [38]:
food_df_top_cuisine['Combined_Course'] = food_df_top_cuisine['Course'].copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_df_top_cuisine['Combined_Course'] = food_df_top_cuisine['Course'].copy()


In [39]:
food_df_top_cuisine.loc[ food_df_top_cuisine['Course'].isin(breakfast_tag), 'Combined_Course'] = 'Breakfast'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [40]:
food_df_top_cuisine = food_df_top_cuisine.reset_index(drop=True)

## Drop non-english entries for consistency

In [41]:
# Some entries in the `TranslatedIngredients` have non-english entries 
def filter_english(string):
    try:
        string.encode('utf-8').decode('ascii')
        out = True
    except UnicodeDecodeError: 
        out = False
    return out

In [42]:
# Droping columns in the dataset having ingredients in language other than english 
df = food_df_top_cuisine.loc[ food_df_top_cuisine['TranslatedIngredients'].apply(filter_english) ]

In [43]:
df.shape

(5457, 8)

In [44]:
df = df.reset_index(drop=True)

In [45]:
df.head(10)

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,PrepTimeInMins,Cuisine,Course,Diet,Combined_cuisine,Combined_Course
0,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,Indian,Side Dish,Diabetic Friendly,,Side Dish
1,Spicy Tomato Rice (Recipe),"2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,South Indian Recipes,Main Course,Vegetarian,South Indian,Main Course
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,South Indian,Breakfast
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"500 grams Chicken,2 Onion - chopped,1 Tomato -...",15,Andhra,Lunch,Non Vegeterian,South Indian,Lunch
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"1 tablespoon chana dal, 1 tablespoon white ura...",10,Andhra,South Indian Breakfast,Vegetarian,South Indian,Breakfast
5,Pudina Khara Pongal Recipe (Rice and Lentils C...,"1 cup Rice - soaked for 20 minutes,1/2 cup Yel...",10,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,South Indian,Breakfast
6,Mexican Style Black Bean Burrito Recipe,"4 Tortillas,1/4 cup Black beans - soaked overn...",10,Mexican,Lunch,Vegetarian,,Lunch
7,Spicy Crunchy Masala Idli Recipe,"10 Idli - cut into strips,1 cup Green Bell Pep...",10,South Indian Recipes,Snack,Vegetarian,South Indian,Snack
8,Cauliflower Leaves Chutney (Recipe in Hindi),"1 cup cabbage leaves, 3/4 cup tomatoes, 18 gra...",5,South Indian Recipes,Side Dish,Vegetarian,South Indian,Side Dish
9,Homemade Baked Beans Recipe (Wholesome & Healthy),250 grams Dry beans - (such as cannellini or s...,60,Fusion,High Protein Vegetarian,Vegetarian,,High Protein Vegetarian


In [None]:
string_words_common = ['to taste', 'as required', 'tablespoon', 'teaspoon', 'cup', 'cups']

In [77]:
# Text cleaning function 
import re
def clean_symbols(row_entry):
    return re.sub(r'\W+|[0-9,]+', ' ', row_entry).strip().lower().replace('  ','')

In [78]:
clean_symbols(df['TranslatedIngredients'][0])

'karela bitter gourd pavakkai deseeded salt to taste onion thinly sliced tablespoon gram flour besan teaspoons turmeric powder haldi tablespoon red chilli powder teaspoons cumin seeds jeera tablespoon coriander powder dhania tablespoon amchur dry mango powder sunflower oil as required'

In [79]:
df['TranslatedIngredients'].map(clean_symbols)

0       karela bitter gourd pavakkai deseeded salt to ...
1       cups rice cooked tomatoes teaspoons bc belle b...
2       cups rice vermicelli noodles thin onion sliced...
3       grams chicken onion chopped tomato chopped gre...
4       tablespoon chana dal tablespoon white urad dal...
                              ...                        
5452    cups paneer homemade cottage cheese crumbled t...
5453    cup risotto cooked risotto recipe below cup pa...
5454    cup quinoa cup sugar teaspoon cardamom powder ...
5455    grams spring onion bulb greens chopped cup tam...
5456    kg chicken medium pieces cup mustard oil ghee ...
Name: TranslatedIngredients, Length: 5457, dtype: object