## Importing Libraries

In [10]:
import pandas as pd
import numpy as np
import requests
from urllib.request import urlopen
import time
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import nltk
#nltk.download('brown')

## Task 1 : Web Scraping ##

The website I have used for the project is 'www.allrecipes.com'. To scrape all the recepies from the website, I first iterated over the first 11 pages of the website, and pulled the recipies in every page. 

In [2]:
recipe_list = []

for i in range(1,11):
    url = 'https://www.allrecipes.com/?page='+str(i)
    
    response = urlopen(url)
    my_html = response.read()
    response.close()
    
    soup = BeautifulSoup(my_html, 'html.parser')
    attribute_list = soup.find_all('a')

    for link in attribute_list:
        sublink = str(link.get('href'))
        if '/recipe/' in sublink:
            recipe_list.append(sublink)

recipe_list = list(set(recipe_list))
len(recipe_list)

100

__recipe_list__ now contains sublinks for every single recipe in my list. There are a total of 100 recipies in my list.

Here is an example of a sublink for a recipe.

In [3]:
recipe_list[10]

'https://www.allrecipes.com/recipe/229150/cheesy-amish-breakfast-casserole/'

Next, I will visit the sublink for every recipe, and extract the name of the recipe and all its ingredients. I will then make a dataframe for the url of the recipe, the name and the ingredient (with one separate row for every ingredient).

In [4]:
df = pd.DataFrame(columns=['url','name','ingredient'])

for link in recipe_list:
    url = link

    response = urlopen(url)
    my_html = response.read()
    response.close()

    soup = BeautifulSoup(my_html, 'html.parser')

    title = soup.find('h1').text

    attribute_list = soup.findAll('li', attrs={'class':"ingredients-item"})

    for line in attribute_list:
        new_row = {'url':url, 'name':title, 'ingredient':' '.join(line.text.split())}
        df = df.append(new_row, ignore_index=True)

df.head()

Unnamed: 0,url,name,ingredient
0,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,1 pound lean ground beef
1,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,1 (12 ounce) bottle chili sauce
2,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,¼ cup water
3,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,1 (1.25 ounce) package chili seasoning mix
4,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,1 tablespoon yellow mustard


Here is the shape of the raw dataframe.

In [5]:
df.shape

(708, 3)

In [7]:
#Saving the file in the folder
df.to_csv(r'rawData.csv', index=False)

## Task 2 : Data Cleaning ##

In [8]:
# Saving the original list of ingredients for comparison purposes later

ogingredients = df.ingredient

As a part of the Data Cleaning process, I will perform the following operations:
1. Convert all the documents in a lowercase format
2. Get rid of everything that is not an alphabet (numbers, special characters, etc.)
3. Lemmatize the words to their noun forms
4. Build a modified list of english stopwords which also include the terms for cooking related measurements and actions
5. Get rid of all the words that are not nouns or adjectives

In [11]:
#Converting all words to lowercase
df['ingredient'] = df.ingredient.apply(lambda x: " ".join(x.lower() for x in x.split())) 

#Removes Punctuation from all sentences
df.ingredient = df.ingredient.str.replace('[^a-zA-Z_\s]',' ')

#Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

#Lemmatizes all the words
df.ingredient = df.ingredient.apply(lambda x: " ".join(lemmatizer.lemmatize(x, wordnet.NOUN) for x in x.split()))

#Removing stop words
#Get all the english stop words except for the word 'all'
stop = stopwords.words('english')
stop.remove('all')

#Define a list of all the possible words that are used for measurement
measurements = ['piece', 'cup','tablespoon','length', 'bunch', 'pound', 'ounce', 'dash', 'teaspoon','quart','inch','degree','optional','half','package', 'chunk', 'pinch','f','c','envelope','small','bulk','large']

#Define a set of all the actions (instructions)
actions = ['cut','chopped','diced','cubed','taste','packed','slice','use','peeled','pitted','mashed']

#Growing the stop words list by adding in measurements and actions
stop.extend(measurements+actions)

#Removing all the words in the new stop word list
df.ingredient =  df.ingredient.apply(lambda x: " ".join(x for x in x.split() if x not in stop))

#Attempting the use of the wordnet library for removing everything except nouns
#df_temp['method1'] = df_temp.clean_ingredient.apply(lambda x: " ".join([n for n,t in [(w, wn.synsets(w)[0].pos()) for w in x.split()] if t in ['n']]))

#Removing words except nouns and adjectives
df.ingredient = df.ingredient.apply(lambda x: " ".join([n for n,t in [j[0] for j in [nltk.pos_tag([i]) for i in x.split()]] if t in ['NN','NNS','JJ']]))

df.head()

Unnamed: 0,url,name,ingredient
0,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,lean ground beef
1,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,bottle chili sauce
2,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,water
3,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,chili mix
4,https://www.allrecipes.com/recipe/236876/hot-d...,Hot Dogs with Coney Sauce,yellow mustard


Let us now compare the original ingredient list with the new cleaned ingredient list.

In [18]:
set(zip(ogingredients,df.ingredient))

{('active dry yeast', 'active dry yeast'),
 ('almond', 'almond'),
 ('asian chile pepper sauce', 'asian chile pepper sauce'),
 ('asian dark sesame oil', 'asian dark sesame oil'),
 ('avocado', 'avocado'),
 ('bacon', 'bacon'),
 ('bag frozen mexican style corn', 'bag frozen mexican style corn'),
 ('bag mexican cheese blend', 'bag mexican cheese blend'),
 ('baking powder', 'baking powder'),
 ('baking soda', 'baking soda'),
 ('banana', 'banana'),
 ('basil', 'basil'),
 ('basil leaf', 'basil leaf'),
 ('bay leaf', 'bay leaf'),
 ('beef broth', 'beef broth'),
 ('beef chuck roast', 'beef chuck roast'),
 ('beef flank steak thick diagonal', 'beef flank steak thick diagonal'),
 ('beef hot dog', 'beef hot dog'),
 ('beef stock', 'beef stock'),
 ('beer', 'beer'),
 ('black bean', 'black bean'),
 ('black olive', 'black olive'),
 ('black pepper', 'black pepper'),
 ('boneless pork chop', 'boneless pork chop'),
 ('boneless pork shoulder roast', 'boneless pork shoulder roast'),
 ('boneless skinless chicken br

As you can see in the list, some of the items are repeated with a different name (for e.g. clove garlic and clove fresh garlic are the same ingredient). With further modifications in the code, we can correct these problems as well. However, I want to keep this code as general as possible.

In [13]:
#Saving the file in the folder
df.to_csv(r'cleanData.csv', index=False)

## Task 3 : Calculating

Finally, I grouped the data by the ingredients to get the number of times the ingredient has occured in my data. Next, I divided this count by the number of unique recipies I have in my list. This gave me the proportion of the total unique recipies that have the ingredient in them.

In [15]:
df_res = df.groupby('ingredient').count().sort_values(by='url', ascending=False)[['name']].reset_index()
df_res = df_res.rename({'name':'count', 'ingredient':'word'}, axis=1)
total = len(df.name.unique())
df_res['proportion'] = df_res['count'].apply(lambda x:x/total)
df_top10 = df_res.head(10)
df_top10

Unnamed: 0,word,count,proportion
0,salt,38,0.550725
1,purpose flour,31,0.449275
2,butter,31,0.449275
3,white sugar,29,0.42029
4,egg,20,0.289855
5,water,16,0.231884
6,onion,15,0.217391
7,brown sugar,14,0.202899
8,clove garlic,13,0.188406
9,ground black pepper,13,0.188406


These results make sense as well.

Salt is used in almost all the recipies (expect for deserts), which is why its present in a very high proporton of 0.55.

Next in the line are flour, butter and white sugar - all three which are again very popular in all the dishes.

In [16]:
#Saving the file in the folder
df_top10.to_csv(r'results.csv', index=False)