## Monty Morawan

#### I will be using the Wiki Cookbook website, whose URL is https://en.wikibooks.org/wiki/Cookbook:Recipes, to scrape recipes starting with the letter 'B' and storing the information in a dataframe, where each row represents a single ingredient for a recipe, as best as I can.

In [1]:
## Imports 
import requests
from bs4 import BeautifulSoup
import time
from urllib.request import urljoin
import pandas as pd
import numpy as np
from fractions import Fraction

In [2]:
## Gets the website page from its URL
url = "https://en.wikibooks.org/wiki/Cookbook:Recipes"
response = requests.get(url)

In [3]:
## Retrieves content from the website page
wiki = response.content
## Reads the html website page content and creates a navigable tree structure
wikiSoup = BeautifulSoup(wiki)
## Selects div called mw-parser-output that contains all recipes listed
DIV_mw_parser_output = wikiSoup.find("div", class_ = "mw-parser-output")
## Selects section that contains list of recipes that start with the letter 'B' 
ul_b_foods_section = DIV_mw_parser_output.find_all("ul")[3]
## Selects list of all recipes that start with the letter 'B' 
## On the website, there are 58 of these recipes
ul_b_foods = ul_b_foods_section.find_all("li")

#### The code above scrapes all recipes that start with the letter 'B' from the website page. There are 58 of these recipes on the website page. 

In [4]:
## Creates an empty recipe data list that will hold information 
## for each row in the data frame that we want to create
recipe_data = []

## Clicking on each recipe in our list sends the user 
## to a seperate page dedicated to that recipe through a
## a hyperlink so I will need to have access to each
## recipe's hyperlink URL.
## The line below is the beginning of each recipe's
## URL that I will need to join to the unique URL
## section of each recipe to get the full URL
url_start = 'https://en.wikibooks.org/'

## The following code below goes through
## each recipe and through each recipe's ingredients
## in order to extract the recipe name, text, quantity,
## unit, and ingredient name for each ingredient.
## All of this information, including an arbitrary
## unique recipe id number for each recipe, is then
## added to the recipe_data list, where each element
## in the list represents an ingredient for a recipe

## Loops through each of the 58 recipes
for i in range(58):
    ## Retrieves the section for the current recipe
    ## from the main website page
    food_i = ul_b_foods[i]
    ## Selects the unique hyperlink text that sends user
    ## to the current recipe's page
    food_i_name = food_i.find("a").text
    ## Joins the unique URL text from above
    ## to the beginning URL text to create
    ## the full URL to the recipe's page
    url_food_i = urljoin(url_start, food_i.a["href"])
    ## Gets the website page for the current recipe
    response_food_i = requests.get(url_food_i)
    ## Retrieves content from the website page
    wiki_recipe_i = response_food_i.content
    ## Reads the html website page content and creates a navigable tree structure
    wiki_recipe_i_soup = BeautifulSoup(wiki_recipe_i)
    ## Selects div that contains information for recipe
    DIV_mw_parser_output_i = wiki_recipe_i_soup.find("div", class_ = "mw-parser-output")
    
    ## try and except makes it so the following code runs for each recipe that
    ## does not produce errors (due to inconsistent formatting issues)
    try:
        ## Selects section that contains list of ingredients for the current recipe
        ingredients_recipe_i = DIV_mw_parser_output_i.find_all("ul")[0]
        ## Selects each ingredient for the current recipe
        ingredient_j_recipe_i = ingredients_recipe_i.find_all('li')
        ## Stores the number of ingredients for the current recipe
        num_ingredients_recipe_i = len(ingredient_j_recipe_i)
        
        ## The following code within the for loop goes through each
        ## ingredient and adds its recipe_id (index i of overarching for
        ## loop), recipe name, text, quantity, unit, and name to the recipe_data
        ## list
        ## The line below loops through each ingredient of the current recipe
        for j in range(num_ingredients_recipe_i):
            ## Selects current ingredient
            ingredient_j = ingredient_j_recipe_i[j]
            ## Retrieves text of current ingredient
            ingredient_j_text = ingredient_j.text
            ## Creates a list of strings where each element is a word from
            ## the ingredient's text
            ingredient_j_text_split = ingredient_j_text.split()
            ## Stores the number of words in the ingredient's text
            ingredient_j_text_split_length = len(ingredient_j_text_split)
            ## Stores the first word in the text list, which usually corresponds
            ## to the quantity
            ingredient_j_quantity = ingredient_j_text_split[0]
            ## Stores the second word in the text list, which usually corresponds
            ## to the unit
            ingredient_j_unit = ingredient_j_text_split[1]
            
            ## The following if else code retrieves and stores the name of the ingredient.
            ## It was generally the case that the full name of the ingredient was a hyperlink
            ## If there was no hyperlink for the current ingredient, then the last word
            ## in the text list would be stored as the ingredient name, which was usually the 
            ## case even when there was a hyperlink. If not, then the hyperlink text would be
            ## stored as the ingredient name
            
            ## If there is no hyperlink
            if ingredient_j.find("a") == None:
                ## Stores last word in the text list as the ingredient name
                ingredient_j_name = ingredient_j_text_split[len(ingredient_j_text_split)-1]
            ## If there is a hyperlink
            else:
                ## Stores hyperlink text as the ingredient name
                ingredient_j_name = ingredient_j.find("a").text 
                
            ## Stores ingredient information (recipe_id, recipe name, text, quantity, unit, and name) 
            ## for current recipe in the recipe_data list as a single element
            recipe_data.append([i,food_i_name,ingredient_j_text,ingredient_j_quantity,ingredient_j_unit,ingredient_j_name])
        
    ## If there is an error for the current recipe which would most likely be due to 
    ## incosistent formatting on the page, print 'Inconsistent recipe page format encountered'
    except:
        print('Inconsistent recipe page format encountered')
        
    ## Stops code from running for 1 second so that I am not scraping too fast 
    time.sleep(1)

Inconsistent recipe page format encountered
Inconsistent recipe page format encountered
Inconsistent recipe page format encountered
Inconsistent recipe page format encountered
Inconsistent recipe page format encountered
Inconsistent recipe page format encountered


#### The code above pulls the list of ingredients for each recipe and for each ingredient, stores its recipe_id (an arbitrary number that is unique to each recipe), recipe name, text, quantity, unit, and ingredient name in a list. Each element in this list represents a single ingredient for a recipe. 

In [5]:
## Stores the information from the recipe_data list in a dataframe called 'recipe_df' with columns 
## id, name, text, quantity, unit, and ingredient, where each row represents an ingredient for a recipe
recipe_df = pd.DataFrame(recipe_data, columns = ['id','name','text','quantity','unit','ingredient'])

## The following two cells prints out the number of recipes that was able to be scraped 
num_recipes = len(recipe_df.name.unique())
print('There are', num_recipes, 'recipes in this dataframe')

## Displays some rows from the dataframe
recipe_df

There are 53 recipes in this dataframe


Unnamed: 0,id,name,text,quantity,unit,ingredient
0,0,Baati,1 pound (450 g) wheat flour,1,pound,wheat flour
1,0,Baati,1/2 teaspoon salt,1/2,teaspoon,salt
2,0,Baati,3 tablespoons oil,3,tablespoons,oil
3,0,Baati,1 1/2 teaspoons ghee (for serving),1,1/2,ghee
4,1,Baba ganoush,"1 medium-large eggplant, any variety, 1 to 1½ ...",1,medium-large,eggplant
...,...,...,...,...,...,...
395,56,Butter Tart,1/4 teaspoon salt,1/4,teaspoon,salt
396,57,Buuz,1 Ingredients,1,Ingredients,1 Ingredients
397,57,Buuz,2 Preparation,2,Preparation,2 Preparation
398,57,Buuz,3 Cooking,3,Cooking,3 Cooking


#### The code above creates a dataframe from the list created from the last problem. The columns for this new dataframe are id, name, text, quantity, unit, and ingredient and each row represents an ingredient for a recipe. Furthermore, one issue that I cannot easily solve is that the formatting for each ingredient text was not consistent. It was generally the case that for each ingredient's text, the property of quantity was listed first, followed by the unit, and then the ingredient name, with each property being one word or number. As a result, in order to retrieve these property values for an ingredient, I generally used the positioning of the words in the text. However, some did not follow this format by mixing up the order of these properties and others had multiple words or numbers for a single property that I could not easily take into account since my code assumes that a single property corresponded with a single word or number. Similarly, another issue I could not easily solve was that some ingredients didn't even list a quantity. As a result of these issues, some ingredients in my dataframe don't have the right values for certain columns/properties. Additionally, I was not able to scrape every recipe due to inconsistent recipe page formats and so the dataframe only contains 53 recipes. Some rows from the dataframe are printed above as well.

In [6]:
## Stores the recipe_df dataframe in a csv file 
recipe_df.to_csv('recipe.csv',index=False)

#### The code above stores the dataframe in a file. I referenced the link, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html, when I was writing the code above. 