More often than not data published on the web are not available in a structured dataset such as those we used in the other labs. Retrieving data requires going through the web pages, examine the [HTML]() code and extract the information. This technique is also known as [Web Scraping](https://en.wikipedia.org/wiki/Web_scraping). Clearly it can be extremely tedious.

We will work with the Python module [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) that provides a nice set of tools for extracting information from web pages. The information that we will extract is located in the [BBC Food Recipes database](https://www.bbc.co.uk/food/recipes). We wish to retrieve the Food Recipes available and store them into a document database where each recipe becomes a separate document. We will use the [MongoDB](https://www.mongodb.com/what-is-mongodb) and the Database-as-a-service provider [mLab](https://mlab.com/).

The first important step before diving into the code is to use your browser to inspect the HTML code of the pages that we will scap. All major browsers offer developer tools that present the HTML code in a human readable format. For example [Mozila Firefox Developer Tools](https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Examine_and_edit_HTML). We will work with the following pages:
* [Ingredients Index from A to Z](http://www.bbc.co.uk/food/ingredients)
* [List of Ingredients](http://www.bbc.co.uk/food/ingredients/by/letter/a)
* [List of Recipes for a specific Ingredient](http://www.bbc.co.uk/food/acidulated_water)
* [Recipe](https://www.bbc.co.uk/food/recipes/roman-style_saltimbocca_44940)

For each recipe we wish to extract and store as a document the following information:
* Name
* URL to BBC web site
* Preparation Time
* Cooking Time
* Servings
* List of Ingredients
* Related Recipes

## Scaping data

We will start with the first page in order to retrieve the list of ingredients for a specific ingredient index letter. We start with the [List of Ingredients for Letter A](http://www.bbc.co.uk/food/ingredients/by/letter/a). 

If you inspect the HTML code of the page (via your browser) you will identify an [HTML order list tag](https://www.w3schools.com/tags/tag_ol.asp) containing one [HTML list item tag](https://www.w3schools.com/tags/tag_li.asp) for each ingredient. The URL of the page of ingredient is included in an [HTML link tag](https://www.w3schools.com/tags/tag_a.asp). Here is a short extract:

```html
<ol class="resources foods grid-view">
                <li class="resource food" id="ackee">
                    <a href="/food/ackee">
                        <img src="http://ichef.bbci.co.uk/food/ic/food_16x9_111/foods/fruit_and_vegetables_16x9.jpg" alt="ackee" width="111" height="63">
                        Ackee                    </a>
                                    </li>
  ....                                    
                                    </ol>
```

The above information can be easily extracted from the web site using the [find all](http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python) method of BeautifulSoup.

We first start by retrieving the contents of the page http://www.bbc.co.uk/food/ingredients/by/letter/a using [requests](http://docs.python-requests.org/en/master/#) python library for using the HTTP protocol in a simple and straight-forward way.

In [1]:
import requests

url = 'http://www.bbc.co.uk/food/ingredients/by/letter/a'
response = requests.get(url)

The next step is to pass the contents of the HTTP response to BeautifulSoup so that the contents are parsed and converted into a python object.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

We are now ready to search the processed HTML page using the *find_all* method. We are looking for those ```<a>``` tags that contain the address of each individual ingredient. This will list all the links contained in the page, which are much more than those that we look for. So we will narrow down the search by looking into those that are of the form
```/food/```. We do this by inspecting the *href* attribute of the ```<a>``` tag.

In [3]:
for link in soup.find_all('a'):
    if (link.get('href').startswith('/food/')):
        print(link.get('href'))

/food/
/food/
/food/
/food/recipes/
/food/seasons
/food/occasions
/food/cuisines
/food/dishes
/food/chefs
/food/programmes
/food/ingredients
/food/techniques
/food/about
/food/my/favourites
/food/ingredients
/food/ingredients/by/letter/b
/food/ingredients/by/letter/c
/food/ingredients/by/letter/d
/food/ingredients/by/letter/e
/food/ingredients/by/letter/f
/food/ingredients/by/letter/g
/food/ingredients/by/letter/h
/food/ingredients/by/letter/i
/food/ingredients/by/letter/j
/food/ingredients/by/letter/k
/food/ingredients/by/letter/l
/food/ingredients/by/letter/m
/food/ingredients/by/letter/n
/food/ingredients/by/letter/o
/food/ingredients/by/letter/p
/food/ingredients/by/letter/q
/food/ingredients/by/letter/r
/food/ingredients/by/letter/s
/food/ingredients/by/letter/t
/food/ingredients/by/letter/u
/food/ingredients/by/letter/v
/food/ingredients/by/letter/w
/food/ingredients/by/letter/y
/food/ingredients/by/letter/z
/food/acidulated_water
/food/ackee
/food/acorn_squash
/food/aduki_beans


Judging from the above output it is clear that we need to refine the list further in order to explude the links that are not related to ingredients:
* The /food/ links
* The links pointing to the Ingredient Index
* The links pointing to anchors (#) within an ingredient page

In [4]:
for link in soup.find_all('a'):
    linkURL = link.get('href')
    if (linkURL.startswith('/food/')):
        if (linkURL == '/food/'):
            continue
            
        if (linkURL.find('#related-foods') > 0):
            continue
            
        if (linkURL.startswith('/food/ingredients') > 0):
            continue
            
        print(linkURL)

/food/recipes/
/food/seasons
/food/occasions
/food/cuisines
/food/dishes
/food/chefs
/food/programmes
/food/techniques
/food/about
/food/my/favourites
/food/acidulated_water
/food/ackee
/food/acorn_squash
/food/aduki_beans
/food/egg_liqueur
/food/agar-agar
/food/ale
/food/aleppo_pepper
/food/alfalfa_sprouts
/food/allspice
/food/almond
/food/almond_essence
/food/almond_extract
/food/almond_milk
/food/amaranth
/food/amaretti
/food/anchovy
/food/anchovy_essence
/food/angelica
/food/bitters
/food/anise
/food/apple
/food/apple_chutney
/food/apple_juice
/food/apple_sauce
/food/apricot
/food/apricot_jam
/food/arborio_rice
/food/arbroath_smokie
/food/argan_oil
/food/arrowroot
/food/artichoke
/food/asafoetida
/food/asparagus
/food/aubergine
/food/avocado


We narrowed down the list, yet there are still some links that are not related to ingredients. These do not seem to follow a patern, so we will have to note them one by one.

In [5]:
unrelated = ['/food/recipes/', '/food/seasons', '/food/occasions', '/food/cuisines', '/food/dishes', 
             '/food/chefs', '/food/programmes', '/food/techniques', '/food/about', '/food/my/favourites']

for link in soup.find_all('a'):
    linkURL = link.get('href')
    if (linkURL.startswith('/food/')):
        if (linkURL == '/food/'):
            continue
            
        if (linkURL.find('#related-foods') > 0):
            continue
            
        if (linkURL.startswith('/food/ingredients')):
            continue

        if (linkURL in unrelated):
            continue
            
        print(linkURL)

/food/acidulated_water
/food/ackee
/food/acorn_squash
/food/aduki_beans
/food/egg_liqueur
/food/agar-agar
/food/ale
/food/aleppo_pepper
/food/alfalfa_sprouts
/food/allspice
/food/almond
/food/almond_essence
/food/almond_extract
/food/almond_milk
/food/amaranth
/food/amaretti
/food/anchovy
/food/anchovy_essence
/food/angelica
/food/bitters
/food/anise
/food/apple
/food/apple_chutney
/food/apple_juice
/food/apple_sauce
/food/apricot
/food/apricot_jam
/food/arborio_rice
/food/arbroath_smokie
/food/argan_oil
/food/arrowroot
/food/artichoke
/food/asafoetida
/food/asparagus
/food/aubergine
/food/avocado


Now we are ready, we can convert this piece of code into a function that returns the extracted links in a form of a list. This way we will be able to go through all the letters of the alphabet and one by one collect all the ingredients.

In [6]:
def extractIngredients(soup):
    ingredients = []
    unrelated = ['/food/recipes/', '/food/seasons', '/food/occasions', '/food/cuisines', '/food/dishes', 
                 '/food/chefs', '/food/programmes', '/food/techniques', '/food/about', '/food/my/favourites']

    for link in soup.find_all('a'):
        linkURL = link.get('href')
        if (linkURL.startswith('/food/')):
            if (linkURL == '/food/'):
                continue

            if (linkURL.find('#related-foods') > 0):
                continue

            if (linkURL.startswith('/food/ingredients')):
                continue

            if (linkURL in unrelated):
                continue

            ingredients.append(linkURL[6:])
    
    return ingredients

[Going through all the alphabet names](https://stackoverflow.com/questions/17182656/how-do-i-iterate-through-the-alphabet-in-python-please) and extracting the ingredients listed on the website is not a simple iteration.

In [7]:
from string import ascii_lowercase

ingredients = []
for letter in ascii_lowercase:
    url = 'http://www.bbc.co.uk/food/ingredients/by/letter/'+letter
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    ingredients += extractIngredients(soup)

Depending on your Internet connection this might take a while.

In [8]:
len(ingredients)

1108

In [9]:
ingredients[0:15]

['acidulated_water',
 'ackee',
 'acorn_squash',
 'aduki_beans',
 'egg_liqueur',
 'agar-agar',
 'ale',
 'aleppo_pepper',
 'alfalfa_sprouts',
 'allspice',
 'almond',
 'almond_essence',
 'almond_extract',
 'almond_milk',
 'amaranth']

## Extracting List of Recipes

Now that we have extracted all the ingredients listed on the BBC food recipes dataset we move one with extracting the recipes by scraping each ingredient. 

Again, we start by exploring the page with our favorite browser. Once again the information (recipes) that we look for can be found within the ```<a>``` tags and in particular the *href* attribute. However, now, we look for those links that start with */food/recipes/*.

In [10]:
url = 'http://www.bbc.co.uk/food/acidulated_water'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

In [11]:
for link in soup.find_all('a'):
    linkURL = link.get('href')
    if (linkURL.startswith('/food/recipes/')):
        if (linkURL == '/food/recipes/'):
            continue
            
        print(link.get('href'))

/food/recipes/barbecue_baby_back_ribs_42228
/food/recipes/derry_duck_and_64105
/food/recipes/roasted_salt_marsh_lamb_13899
/food/recipes/roman-style_saltimbocca_44940
/food/recipes/honeyandzaatarglazed_91314
/food/recipes/terrineofcapricorngo_81701
/food/recipes/search?keywords=acidulated water


This time we are done much faster. Notice that a link to the *search* page is also include. Let's make this also a function so that we can extract the recipies for all ingredients in one go.

In [12]:
def extractRecipes(soup):
    recipes = []
    
    for link in soup.find_all('a'):
        linkURL = link.get('href')
        if (linkURL.startswith('/food/recipes/')):
            if (linkURL == '/food/recipes/'):
                continue
                
            if (linkURL.find('/search') > 0):
                continue                                        

            recipes.append(link.get('href')[14:])
    
    return recipes

Now we are ready to go through all the ingredients we extracted previously and one by one also retrieve the recipes. However, as we do this, we can also store the index of ingredients and build the reverse index for recipes.

## Building a Reverse Index

For the indexes we work with dictionaries. As we go through the list we keep one dictionary using as key the ingredient and a second one for the recipe. For each key we will store a list of all the items connected to it. So for the first dictionary, each ingredients will contain a list of recipes. For the second dictionary, each recipe will be connected to a list of ingredients.

In [15]:
indexIngredients = {}
indexRecipes = {}

recipes = []
for item in ingredients:
    url = 'http://www.bbc.co.uk/food/' + item
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    thislist = extractRecipes(soup)
    
    # produce short debug to keep track of progress
    print(item, len(thislist))
    
    recipes += thislist
    
    # Update Ingredients index
    indexIngredients[item] = thislist
    
    # update Recipes index    
    for recipe in thislist:
        recipeIngredients = indexRecipes.get(recipe, [])        
        recipeIngredients.append(item)
        indexRecipes[recipe] = recipeIngredients

acidulated_water 6
ackee 4
acorn_squash 4
aduki_beans 1
egg_liqueur 1
agar-agar 7
ale 17
aleppo_pepper 1
alfalfa_sprouts 4
allspice 36
almond 38
almond_essence 9
almond_extract 14
almond_milk 1
amaranth 13
amaretti 11
anchovy 22
anchovy_essence 6
angelica 3
bitters 4
anise 5
apple 45
apple_chutney 9
apple_juice 28
apple_sauce 10
apricot 24
apricot_jam 15
arborio_rice 14
arbroath_smokie 7
argan_oil 1
arrowroot 18
artichoke 19
asafoetida 11
asparagus 24
aubergine 23
avocado 25
bacon 32
bagel 12
baguette 22


ConnectionError: HTTPConnectionPool(host='www.bbc.co.uk', port=80): Max retries exceeded with url: /food/baked_beans (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fde9583b908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

Once again, this will take a while. However this time, because it is a quite long list of ingredients, it is reasonable to get an error that interrupts the process and we have to restart. We need to implement an error handling mechanism using [python error & exceptions](https://docs.python.org/3/tutorial/errors.html) mechanism.

We will use a very basic repeat mechanism that will examine one by one the items and every time an error is encountered we will repeat the search until eventually the page is properly scappred.

In [16]:
indexIngredients = {}
indexRecipes = {}

recipes = []
for item in ingredients:
    repeat = True
    while (repeat):
        try:
            url = 'http://www.bbc.co.uk/food/' + item
            response = requests.get(url)
            soup = BeautifulSoup(response.text, "lxml")
            thislist = extractRecipes(soup)
            
        except:
            print("Error encountered while collecting", item)
            pass
        
        else:
            repeat = False
                        
    # produce short debug to keep track of progress
    print(item, len(thislist))

    recipes += thislist

    # Update Ingredients index
    indexIngredients[item] = thislist

    # update Recipes index    
    for recipe in thislist:
        recipeIngredients = indexRecipes.get(recipe, [])        
        recipeIngredients.append(item)
        indexRecipes[recipe] = recipeIngredients

acidulated_water 6
ackee 4
acorn_squash 4
aduki_beans 1
egg_liqueur 1
agar-agar 7
ale 17
aleppo_pepper 1
alfalfa_sprouts 4
allspice 36
almond 38
almond_essence 9
almond_extract 14
almond_milk 1
amaranth 13
amaretti 11
anchovy 22
anchovy_essence 6
angelica 3
bitters 4
anise 5
apple 45
apple_chutney 9
apple_juice 28
apple_sauce 10
apricot 24
apricot_jam 15
arborio_rice 14
arbroath_smokie 7
argan_oil 1
arrowroot 18
artichoke 19
asafoetida 11
asparagus 24
aubergine 23
avocado 25
bacon 32
bagel 12
baguette 22
baked_beans 6
cakes_and_baking 16
baking_powder 31
balsamic_vinegar 36
bamboo_shoots 11
banana 32
banana_bread 7
barbary_duck 3
barbecue_sauce 11
barley 9
basil 37
basmati_rice 24
bay_boletes 0
bay_leaf 36
bean 16
beansprouts 16
bechamel_sauce 8
beef 17
beef_consomme 2
beef_dripping 13
beef_mince 13
beef_ribs 7
beef_rump 5
beef_sausage 6
beef_stock 20
beef_tomato 15
beer 20
beetroot 36
berry 20
betel_leaves 0
beurre_manie 1
bicarbonate_of_soda 26
bilberries 0
birds-eye_chillies 12
bisc

In [17]:
len(recipes)

15482

In [18]:
recipes[0:5]

['barbecue_baby_back_ribs_42228',
 'derry_duck_and_64105',
 'roasted_salt_marsh_lamb_13899',
 'roman-style_saltimbocca_44940',
 'honeyandzaatarglazed_91314']

In [19]:
len(indexIngredients)

1108

In [20]:
indexIngredients['acidulated_water']

['barbecue_baby_back_ribs_42228',
 'derry_duck_and_64105',
 'roasted_salt_marsh_lamb_13899',
 'roman-style_saltimbocca_44940',
 'honeyandzaatarglazed_91314',
 'terrineofcapricorngo_81701']

In [21]:
len(indexRecipes)

5198

In [22]:
indexRecipes['roman-style_saltimbocca_44940']

['acidulated_water',
 'anchovy',
 'broad_beans',
 'caster_sugar',
 'escalope',
 'globe_artichoke',
 'lettuce',
 'mint',
 'pancetta',
 'pear',
 'pea',
 'pecorino_cheese',
 'plain_flour',
 'prosciutto',
 'sage',
 'spring_onion',
 'vegetable_stock',
 'white_wine']

## Extracting Recipes

Now we are ready to go into each recipe and extract the information that we look for. Again, we use our browser to examine the HTML code:
* The name of the recipe can be extracted from the title of the page
* It seems that Preparation Time, Cooking Time and Number of Serving are all [HTML paragraph tags](https://www.w3schools.com/tags/tag_p.asp) that have assigned a custom attribute *itemprop*. 
 * For Preparation Time the attribute is set to "prepTime"
 * For Cooking Time the attribute is set to "cookTime" 
 * For Number of Services the attribute is set to "recipeYield"

In [23]:
url = 'https://www.bbc.co.uk/food/recipes/roman-style_saltimbocca_44940'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

Extracting the title is straight forward.

In [24]:
soup.title.contents[0]

'BBC Food - Recipes - Roman-style saltimbocca with caramelised pear, fried sage and vignarola'

In [25]:
title = soup.title.contents[0][21:]

Retrieving the other fields will be done using the *find_all* command.

In [26]:
for tag in soup.find_all(itemprop='prepTime'):
    print(tag.contents[0])

30 mins to 1 hour


So now we are ready to build our final function that extracts the contents of each recipe.

In [27]:
def extractRecipe(soup):
    recipe = {'title': soup.title.contents[0][21:]}       
    
    for tag in soup.find_all(itemprop='prepTime'):
        recipe['preptime'] = tag.contents[0]

    for tag in soup.find_all(itemprop='cookTime'):
        recipe['cookTime'] = tag.contents[0]
        
    for tag in soup.find_all(itemprop='recipeYield'):
        recipe['recipeYield'] = tag.contents[0]  
        
    return recipe

So now we can do the final step, go through all the recipes and scrap the data

In [None]:
dataRecipes = {}
for recipe in recipes:
    url = 'https://www.bbc.co.uk/food/recipes/' + recipe
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    dataRecipes[recipe] = extractRecipe(soup)
    dataRecipes[recipe]['url'] = url

#### Exercise
* Implement a simple error handling code for the above iteration to make sure that scraping continues even in the case of connectivity errors

In [29]:
dataRecipes['roman-style_saltimbocca_44940']

{'cookTime': '30 mins to 1 hour',
 'preptime': '30 mins to 1 hour',
 'recipeYield': 'Serves 4',
 'title': 'Roman-style saltimbocca with caramelised pear, fried sage and vignarola',
 'url': 'https://www.bbc.co.uk/food/recipes/roman-style_saltimbocca_44940'}

## Store scraped data on a Document Store

We now wish to store the contents of the dataset into a document database where each row becomes a separate document. We will use the [MongoDB](https://www.mongodb.com/what-is-mongodb) and the Database-as-a-service provider [mLab](https://mlab.com/).

Step-by-step introductory for the Data API can be found under the [Data Manipulation using Lazio Bar/Restaurant Dataset](lab-restaurants/ADM%20Lab%20-%20Restaurants.ipynb) laboratory page.

**Make sure you replace your API key in the code below**

In [30]:
params = {'apiKey': '<Paste your api key HERE>'}
url = 'https://api.mlab.com/api/1/databases'
response = requests.get(url, params)

In [31]:
response.text

'[ "adm" , "adm2017" , "ds" , "seed" ]'

In [32]:
import json

dbname = 'adm2017'
collection = 'recipes'
url = 'https://api.mlab.com/api/1/databases/' + dbname + '/collections/' + collection
headers = {'content-type': 'application/json'}
data = json.dumps(dataRecipes['roman-style_saltimbocca_44940'])
response = requests.post(url, data=data, params=params, headers=headers)

After this simple test to make sure that everything works as it should, let's now upload all the information extracted along with the indexes to our document store. Here we can use the batch upload documents as discussed in the [Data Manipulation using Lazio Bar/Restaurant Dataset](lab-restaurants/ADM%20Lab%20-%20Restaurants.ipynb) laboratory.

We need to convert the dictionary into a list of documents. In the process we can also include in each recipe the list of ingredients used by using the reverse index that we created.

In [33]:
recipesList = []
for key,item in dataRecipes.items():
    item['ingredients'] = indexRecipes[key]
    recipesList.append(item)

Let's have a look how this looks like.

In [34]:
recipesList[1]

{'cookTime': '10 to 30 mins',
 'ingredients': ['almond',
  'chicken_breast',
  'cucumber',
  'olive_oil',
  'paprika',
  'parsley',
  'salt',
  'sunflower_oil',
  'whipping_cream'],
 'preptime': 'less than 30 mins',
 'recipeYield': 'Serves 4',
 'title': 'Chicken and cucumber en papillote with toasted almonds',
 'url': 'https://www.bbc.co.uk/food/recipes/chicken_and_cucumber_en_93986'}

Now we are ready to batch upload all the recipes to mLab.

In [35]:
data = json.dumps(recipesList)
response = requests.post(url, data=data, params=params, headers=headers)

## Exercises
* Upload all the ingredients along with the index of recipes.
* Extend the recipes scraping by including:
 * Name of the chef
 * Name of show where the recipe was presented
 * Link to the image
* Extend the recipes scraping by also extrecting the preparation instructions.
* Update recipe documents in mLab with the additional information retrieved.
* Create a new index for recipes related to:
 * Cuisines based on *http://www.bbc.co.uk/food/cuisines*
 * Seasons based on *http://www.bbc.co.uk/food/seasons*
 * Occasions based on *http://www.bbc.co.uk/food/occasions*