More often than not data published on the web are not available in a structured dataset such as those we used in the other labs. Retrieving data requires going through the web pages, examine the [HTML]() code and extract the information. This technique is also known as [Web Scraping](https://en.wikipedia.org/wiki/Web_scraping). Clearly it can be extremely tedious.

We will work with the Python module [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) that provides a nice set of tools for extracting information from web pages. The information that we will extract is located in the [BBC Food Recipes database](https://www.bbc.co.uk/food/recipes). We wish to retrieve the Food Recipes available and store them into a document database where each recipe becomes a separate document. We will use the [MongoDB](https://www.mongodb.com/what-is-mongodb) and the Database-as-a-service provider [mLab](https://mlab.com/).

The first important step before diving into the code is to use your browser to inspect the HTML code of the pages that we will scap. All major browsers offer developer tools that present the HTML code in a human readable format. For example [Mozila Firefox Developer Tools](https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Examine_and_edit_HTML). We will work with the following pages:
* [Ingredients Index from A to Z](http://www.bbc.co.uk/food/ingredients)
* [List of Ingredients](http://www.bbc.co.uk/food/ingredients/by/letter/a)
* [List of Recipes for a specific Ingredient](http://www.bbc.co.uk/food/acidulated_water)
* [Recipe](https://www.bbc.co.uk/food/recipes/roman-style_saltimbocca_44940)

For each recipe we wish to extract and store as a document the following information:
* Name
* URL to BBC web site
* Preparation Time
* Cooking Time
* Servings
* List of Ingredients
* Related Recipes

## Scaping data

We will start with the first page in order to retrieve the list of ingredients for a specific ingredient index letter. We start with the [List of Ingredients for Letter A](http://www.bbc.co.uk/food/ingredients/by/letter/a). 

If you inspect the HTML code of the page (via your browser) you will identify an [HTML order list tag](https://www.w3schools.com/tags/tag_ol.asp) containing one [HTML list item tag](https://www.w3schools.com/tags/tag_li.asp) for each ingredient. The URL of the page of ingredient is included in an [HTML link tag](https://www.w3schools.com/tags/tag_a.asp). Here is a short extract:

```html
<ol class="resources foods grid-view">
                <li class="resource food" id="ackee">
                    <a href="/food/ackee">
                        <img src="http://ichef.bbci.co.uk/food/ic/food_16x9_111/foods/fruit_and_vegetables_16x9.jpg" alt="ackee" width="111" height="63">
                        Ackee                    </a>
                                    </li>
  ....                                    
                                    </ol>
```

The above information can be easily extracted from the web site using the [find all](http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python) method of BeautifulSoup.

We first start by retrieving the contents of the page http://www.bbc.co.uk/food/ingredients/by/letter/a using [requests](http://docs.python-requests.org/en/master/#) python library for using the HTTP protocol in a simple and straight-forward way.

In [1]:
import requests

url = 'http://www.bbc.co.uk/food/ingredients/by/letter/a'
response = requests.get(url)

The next step is to pass the contents of the HTTP response to BeautifulSoup so that the contents are parsed and converted into a python object.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

We are now ready to search the processed HTML page using the *find_all* method. We are looking for those ```<a>``` tags that contain the address of each individual ingredient. This will list all the links contained in the page, which are much more than those that we look for. So we will narrow down the search by looking into those that are of the form
```/food/```. We do this by inspecting the *href* attribute of the ```<a>``` tag.

In [14]:
for link in soup.find_all('a'):
    if (link.get('href').startswith('/food/')):
        print(link.get('href'))

/food/
/food/
/food/
/food/recipes/
/food/seasons
/food/occasions
/food/cuisines
/food/dishes
/food/chefs
/food/programmes
/food/ingredients
/food/techniques
/food/about
/food/my/favourites
/food/ingredients
/food/ingredients/by/letter/b
/food/ingredients/by/letter/c
/food/ingredients/by/letter/d
/food/ingredients/by/letter/e
/food/ingredients/by/letter/f
/food/ingredients/by/letter/g
/food/ingredients/by/letter/h
/food/ingredients/by/letter/i
/food/ingredients/by/letter/j
/food/ingredients/by/letter/k
/food/ingredients/by/letter/l
/food/ingredients/by/letter/m
/food/ingredients/by/letter/n
/food/ingredients/by/letter/o
/food/ingredients/by/letter/p
/food/ingredients/by/letter/q
/food/ingredients/by/letter/r
/food/ingredients/by/letter/s
/food/ingredients/by/letter/t
/food/ingredients/by/letter/u
/food/ingredients/by/letter/v
/food/ingredients/by/letter/w
/food/ingredients/by/letter/y
/food/ingredients/by/letter/z
/food/acidulated_water
/food/ackee
/food/acorn_squash
/food/aduki_beans
