# Web Scraping with Python
All websites are different and each needs a customised approach when it comes to scraping, but we can generally identify 3 main scenarios:
1. Website has a publicly available and well documented API. The dream. (rarely comes true)
2. Website has all the juicy data present in the source code. Go to your website of choice, right-click anywhere on the page and select View Page Source. If you can find your data of interest somewhere within that HTML code then you're in luck.
3. You can see data on the website but when you view source code, it's not there. Huh?? That most likely means that the data is being dynamically loaded via Javascript. We'll have to try and reverse-engineer the website's non-public API.

In this tutorial we'll cover scenarios 2 & 3.

Requirements:
- Python >= 3.6
- Chrome browser (if you're using something else you'll have to figure out the names of equivalent tools)

## Scenario 2: extracting data from website's HTML code.
### First, let's import some libraries

In [1]:
import pandas as pd  # we'll use pandas to display our scraped data in a table and export to csv / excel etc.
import requests  # requests library allows us to request data from a url using RESTful GET and POST methods
from bs4 import BeautifulSoup  # beautiful soup is a HTML parser
from urllib import parse  # we'll use this to make sure our search terms are url-friendly

I want to bake some muffins! In fact I love muffins so much, I want to save all recipes from the BBC Good Food website 🧁

I went to https://www.bbc.co.uk/food and searched for `muffins`, then copied the url from my browser.

In [2]:
url = "https://www.bbc.co.uk/food/search?q=muffins&page=1"
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}

What's this `headers` thing? Well, some websites don't like to be accessed by scripts (BBC is actually pretty lenient in that regard). In case you come across a website that will deny you access, you may be able to get it to work by pretending to be a browser and passing a `headers` parameter in your request. To get the value of the "user-agent", right click anywhere on the page and select "Inspect". In the inspect pane on the right, go to Network tab. Now click anywhere on the page - you should see some network activity. Click on any item in the list - a panel should open on the right with the first tab being "Headers". Scroll down to "Request Headers" - your user-agent info should be there.

Now we want to send a GET request to that url and hopefully get some data back

In [3]:
# I'm passing my two veriables as arguments into the requests.get() method
r = requests.get(url=url, headers=headers)
print(f"Status code: {r.status_code}")
print(f"Content data type: {type(r.content)}")

Status code: 200
Content data type: <class 'bytes'>


HTTP status code indicate whether the response has been successful or not. You can check what each code means on this [Wiki page](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).
- 1xx informational response – the request was received, continuing process
- 2xx successful – the request was successfully received, understood, and accepted
- 3xx redirection – further action needs to be taken in order to complete the request
- 4xx client error – the request contains bad syntax or cannot be fulfilled
- 5xx server error – the server failed to fulfil an apparently valid request

So far so good, we got OK response! But the content of our response is stored as bytes - we'll need to decode it into a string before we can continue.
The default encoding is "utf-8" but you might come across a website using a different type of encoding (try Russian websites in cyrillic!). Learn more about [Character Encoding](https://en.wikipedia.org/wiki/Character_encoding).

In [4]:
# let's decode our content
html = r.content.decode(encoding='utf-8')
# and now we can make a soup from it! BS makes sense of all the HTML tags in our content so that we can fish them out of the soup XD
soup = BeautifulSoup(html, "html.parser")

This is where right-click + Inspect (RCI) becomes your best friend. Right now our soup contains a _a lot_ of stuff we don't need: navigation bars, ads, featured content, footer etc. We need to find the smallest element (HTML tag) that encompasses all of our recipes to narrow down our search area. RCI on the first recipe on the page and keep moving up the HTML tag tree until you find the first tag that holds all recipes. Note the tag name and any other attributes that make it uniquely identifiable.

The two methods we'll be using the most are `.find()` and `.findAll()` (or `find_all()` - different names, same thing). `.find()` returns the first element that matches given arguments; `findAll()` returns a list of all matching elements.

In [5]:
# we're looking for a div tag. to identify the one we need we need to pass additional attributes dict with its class name
recipe_book = soup.find("div", attrs={"class": "gel-layout gel-layout--equal promo-collection"})

If you print the `recipe_book` variable you should see just the bit of HTML with all our recipes - no more full page junk! Now we need to find all recipe containers but rather than searching in our soup that contains the full page, let's just look in the `recipe_book`. I did RCI on the individual recipe card to find my container tag and attributes.

In [6]:
# I used the findAll() method so my result is a list (technically a 'bs4.element.ResultSet' but you can treat it like a list)
recipes = recipe_book.findAll("div", attrs={"class": "gel-layout__item gel-1/2 gel-1/4@xl"})
type(recipes)

bs4.element.ResultSet

In [7]:
# Let's try to pull some info from an individual recipe
recipe = recipes[0]
title = recipe.find("h3").text  # I can use the .text attribute to pull out text from the tag
# Recipe author info has this annoying "by Joe Bloggs" format. I want just the name, so I'll chain a string .replace() method and get rid of the "by " bit
author = recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}).text.replace("by ", "")
url = recipe.find("a")["href"] # some attributes like href (the url reference inside <a> tag) can be accessed via square brackets notation, kind of like a dictionary
print(f"Title: {title}\nAuthor: {author}\nURL: {url}")

Title: Lemon muffins
Author: Sarah Cook
URL: /food/recipes/lemon_muffins_01522


In [8]:
# One thing to note - if you try to access the .text attribute on an element that doesn't exist or can't be found, you'll get an error
recipe.find("span", attrs={"class": "class_that_doesn't_exists"}).text

AttributeError: 'NoneType' object has no attribute 'text'

In [9]:
# to make your script error-proof, fist check if the element can be found (i.e. element type is not None) before you try to extract any attributes from it
if recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}) is not None:
    author = recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}).text

Now that we know what we're looking for, we can put all of these operations into a loop to extract data from all recipes in our list.

In [10]:
# we need a master list that will sit outside of our loop
recipes_master = []
for recipe in recipes:
    # we're going to do this for each recipe in our recipes list...
    title = recipe.find("h3").text
    if recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}) is not None:
        author = recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}).text.replace("by ", "")
    else:
        author = None  # since we're using this variable later, we need to give it some value even if it's None
    url = recipe.find("a")["href"]
    # let's organise our recipe data into a nice dictionary
    recipe_dict = {"title": title,
                  "author": author,
                  "url": url}
    # now we can append it into a master list
    recipes_master.append(recipe_dict)

# let's have a little peek into our master list
recipes_master[:5]

[{'title': 'Lemon muffins',
  'author': 'Sarah Cook',
  'url': '/food/recipes/lemon_muffins_01522'},
 {'title': 'Pumpkin muffins',
  'author': 'Lorraine Pascale',
  'url': '/food/recipes/pumpkin_and_rosemary_11109'},
 {'title': 'Chai muffins',
  'author': 'Nigella Lawson',
  'url': '/food/recipes/chai_muffins_06244'},
 {'title': 'Banana muffins',
  'author': 'Jill Dupleix',
  'url': '/food/recipes/bananamuffins_71268'},
 {'title': 'Healthy banana muffins',
  'author': 'Fiona Hunter',
  'url': '/food/recipes/banana_muffins_51549'}]

Nice! but that's just one page of search results - let's see if we can grab all recipes from all pages. First, we need to know how many pages there are.

There's a lot going on here. I look for all `<a>` tags with the class used by pagination buttons and assign it to `pages_lst`.
Then I'm using list comprehension to keep it on one line.
- I want .text for every element (page) of the list
- but only if .text is not an empty string
- because I need to convert those numbers stored as strings to integers
- so that I can get the max value from the list - the total number of pages

In [11]:
pages_lst = soup.findAll("a", attrs={"class": "pagination__link gel-pica-bold"})
total_pages = max([int(page.text) for page in pages_lst if page.text != ""])
print(f"Total pages: {total_pages}")

Total pages: 3


Now I can put it all in another loop to cycle through the pages. For each page we'll extract the HTML and run our smaller loop to get all recipes.

I've also extracted the search term just in case I wanted to search for something other than muffins. Remeber that this needs to be url-friendly (i.e. spaces need to be replaced with `%20` etc.) so we'll use the urllib parse module.

In [12]:
search_term = "muffins"
search_url = parse.quote(search_term)

In [13]:
# the master list has to be moved outside the big loop, otherwise it would get overwritten with each new page
recipes_master = []

for i in range(0,total_pages):
    # Python is a zero-indexed language so we need to increment our i to start from number 1 rather than 0
    url = f"https://www.bbc.co.uk/food/search?q={search_url}&page={i+1}"
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content.decode())
    recipe_book = soup.find("div", attrs={"class": "gel-layout gel-layout--equal promo-collection"})
    recipes = recipe_book.findAll("div", attrs={"class": "gel-layout__item gel-1/2 gel-1/4@xl"})
    
    for recipe in recipes:
        title = recipe.find("h3").text
        if recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}) is not None:
            author = recipe.find("span", attrs={"class": "promo__subtitle gel-long-primer"}).text.replace("by ", "")
        else:
            author = None
        url = recipe.find("a")["href"]
        recipe_dict = {"title": title,
                      "author": author,
                      "url": url}
        recipes_master.append(recipe_dict)

We have all recipes with their links in our list, let's put them in a DataFrame

In [14]:
df = pd.DataFrame(recipes_master)
df.head()

Unnamed: 0,title,author,url
0,Lemon muffins,Sarah Cook,/food/recipes/lemon_muffins_01522
1,Pumpkin muffins,Lorraine Pascale,/food/recipes/pumpkin_and_rosemary_11109
2,Chai muffins,Nigella Lawson,/food/recipes/chai_muffins_06244
3,Banana muffins,Jill Dupleix,/food/recipes/bananamuffins_71268
4,Healthy banana muffins,Fiona Hunter,/food/recipes/banana_muffins_51549


### Next Steps
What about getting recipe details: ingredients and instructions? We have all the urls and we can use them to loop through individual recipes

In [15]:
base_url = "https://www.bbc.co.uk"
recipe_urls = df["url"].tolist()
for recipe_url in recipe_urls:
    url = base_url + recipe_url
    # here goes code to scrape the individual recipe page

## Scenario 3: reverse-engineering a non-public API.
On some websites data you see in your browser does not appear in the source code. This is usually because it's being dynamically loaded in via Javascript call to the website's (non-public, as in hidden and undocumented) API. We can figure out how to plug into that kind of API by analysing network call the website makes.
Let's look at our favourite Verra registry and try to get the CCB data at https://registry.verra.org/app/search/CCB.

### 1. Inspect network activity
RCI anywhere on the page and go to the Network tab. Now click on the Search button to load the results. You should see some activity in the network panel.

Find `search?maxResults=2000&$count=true&$skip=0&$top=50` and click on it: in Headers tab / General section you'll see some basic information about the call. We can see the request URL and that the request method was POST. This is different than the previous GET method we were using; we're not just asking for a response, we're sending some data _to_ the API to receive a response. So what are we sending?

Scroll down in the Headers pane to the Request Payload section. This is what we're sending! Click on view source and copy the contents. Now that we have our request URL and the payload we can try and see if the API will give us a correct response.

In [47]:
url = "https://registry.verra.org/uiapi/resource/resource/search?maxResults=2000&$count=true&$skip=0"
payload = {"program":"CCB",
           "resourceStatuses":
           ["CCB_EX_PROJECT_WITHDRAWN","CCB_EX_UNDER_VALIDATION","CCB_EX_UNDER_VAL_AND_VER","CCB_EX_UNDER_VRF",
            "CCB_EX_VALIDATION_APPROVED","CCB_EX_VALIDATION_EXPIRED","CCB_EX_VRF_APPV_REQUESTED","CCB_EX_VERIFICATION_APPROVED",
            "CCB_EX_VERIFICATION_EXPIRED","CCB_EX_VRF_PBL_CMT_PERIOD_REQ"]}

In [49]:
# unlike the BBC GET request, we're using a POST request here and passing our payload as the JSON argument
r = requests.post(url, json=payload)
r.status_code

200

In [50]:
data = r.json()
data

{'totalCount': 201,
 'countExceeded': False,
 '@count': 201,
 'value': [{'program': 'Climate, Community, and Biodiversity Standards',
   'resourceIdentifier': '142',
   'resourceName': 'Reforestation of degraded grasslands in Uchindile & Mapanda, Tanzania',
   'proponent': 'GREEN RESOURCES',
   'operator': None,
   'designee': None,
   'protocolCategories': 'Afforestation, Reforestation and Revegetation',
   'resourceStatus': 'Verification approved',
   'country': 'Tanzania',
   'estAnnualEmissionReductions': 25000,
   'region': 'Africa',
   'version': 'CCB Second Edition',
   'protocols': None},
  {'program': 'Climate, Community, and Biodiversity Standards',
   'resourceIdentifier': '514',
   'resourceName': 'Promoting Sustainable Development through Natural Rubber Tree Plantations in Guatemala',
   'proponent': 'PICA DE HULE NATURAL, S.A.',
   'operator': None,
   'designee': None,
   'protocolCategories': 'Afforestation, Reforestation and Revegetation',
   'resourceStatus': 'Verific

In [51]:
# Our juicy projects data is a list within the 'value' key. Let's have a look at the first 3 projects
data["value"][:3]

[{'program': 'Climate, Community, and Biodiversity Standards',
  'resourceIdentifier': '142',
  'resourceName': 'Reforestation of degraded grasslands in Uchindile & Mapanda, Tanzania',
  'proponent': 'GREEN RESOURCES',
  'operator': None,
  'designee': None,
  'protocolCategories': 'Afforestation, Reforestation and Revegetation',
  'resourceStatus': 'Verification approved',
  'country': 'Tanzania',
  'estAnnualEmissionReductions': 25000,
  'region': 'Africa',
  'version': 'CCB Second Edition',
  'protocols': None},
 {'program': 'Climate, Community, and Biodiversity Standards',
  'resourceIdentifier': '514',
  'resourceName': 'Promoting Sustainable Development through Natural Rubber Tree Plantations in Guatemala',
  'proponent': 'PICA DE HULE NATURAL, S.A.',
  'operator': None,
  'designee': None,
  'protocolCategories': 'Afforestation, Reforestation and Revegetation',
  'resourceStatus': 'Verification approved',
  'country': 'Guatemala',
  'estAnnualEmissionReductions': 46434,
  'regio

In [52]:
# Now we can make a dataframe from it
df = pd.DataFrame(data["value"])
df.head()

Unnamed: 0,program,resourceIdentifier,resourceName,proponent,operator,designee,protocolCategories,resourceStatus,country,estAnnualEmissionReductions,region,version,protocols
0,"Climate, Community, and Biodiversity Standards",142,Reforestation of degraded grasslands in Uchind...,GREEN RESOURCES,,,"Afforestation, Reforestation and Revegetation",Verification approved,Tanzania,25000.0,Africa,CCB Second Edition,
1,"Climate, Community, and Biodiversity Standards",514,Promoting Sustainable Development through Natu...,"PICA DE HULE NATURAL, S.A.",,,"Afforestation, Reforestation and Revegetation",Verification approved,Guatemala,46434.0,Latin America,CCB Second Edition,Biodiversity Gold
2,"Climate, Community, and Biodiversity Standards",562,The Kasigau Corridor REDD Project – Phase I Ru...,Wildlife Works Carbon LLC,,,Reduced Emissions from Deforestation and Degra...,Verification approved,Kenya,251432.0,Africa,CCB Second Edition,"Climate Gold, Biodiversity Gold"
3,"Climate, Community, and Biodiversity Standards",576,Restoration of degraded areas and reforestatio...,Asorpar Ltd.,,,"Afforestation, Reforestation and Revegetation",Verification approved,Colombia,80000.0,Latin America,CCB Second Edition,
4,"Climate, Community, and Biodiversity Standards",594,"TIST Program in Kenya, VCS 001",Clean Air Action Corporation,,,"Afforestation, Reforestation and Revegetation",Verification approved,Kenya,14701.0,Africa,CCB Second Edition,Community Gold
