![sslogo](https://github.com/stratascratch/stratascratch.github.io/raw/master/assets/sslogo.jpg)

# Web scraping in Python

Scraping refers to extracting useful data from web pages which are written in a programming language called HTML. To scrap data from the HTML tree we first have to download the web page to our PC.

We will use the following packages to achieve the tasks in this lesson:
- [`requests`](http://docs.python-requests.org/en/master/)
- [`beautifoulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

### Install the packages using pip

In [None]:
!pip install requests

In [None]:
!pip install beautifulsoup4

### Import the modules

In [None]:
import numpy as np
import pandas as pd
import requests
import bs4
import lxml.etree as xml

## Basic concepts

### Fetch webpage contents using requests

To get everything about a webpage we use the `get` method from requests. There are many optional arguments it can take but the one main argument it takes is the url to the webpage you want retrieved.

In [None]:
URL = "https://github.com/requests/requests"

requests.get(URL)

<Response [200]>

The result of this method is a Response object. The number 200 is a [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 200 is OK and it means no error.

In [None]:
requests.get(URL, {}).text



To get the HTML as a string we use the `text` property of the Response object.

Before we go farther you should know that often you will get an error when accessing the webpage. There are many errors and even more causes for the error, but the most common cases are:
- You use a wrong URL.
- The website is down. To be sure this happens access it via browser.
- The website blocks bots and scraping agents. You can try to use browser looking UserAgent to fix this. If this happens investigate the `headers` parameter of the `get` method. It usually helps to use a plausible UserAgent but if it doesn't good luck trying to find a solution.

We can convert that text into either a BeautifoulSoup object.

#### Example 1

Create a beautifoul soup object.

In [None]:
web_page = bs4.BeautifulSoup(requests.get(URL, {}).text, "lxml")

In [None]:
web_page

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://assets-cdn.github.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-01356238c65ce56a395237b592b58668.css" integrity="sha512-qQ+v+W1uJYfDMrQ/cwCVI+AGTsn1yi4rCU6KX45obe52BoF+WiHNeQ11u63iJA05vyivY57xNbhAsyK4/j1ZIQ==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-f01d758edeec501660dbed3e681f6493.css" integrity="sha512-V9a64JRnkUg/Cpl1MyEG/fDlLG4NnmKpmq

Web pages are trees of elements nested one inside the other.

For example:
- html
  - body
      - div
      - div
      - div
      
We say that body is a child of html and html is a parent of body, and that the 3 div are children of body. The 3 div are siblings. This terminology matters because the method names in bs4 follow it. 

Before you go scrapping open the website in Inspector View to see the nesting hierarchy of web page elements.

Generally all web pages have two main sections called `head` and `body`:
- `head` is where a lot of metadata lives
- `body` is what you seen on the screen and it contains all links, tables and images.

#### Example 2

Let's find the title of the web page we pulled using the `head` and `title` elements.

In [None]:
web_page.head.title

<title>GitHub - requests/requests: Python HTTP Requests for Humans™ ✨🍰✨</title>

We can navigate the tree by going element by element. You need to know the element names (html, head, div, span, p, a and so on) but don't worry if you don't. Look at the webpage in the inspector view in your browser and you can see the full path to the element of interest.

To get the text we need to use the `text` property of elements.

In [None]:
web_page.head.title.text

'GitHub - requests/requests: Python HTTP Requests for Humans™ ✨🍰✨'

#### Example 3

Let's go into the body of the github page we accessed.

In [None]:
web_page.body

<body class="logged-out env-production">
<div class="position-relative js-header-wrapper ">
<a class="px-2 py-4 bg-blue text-white show-on-focus js-skip-to-content" href="#start-of-content" tabindex="1">Skip to content</a>
<div class="pjax-loader-bar" id="js-pjax-loader-bar"><div class="progress"></div></div>
<header class="Header header-logged-out position-relative f4 py-3" role="banner">
<div class="container-lg d-flex px-3">
<div class="d-flex flex-justify-between flex-items-center">
<a aria-label="Homepage" class="header-logo-invertocat my-0" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark; experiment:site_header_dropdowns; group:control" href="https://github.com/">
<svg aria-hidden="true" class="octicon octicon-mark-github" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.0

It is full of elements like `<a>` or `<ul>` or `<li>` or `<div>` or `<span>` etc.

The majority of that is noise to us because we want to find the numbers which describe this repository.

#### Example 4

Get the numbers we want.

When you open the inspector view you will see that 9 elements are in a parent-children relation before the element we care about which is `<ul class="numbers-summary">`.

We can use the approach we used above to reach it but there is a faster way using the `find_all` method:
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

But before we jump into it you must learn about element attributes like `class="numbers-summary"`.

Element attributes are like a python dictionary and actually in bs4 they are a python dictionary. 

The attributes are standardized. There are many of them but the 3 most important ones for scrapping are:
- `id` which is a unique id of each element in the webpage
- `class` which is way to style multiple element in a same way
- `href` which is only valid for `<a>` elements and is the URL to which the element links to.

The result of find all is a list of web pages which are sub web pages of the original one and which satisfy some criteria. There are many such criteria and you can see them in the documentation but we use the `attrs` filter now. 

This filter in our example will give us back all web pages which start from an element named `<ul>` and which have an attribute called `class` and that attribute has the value `numbers-summary`. 

There is only a single such web page so we take the first element.

In [None]:
sub_web_page = web_page.find_all(name="ul", attrs={"class": "numbers-summary"})[0]

To get the numbers we care about we need to filter our sub web page even more and again we will use `find_all` which is the work horse method you use 99% of the time.

In [None]:
sub_web_page.find_all("span")

[<span class="num text-emphasized">
                 5,491
               </span>, <span class="num text-emphasized">
               12
             </span>, <span class="num text-emphasized">
               133
             </span>, <span class="num text-emphasized">
       512
     </span>]

We get a list of single element web pages which we need to transform to numbers.

We will use list comprehensions and do some string cleaning. 

Generally you will always need to do string cleaning when doing web scrapping so it's good to learn your regexes.

In [None]:
[int(wp.text.strip("\n ").replace(",", "")) 
    for wp in sub_web_page.find_all("span")]

[5491, 12, 133, 512]

#### Example 5

Get all the tags from the github page. The tags are `python`, `http`, `forhumans` etc.

In [None]:
URL = "https://github.com/requests/requests"

web_page_text = requests.get(URL).text

web_page = bs4.BeautifulSoup(web_page_text, "lxml")

web_page

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://assets-cdn.github.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-01356238c65ce56a395237b592b58668.css" integrity="sha512-qQ+v+W1uJYfDMrQ/cwCVI+AGTsn1yi4rCU6KX45obe52BoF+WiHNeQ11u63iJA05vyivY57xNbhAsyK4/j1ZIQ==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-f01d758edeec501660dbed3e681f6493.css" integrity="sha512-V9a64JRnkUg/Cpl1MyEG/fDlLG4NnmKpmq

We are looking for the element `<div class="list-topics-container f6 mt-1">`.

Notice that class can have many entries, for example 3 as seen in this `div`.

When filtering by class we can use any single class, we don't have to list them all.

When we want to target a single element it is better to use `find` which has the same parameters as `find_all` but returns only a single sub web page.

Here we also use the `children` property for the first time.

In our list comprehension we filter by type and we remove all `NavigableString`s which are just string and not elements.

To understand the difference between a `NavigableString` and `Tag`(tag is synonimous for Element) look at this example.

```
    <div>
        I am some text and of type Navigable String.
        
        <a> I am a child I am of type Tag when I am an element but I am Navigable String when I am text </a>
   </div>
```

We keep only elements and access their text property and clean.

In [None]:
tags_elements = web_page.find(name="div", attrs={"class": "list-topics-container"})

tags_text = [elem.text.strip("\n ") 
             for elem in tags_elements.children 
             if type(elem) != bs4.NavigableString]

tags_text

['python',
 'http',
 'forhumans',
 'requests',
 'kennethreitz',
 'python-requests',
 'client']

#### Example 6

Achieve the same with less code.

We filter on `href` of all links in the webpage which satisfy our regular expression.

The `compile` method from `re` turns a regexp pattern into an object which can be used for matching.

If we passed only a string here it would look for exact matches to `href` not regexp matches.

The thing to learn here and from the whole lesson is that there is no rule on how to do web scrapping. Always use whatever works fastest and takes you least time to think of and type.

In [None]:
import re

[e.text.strip("\n ") for e in 
 web_page.find_all(name="a", 
                   attrs={"href": re.compile("/topics/.+")})]

['python',
 'http',
 'forhumans',
 'requests',
 'kennethreitz',
 'python-requests',
 'client']

#### Example 7

We will now scrap a table element.

Tables in html are of form:
- table
    - tbody
        - tr (table row)
            - td (table column)
            - td
            - td
            
The element from which we start is:

`<table class="files js-navigation-container js-active-navigation-container" data-pjax="">`

As always use the inspector to see how the webpage is made and how we deconstruct it.

In [None]:
files_table = web_page.find(name="table", attrs = {"class": "files"}).tbody.children

files = []

for file_row in files_table:
    # ignore all navigable strings
    if type(file_row) == bs4.NavigableString:
        continue
        
    content = file_row.find(name="td", attrs={"class": "content"})\
                      .find(name="a")
        
    # If we didn't find the link ignore this element
    if content is None:
        continue
    
    # Get the href attribute
    href      = content.attrs["href"]    
    file_name = content.text.strip("\n ")
    
    files.append(("https://github.com" + href, file_name))
    
files

[('https://github.com/requests/requests/tree/master/.github', '.github'),
 ('https://github.com/requests/requests/tree/master/_appveyor', '_appveyor'),
 ('https://github.com/requests/requests/tree/master/docs', 'docs'),
 ('https://github.com/requests/requests/tree/master/ext', 'ext'),
 ('https://github.com/requests/requests/tree/master/requests', 'requests'),
 ('https://github.com/requests/requests/tree/master/tests', 'tests'),
 ('https://github.com/requests/requests/blob/master/.coveragerc',
  '.coveragerc'),
 ('https://github.com/requests/requests/blob/master/.gitignore', '.gitignore'),
 ('https://github.com/requests/requests/blob/master/.travis.yml',
  '.travis.yml'),
 ('https://github.com/requests/requests/blob/master/AUTHORS.rst',
  'AUTHORS.rst'),
 ('https://github.com/requests/requests/blob/master/CODE_OF_CONDUCT.md',
  'CODE_OF_CONDUCT.md'),
 ('https://github.com/requests/requests/blob/master/CONTRIBUTING.md',
  'CONTRIBUTING.md'),
 ('https://github.com/requests/requests/blob/m

#### Example 8

Scrap multiple web pages.

We will find all files which are present in the `requests` github repository at root level and one level below it.

To do so we find all files and go into them following links.

In [None]:
def find_files(url):
    web_page = bs4.BeautifulSoup(requests.get(url).text, "lxml")
    
    files_table = web_page.find(name="table", attrs = {"class": "files"}).tbody.children

    files = []

    for file_row in files_table:
        # ignore all navigable strings
        if type(file_row) == bs4.NavigableString:
            continue

        content = file_row.find(name="td", attrs={"class": "content"})\
                          .find(name="a")

        # If we didn't find the link ignore this element
        if content is None:
            continue

        # Get the href attribute
        href      = content.attrs["href"]    
        file_name = content.text.strip("\n ")

        files.append(("https://github.com" + href, file_name))

    return files
    
for path, name in find_files(URL):
    print(name)
    
    # We wrap this in a try-catch block because we don't know what we might find
    # and we iteratively build our scrapper from error messages and the Inspector view
    # for example we do not check if we scrap a file or a directory and always assume directory
    # which crashes
    try:
        other_files = find_files(path)

        for path2, name2 in other_files:
            print(name2)
    except Exception as e:
        print("ERROR:" + str(e))
        print(path2)

.github
ISSUE_TEMPLATE
ISSUE_TEMPLATE.md
_appveyor
install.ps1
docs
_static
_templates
_themes
community
dev
user
Makefile
api.rst
conf.py
index.rst
make.bat
ext
requests-logo.ai
requests-logo.svg
requests
__init__.py
__version__.py
_internal_utils.py
adapters.py
api.py
auth.py
certs.py
compat.py
cookies.py
exceptions.py
help.py
hooks.py
models.py
packages.py
sessions.py
status_codes.py
structures.py
utils.py
tests
testserver
__init__.py
compat.py
conftest.py
test_help.py
test_hooks.py
test_lowlevel.py
test_packages.py
test_requests.py
test_structures.py
test_testserver.py
test_utils.py
utils.py
.coveragerc
ERROR:'NoneType' object has no attribute 'tbody'
https://github.com/requests/requests/blob/master/tests/utils.py
.gitignore
ERROR:'NoneType' object has no attribute 'tbody'
https://github.com/requests/requests/blob/master/tests/utils.py
.travis.yml
ERROR:'NoneType' object has no attribute 'tbody'
https://github.com/requests/requests/blob/master/tests/utils.py
AUTHORS.rst
ERROR:'None

## Full Example

This is an example scrapper for GDP from wikipedia per IMF.

The web page is located at:
- https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

In the end we will have a pandas data frame of GDPs for each country in 2017.

In [None]:
WP_URL = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Fake the user agent so the web page thinks we access it as a regular human user
web_page = bs4.BeautifulSoup(requests.get(WP_URL, headers={
    "UserAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.183 Safari/537.36"
}).text, "lxml")

imf_table = web_page.find_all(name="table", attrs={"class": "wikitable"})[0]

# Get the column names of our dataframe. 
# `children` is an iterator and to index it we must first convert it to a list.
columns = list(imf_table.tbody.children)[0]
columns = [elem.text.strip("\n ") 
           for elem in columns 
           if type(elem) != bs4.NavigableString]

rows = []

for i, row in enumerate(imf_table.tbody.find_all("tr")):
    # Skip the header
    if i <= 1 or type(row) == bs4.NavigableString:
        continue

    tds = row.find_all("td")
    
    rank         = tds[0].text
    country_name = tds[1].text
    gdp          = tds[2].text
    
    rows.append((rank, country_name, gdp))
    
data_frame = pd.DataFrame(rows, columns=columns)

data_frame.head()

Unnamed: 0,Rank,Country,GDP(US$MM)
0,1,United States,"19,390,600\n"
1,—,European Union[n 1][19],"17,308,862\n"
2,2,China[n 2],"12,014,610\n"
3,3,Japan,"4,872,135\n"
4,4,Germany,"3,684,816\n"


Our data frame is full of unclean data and all are of type object.

Our next step is to clean our data.

In [None]:
import re

def clean_country(c):
    try:
        return re.findall(pattern="(.+)\[.+", string=c)[0].strip("\xa0")
    except:
        return c
    
# One call removes one set of brackets [] so two calls to fix EU if EU is fixable...
data_frame.Country = data_frame.Country.apply(clean_country)
data_frame.Country = data_frame.Country.apply(clean_country)
                    
data_frame.rename({"GDP(US$MM)": "GDP"}, inplace=True, axis=1)
data_frame.GDP = data_frame.GDP.apply(lambda gdp: gdp.replace("\n", "").replace(",", ""))

# Fix 77,460/Na for Syria
data_frame.GDP.replace("77460/Na", "77460", inplace=True)

data_frame.GDP = data_frame.GDP.astype(np.float64)

In [None]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 3 columns):
Rank       193 non-null object
Country    193 non-null object
GDP        193 non-null float64
dtypes: float64(1), object(2)
memory usage: 4.6+ KB


In [None]:
data_frame.head()

Unnamed: 0,Rank,Country,GDP
0,1,United States,19390600.0
1,—,European Union,17308862.0
2,2,China,12014610.0
3,3,Japan,4872135.0
4,4,Germany,3684816.0


## Your very own scrapper

Pick a website of your choice and scrap the data into a pandas dataframe and clean it afterwards.