# Playground Web Scraping

This notebook contains a few snippets and first chuncks of code to understand how web scraping works in python

In [2]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Before sending our first request to scrape a webpage we need to define a few functions

In [3]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

Lets test these functions to see if we can obtain valid HTML code.

In [4]:
raw_html = simple_get('https://nu.nl')
len(raw_html)

337703

In [5]:
no_html = simple_get('https://nu.nl/bestaatniet')
no_html is None

True

Now we have the raw HTML data **BeautifulSoup** can help us to extract information from this data. Lets scrape a list with the greatest mathematicians of all time.

In [6]:
math_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')
html = BeautifulSoup(math_html, 'html.parser')

# Enumerate all elements found in the list
for i, li in enumerate(html.select('li')):
    print(i, li.text)

0  Isaac Newton
 Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

5  Henri Poincaré
 Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

6  Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

7  Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

8  David Hilbert
 Gottfried W. Leibniz

9  Gottfried W. Leibniz

10  Alexandre Grothendieck
 Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

11  Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

12  Évariste Galois
 John von Neumann
 René Descartes

13  John von Neumann
 René Descartes

14  René Descartes

15  Karl W. T. Weierstrass
 Srinivasa Ramanujan
 Hermann K. H. Weyl
 Peter G. L. Dirichlet
 Niels Abel

16  Srinivasa Ramanujan
 Hermann K. H. Weyl
 

Observe that the abovelisted names are sometimes listed together. We write a function split these names and return them in a list.

In [7]:
def get_names():
    """
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    """
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)
    
    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set()
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
                    names.add(name.strip())
        return list(names)
    
    raise Exception('Error retrieving contents at {}'.format(url))

In [9]:
names = get_names()

We can get a popularity score for each mathematician by checking the number of hits each person gets on Wikipedia. Given a mathematicians name we can get this score by the following function

In [10]:
def get_hits_on_name(name):
    """
    Accepts a `name` of a mathematician and returns the number
    of hits that mathematician's Wikipedia page received in the 
    last 60 days, as an `int`
    """
    # url_root is a template string that is used to build a URL.
    url_root = 'https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/{}'
    response = simple_get(url_root.format(name))

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')

        hit_link = [a for a in html.select('a')
                    if a['href'].find('latest-60') > -1]

        if len(hit_link) > 0:
            # Strip commas
            link_text = hit_link[0].text.replace(',', '')
            try:
                # Convert to integer
                return int(link_text)
            except:
                log_error("couldn't parse {} as an `int`".format(link_text))

    log_error('No pageviews found for {}'.format(name))
    return None

In [11]:
# Create empty list with final results
results = []

for name in names:
    try:
        hits = get_hits_on_name(name)
        if hits is None:
            hits = -1
        results.append((hits, name))
    except:
        result.append((-1, name))
        log.error('error encountered while processing ''{}, skipping'.format(name))

No pageviews found for Hermann K. H. Weyl
No pageviews found for Karl W. T. Weierstrass
No pageviews found for Panini  of Shalatula
No pageviews found for Muhammed al-Khowârizmi
No pageviews found for Hermann G. Grassmann
No pageviews found for F. Gotthold Eisenstein
No pageviews found for Bháscara (II) Áchárya
No pageviews found for William R. Hamilton
No pageviews found for Adrien M. Legendre
No pageviews found for Gottfried W. Leibniz
No pageviews found for F.E.J. Émile Borel
No pageviews found for Peter G. L. Dirichlet
No pageviews found for Omar al-Khayyám
No pageviews found for M. E. Camille Jordan
No pageviews found for James J. Sylvester
No pageviews found for Ernst E. Kummer
No pageviews found for Alhazen ibn al-Haytham
No pageviews found for F. L. Gottlob Frege
No pageviews found for Leonardo `Fibonacci'


The results list contains the number of hits for each mathematician. If we could not find hits its hit count is set to -1. We now sort the hits and show the top 5 popular mathematicians.

In [13]:
results.sort()
results.reverse()

if len(results) > 5:
    top_marks = results[:5]
else:
    top_marks = results

for (mark, mathematician) in top_marks:
    print('{} with {} pageviews'.format(mathematician, mark))

Albert Einstein with 979742 pageviews
Isaac Newton with 478203 pageviews
Srinivasa Ramanujan with 464945 pageviews
Aristotle with 402373 pageviews
Galileo Galilei with 351093 pageviews


However, we could not find results for the following list of mathematicians...

In [16]:
no_results = [res for res in results if res[0] == -1]

print('We did not find results for the following ''{} mathematicians:'.format(len(no_results)))
for (mark, mathematician) in no_results:
    print(' - {}'.format(mathematician))

We did not find results for the following 19 mathematicians:
 - William R. Hamilton
 - Peter G. L. Dirichlet
 - Panini  of Shalatula
 - Omar al-Khayyám
 - Muhammed al-Khowârizmi
 - M. E. Camille Jordan
 - Leonardo `Fibonacci'
 - Karl W. T. Weierstrass
 - James J. Sylvester
 - Hermann K. H. Weyl
 - Hermann G. Grassmann
 - Gottfried W. Leibniz
 - F.E.J. Émile Borel
 - F. L. Gottlob Frege
 - F. Gotthold Eisenstein
 - Ernst E. Kummer
 - Bháscara (II) Áchárya
 - Alhazen ibn al-Haytham
 - Adrien M. Legendre
