# Retrieving data from the web

# requests

The first task is to retrieve some data from the Internet. Python has many built-in libraries that were developed over the years to do exactly that (e.g. urllib, urllib2, urllib3).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in Python, someone has developed a library that simplifies these tasks. In reality, the requests made both on this assignment are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to requests as soon as possible, since you will probably need it in the future.

In [1]:
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named get function to issue a GET request. This is equivalent to typing a URL into your browser and hitting enter.

In [2]:
req = requests.get("https://en.wikipedia.org/wiki/University_of_Cambridge")

Another very nifty Python function is dir. You can use it to list all the properties of an object.

In [33]:
dir(req)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [34]:
page = req.text

In [35]:
type(page)

str

# BeautifulSoup
Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

In [36]:
from bs4 import BeautifulSoup

BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the page variable using its HTML parser, and assigns the result of that to the soup variable.

In [37]:
soup = BeautifulSoup(page,'html.parser')

In [38]:
type(soup)

bs4.BeautifulSoup

# Display the title of the webpage

In [45]:
soup.title

<title>University of Cambridge - Wikipedia</title>

In [46]:
soup.title.get_text()

'University of Cambridge - Wikipedia'

# Display all p tags from the webpage

You may use find_all method!


In [47]:
x = soup.findAll('p')

# No of P tags are present

In [48]:
len(soup.findAll('p'))

155

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "UCAS Admission Statistics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy..

In [55]:
soup.table['class']

['infobox', 'vcard']

# Create a nested list containing classes of all the table tags

In [56]:
new_table = []
for table in soup.findAll('table',class_= True):
    new_table.append(table['class'])

print(new_table)


[['infobox', 'vcard'], ['wikitable', 'floatright'], ['infobox'], ['mbox-small', 'plainlinks', 'sistersitebox'], ['mbox-small', 'plainlinks', 'sistersitebox'], ['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner'], ['nowraplinks', 'navbox-subgroup'], ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'], ['nowraplinks', 'hlist', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'], ['nowraplinks', 'navbox-subgroup'], ['nowraplinks', 'navbox-subgroup'], ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'], ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'], ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'], ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'], ['nowraplinks', 'hlist', 'navbox-inner']]


As I mentioned, we will be using the UCAS Admission Statistics table for this lab. The next cell contains the HTML elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

# Check the classes and find the UCAS Admission Statistics Table

Use find method to find the table using the correct class , convert it into string format and store it in table_html also stored the original form in html_soup

In [57]:
table_html = str(soup.find_all('table', class_="wikitable floatright")[0])
html_soup = soup.find_all('table', class_="wikitable floatright")[0]

In [58]:
table_html

'<table class="wikitable floatright" style="font-size:85%; text-align:center;">\n<caption>UCAS Admission Statistics\n</caption>\n<tbody><tr>\n<th>\n</th>\n<th>2017\n</th>\n<th>2016\n</th>\n<th>2015\n</th>\n<th>2014\n</th>\n<th>2013\n</th></tr>\n<tr>\n<td><b>Applications</b><sup class="reference" id="cite_ref-104"><a href="#cite_note-104">[104]</a></sup>\n</td>\n<td>17,235\n</td>\n<td>16,795\n</td>\n<td>16,505\n</td>\n<td>16,970\n</td>\n<td>16,330\n</td></tr>\n<tr>\n<td><b>Offer Rate (%)</b><sup class="reference" id="cite_ref-105"><a href="#cite_note-105">[105]</a></sup>\n</td>\n<td>31.2\n</td>\n<td>33.8\n</td>\n<td>33.5\n</td>\n<td>32.5\n</td>\n<td>32.2\n</td></tr>\n<tr>\n<td><b>Enrols</b><sup class="reference" id="cite_ref-106"><a href="#cite_note-106">[106]</a></sup>\n</td>\n<td>3,480\n</td>\n<td>3,440\n</td>\n<td>3,430\n</td>\n<td>3,425\n</td>\n<td>3,355\n</td></tr>\n<tr>\n<td><a href="/wiki/Yield_(college_admissions)" title="Yield (college admissions)"><b>Yield (%)</b></a>\n</td>\n

In [59]:
from IPython.core.display import HTML

HTML(table_html)

Unnamed: 0,2017,2016,2015,2014,2013
Applications[104],17235.0,16795.0,16505.0,16970.0,16330.0
Offer Rate (%)[105],31.2,33.8,33.5,32.5,32.2
Enrols[106],3480.0,3440.0,3430.0,3425.0,3355.0
Yield (%),64.7,60.6,62.0,62.1,63.8
Applicant/Enrolled Ratio,4.95,4.88,4.81,4.95,4.87
Average Entry Tariff[107][note 1],,226.0,592.0,600.0,601.0


# Extract the rows from the UCAS Admission Statistics table and store it in rows variable

In [60]:
rows = html_soup.find_all('tr')
print(rows)

[<tr>
<th>
</th>
<th>2017
</th>
<th>2016
</th>
<th>2015
</th>
<th>2014
</th>
<th>2013
</th></tr>, <tr>
<td><b>Applications</b><sup class="reference" id="cite_ref-104"><a href="#cite_note-104">[104]</a></sup>
</td>
<td>17,235
</td>
<td>16,795
</td>
<td>16,505
</td>
<td>16,970
</td>
<td>16,330
</td></tr>, <tr>
<td><b>Offer Rate (%)</b><sup class="reference" id="cite_ref-105"><a href="#cite_note-105">[105]</a></sup>
</td>
<td>31.2
</td>
<td>33.8
</td>
<td>33.5
</td>
<td>32.5
</td>
<td>32.2
</td></tr>, <tr>
<td><b>Enrols</b><sup class="reference" id="cite_ref-106"><a href="#cite_note-106">[106]</a></sup>
</td>
<td>3,480
</td>
<td>3,440
</td>
<td>3,430
</td>
<td>3,425
</td>
<td>3,355
</td></tr>, <tr>
<td><a href="/wiki/Yield_(college_admissions)" title="Yield (college admissions)"><b>Yield (%)</b></a>
</td>
<td>64.7
</td>
<td>60.6
</td>
<td>62.0
</td>
<td>62.1
</td>
<td>63.8
</td></tr>, <tr>
<td><b>Applicant/Enrolled Ratio</b>
</td>
<td>4.95
</td>
<td>4.88
</td>
<td>4.81
</td>
<td>4.95
</td

# lambda expressions

We will then use a lambda expression to replace new line characters with spaces. Lambda expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences

In [62]:
rem_nl = lambda s: s.replace("\n", " ")

# Extract the columns from the UCAS Admission Statistics table and store it in columns variable

In [72]:
th = html_soup.find_all('th')

column = []
columns = []

for x in th:
    column.append(rem_nl(x.get_text()))
    
column.remove(' ')

for p in column:
    columns.append(p.strip())
        
print(columns)
        
    
    


['2017', '2016', '2015', '2014', '2013']


# Extract the indexes from the rows variable

Store it in a variable named indexes

In [73]:
index = html_soup.findAll('b')

indexes = []
for b in index:
    indexes.append(b.get_text())
    
print(indexes)

['Applications', 'Offer Rate (%)', 'Enrols', 'Yield (%)', 'Applicant/Enrolled Ratio', 'Average Entry Tariff']


In [74]:
HTML(table_html)

Unnamed: 0,2017,2016,2015,2014,2013
Applications[104],17235.0,16795.0,16505.0,16970.0,16330.0
Offer Rate (%)[105],31.2,33.8,33.5,32.5,32.2
Enrols[106],3480.0,3440.0,3430.0,3425.0,3355.0
Yield (%),64.7,60.6,62.0,62.1,63.8
Applicant/Enrolled Ratio,4.95,4.88,4.81,4.95,4.87
Average Entry Tariff[107][note 1],,226.0,592.0,600.0,601.0


In [75]:
td = html_soup.findAll('td')

value = []
i = []
values = []

for x in td:
    value.append(rem_nl(x.get_text()))

unwanted = [0,6,12,18,24,30]

for y in sorted(unwanted, reverse = True):
    del value[y] 
    
for z in value:
    i.append(z.strip(" "))
    
for w in i:
    if (w == 'n/a'):
        values.append(None)
    else:
        values.append(float(w.replace(',','')))
    

    
print(values)

[17235.0, 16795.0, 16505.0, 16970.0, 16330.0, 31.2, 33.8, 33.5, 32.5, 32.2, 3480.0, 3440.0, 3430.0, 3425.0, 3355.0, 64.7, 60.6, 62.0, 62.1, 63.8, 4.95, 4.88, 4.81, 4.95, 4.87, None, 226.0, 592.0, 600.0, 601.0]


The problem with the list above is that the values lost their grouping.

The zip function is used to combine two sequences element wise. So zip([1,2,3], [4,5,6]) would return [(1, 4), (2, 5), (3, 6)].

This is the first time we see a container bounded by parenthesis. This is a tuple, which you can think of as an immutable list (meaning you can't add, remove, or change elements from it). Otherwise they work just like lists and can be indexed, sliced, etc.

In [76]:
stacked_values = zip(*[values[i::5] for i in range(len(columns))])
print(stacked_values)

<zip object at 0x00000215FD652D88>


---------

# Pandas data structures

## DataFrames

To recap, we now have three data structures holding our column names, our row (index) names, and our values grouped by index.

We will now load this data into a Pandas DataFrame. The loading process is pretty straightforward, and all we need to do is tell Pandas which container goes where.

In [87]:
import pandas as pd

# Create the DataFrame
### Use stacked_values, columns and indexes to create the Demographics DataFrame
#### Name the DataFrame df

In [78]:
df = pd.DataFrame(stacked_values,indexes,columns)
df.head(6)

Unnamed: 0,2017,2016,2015,2014,2013
Applications,17235.0,16795.0,16505.0,16970.0,16330.0
Offer Rate (%),31.2,33.8,33.5,32.5,32.2
Enrols,3480.0,3440.0,3430.0,3425.0,3355.0
Yield (%),64.7,60.6,62.0,62.1,63.8
Applicant/Enrolled Ratio,4.95,4.88,4.81,4.95,4.87
Average Entry Tariff,,226.0,592.0,600.0,601.0


In [79]:
df.dtypes

2017    float64
2016    float64
2015    float64
2014    float64
2013    float64
dtype: object

# Drop the row containing NaN value

In [81]:
df_clean = df.fillna(value = 0)
df_clean

Unnamed: 0,2017,2016,2015,2014,2013
Applications,17235.0,16795.0,16505.0,16970.0,16330.0
Offer Rate (%),31.2,33.8,33.5,32.5,32.2
Enrols,3480.0,3440.0,3430.0,3425.0,3355.0
Yield (%),64.7,60.6,62.0,62.1,63.8
Applicant/Enrolled Ratio,4.95,4.88,4.81,4.95,4.87
Average Entry Tariff,0.0,226.0,592.0,600.0,601.0


Now our table looks good!

---

### NumPy

Pandas is awesome, but it is built on top of another library the we will use extensively during the course. NumPy implements new data types and vectorized functions.



In [82]:
import numpy as np

In [83]:
df_clean.values

array([[1.7235e+04, 1.6795e+04, 1.6505e+04, 1.6970e+04, 1.6330e+04],
       [3.1200e+01, 3.3800e+01, 3.3500e+01, 3.2500e+01, 3.2200e+01],
       [3.4800e+03, 3.4400e+03, 3.4300e+03, 3.4250e+03, 3.3550e+03],
       [6.4700e+01, 6.0600e+01, 6.2000e+01, 6.2100e+01, 6.3800e+01],
       [4.9500e+00, 4.8800e+00, 4.8100e+00, 4.9500e+00, 4.8700e+00],
       [0.0000e+00, 2.2600e+02, 5.9200e+02, 6.0000e+02, 6.0100e+02]])

In [84]:
type(df_clean.values)

numpy.ndarray

In [85]:
mean = df_clean['2017'].mean()
mean

3469.308333333334

In [86]:
SD = df_clean.std()
SD

2017    6883.955362
2016    6685.915854
2015    6536.376041
2014    6721.802172
2013    6465.609210
dtype: float64