<div style="display: block; width: 100%; height: 100px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        Department of Digital Humanities
        <br />
        5AAVC210 Introduction to Programming in Python
    </span>
    <br >
   
</p>


</div>

# Notebook 3: Files, Requests and Statistics 

With this Notebook, our focus is on getting what you need in place to help you do your final assessment. This is a long and complicated notebook, but you have nearly three weeks to do it! **It's due in on 17th March before 4pm.**

There are three main parts to this Notebook, so that's nearly a week for each part. 

This time we will focus on opening files and doing something with the contents, making requests to web pages, and using the data returned from them for research. We'll also say hello to the standard `statistics` library, and look at handy functions for calculating the mean, median and mode. By the end of this, you should be comfortable doing the following.

* Importing files and working with the data in them
* Making an HTTP GET request to a remote web page and receiving a result back.
* Using BeautifulSoup to parse a web page and get basic data out of an HTML tree.
* Using the statistics module to calculate the mean, median and mode.

If you're a bit unsure about the statistics bit and would like a re-cap of the terms, [check out this article](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/mean-median-basics/a/mean-median-and-mode-review).

To this point in the module, all of the data we've used has been provided to us ready-made. This week, we'll be getting some of that data from a file, and some from the real world, by taking it from the Internet.

# PART 1

## Using files

You are going to write a program to tell users about a specific country. A file called `countryinfo.py`, which is included in the .zip file along with this Notebook, is a list of dictionaries detailing the majority of countries in the world. 

Make sure you have this `countryinfo.py` file in the same folder as your Jupyter Notebook.

Dealing with dicts inside a list sounds complex. The first part of this tutorial should help you understand the structure: https://canvas.instructure.com/courses/1133362/pages/book-5-dot-4-python-combining-lists-and-dictionaries

I've imported the `countryinfo.py` file in the cell below, alongside the pprint() function, which works just like print(), but makes complex types easier to read by putting spaces and new lines in them. Run it and see the contents of the file. (You can also open the file in a text editor to view it.)

In [None]:
# we're importing a variable here from a local file
from countryinfo import countries
from pprint import pprint # this is the "pretty print" function

pprint(countries)

When you run pprint(countries) you can see all the dict items with their keys and values. (These are all contained within a list called countries, which is what we imported, above.)

### EXERCISE 1
<font color=purple> Write a program which:

Asks the user which country they wish to find out about (NB: they'll have to spell it correctly and start it with a capital letter). **[10 marks]**

Looks to see if the country is within a dict item in the list, displaying a message if no such country is found. **[5 marks]**

If the country name is found, prints out facts about the country using the information in countryinfo.py. **[15 marks]**
    </font>

For example, if your user entered 'Chile', your program would return something like:
You asked about: Chile

Capital city is: 

Santiago 

Timezone is: 

['America/Santiago', 'Pacific/Easter'] 

Continent is: 

South America

**Hint**
In the cell below I've given you a way of looping through all the dict items in the countries list and extracting the value of the name variable. Play around with this and see how it works.

In [None]:
for item in countries: 
    countryname = item['name']  
    print(countryname) 

In [None]:
# your code here. You can use multiple cells if you wish.


# PART 2

## Making Requests

As you might expect, it's very rare to have data readily available to us as a neat list or dictionary in Python. Usually, we'll need to take our data *from* somewhere, and more often than not, that data is likely to be on the internet or in a file.

But first a recap about the way that the internet works: HTTP requests and reponses.

Whenever we want to visit a web page, we type the URL into our browser, hit enter, and then the web page appears. But what happens "under the hood?" If we remember our History of Networked Technologies module then we should know that:

* The URL is a Uniform Resource Locator, which informs the browser of the protocol, location and path of the resource we want to get. So, for example, `http://www.bbc.co.uk/news` is actually an instruction to our browser to use the `http` protocol to look on the `www.bbc.co.uk` server (using the DNS system) for a resource called `/news`.
* HTTP stands for Hyper-Text Transfer Protocol, and was developed by [Tim Berners-Lee at CERN in 1989](http://info.cern.ch/Proposal.html).
* The browser makes an HTTP request to the server. The request must be one of several different methods, the most common of which are `GET` and `POST`. Today we're only looking at `GET` requests.
* The server receives the request, and looks for the resource on the server. It will send a response with a status code and a body.
    * If the resource is found, the server will send a response with status code **200**.
    * If the resource is not found, the server will send a response with status code **404**, perhaps with a page saying "not found".
    
So, how do we do this in Python?

Well, like all things, there are several different methods we can use, but the most common and often most effective method is to use the `requests` library. Now, let's see an example of making a request.

In [None]:
# import the library
import requests

# let's specify a URL, the KCL Library Services page
u = 'https://www.kcl.ac.uk/library/index.aspx'

# and now make a GET request
r = requests.get(u)

# and see the status code
print(r.status_code)

Hopefully, when you run the above code you should see the number `200` printed out from the cell above. Hoorah, we made a successful request. And just to prove it, let's make a nonsensical request.

In [None]:
# let's specify a URL that doesn't exist
u_bad = 'https://www.kcl.ac.uk/notthispapge'

# and now make a GET request
r_bad = requests.get(u_bad)

# and see the status code
print(r_bad.status_code)

Uh oh, seems they haven't gotten around to making that page yet. As you can see, we got an error, a `404` meaning `NOT FOUND`.

### But What Does it Mean?

Okay, so with that first URL we've got a response back, but what did it say? Let's look at a special property of the response, called its `text`, and see if we can make sense of it.

In [None]:
print(r.text)

So hopefully you should all recognise this output as HTML (hyper-text markup language), but it's going to take us a very long time to figure out what it says and what it means. Luckily, there's also a library we can use to start to pick apart some of that HTML and to take something more maningful out of it.

That library is called BeautifulSoup, so named because it takes a messy "soup" of text, and turns it into something more structured (and beautiful). 

Let's imagine we wanted to get the names of some of the navigation links from this KCL website. The Library Services page has been set up so that these links are structured as items in an HTML list. There are a lot of these lists items in the web page so we are going to chose a selection – those with a specific class name (more on this below). The code below should do that for us, run it now and we'll see what happens.

In [None]:
# import BeautifulSoup
from bs4 import BeautifulSoup

# create a new BeautifulSoup object from our text
tree = BeautifulSoup(r.text)

# Find all the list items <li> on the page with the role of listitem
li_elements = tree.findAll('li',{'role': 'listitem'})
print(li_elements)

Wait, what happened there? Well, the important thing lies in this line here:

    li_elements = tree.findAll('li', {'role': 'listitem'})
    
With the new tree we've created (some people call this object soup, or something similar – the name really doesn't matter), we've used the `findAll()` method to find the part of the web page that we're interested in. By looking at the page's source code in our web browser, we can see that the list items (which are being used as links) are contained in `li` elements.

However, we want to find specific links. Often, the way that a site will do this is using a *class* which helps to style the element in the way the designer wants. So, we supplied two arguments to `findAll()`, one a string saying what kind of element we want, and another a dictionary saying what properties that element must have – in this case, a role of `listitem`. `findAll()` then found all the possible matches, and returned them to us as a Python list. 

However, that output is still a little messy and hard to read, ideally we probably just want the list items without the extra messy tags in the way. luckily, BeautifulSoup is way ahead of us, and has just such a way for us to get on with this: the `text` property. For example, see the code below.

In [None]:
# loop through the titles
for li_elements in li_elements:
    print(li_elements.text)

It's a lot to take in, so have a read through some online tutorials before you go on to the next bit. There are some listed on the KEATS page for you. Here are some more that introduce this from the basics:

Dataquest's Tutorial: Python Web Scraping Tutorial using Beautiful Soup:<br /> 
https://www.dataquest.io/blog/web-scraping-tutorial-python/ 

ScrapeHero's Tutorial: Build a web scraper for Reddit using Python and BeautifulSoup <br /> 
https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-2-build-a-scraper-for-reddit/ 

Excellent! Now we should know everything we need to tackle this week's first problem!

### EXERCISE 2
<font color=purple>Wikipedia is a fascinating resource. This page, for example, is a List of Common Misconceptions: https://en.wikipedia.org/wiki/List_of_common_misconceptions . 
Your task is to use Python and Beautiful Soup to scrape the page and extract the subheadings, i.e. Food and cooking; Law, crime, and military; Literature; etc. <br /> 

**[25 marks]**</font>

In [None]:
# Your code here


# PART 3

## The `statistics` Module

Now we're going to use London Borough data to do some basic statistics. 

Now that we're more advanced in our Python knowledge, and we know about libraries, functions and complex types, this section shouldn't need too much introduction.

Python has a statistics library which we can use to work out basic statistics based on complex data types. For example, if I wanted to find out some statistics about the ages of students, I could do the following.

In [None]:
# let's get the ages of students
student_ages = [19, 20, 19, 21, 22, 19, 20, 20, 24, 19, 20, 20, 19, 20, 21]

# import our function 
from statistics import mean

# work out the mean
mean_age = mean(student_ages)
print('The mean age of a student in the class is', mean_age)

Likewise, for the median:

In [None]:
from statistics import median

median_age = median(student_ages)
print('The median age is', median_age)

And for the mode:

In [None]:
from statistics import mode

mode_age = mode(student_ages)
print('The most common age is', mode_age)

Have a go at some online tutorials on statistics in Python: <br />
https://pythonforundergradengineers.com/statistics-in-python-using-the-statistics-module.html <br />
https://www.dataquest.io/blog/basic-statistics-with-python-descriptive-statistics/ <br />


## EXERCISE 3
<font color=purple>Using code, import the London Boroughs dataset and work out the mean and median population, area and population density (number of people per square mile) of all London Boroughs. **[30 marks]**</font>

In [None]:
from boroughs import boroughs

# Your code here

<font color=purple>Now, work out which Borough has each of the following:
    
* The highest population **[2 marks]**
* The lowest population **[2 marks]**
* The greatest population density **[2 marks]**
* The lowest population density **[2 marks]**
* The greatest area **[2 marks]**

*HINT: refresh your knowledge with https://www.tutorialspoint.com/list-methods-in-python-in-not-in-len-min-max* 

</font>



In [None]:
# Your code here

<font color=purple> Finally, what is the mean, median and mode length of the **name** (i.e of the word) of a London Borough? **[5 marks]** </font>

In [None]:
# your code here

# Well done!

We've very nearly covered everything you need for your final assignment.   
