# This Week’s Overview

This week we're going to apply a number of the concepts covered in class in order to read a remote CSV file, turn it into data, and then perform some simple analyses on it. We're going to first do it 'by hand' (building the tools we need) and then using the 'pandas' package that gives us a very different way to do things.

## Learning Outcomes

By the end of this practical you should:
- Have read a remote data file
- Written a function to derive some statistics
- Have imported a package
- Have made use of methods

# Taking a Second to Think

To a computer, reading data from a remote location (e.g. a web site halfway around the world) is not really any different from reading one that's sitting on your desktop, you just need to make set it up a bit differently and, in Python, make use of a different _package_. In all cases – local and remote – Python gives you a 'filehandle' that supports certain operations like 'open a filehandle', 'read a line', 'close a filehandle'. It's called a filehandle because it's something that gives you a 'grip' on a file-like object.

I also want you to understand how we're approach the problem set out above: we want to read a remote file (i.e. a text file halfway round the planet), turn it into a local data structure (i.e a list or a dictionary), and then perform operations on it using functions (e.g. calculate the mean, find the easternmost city, etc.).

_**But**_, we _don't_ try to do all of this at once by writing a whole lot of code and the hoping for the best; when you're tackling a problem like this you break it down into separate, simpler steps, and then tick them off one by one. So first we'll get Python _reading_ remote data, then we'll _convert_ text into data, and finally we'll _analyse_ the data using functions.

_**Note**_: you might also find it helpful to take a close look at the URL that we are reading by pasting this link into a web browser: http://www.reades.com/CitiesWithWikipediaData-simple.csv. 

# Step 1: Reading a Remote File

So, as I said, we are going to parse a data file hosted on a remote web site. This is just the first step: we are going to build from this first step towards more substantial exercises and, eventually, you could easily request Megabytes of data in real-time according to flexibly-specified parameters!

You'll probably need to do a quick Google in order to make sense of what you're about to do; I'd suggest "`read remote CSV file Python urllib2`".

In [7]:
import urllib2

help(urllib2.urlopen)

url = "http://www.reades.com/CitiesWithWikipediaData-simple.csv"
print("URL: " + url + ".\n")

response = urllib2.urlopen(url) # ???
for line in response: # ???
    print line.rstrip()

Help on function urlopen in module urllib2:

urlopen(url, data=None, timeout=<object object>, cafile=None, capath=None, cadefault=False, context=None)

URL: http://www.reades.com/CitiesWithWikipediaData-simple.csv.

id,Name,Rank,Longitude,Latitude,Population
1,Greater London,1,-18162.92767,6711153.709,9787426
2,Greater Manchester,2,-251761.802,7073067.458,2553379
3,West Midlands,3,-210635.2396,6878950.083,2440986
4,West Yorkshire,4,-185959.3022,7145450.207,1777934
5,Glasgow,5,-473845.2389,7538620.144,1209143
6,Liverpool,6,-340595.1768,7063197.083,864122
7,South Hampshire,7,-174443.8647,6589419.084,855569
8,Tyneside,8,-187604.3647,7356018.207,774891
9,Nottingham,9,-131672.2399,6979298.895,729977
10,Sheffield,10,-163545.3257,7055177.403,685368


If you've managed to get the code above to run and have received 11 rows of text in response to your `urlopen` query then, congratulations, You've now read a text file sitting on a server in, I think, Alberta, Canada and Python _didn't care_. 

The last row should be `10,Sheffield,10,-163545.3257,7055177.403,685368`.

### URLLIB2

In this particular case, I 'gave' you the fact that you'd need to make use of the `urllib2` package in order to read the file. But you could almost certainly have Googled this for yourself using something like 'Python read file on server'...

Urllib2 is a very useful library, but compared to pandas (which we'll see later), it's pretty simple since it just sends a 'request' to a web site and 'reads' the results. You can also do things like submit a form (e.g. you could submit hundreds of applications for a free TV if someone ran a competition... that's why they have 'captchas' now).

Anyway, that's some background, let's move on to step 2! 

## Step 2: Turning Text into Data

We now need to work on turning that text into useful data. You'll notice that we 

3.1.2	Write a function to get first n lines of a file? Need to get them using help() and dir() so a wrapper function to head –n would be a good example
3.1.3	Now let’s put the pieces together to write some code to find the population of the Liverpool Metropolitan Area, you will need:
•	A for loop that iterates over the ‘data’ value of the dictionary
•	A way to find the metro area of Liverpool in your list of lists (LoLs)
•	A way to retrieve the population value(s) from the list once you find them.
•	Note that there is more than one city that is part of the Liverpool Metro Area!
•	Here’s the clue:
 
The answer should be 1,354,842.
3.1.4	Optional: if you would like a pretty-printed number then you will need to change things as follows:

In [None]:
# The next 2 lines normally go at the top of the file, 
# and no, I don’t know why ‘en_UK’ is not an option
import locale 
locale.setlocale(locale.LC_ALL, 'en_US')
# The next line replaces int(population)
locale.format(“%d”, population, grouping=True) 

# Functions & Maths
4.1.1	Everything we do form here on out will be modelled on 5.1.6 in some way, so even if it gets more complex you can always refer back to your answer to 5.1.6!
4.1.2	Let’s convert the code that gave us the population of the Liverpool Metro Area into a function! We’re going to do this in several steps so that you can see how code can be made more useful and re-usable, and also how you can use incremental change to update your code.
4.1.3	OK, so step 1 is just to get a function working that will give us the population of the Liverpool Metro Area as a float. Recall that functions start with a def and then the function name and any input parameters, and that they can end with either a return value, or with ‘nothing’ (i.e. they don’t give the user back any information). So here’s what you’re aiming for:
  
But you will need to figure out what goes in between but it’s code you’ve already written!
4.1.4	Now, let’s say we want to find the population of the Gloucester-Cheltenham area… how would we change the function we just wrote to do this?
•	The easy way would be to just change the text in the middle of the function so that it looks for “Gloucester-Cheltenham” in …, but that doesn’t seem like a very good way to do it, does it? What if we next want to look for London’s population, or Glasgow’s?
•	The better way is to think about passing an input parameter to the function so that it can look for any area, not just Liverpool! Here’s your clue:
 
•	Notice that we set a default parameter value here – in the long run it might make more sense to not have any default since it’s not clear why we’d want to know the population of the Liverpool area by default, but this keeps the behaviour roughly the same. So if you call:
populationOfMetroArea()
then you will get the population of the Liverpool Metro Area, and if you call:
populationOfMetroArea(‘Gloucester-Cheltenham’)
then you will get the population of a different Metro Area, in this case it will be 266,500.
4.1.5	But how do we know what Metro Areas there are in this data set? Let’s work towards writing another function to return unique values, but as before let’s start with something simple…
•	We need a list to store all of the values
•	We need a for loop to look through the data
•	And we need to make the list into a set() when we’re done.
•	Here’s your clue:
 
4.1.6	Can you think how you would write 6.1.5 using a dictionary instead? [Note: 6.1.5 and 6.1.6 are largely equivalent, but with a very big file with relatively few keys then 6.1.6 would be faster and more efficient. If you think you know why, please post to the #practical channel on Slack!]
4.1.7	Let’s turn 6.1.5 into a function so that we can get the unique Metro Area values. By now, this should be starting to make sense, so I’m just going to give you this:
 
4.1.8	Up to here, we’ve ‘hard-coded’ the field that we want to process, so now let’s make it a bit more flexible by allowing it to find unique values for any column! Here’s what you need to consider:
•	We need to add input parameters to the function…
o	One of these will have to be the data
o	The other will have to be the column to search
•	But there’s shouldn’t be much else that needs changing (except the name):
  
•	Now find the unique values for the ‘Name’ column in the data set. You should get 73 unique city names.
5	Revisiting/Revising Shakespeare
5.1.1	Read with pandas? Can you get most-used words in a histogram? Compare to stuff from another playwright? Should we recommend a particular library and have them learn how it works?

# Parsing a Text File

Update the call to readTxt() and check that line 2 of the data is:
http://www.gutenberg.org/cache/epub/100/pg100.txt'