<div style="text-align:center">
<h1>Functions & Packages</h1>
<h2>7SSG2059 Geocomputation 2016/17</h2>
<h3>_Jon Reades_</h3>
</div>

## This Week’s Overview

This week we're going to apply a number of the concepts covered in class in order to read a remote CSV file, turn it into data, and then perform some simple analyses on it. We're going to first do it 'by hand' (building the tools we need) and then using the 'pandas' package that gives us a very different way to do things.

## Learning Outcomes

By the end of this practical you should:
- Have read a remote data file
- Written a function to derive some statistics
- Have imported a package
- Have made use of methods

## Tackling Programming Problems

I also want you to understand how to approach the kind problem I've just set: turning a text file into data and performing some analysis.

_**Note**_: you might also find it helpful to take a close look at the URL that we are reading by pasting this link into a web browser: https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/CitiesWithWikipediaData-simple.csv. 

### Analyse the Problem

The first step to writing a program is thinking about your goal and the steps required to achieve that. We _**don't**_ write programs like we write essays: all at once by writing a whole lot of code and then hoping for the best when we hit 'submit'. 

When you're tackling a programming problem you break it down into separate, simpler steps, and then tick them off one by one. Doing this gets easier as you become more familiar with programming, but it remains crucial and, in many cases, good programmers in large companies spend more time on _design_ than they do on actual _coding_.

Some steps in a program are done so many times by so many people that, eventually, someone writes a _package_ that bundles up those operations into something easy to use that saves _you_ having to figure out the gory details. Reading a file (even one on a computer halfway round the world) is one of those things. Analysing a data file for you is probably not.

![xkcd: Easy vs. Hard](https://imgs.xkcd.com/comics/tasks.png)

### Packages & Functions

To a computer, reading data from a remote location (e.g. a web site halfway around the world) is not really any different from reading one that's sitting on your desktop: to simplify things a great deal it really just needs to know the location of the file and an appropriate _protocol_ for accessing that file (_e.g._ http, https, ftp, local...) and then a clever programming language like Python will typically have packages that can kind of take of the rest. 

In all cases -- local and remote -- you use the package to handle the hard bit of knowing how to actually 'read' data (because all files are just `1`s and `0`s of data) at the _device_ level and then Python gives you back a 'file handle' that helps you to achieve things like 'read a line' or 'close an open file'. You can think of a filehandle as something that gives you a 'grip' on a file-like object no matter where or what it is, and the package is the way that this magic is achieved.

#### Q.How is a package different from a function?

#### A. A package is just a bundle of useful functions. 

There's a little more to it than this (as we'll see later in the term), but simplest way to think of this is that when we type `import foo` then we are including functions from the `foo` package in our code. You'll see how this works below.

The point of the bundle-of-functions is that they can help us to achieve quite a lot very quickly since we don't need to reinvent the wheel and can just make use of someone else's code. In the same way that we won't mark you down for Googling the answer to a coding question, we won't mark you down for using someone else's package to help you get going with your programming. _**That's the whole point!**_

Often, if you're not sure where to start Google (or StackOverflow) is the place to go:

<pre>
[how to read text file on web server python](https://www.google.co.uk/search?q=how+to+read+text+file+on+web+server+python&oq=how+to+read+text+file+on+web+server+python&aqs=chrome..69i57.629j0j7&sourceid=chrome&ie=UTF-8)
</pre>

Boom!

### Back to Analysing the Programme

OK, so back to our problem: 

* First we want to read a remote file (i.e. a text file somewhere the planet), 
* Then we want to turn it into a local data structure (i.e a list or a dictionary), 
* Finally we want to perform some calculations on the data (e.g. calculate the mean, find the easternmost city, etc.).

We can tackled each of those in turn, getting the first bit working, then adding the second bit, etc. It's just like using lego to build something: you take the same pieces and assemble them in different ways to produce different things.

## Step 1: Reading a Remote File

So, as I said, in Step \#1 we are going to download a file hosted on a remote web site at:

* https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/CitiesWithWikipediaData-simple.csv (this can also be stored as a bit-link to make it easier to copy+paste and avoid _really_ long lines: http://bit.ly/2vrUFKi)

We aren't going to to try to turn it into data or otherwise make 'sense' of it, we just want to **get** it. We are then going to build from this first step towards more substantial exercises and, eventually, you could easily request Megabytes of data in real-time according to flexibly-specified parameters!

You'll probably need to do a quick Google in order to make sense of what you're about to do; I'd suggest 
```
read remote CSV file Python urllib2
```

### Getting help with packages

Of course, just knowing that you need `urllib2` doesn't help you to _use_ it. In addition to finding example code on StackOverflow, you can also ask the package itself for help with `dir` and `help`.

#### dir

The 'Dive into Python' web site will tell you that "dir returns a list of the attributes and methods of any object". That introduces yet another term ('modules') that we don't want to get into right now, but _everything_ in Python is an object and so `dir` will give you help with package, variables, functions... you name it.

What **`dir`** gives you is information about things you can potentially do: it's like navigating the menu of a web site -- you aren't yet looking at the information you need, you're trying to figure out if the site even _has_ what you need. So `dir` on a package will give you a list of the functions (and any variables) that the person who created the package has provided.

Typically, the information given by `dir` is highly abbreviated and is really just a prelude to using `help`.

#### help

The `help` function gives you the actual detail you need about how to use a particular function: what are the inputs, what are the outputs, and what will the function actually _do_?

Let's see this in action!

In [None]:
import urllib2
print("dir on urllib2 returns:\n")
print(dir(urllib2))
print("\n\n")
print("help on urlopen returns:\n")
print(help(urllib2.urlopen)) # Notice!

You can also get help in Jupyter by typing `?urllib2.urlopen` in a code block and then hitting 'run'.

In [None]:
?urllib2.urlopen

### Let's get started!

In [None]:
import urllib2

# Given what I've written above, what do you 
# think the value of 'url' should be? What
# type of variable is it? int or string? 
url = ???
print("URL: '" + url + "'.\n")

# Read the URL and copy the results to 
# a variable called 'response'.
response = urllib2.urlopen(url)

print("Variable is of type: '" + response.__class__.__name__ + "'.\n")

Now that we've got a response variable let's try to print out the contents of the file:

In [None]:
# The 'response' can behave like a list in 
# a for loop... we can create a temporary
# variable called line, and each time we 
# ask 'response' it will give us a new line.
for line in response:
    print ???.rstrip()

If you've managed to get the code above to run after fixing the '???' and have received 11 rows of text in response to your `urlopen` query then, congratulations, You've now read a text file sitting on a server in, I think, Alberta, Canada and Python _didn't care_. 

The last row should be `10,Sheffield,10,-163545.3257,7055177.403,685368`.

### What is URLLIB2?

In this particular case, I 'gave' you the fact that you'd need to make use of the `urllib2` package in order to read the file. But you could certainly have Googled this for yourself using something like 'Python read file on server' or 'Python read remote file'...

Urllib2 is a very useful library (another, less technical, name for a 'package'), but compared to pandas (which we'll see next week), it's pretty simple since it just sends a 'request' to a web site and 'reads' the results. But on top of that you can build much more complex things that even pandas can't handle:

* You could submit hundreds of online applications for a free TV if someone ran a competition... (though that's why they have 'captchas' now).

* You could automate boring stuff such as the conversion of data between two Census years using [GeoConvert](http://geoconvert.mimas.ac.uk/) -- which is exactly what Phil Hubbard and I did for a recent article on gentrification in London that used more than 30 data sets and 175 variables, _all_ of which needed conversion. Rather than do this manually by uploading files, I wrote a package that automated this whole process using urllib2.

Anyway, that's some background, let's move on to step 2! 

## Step 2: Turning Text into Data

We now need to work on turning the response we got to our `urllib2` request into useful data. You'll notice that we are dealing with a _CSV_ (Comma-Separated Value) file and that the format is quite simple since none of the rows have fields that *themselves* contain commas. So to turn this into data we just need to _split_ the row into separate fields using the commas.

In the code below, `dir('string')` lists the available function for strings (because `'string'` is itself a _String_; we could just as easily written `dir('foo')` or `dir('supercalifragilisticexpialidocious')` because 'foo' and 'supercalifragilisticexpialidocious' are also strings and so have the _same_ functions available. 

In the output below, the functions that start and end with `__` are generally considered private, so you can skip over these and focus on the ones further down that are designed to be useful to programmers. Can you spot the method that is most likely to be useful?

### A Brief Interlude

Just in case you need help pronouncing supercalifragilisticexpialidocious:

[![Mary Poppins](http://img.youtube.com/vi/tRFHXMQP-QU/0.jpg)](http://www.youtube.com/watch?v=tRFHXMQP-QU)

Remember that you can find out what _methods_ are supported by a string using `dir(<string>)`:
```python
dir('supercalifragilisticexpialidocious')
```
I'm going to save you some time (_this_ time!) and tell you that we're interested in the `split` method. Why not use the `help` function to figure out how to make use of it?

In [None]:
help('supercalifragilisticexpialidocious'.split)

Now, how would you use `split` to turn this word into a list like this: 
```python
['sup','rcalifragilisticexpialidocious']
```

In [None]:
if ['sup','rcalifragilisticexpialidocious']=='supercalifragilisticexpialidocious'.split(???):
    print("You got it!")
else:
    print("Not yet!")

In [None]:
# Some other methods
print('supercalifragilisticexpialidocious'.upper())
print('supercalifragilisticexpialidocious'.title())

OK, so you've tracked down the way to split a string using a delimiter and _even_ how to limit the number of 'words' that come out of the split operation. You might want to make a mental note of some of the other useful functions that are available for strings: `splitlines`, `upper`, `lower`, `rstrip`, and the whole `is...` set of functions. We work a lot with strings, so it's handy to get to know the readily-available methods well.

Let's test string splitting using our sample data (the last line of the 'simple' CSV file) to make sure it works the way we think it does... We want to turn the string below into a list like this:
```
['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']
```

In [None]:
test = "10,Sheffield,10,-163545.3257,7055177.403,685368".split(???)
print test

Cool! But, first, a question: why do you think that I consider 
```python
['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']
```
to be data and not
```python
"10,Sheffield,10,-163545.3257,7055177.403,685368"
```

Here's a clue:
```python
print("The population of " + myList[1] + " is " + myList[5])
```

Now that you've figured out how to make use of the appropriate method using `help` and a simple test, it's time to revise the code above so that it turns the remote file into data. You can hopefully see how we're breaking a complex problem down into a set of _increments_, each of which is a bit easier to write and understand. 

Now, remember that ultimately we want to make use of the data in this file, so simply printing it back out isn't particularly helpful. What we really want to do is stash the data we've read in some kind of _data structure_ that resembles the CSV file but is easier and faster for the computer to navigate.

Instinctively\* it seems like we need:
1. To keep the rows in order
2. To keep the columns in order

With what we've covered in previous sessions and what we've covered in class, what approach might allow us to do this? _Hint:_ is it likely to be a dictionary? 

_\* We'll see why instinct isn't always right later..._

In [None]:
import urllib2 # You don't need to keep reimporting urrlib2 if you've done it above, 
               # but this is helpful if you want to skip the earlier stuff and jump
               # straight to a section further down the page in the future.

url = "http://bit.ly/2vrUFKi" # A bit-link to save space

cityData = [] # Somewhere to store the data

response = urllib2.urlopen(url) 
for line in response: 
    cityData.append( line.rstrip().split(',') )

print cityData # Check it worked!

If it worked, then you should have this output:
```python
[['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], ['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], ['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], ['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986'], ['4', 'West Yorkshire', '4', '-185959.3022', '7145450.207', '1777934'], ['5', 'Glasgow', '5', '-473845.2389', '7538620.144', '1209143'], ['6', 'Liverpool', '6', '-340595.1768', '7063197.083', '864122'], ['7', 'South Hampshire', '7', '-174443.8647', '6589419.084', '855569'], ['8', 'Tyneside', '8', '-187604.3647', '7356018.207', '774891'], ['9', 'Nottingham', '9', '-131672.2399', '6979298.895', '729977'], ['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']]
```
To you that might look a lot _worse_ that the data that you originally had, but to a computer that list-of-lists is something it can work with; check it out:

In [None]:
for c in cityData:                                     # For each row in the list
    print("The population of " + c[1] + " is " + c[5]) # Print out the name and population

You need to be careful assuming that, just because it's hard for you to read, it's also hard for a computer to read! We're going to see how this _really_ makes a difference to the power of the `pandas` library to do all sorts of clever statistical analysis next week.

## More on Functions

Everything we do from here on out will be modelled on the code that we've just finished, so if you get lost you can always come back to this step and start over! Sometimes, you can tie yourself in knots thinking about a problem and it ends up being easier to throw everything out and start again with a simpler approach or new angle...

We can write functions in much the same way that we developed the code above: incrementally. Rather than just sitting down, typing out your function, and hoping for the best, it's often easier to write the code first and _then_ turn it into a function! Let's try this for the code we've written so far by creating a function that will access _any_ URL and return a 'simple' list-of-lists that represents any CSV data stored at the URL's location. 

### Designing a function

The first stage in writing a function is to figure out what inputs and outputs it should have. For this function that is fairly straightforward:
* We give the function a URL that we want to it read in
* The function gives us back a list-of-lists containing the CSV data

We should also give it a name that is fairly obvious to anyone else who comes along and tries to read our code; how about: `readRemoteCSV`?

### Creating  a function

This is a good point to hit Google or StackOverflow for some help. We're trying to `"write a function in Python"` so why not search for that? I quickly found some useful hints on sites like TutorialPoint and, obviously, in the online Python documentation itself.

I've created the sketch of a function below but have left out quite a bit that you'll have to fill in by searching on your own.

In [None]:
def readRemoteCSV(url):
    """
    Reads a remote CSV file and returns
    a list-of-lists containing the data.
    """
    urlData = [] # Somewhere to store the data

    # You've seen all of this before -- we're just
    # doing the work inside a function now, instead
    # of as standalone code.
    response = urllib2.urlopen(???)
    for line in response: 
        urlData.append( line.rstrip().split(???) )
        
    # Where did we store the data? Isn't 
    # that the thing that we want to return?
    return ???

print "URL 1:\n"
data1 = readRemoteCSV("http://bit.ly/2vrUFKi") # Converted to Bit.ly link to save space
print data1

# Now...

print "URL 2:\n"
data2 = readRemoteCSV("http://bit.ly/2iIK9bA") # Converted to Bit.ly link to save space
print data2


### Review!

OK, let's just take a look at that again: do you see how, by packaging up the code as a function, we have made it more useful and more re-usable? We now have a little snippet of code that we can _call_ to process _any_ valid URL. So we called `readRemoteCSV` once to read the 'Cities-simple.csv' file, and again to read to the much larger 'Cities.csv' file. 

We could read 1,000 other URLs using the same function! Now our code is a lot cleaner because we don't need to keep reading (and writing) lots of lines of code about calling remote files and parsing them. 

The other big gain is that it's also a lot easier to maintain our code now because if we want to _change_ the way that we parse remote CSV files then we only need to do it in _one place_ and then everywhere that we use this function benefits form the improvement... which is what we're going to do now.

## Using the CSV library

Our little CSV function is already useful, but it's a little naive: we are implicitly _assuming_ that none of the fields can containg a comma. Why is that? Before you continue reading, take a moment to think about what `split(',')` does and why it won't work well with a line of data that looks like this:
```python
11,"Cardiff,Caerdydd",11,51.483333,-3.183333,447287
```
Let's try it:

In [None]:
'11,"Cardiff,Caerdydd",11,51.483333,-3.183333,447287'.split(',')

Do you see the problem now? Will this code still work:

```python
for c in cityData:
    print("The population of " + c[1] + " is " + c[5])
```

This is where using code that someone _else_ has written and contributed is helpful: we don't need to think through how to deal with this sort of thing ourselves, we can just import the library that we need and make use of its functionality. In the _simple_ file there are no examples of this issue, but there are in the _full_ data set: we always try to start simple and build from there...

I've given you the skeleton of the answer below, but you'll need to do a little Googling to find out how to `"read csv urllib2 python"`.

In [1]:
import urllib2
import csv

# Redefine the function
def readRemoteCSV(url):
    """
    Reads a remote CSV file and returns
    a list-of-lists containing the data.
    """
    urlData = [] # Somewhere to store the data

    response = urllib2.urlopen(url)
    reader   = ???
    for row in reader: 
        urlData.append( ??? )
        
    return urlData

print "URL 1:\n"
data1 = readRemoteCSV("http://bit.ly/2vrUFKi")
print data1

SyntaxError: invalid syntax (<ipython-input-1-0460e932cab0>, line 13)

The advantage of this switch (from `split` to using the `csv` library) is that the csv library knows how to deal with fields that contain commas (or newlines!) and so is much more flexible and consistent that our naive `split` approach. The vast majority of _common_ tasks (reading certain types of files, getting remote files, etc.) have libraries that do exactly what you want without you needing to write much code yourself to take advantage of it. You should always have a look around online to see if a library exists before thinking that you need to write everything/anything from scratch. The tricky part is knowing what words to use for your search and how to read the answers that you find...

## Calculating the Mean

Now I'd like you to write a function that will enable you to calculate the mean city size from data retrieved from _any_ URL that contains city data. So you should be able to call _one_ function that will work for both `Cities.csv` and `Cities-simple.csv`. You'll need to look closely at how the two files are 'laid out': where is the population column, and how would we iterate over the rows to find the mean?

### Designing the Function

OK, we know that `readRemoteCSV` will give us back a list-of-lists: the 'big list' contains a large number of 'small lists', each of which represents a row in the data set. Let's break this down:
* We know that we will have a LoL (List-of-Lists) to work from
* We know that each 'small list' represents a row in the data
* We know that the position of the column of interest might change from data set to data set, but it won't change _within_ a data set
* We know that we'll need to convert every value to a... `float`? `int`? Let's assume `float` just to be safe.
* We know that we'll need to sum up the values
* We know that we'll need to keep track of how many rows of data there are

As before, let's start by working it out _as code_, and then package it up _as a function_ once it's working.

In [None]:
# The starting point... using the data retreived
# from the function that we wrote above...
for row in data1:
    print row

In [None]:
# Now let's track a particular column
col = 3
for row in ???:
    print row[???]

In [None]:
# And now let's figure out the mean
col   = 3
total = 0 # What's the sum of the values?
count = 0 # How many values have we read?
for row in data1:
    print row[col]
    value = row[col]
    count += 1
    total += value

Ooops, that last one didn't work so well. How would you fix this?

Here are _two_ hints:
1. What is the count when you are reading the first row (which contains the column name)?
2. If the value is a `string`, how do you convert it to an `int` or `float`

_P.S._ I also broke one very important thing deliberately so you will need your debugging skills...

In [None]:
# And now let's figure out the mean
col   = 5   # Which column to read?
total = 0.0 # What's the sum of the values?
count = 0   # How many values have we read?

for row in data1:
    #print(row[col]) # Uncomment to debug
    if count > 0:
        value = float(???)
        total += total
    count += 1

print total/count

### Debugging

You'll notice that there's a line in the above that says:
```python
    #print(row[col]) # Uncomment to debug
```
This is a really common technique used by programmers to figure out what's happening in their code. Many people will spend _more_ time debugging than they will writing the code in the first place! One of the most important ways to debug is simply to print out the values of whatever you're working with: are you seeing what you expected to see? are there values that you hadn't counted on? are all of the values printed out or are some missing? And so on... To 'turn on' debugging, all we have to do is remove the `#` in front of the print statement and everything will start printing out as we process the file. Simple. And useful.

OK, we're almost there now: we know _how_ to calculate the mean for any column that is numeric (dealing with non-numeric columns would be nice but that's just adding extra difficulty to this notebook). Now we want to package this up as a function so that you can just write `calcMean(...)` and get back an answer!

Here's a skeleton to get you started:

In [None]:
def calcMean(data, col):
    "Take a list-of-lists and derive the mean for a specified column."
    total = 0.0
    count = 0
    
    for row in data:
        # print(row[col]) # Uncomment to debug
        if count > 0:
            value = float(???)
            total += value
        count += 1
    
    return total/count

data1 = readRemoteCSV("http://bit.ly/2vrUFKi")
print "Mean of simple file populations is: " + str(calcMean(data1,5))

# Now...

data2 = readRemoteCSV("http://bit.ly/2iIK9bA")
print "Mean of big file populations is: " + str(calcMean(data2,5))

And there you go! Done.

You've written two functions: one to read a remote file from a URL, and one to calculate the mean for a simple CSV file of _any_ size. I hope you'll agree that that is pretty handy, but that it's also pretty awkward: we're not doing any type-checking (to see if something is an integer, float, or string) and if we get it wrong the whole thing 'blows up' on us. It's also just kind of _inelegant_ since we have all of these counters (`total` and `count`) to keep track of things... is this what's really going on behind the scenes?

## Wrap-Up: Thinking About Data

I've said before that the way a computer 'thinks' and the way that we think doesn't always line up naturally. Experienced programmers can think their way _around_ a problem by working _with_ the computer, rather than against it. Let's apply this approach to the parsing of CSV files.

### What's an _Appropriate_ Data Structure?

As you saw when I asked you to calculate the mean population of British cities twice -- once using the simple file, and once using the bigger, more complex file -- there is a 'problem': our list-of-lists isn't very easy to navigate. Not only _might_ the location of the Population column be different in the two files (as it was, deliberately), but when we want to work out the mean we need to step through a lot of irrelevant data as well (in our for loop we need to skip past the name, latitude, longitude, etc.). 

And if I asked you to find the _largest_ city you'd need to do even more work. That doesn't make much sense since this should all be easier and faster in code than in Excel, but right now it's _harder_ and _slower_! When you get into situations like this (having to write a lot of code to do something that should be fast and easy) it is often the case that you've got the wrong _data structure_. 

So how does the experienced programmer get around this? 'Simple': she realises that the data is organised the wrong way! We humans naturally tend to think in rows of data: London has the following _attributes_ (population, location, etc.), and York has a different set of attributes. Se we read across the row because that's the easiest way for us to read it. But, in short, a list-of-lists does _not_ seem to be the right way to store this data!

Crucially, a computer doesn't have to work that way. For a computer, it's as easy to read _down_ a column as it is to read _across_ a row. In fact, it's easier, because each column has a consistent _type_ of data: one column contains names (strings), another column contains populations (integers), and other columns contain other types of data (floats, etc.). 

Better still, the order of the columns often doesn't matter as long as we know what they are called: it's easier to ask for the 'population column' than it is to ask for the 6th column since, for all we know, the population column might be in a different place for different files but they are all (relatively) likely to use the 'population' label for the column itself.

### A Dictionary of Lists

So, if we don't care about column order, only row order, then a dictionary of lists would be a nice way to handle things. And why should we care about column order? With our two CSV files above we already saw what a pain it was to fix things when the layout of the columns changed from one data set to the next. If, instead, we can just reference 'population' column then it doesn't matter where that column actually is. Why is that? 

Well, here are the first four rows of data from the simple city file as a list-of-lists:

```python
['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], 
['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], 
['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], 
['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986']
```

Now, here's how it would look as a dictionary of lists organised by _column_, not by row:

```python
myData = {
    'id'         : [1, 2, 3],
    'Name'       : ['London', 'Manchester', 'West Midlands'],
    'Rank'       : [1, 2, 3],
    'Longitude'  : [-18162.92767, -251761.802, -210635.2396],
    'Latitude'   : [6711153.709, 7073067.458, 6878950.083],
    'Population' : [9787426, 2553379, 2440986],
}

```

What does this do better? Well, for starters, we know that everything in the 'Name' column will be a string, and that everything in the 'Longitude' column is a float, while the 'Population' column contains integers. So that's made life easier already. But let's test this out and see how it works.

In [None]:
myData = {
    'Name'       : ['London','Manchester','West Midlands'],
    'Rank'       : [1, 2, 3, 4],
    'Longitude'  : [-18162.92767, -251761.802, -210635.2396],
    'Latitude'   : [6711153.709, 7073067.458, 6878950.083],
    'Population' : [9787426, 2553379, 2440986],
}

# Find the population of Manchester
pop = myData['Population'][myData['Name'].index('Manchester')]
print("The population of Manchester is: " + str(pop))

# Find the easternmost city
city = myData['Name'][myData['Longitude'].index(max(myData['Longitude']))]
print("The easternmost city is: " + str(city))

# Find the mean population of the cities
import numpy as np # Need to import a useful package
mean = np.mean(myData['Population'])
print("The mean population is: " + str(mean))

## Review!

There's a _lot_ of content to process in the code above, so do _not_ rush blindly on if this is confusing. **Stop. Think it through. Talk it out with your neighbour and lecturers.** 

We'll go through each one in turn, but they nearly all work in the same way and the really key thing is that you'll notice that we no longer have any loops (which are slow) just `index` (which is very fast). 

If you want to have a stab at writing the code to print out the 2nd most populous city-region then knock yourself out. _Or_, read on the rest of the notebook and then come back to this:

In [None]:
# Print out the name of the 2nd most populous city-region
city = myData???
print("The second most populous city is: " + str(city))

### The Population of Manchester

The code can look pretty daunting, so let's break it down into two parts.

What would you get if you ran just this code?
```python
myData['Population'][0]
```
Remember that this is a dictionary-of-lists (DoL). So, Python first looks for a key named `Population` in the myData dictionary. It finds out that the value associated with this key is a _list_ (`[9787426, 2553379, 2440986]`). In this example, it just pulls out the first value (index 0), which is `9787426`. Does that make sense?

Now, to the second part:
```python
myData['Name'].index('Manchester')
```

This is very similar: we look in the dictionary for the key `Name` and find that that's _also_ a list (`['London','Manchester','West Midlands']`, since you asked). If you don't remember what `index` does, don't worry, here's the output from Python's `help()` function:
```
Help on built-in function index:

index(...)
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.
```
So all we're doing is asking Python to find out the index of 'Manchester' in the list associated with the dictionary key 'Name' _instead_ of just sticking in a `0` to get the first index value. Putting these two things back together what we're doing is:

* Finding the index (i.e. **row**) of 'Manchester' in the Name column,
* Using that index to read a value out of the Population column.

Notice the complete _absence_ of a for loop?

Does that make sense? If it does then you should be having a kind of an Alice-through-the-Looking-Glass moment because what we've done by taking a column view, rather than a row view is to make Python's ``index()`` command do the work for us. Instead of having to look through each row for a field that matches 'Name' and then check to see if it's 'Manchester', we've pointed Python at the right column immediately and asked it to find the match (which it can do very quickly). Once we have a match then we _also_ have the row number to go and do the lookup in the 'Population' column because the index _is_ the row number!

### The Easternmost City

Where this approach really comes into its own is on problems that involve maths. To figure out the easternmost city in this list we need to find the _maximum_ Longitude and then use _that_ value to look up the city name. So let's do the same process of pulling this apart into two steps:

It should be _pretty_ obvious what this does:
```python
myData['Name'][0]
```

But we don't just want the first city in the list, we want the one with the highest longitude. So to achieve that we need to replace the `0` with an index that we found by looking in the `Longitude` list.
```python
myData['Longitude'].index(max(myData['Longitude']))
```

Ugh, that's still a little hard to read, isn't it? Let's write it down another way to make it easier to read:

```python
myData['Longitude'].index(
    max(myData['Longitude'])
)
```

There's the same `.index` which tells us that Python is going to look for something in the list associated with the `Longitude` key. All we've done is change what's _inside_ that index function to `max(myData['Longitude'])`. This is telling Python to find the _maximum_ value in the `myData['Longitude']` list. So to explain this in three steps, what we're doing is:
* Finding the maximum value in the Longitude column (we know there must be one, but we don't know what it is!),
* Finding the index (position) of that maximum value in the Longitude column (now that we know what the value is!),
* Using that index to read a value out of the Name column.

I _am_ a geek, but that's pretty cool, right? In one line of code we managed to quickly find out where the data we needed was even though it involved three discrete steps. Remember how much work it was to find the mean when you were still thinking in _rows_, not _columns_?

### The Average City Size

Yeah, let's try that too.

Here we're going to 'cheat' a little bit: rather than writing our own function, we're going to import a package and use someone _else's_ function. The `numpy` package contains a _lot_ of useful functions that we can call on (if you don't believe me, add "`dir(np)`" on a new line after the `import` statement), and one of them calculates the average of a list or array of data.
```python
import numpy as np # Need to import a useful package
mean = np.mean(myData['Population'])
```
This is where our new approach really comes into its own: because all of the population data is in one place (a.k.a. a _series_ or column), we can just throw the whole list into the `np.mean` function rather than having to use all of those convoluted loops and counters. Simples, right?

## Review!

So the _really_ clever bit in all of this isn't switching from a list-of-lists to a dictionary-of-lists, it's recognising that the latter is a _better_ way to work _with_ the data that we're trying to analyse and that that there are useful functions that we can exploit to do the heavy lifting for us. Simply be changing the way that we stored the data in a 'data structure' (i.e. complex arrangement of lists, dictionaries, and variables) we were able to do away with lots of for loops and counters and conditions, and reduce many difficult operations to something that could be done on one line! 

## Formatting a Number

A final, handy trick if you want to output numbers in a _pretty_ way is to understand how the `format` method associated with string objects works: different cultures format numbers differently, so the English use commas as the thousands separator and a full-stop as the decimal separator but the French, naturally, do it their own way.

Here's an example:

In [None]:
print("{:,.2f}".format(mean))

import locale
locale.setlocale(locale.LC_ALL, 'fr_FR')
print locale.format('%0.2f', mean, grouping=True, monetary=True)

Unfortunately, most programming languages were written by Anglophones and so most applications will happily output Anglo-formatted numbers but are rather less happy doing anyone else's.

More on `format` in the _Python String Format Cookbook_: https://mkaz.tech/python-string-format.html

# Writing a Script

If all of this has made sense to you then the _final_ step in this practical should be easy... Nah, I'm just kidding: there is a _lot_ to make sense of in this practical and you shouldn't worry if it hasn't all become clear to you just yet. But what I _can_ say is that you should read, and re-read this, until it _does_ make sense. Talk to your classmates. Talk to your teachers. This is the practical that unlocks everything that comes after: how to write code that can be re-used, how to make use of code written by other people, and a little bit about how to _think like a computer_. 

The _best_ way to be sure that you've understood the concepts here is to write a standalone Python script that takes any URL, parses it as a CSV file, and returns a dictionary-of-lists on which you can perform simple numerical analyses like the ones we just did above. You shouldn't need to write much in the way of new code because what you need to do involves _remixing_ the code that we've just been working with! The _only_ thing that will really require some thought is how to change the `readRemoteCSV` function from return a list-of-lists to a dictionary-of-lists. That won't be easy, but it shouldn't be impossible either. Remember that you can always start by breaking the problem down into little steps: how do you turn the header row into dictionary keys? how do you keep track of which column is associated with which key?

Do this now...