# Making Sense of Data with Pandas

Pandas stands for 'Python Data Analysis Library'; it is designed to provide data scientists working in Python with a set of powerful tools to load, transform, and process large-ish data sets. As a result, it has become something of a *de facto* standard for online tutorials and many of the lessons that you can find online will make use of pandas at some point.

You will want to bookmark [the documentation](http://pandas.pydata.org/pandas-docs/stable/) since you will undoubtedly need to refer to it fairly regularly. _Note_: this link is to the most recent, stable release. If you are using an older version of pandas then you'll need to track down the appropriate version from the [home page](http://pandas.pydata.org).

You can always check what version you have installed like this:
```python
import pandas as pd
print pd.__version__
```
*Note*: this approach won't necessarily work with _every_ package, but it will work with many of them. Remember that variables and methods starting and ending with '`__`' are **private** and any interaction with them should be approached very, very carefully.

Anyway, the main elements of pandas with which you interact directly are: 
1. the DataFrame; 
2. the Series;
3. the Index. 

Let's take a look:

In [None]:
import pandas as pd
help(pd.DataFrame)

On second thought, let's never do that again. Well, at least not _that_ way! You'll have noticed that the help documentation for the DataFrame is not just a bit longer than anything we've seen before, it's massively longer. There's probably quite a lot of intimidating terminology in there too... Right from the start we get things like "Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)." 

Here's the thing: in the [last notebook](https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/Practical-4-Functions%2C%20Packages%20and%20Methods.ipynb) we came close to inventing something a lot like pandas from scratch. 

So you already _know_ what's going on, or at least have an analogy that you can use to make sense of it. Pandas takes a column-view of data in the same way that our Dictionary-of-Lists did, it's just got a lot more features. That's why the documentation is so much more forbidding and why pandas is so much more powerful.

But at its heart, a pandas data frame ('df' for short) is just a collection of data series (i.e. columns) with an index. Each Series is like one of our column-lists from the last notebook. And the df is like the overarching dictionary that held the collection of data series (serieses?) together. OK? You've seen this before.

Let's try it with last week's data!

In [2]:
import pandas as pd
df = pd.read_csv('http://www.reades.com/CitiesWithWikipediaData.csv')

df.head()

Unnamed: 0,id,Name,Rank,Population,Longitude,Latitude,Area,Density,Subs,MetroArea
0,1,Greater London,1,9787426,-18162.92767,6711153.709,1737.9,5630.0,"London Boroughs, Hemel Hempstead, Watford, Wok...",London
1,2,Greater Manchester,2,2553379,-251761.802,7073067.458,630.3,4051.0,"Manchester, Salford, Bolton, Stockport, Oldham...",Manchester
2,3,West Midlands,3,2440986,-210635.2396,6878950.083,598.9,4076.0,"Birmingham, Wolverhampton, West Bromwich, Dudl...",Birmingham
3,4,West Yorkshire,4,1777934,-185959.3022,7145450.207,487.8,3645.0,"Leeds, Bradford, Wakefield, Huddersfield, Dews...",Leeds-Bradford
4,5,Glasgow,5,1209143,-473845.2389,7538620.144,368.5,3390.0,"Glasgow, Paisley, Clydebank",Glasgow


Check it out!

Instead of having to write a 'readRemoteCSV' function and then manually create a Dictionary-of-Lists from that remote file, we just told pandas to read it for us and it automagically converted it to a data structure that we could view. You'll notice that it even figured out where the column names were. 

All we did with `df.head()` was to ask it to print out the first 5 rows of data. If we wanted to only see the first two rows it would be `df.head(2)`. This is pretty handy, right? 

Also, it deliberately mimics the Unix command-line tool `head` (i.e. `head -5 CitiesWithWikipediaData.csv`). So you've learned two tools for the price of one!

Let's try a few more things:

In [3]:
df.describe()



Unnamed: 0,id,Rank,Population,Longitude,Latitude,Area,Density
count,72.0,72.0,72.0,72.0,72.0,68.0,68.0
mean,36.5,37.0,501933.7,-160383.721227,6918399.0,118.266176,3995.794118
std,20.92845,21.590295,1199693.0,162696.710564,272655.0,231.859227,489.163635
min,1.0,1.0,106940.0,-659942.9337,6512204.0,24.8,3107.0
25%,18.75,18.75,148281.2,-257802.2659,6714778.0,,
50%,36.5,36.5,219622.0,-157710.4947,6863271.0,,
75%,54.25,54.75,373739.5,-50190.238135,7086331.0,,
max,72.0,74.0,9787426.0,142539.1149,7791034.0,1737.9,5630.0


You'll probably have seen a fairly prominent warning ("Invalid value encountered in percentile"), and if you look closely you'll see that there are some fields that report things like '`NaN`' in some of the rows. These are related, but let's take a step back for a second: by calling the `describe` we were able to produce a 7-figure summary for _most_ of the columns in the data! That's a pretty handy way to summarise what's in there, right?

So, just by calling `describe`...
1. We've asked Python to describe the data frame and it has returned a set of columns with descriptive metrics for each.
2. Note what is _missing_ from this list: where are 'Name', 'MetroArea', and a couple of the other columns? Can you think why they weren't reported in the descriptives?
3. For the other columns, notice the `NaN`; these are short-hand for 'Not a Number' and it flags up a potential problem. When we are dealing with numeric columns NaNs are an issue because it's hard to know what do: is something that isn't a number something that should be ignored? Is it a major problem? Is it a 'we don't know the value' or 'we couldn't read the value'? Those are different problems!

Of course, maybe you don't want the report for all columns, maybe you're just interested in one column:

In [4]:
print df.Population.describe()

count    7.200000e+01
mean     5.019337e+05
std      1.199693e+06
min      1.069400e+05
25%      1.482812e+05
50%      2.196220e+05
75%      3.737395e+05
max      9.787426e+06
Name: Population, dtype: float64


So now we have the same information, but only for the Population column. We have to do this a _little_ differently because describing the DataFrame does some clever formatting when you're using Jupyter notebooks, and describing a Series requires us to print out the result. Also notice that `dtype` at the end: that tells us the _data type_ is a 64-bit float. You can have strings, floats, integers, booleans, etc. in a DataFrame.

But the really crucial thing is that this introduces _one_ of the two ways that we access a Series in pandas: `<data frame>.<series name>.method`. So we could get similar information on the Name column with:
```python
df.Name.describe()
```
And so forth.

In [5]:
print df.Name.describe()
print " "
print "The mean is: " + str(df.Population.mean())

count             72
unique            72
top       Chelmsford
freq               1
Name: Name, dtype: object
 
The mean is: 501933.666667


Notice that describing a text column gives us an 'object' data type because a String is a complex object, not a simple float or int.

And notice to that we can ask the df directly for a derived variable (such as the mean) just by asking the Series to do the work for us: `<data frame>.<series>.method()`. You might want to have a [look at the documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#series) to see what other methods are available for a data series. It's rather a long list.

# Data Series & Indexing

A DataFrame is composed of one or more data series (columns) objects and an index that is a non-data column useful for finding individual observations. In our 'city data' data set, the index would be the city names themselves because the names _aren't_ data in the usual sense: you can't calculate a mean from them and they aren't categorical variables (e.g. 'Metro' vs 'Town') that we'd use for grouping. They are unique non-data values, so that's your index.

## Creating your own index

Ordinarily, the data making up a df are read directly from a file and the index is automatically built using the first available 'index-like' column in the file. But you are not bound by what pandas thinks is the 'right' thing to do: you can set any column as an index, or even create one of your own!

For instance, let's say that you wanted a series containing only latitudes for British cities, you could create a new Series with this custom index as follows:
```python
myLatitudes = pd.Series(
    [7063197, 6708480, 6703134, 7538620], 
    index = ['Liverpool', 'Bristol', 'Reading', 'Glasgow']
)
```
In this case, the index is a list of cities and it would, generally, make it quite quick to look up the latitude of any of the cities listed. You are never limited to _only_ looking up values by index, but this is usually faster.

In [6]:
import pandas as pd
myLatitudes = pd.Series(
    [7063197, 6708480, 6703134, 7538620], 
    index = ['Liverpool', 'Bristol', 'Reading', 'Glasgow']
)
print "Type of myLatitudes: "      + str(type(myLatitudes))
print "Access like a dictionary: " + str(myLatitudes['Liverpool'])
print "Access like a method: "     + str(myLatitudes.Liverpool)

myLatitudes.Bristol = '555000'

print "Updated latitude: " + str(myLatitudes.Bristol)

Type of myLatitudes: <class 'pandas.core.series.Series'>
Access like a dictionary: 7063197
Access like a method: 7063197
Updated latitude: 555000


You'll notice that we also just accessed the df in two different ways -- understanding the strengths and weaknesses of these two approaches is really important:

1. The 'method' approach (`<df>.<series name>` and `<df>.<index name>`) makes for code that is easy to read. A good example of that would be the `df.Population.mean()` that we saw above.

2. The 'dictionary' approach (`<df>['<series name>']` and `<df>['<index value>']`) is helpful when there is potentially ambiguity about what you want Python to do (you shouldn't run into this problem very often), but it's mainly about being able to access or modify a _range_ of values... as we'll see below.

## Loc and iloc

So, what about if you wanted to select several values from the df at the same time? How do you select, say, rows in the range from 0 to 2, or select Bristol and Glasgow in one go? 

Here's how:

In [8]:
# Access like a list
print myLatitudes.iloc[0:2]

print "\n"

# Access a range
print myLatitudes.loc['Reading':]

print "\n"

# Access non-sequential values
print myLatitudes.loc[ ['Bristol','Glasgow'] ]

Liverpool    7063197
Bristol       555000
dtype: int64


Reading    6703134
Glasgow    7538620
dtype: int64


Bristol     555000
Glasgow    7538620
dtype: int64


A simple mnemonic for loc and iloc is that iloc is about using _integers_ (i == integers) to to help you to find something in the data frame (like working with a list), while loc is about using the index _directly_ in a list-like way.

*Note*: there is a [lot more](http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#selection-by-position) that you can do with this.

### A Challenge for You!

If all of this has made some kind of sense, why not spend a few minutes exploring the CSV data from last week using pandas. Try the following:

1. What's the mean population?
2. What's the standard deviation of the population?
3. What's the highest rank (i.e. smallest city) in the data set?
4. Can you figure out how to calculate a z-score using one line of pandas-enabled code?

Use the coding block below for your exploration.

# Adding a New Series

Finally, and building on everything we've seen so far, to add a new series to an existing data frame we use the dictionary-like syntax:
```python
df['NewSeriesName'] = pd.Series(...Series definition...)
``` 
See how familiar that syntax is? `df['NewSeriesName']` is _exactly_ like creating and assigning a new key/value pair to a dictionary! The only difference here is that the 'value' we store in the dictionary is a Series object, and not a simple variable (String, int, float).

# Working with Data

One of the first things that we do when working with any new data set is to familiarise ourselves with it. There are a _huge_ number of ways to do this, but there are no shortcuts to:
* Reading about the data (how it was collected, what the sample size was, etc.)
* Reviewing any accompanying metadata (data about the data, column specs, etc.)
* Looking at the data itself at the row- and column-levels
* Producing descriptive statistics 
* Visualising the data using plots 
In fact, you should use _all_ of these together to really understand where the data came from, how it was handled, and whether there are gaps or other problems. If you're wondering which comes first, I've always liked this approach: _start with a chart_. We're _not_ going to do that here because, first, I want you to get a handle on pandas itself!

For the remainder of this module we're going to be working with two types of data: data about people (Socio-economic Classifcation) and data about the environment (weather). We've selected two very different types of data on purpose:
1. Because we know that some of you have interests in the human environment, and others in the natural
2. Because these are very different types of data with very different properties
3. Because we'll see that _similar_ workflows can be used with each!
What we want to highlight is that computational approaches are _highly transferrable_ between contexts. The mean or median is not _less_ relevant in one context than another, it's just more or less appropriate as a tool for understanding the data! 

We'll see the Socio-economic Classification data next week and focus on the weather API data this week.

## Weather Data 

The UK's Met Office is a world-leading weather and climate research centre, and even if it doesn't always seem like their forecasts are very accurate that's because Britain's weather is inherently _unpredictable_. They've also done a lot of work to make their weather data widely available to people like us.

I probably don't need to say a _lot_ about weather data because you've probably been making use of forecasts for much of your life! But it's _still_ worth understanding something about how weather data is gathered and reported: many organisations operate weather stations where data on wind speed, temperature, rain, and amount of sun are collected and then transmitted to a server to be integrated into a larger data set of weather _observations_ at a national or global scale. Of course, any _one_ station might be in the 'wrong' place (somewhere shady or protected from the rain) or it might even break down, but the idea is that if you have enough of them you can collect a pretty good range of data for the country and begin to look for patterns and, potentially, make predictions.

We will be access data from the MetOffice from a couple of different locations where observations such as the ones below are collected:
* <Param name="F" units="C">Feels Like Temperature (units: degrees Celsius)
* <Param name="G" units="mph">Wind Gust (units: mph)</Param>
* <Param name="H" units="%">Screen Relative Humidity (units: percent)</Param> 
* <Param name="T" units="C">Temperature (units: degrees Celsius)</Param> 
* <Param name="V" units="">Visibility (units: km?)</Param> 
* <Param name="D" units="compass">Wind Direction (units: compass degrees)</Param>  
* <Param name="S" units="mph">Wind Speed (units: mph)</Param> 
* <Param name="U" units="">Max UV Index (units: index value)</Param> 
* <Param name="W" units="">Weather Type (units: categorical)</Param> 
* <Param name="Pp" units="%">Precipitation Probability (units: percent)</Param>

These observations are only associated with a particular station (where did we see/will we see these values?), they will also be associated with _either_ a particular time in the past (when were they collected?) or, if they're forecasts, with a particular time in the future (when do we expect to see them?). 

So although weather data might seem more 'objective' than data on social class (though for obvious reasons it turns out that both are just attempts to capture data about reality, not reality itself), it may also turn out to be very complex to store and manage beccause of the temporal element _and_ the fact that it's not just a count of one thing, each of these observations uses a very different set of units.

To really get to grips with the MetOffice API you will need to RTM (Read The Manual): http://www.metoffice.gov.uk/media/pdf/3/0/DataPoint_API_reference.pdf

----
# Getting Weather Data

Because the weather is changing all the time, so is the data! And, 'worse', it's becoming obsolete: the forecast from 2 years ago isn't particularly useful to us now. *And* asking for "yesterday's weather" depends on the day that we're asking! When you have data that is always changing from minute to minute or day to day then you use an API (Application Programming Interface) to access it: the API knows that "yesterday's weather" means "work out what day it is right now and then get the weather from the day before", and it also knows that "give me the current weather from station X" means "look up station X and find the latest weather report that I've received". In other words, an API is  designed with programmatic, dynamic interaction in mind right from the start.

Helpfully, the MetOffice provides a lot of documentation about their API (I'd suggest bookmarking it): http://www.metoffice.gov.uk/datapoint/support/api-reference

This type of data requires a lot more research up front to work with, but it's very flexible once you know how to 'speak API' because you can _customise_ the API request (the thing we want to know) to obtain _only_ the data we're interested in instead of being 'stuck' with what the provider wants to give you.

## Obtaining an API Key

The first step to working with the API from the MetOffice is to obtain an API key: [do that here](http://www.metoffice.gov.uk/datapoint/API).

## Making an API Request

We then use the key as part of an API request: the process by which we _ask_ for data. We're going to show you the code and output first and then we'll talk through the steps involved. But, first, you'll need to replace "???" with the API key provided to you by the MetOffice.

In [9]:
import json, requests

api_key   = "8e5675c8-cd82-49b4-ac61-793dc71c3fac" # your API key
api_url   = "http://datapoint.metoffice.gov.uk/public/data/" # base URL
obs_json  = "val/wxobs/all/json/" # observations URL
fcs_json  = "val/wxfcs/all/json/" # forecasts URL

heathrow = str(3772)  # heathrow airport weather station

payload = {'res': 'hourly', 'key': api_key}
r = requests.get(api_url + obs_json + heathrow, params=payload)

#check the call - need some proper try, except stuff here
print(r.url)

#check the output
print(r.json())

http://datapoint.metoffice.gov.uk/public/data/val/wxobs/all/json/3772?res=hourly&key=8e5675c8-cd82-49b4-ac61-793dc71c3fac
{u'SiteRep': {u'DV': {u'type': u'Obs', u'dataDate': u'2016-10-15T09:00:00Z', u'Location': {u'elevation': u'25.0', u'name': u'HEATHROW', u'i': u'3772', u'country': u'ENGLAND', u'lon': u'-0.4491', u'Period': [{u'Rep': [{u'D': u'ENE', u'Pt': u'R', u'H': u'78.9', u'P': u'1006', u'S': u'6', u'T': u'11.1', u'W': u'7', u'V': u'13000', u'Dp': u'7.6', u'$': u'540'}, {u'D': u'ENE', u'Pt': u'F', u'H': u'73.4', u'P': u'1006', u'S': u'6', u'T': u'12.3', u'W': u'7', u'V': u'14000', u'Dp': u'7.7', u'$': u'600'}, {u'D': u'ENE', u'Pt': u'F', u'H': u'61.9', u'P': u'1006', u'S': u'7', u'T': u'14.6', u'W': u'8', u'V': u'20000', u'Dp': u'7.4', u'$': u'660'}, {u'D': u'E', u'Pt': u'F', u'H': u'66.5', u'P': u'1006', u'S': u'8', u'T': u'13.8', u'W': u'8', u'V': u'17000', u'Dp': u'7.7', u'$': u'720'}, {u'D': u'ENE', u'Pt': u'F', u'H': u'63.9', u'P': u'1005', u'S': u'9', u'T': u'13.8', u'W': 

OK, now let's make sense of this:
```python
import json, requests

api_key   = "???" # your API key
api_url   = "http://datapoint.metoffice.gov.uk/public/data/" # base URL
obs_json  = "val/wxobs/all/json/" # observations URL
fcs_json  = "val/wxfcs/all/json/" # forecasts URL
```
So, first we import two new modules: one that makes requests to a web server, and one that will parse JSON responses from the server in order to turn them into something that we can use.

Then we set up some default values that will allow us to build our request to the MetOffice server. The comments help us to remember what each of these variables is.

Now let's do the actual work:
```python
heathrow = str(3772)  # heathrow airport weather station

payload = {'res': 'hourly', 'key': api_key}
r = requests.get(api_url + obs_json + heathrow, params=payload)

# Check the call - need some proper try, except stuff here
print(r.url)

# Check the output
print(r.json())
```
We want the data for Heathrow Airport: we have to request it using a unique identifier (3772 in this case) because that's easier for the computer to handle than a long, potentially ambiguous string. For instance, if you asked for 'London' what would you get? The City of London? Greater London? 

We can then assemble a URL request by combining the site name, the observations URL, and the parameters. In this case that's: the type of 'resource' (the hourly observations), and our API key.

The last two steps are just about printing out the reply... It's pretty hard to figure out what that reply means, but it's actually just a kind of dictionary. That's it. It looks like a mess, but it _is_ a dictionary and the only thing that is entirely new is the fact that every string has the letter 'u' in front of it. That 'u' means 'Unicode' and it just a special kind of string that supports accents, Chinese characters, emojis, and just about anything else that you can think of...

It might be a little easier to read if we just look at the description.

In [10]:
pdesc = r.json()['SiteRep']['Wx']
pdat  = r.json()['SiteRep']['DV']

print(pdesc)

{u'Param': [{u'units': u'mph', u'name': u'G', u'$': u'Wind Gust'}, {u'units': u'C', u'name': u'T', u'$': u'Temperature'}, {u'units': u'm', u'name': u'V', u'$': u'Visibility'}, {u'units': u'compass', u'name': u'D', u'$': u'Wind Direction'}, {u'units': u'mph', u'name': u'S', u'$': u'Wind Speed'}, {u'units': u'', u'name': u'W', u'$': u'Weather Type'}, {u'units': u'hpa', u'name': u'P', u'$': u'Pressure'}, {u'units': u'Pa/s', u'name': u'Pt', u'$': u'Pressure Tendency'}, {u'units': u'C', u'name': u'Dp', u'$': u'Dew Point'}, {u'units': u'%', u'name': u'H', u'$': u'Screen Relative Humidity'}]}


Notice how the above also looks a lot like a mix of Python dictionaries and lists: '{' and '['.

## Using recursion to explore data dictionaries

We've seen dictionaries-of-lists and dictionaries-of-dictionaries before! We know how these work, but they've never been very easy to work with because we had to write lots and lots of nested loops:

```python
for key1 in bigDictionary:
    for key2 in bigDictionary[key1]:
        for key3 in bigDictionary[key1][key2]:
            ... And so on ...
```

And if we have to add checks on each one of these `keys` to see if it is a list, a dictionary, or a simple float/int then this code would explode in complexity and become very, very hard to follow.

But there is another way. It's a concept called _recursion_. 

Let's imagine that we have to deal with lists-of-lists (because those are a bit simpler to think about) but we don't know in advance how many lists there are inside of each list; e.g.:
```python
myList = [
    ['Value 1',
        ['Value 1.a.i', 'Value 1.a.ii'],
        ['Value 1.b.i', 'Value 1.b.ii', 
            ['Value 1.b.ii.I', 'Value 1.b.ii.II'],
        'Value 1.c'],
    ['Value 2'],
    'Value 3'
]
```
What a nightmare! That's hard to even _read_, let alone know how to process! But recursion allows us reframe this problem as something that is _almost_ simple (it's certainly elegant): we need a function that steps through a list one element at a time and then: 
* if the element is a simple value (float, int or string) then it prints it out, 
* if the element is a list then the function _calls itself_ on the nested list! 
In other words, when our list-reading-function finds a new list _inside_ the list it is currently reading, then it calls itself and passes in the list-inside-the-list.

That explanation probably _still_ doesn't take much sense, but take a look at the code below. 

**Really, really look**:

In [31]:
def outputList(l, depth): 
    for i in range(len(l)):
        value = l[i]
        if type(value) is list:
            outputList(value, depth+1)
        elif type(value) is dict: 
            outputDict(value, depth+1)
        else:
            print "\t" * depth + "l-Value: " + value
    print "\n"

def outputDict(d, depth):
    for key, value in d.iteritems():
        print "\t" * depth + "d-Key: " + key
        if type(value) is list:
            outputList(value, depth+1)
        elif type(value) is dict:
            outputDict(value, depth+1)
        else:
            print "\t" * depth + "  d-Value: " + value
    print "\n"

So, `outputList` takes a list `l` and then steps through each element of that list. If it encounters an element that is a list, it calls `outputList` and passes it the list that it just found. If it encounters an element that is a dictionary, it calls `outputDict` and passes it the dictionary that it just found. If it encounters a simple value (the `else`) then it just prints it out.

`outputDict` works the same way.

Now, what's going on with `depth`? That variable is the one that demonstrates actual recursion. You can see that we output `"\t" * depth` as part of our print statement; that will print out `depth` tab spaces. You'll also notice that every time we recurse (call `outputList` or `outputDict` _again_) that we increment (increase) depth by 1. So this helps us to make the formatting legible so that we can see where each embedded list or dictionary actually sits within the data.

It's probably better that we just see hi it action... In our case we know that we're starting with a dictionary so we would ask `outputDict` to start outputting the content of `pdesc` (the rePly DESCription). `outputDict` then takes each of the key/value pairs in turn, looks at the value to see if _it_ is a dictionary or list or (by default) string and takes appropriate action. Don't get too stressed out if it doesn't make sense just yet, but it's such a powerful concept that it's definitely worth getting to grips with it.

In [32]:
outputDict(pdesc, 0)

d-Key: Param
		d-Key: units
		  d-Value: mph
		d-Key: name
		  d-Value: G
		d-Key: $
		  d-Value: Wind Gust


		d-Key: units
		  d-Value: C
		d-Key: name
		  d-Value: T
		d-Key: $
		  d-Value: Temperature


		d-Key: units
		  d-Value: m
		d-Key: name
		  d-Value: V
		d-Key: $
		  d-Value: Visibility


		d-Key: units
		  d-Value: compass
		d-Key: name
		  d-Value: D
		d-Key: $
		  d-Value: Wind Direction


		d-Key: units
		  d-Value: mph
		d-Key: name
		  d-Value: S
		d-Key: $
		  d-Value: Wind Speed


		d-Key: units
		  d-Value: 
		d-Key: name
		  d-Value: W
		d-Key: $
		  d-Value: Weather Type


		d-Key: units
		  d-Value: hpa
		d-Key: name
		  d-Value: P
		d-Key: $
		  d-Value: Pressure


		d-Key: units
		  d-Value: Pa/s
		d-Key: name
		  d-Value: Pt
		d-Key: $
		  d-Value: Pressure Tendency


		d-Key: units
		  d-Value: C
		d-Key: name
		  d-Value: Dp
		d-Key: $
		  d-Value: Dew Point


		d-Key: units
		  d-Value: %
		d-Key: name
		  d-Value: H
		d-Key: $
		  d-Value: Screen Relativ

OK, so what we have is:

* A dictionary saved in the variable `pdesc` (parameter-description)
* It contains one key only: `Param`
* `pdesc['Param']` is a list of dictionaries

How do I know this? I investigated...

In [22]:
print("Type for pdesc['Param']: " + str(type(pdesc['Param'])))

print("Type for pdesc['Param'][0]: " + str(type(pdesc['Param'][0])))

print("Contents of pdesc['Param'][0]: " + str(pdesc['Param'][0]))

Type for pdesc['Param']: <type 'list'>
Type for pdesc['Param'][0]: <type 'dict'>
Contents of pdesc['Param'][0]: {u'units': u'mph', u'name': u'G', u'$': u'Wind Gust'}


So the point here is that we know have little bundles of information about the data the MetOffice is giving back to us: the parameter description dictionary tells us, for instance, that the name 'G' in the data-part of the reply is data about 'wind gusts' given in miles per hour ('mph'). We can do the same for every other parameter.

Now, let's see what we get when we look at the reply:

In [33]:
outputDict(pdat, 0)

d-Key: type
  d-Value: Obs
d-Key: dataDate
  d-Value: 2016-10-15T09:00:00Z
d-Key: Location
	d-Key: elevation
	  d-Value: 25.0
	d-Key: name
	  d-Value: HEATHROW
	d-Key: i
	  d-Value: 3772
	d-Key: country
	  d-Value: ENGLAND
	d-Key: lon
	  d-Value: -0.4491
	d-Key: Period
			d-Key: Rep
					d-Key: D
					  d-Value: ENE
					d-Key: Pt
					  d-Value: R
					d-Key: H
					  d-Value: 78.9
					d-Key: P
					  d-Value: 1006
					d-Key: S
					  d-Value: 6
					d-Key: T
					  d-Value: 11.1
					d-Key: W
					  d-Value: 7
					d-Key: V
					  d-Value: 13000
					d-Key: Dp
					  d-Value: 7.6
					d-Key: $
					  d-Value: 540


					d-Key: D
					  d-Value: ENE
					d-Key: Pt
					  d-Value: F
					d-Key: H
					  d-Value: 73.4
					d-Key: P
					  d-Value: 1006
					d-Key: S
					  d-Value: 6
					d-Key: T
					  d-Value: 12.3
					d-Key: W
					  d-Value: 7
					d-Key: V
					  d-Value: 14000
					d-Key: Dp
					  d-Value: 7.7
					d-Key: $
					  d-Value: 600


					d-Key: D
					  d-Value: 

Right, so that's a lot more complex isn't it? But we can make sense of it in the same incremental way...

We can start off by noticing that there are some useful _generic_ fields:

* `pdat['dataDate']` will give us the date and time of the data in the reply.
* `pdat['type']` tells us that we're looking at _Obs_-ervations

And so on. The really interesting one in there is the 'Location'... let's investigate:

In [24]:
print(pdat['Location']['name'])
print(pdat['Location']['elevation'])
print(pdat['Location']['lat'])
print(pdat['Location']['lon'])

HEATHROW
25.0
51.479
-0.4491


That leaves us with the rather nasty-looking `pdat['Location']['Period']`:

In [25]:
pdat['Location']['Period']

[{u'Rep': [{u'$': u'540',
    u'D': u'ENE',
    u'Dp': u'7.6',
    u'H': u'78.9',
    u'P': u'1006',
    u'Pt': u'R',
    u'S': u'6',
    u'T': u'11.1',
    u'V': u'13000',
    u'W': u'7'},
   {u'$': u'600',
    u'D': u'ENE',
    u'Dp': u'7.7',
    u'H': u'73.4',
    u'P': u'1006',
    u'Pt': u'F',
    u'S': u'6',
    u'T': u'12.3',
    u'V': u'14000',
    u'W': u'7'},
   {u'$': u'660',
    u'D': u'ENE',
    u'Dp': u'7.4',
    u'H': u'61.9',
    u'P': u'1006',
    u'Pt': u'F',
    u'S': u'7',
    u'T': u'14.6',
    u'V': u'20000',
    u'W': u'8'},
   {u'$': u'720',
    u'D': u'E',
    u'Dp': u'7.7',
    u'H': u'66.5',
    u'P': u'1006',
    u'Pt': u'F',
    u'S': u'8',
    u'T': u'13.8',
    u'V': u'17000',
    u'W': u'8'},
   {u'$': u'780',
    u'D': u'ENE',
    u'Dp': u'7.1',
    u'H': u'63.9',
    u'P': u'1005',
    u'Pt': u'F',
    u'S': u'9',
    u'T': u'13.8',
    u'V': u'20000',
    u'W': u'8'},
   {u'$': u'840',
    u'D': u'ENE',
    u'Dp': u'7.5',
    u'H': u'65.2',
    u'P': 

Again, however, if we don't panic then we can make sense of it! First, let's look at the big pieces:

* It's pretty obvious that there's a set of dictionaries in there -- we can see the '{...}'!
* We can also see things that look like readings: 'D', 'Dp', 'H', 'P'...
* We can also see two rather useful-looking bits of information: something that says 'Day' and something that looks like a timestamp (e.g. '2016-10-14Z')

Let's work on this some more by trial-and-error:

In [38]:
outputDict(pdat['Location']['Period'][0], 0)

d-Key: Rep
		d-Key: D
		  d-Value: ENE
		d-Key: Pt
		  d-Value: R
		d-Key: H
		  d-Value: 78.9
		d-Key: P
		  d-Value: 1006
		d-Key: S
		  d-Value: 6
		d-Key: T
		  d-Value: 11.1
		d-Key: W
		  d-Value: 7
		d-Key: V
		  d-Value: 13000
		d-Key: Dp
		  d-Value: 7.6
		d-Key: $
		  d-Value: 540


		d-Key: D
		  d-Value: ENE
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 73.4
		d-Key: P
		  d-Value: 1006
		d-Key: S
		  d-Value: 6
		d-Key: T
		  d-Value: 12.3
		d-Key: W
		  d-Value: 7
		d-Key: V
		  d-Value: 14000
		d-Key: Dp
		  d-Value: 7.7
		d-Key: $
		  d-Value: 600


		d-Key: D
		  d-Value: ENE
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 61.9
		d-Key: P
		  d-Value: 1006
		d-Key: S
		  d-Value: 7
		d-Key: T
		  d-Value: 14.6
		d-Key: W
		  d-Value: 8
		d-Key: V
		  d-Value: 20000
		d-Key: Dp
		  d-Value: 7.4
		d-Key: $
		  d-Value: 660


		d-Key: D
		  d-Value: E
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 66.5
		d-Key: P
		  d-Value: 1006
		d-Key: S
		  d-Value: 8
		d

In [39]:
outputDict(pdat['Location']['Period'][1], 0)

d-Key: Rep
		d-Key: D
		  d-Value: NE
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 82.8
		d-Key: P
		  d-Value: 1004
		d-Key: S
		  d-Value: 5
		d-Key: T
		  d-Value: 11.2
		d-Key: W
		  d-Value: 7
		d-Key: V
		  d-Value: 9000
		d-Key: Dp
		  d-Value: 8.4
		d-Key: $
		  d-Value: 0


		d-Key: D
		  d-Value: ENE
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 82.3
		d-Key: P
		  d-Value: 1004
		d-Key: S
		  d-Value: 3
		d-Key: T
		  d-Value: 11.2
		d-Key: W
		  d-Value: 8
		d-Key: V
		  d-Value: 9000
		d-Key: Dp
		  d-Value: 8.3
		d-Key: $
		  d-Value: 60


		d-Key: D
		  d-Value: NE
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 83.4
		d-Key: P
		  d-Value: 1004
		d-Key: S
		  d-Value: 5
		d-Key: T
		  d-Value: 11.2
		d-Key: W
		  d-Value: 8
		d-Key: V
		  d-Value: 9000
		d-Key: Dp
		  d-Value: 8.5
		d-Key: $
		  d-Value: 120


		d-Key: D
		  d-Value: E
		d-Key: Pt
		  d-Value: F
		d-Key: H
		  d-Value: 85.5
		d-Key: P
		  d-Value: 1004
		d-Key: S
		  d-Value: 3
		d-Key: T


OK, now we know that `pdat['Location']['Period']` is a list of daily reports. How do I know that? Because when I asked for the first item in the list I got an answer with yesterday's date, and when I asked for the second item in the list I got something containing today's date! And _within_ each of those is _another_ list that contains a set of reports about the weather at Heathrow!

The _last_ clue in there is that one of the parameters is changing in an unusal way: we can guess what H (Humidity), P (Pressure) and most of the rest are from having output `pdesc` above, but the '$' is always in multiples of 60. Can you guess why?

Let's see if we can turn this into something useful... Fix the '???' so that prints out the temperature reading at Heathrow.

In [45]:
for d in pdat['Location']['Period']: # d is short for day
    print("Date: " + d['value'])
    for i in d['Rep']: # i is short for time interval
        print("Temperature at " + str(i['$']) + " is " + str(i[???]))

2016-10-14Z
Temperature at 540 is 11.1
Temperature at 600 is 12.3
Temperature at 660 is 14.6
Temperature at 720 is 13.8
Temperature at 780 is 13.8
Temperature at 840 is 13.9
Temperature at 900 is 14.0
Temperature at 960 is 13.7
Temperature at 1020 is 13.1
Temperature at 1080 is 12.5
Temperature at 1140 is 12.3
Temperature at 1200 is 11.9
Temperature at 1260 is 11.9
Temperature at 1320 is 11.6
Temperature at 1380 is 11.5
2016-10-15Z
Temperature at 0 is 11.2
Temperature at 60 is 11.2
Temperature at 120 is 11.2
Temperature at 180 is 10.1
Temperature at 240 is 9.2
Temperature at 300 is 9.6
Temperature at 360 is 8.7
Temperature at 420 is 10.5
Temperature at 480 is 10.9
Temperature at 540 is 12.8


Can you explain why there are two days in there and why the '$' values don't overal?

Use the coding area below to print out the humidity values over the same period of time...

# Turning API data into a Pandas DataFrame

I've done a little searching online and no one has posted code to do this for us, so we'll have to put together everything that we learned in the past few weeks _as well as_ some new ideas about how to deal with new types of data...

In [55]:
from datetime import datetime, timedelta 

if pdat['type'] != "Obs":
    print("Errrr, these aren't observations!")

# Ignore the time part as we're getting data with values in minutes
# after midnight so we want to reset this to 00:00:00Z
obsDate = datetime.strptime(pdat['dataDate'].split("T")[0],'%Y-%m-%d')

print obsDate

2016-10-15 00:00:00


In [63]:
for d in pdat['Location']['Period']: # d is short for day
    dataDate = datetime.strptime(d['value'],'%Y-%m-%dZ') # Convert date to datetime object
    print("Date: " + str(dataDate)) # Print for debugging
    for i in d['Rep']: # i is short for time interval
        obsDate = dataDate + timedelta(minutes = int(i['$'])) # Add the time in minutes to the datetime
        # print(obsDate) # Debug!
        print("Temperature at " + str(obsDate) + " is " + str(i['T']))

Date: 2016-10-14 00:00:00
Temperature at 2016-10-14 09:00:00 is 11.1
Temperature at 2016-10-14 10:00:00 is 12.3
Temperature at 2016-10-14 11:00:00 is 14.6
Temperature at 2016-10-14 12:00:00 is 13.8
Temperature at 2016-10-14 13:00:00 is 13.8
Temperature at 2016-10-14 14:00:00 is 13.9
Temperature at 2016-10-14 15:00:00 is 14.0
Temperature at 2016-10-14 16:00:00 is 13.7
Temperature at 2016-10-14 17:00:00 is 13.1
Temperature at 2016-10-14 18:00:00 is 12.5
Temperature at 2016-10-14 19:00:00 is 12.3
Temperature at 2016-10-14 20:00:00 is 11.9
Temperature at 2016-10-14 21:00:00 is 11.9
Temperature at 2016-10-14 22:00:00 is 11.6
Temperature at 2016-10-14 23:00:00 is 11.5
Date: 2016-10-15 00:00:00
Temperature at 2016-10-15 00:00:00 is 11.2
Temperature at 2016-10-15 01:00:00 is 11.2
Temperature at 2016-10-15 02:00:00 is 11.2
Temperature at 2016-10-15 03:00:00 is 10.1
Temperature at 2016-10-15 04:00:00 is 9.2
Temperature at 2016-10-15 05:00:00 is 9.6
Temperature at 2016-10-15 06:00:00 is 8.7
Tempe

In [72]:
def processMetOfficeObservations(loc): 
    """
    Process a series of 'reports' for a single
    location using the datetime object as the 
    reference time against which to build the 
    timedelta (i.e. we start from midnight and 
    the timedelta is the number of minutes past 
    midnight)
    """
    observations = []
    
    for d in loc['Period']: # d for day
        dt = datetime.strptime(d['value'],'%Y-%m-%dZ') # Convert date to datetime object
    
        # Now deal with the actual observations
        for report in d['Rep']:
            minutes_after_midnight = int(report['$'])
            ts = dt + timedelta(minutes=minutes_after_midnight)
            
            for key in ['D','Pt']:
                if key not in report:
                    report[key] = u""
            for key in ['W','V','S','G']:
                if key not in report or report[key] == "":
                    report[key] = "0"
            for key in ['T','Dp','H']:
                if key not in report or report[key] == "":
                    report[key] = "0.0"          
            
            observations.append([ str(ts), int(report['W']), int(report['V']), float(report['T']), str(report['D']), 
                int(report['S']), int(report['G']), str(report['Pt']), float(report['Dp']), float(report['H']) ])
        
    return observations

print(processMetOfficeObservations(pdat['Location']))

[['2016-10-14 09:00:00', 7, 13000, 11.1, 'ENE', 6, 0, 'R', 7.6, 78.9], ['2016-10-14 10:00:00', 7, 14000, 12.3, 'ENE', 6, 0, 'F', 7.7, 73.4], ['2016-10-14 11:00:00', 8, 20000, 14.6, 'ENE', 7, 0, 'F', 7.4, 61.9], ['2016-10-14 12:00:00', 8, 17000, 13.8, 'E', 8, 0, 'F', 7.7, 66.5], ['2016-10-14 13:00:00', 8, 20000, 13.8, 'ENE', 9, 0, 'F', 7.1, 63.9], ['2016-10-14 14:00:00', 8, 20000, 13.9, 'ENE', 8, 0, 'F', 7.5, 65.2], ['2016-10-14 15:00:00', 8, 20000, 14.0, 'ENE', 9, 0, 'F', 7.5, 64.8], ['2016-10-14 16:00:00', 8, 27000, 13.7, 'ENE', 9, 0, 'F', 6.6, 62.1], ['2016-10-14 17:00:00', 8, 18000, 13.1, 'ENE', 8, 0, 'F', 7.4, 68.2], ['2016-10-14 18:00:00', 8, 15000, 12.5, 'ENE', 9, 0, 'R', 7.2, 70.0], ['2016-10-14 19:00:00', 8, 16000, 12.3, 'ENE', 8, 0, 'R', 7.3, 71.4], ['2016-10-14 20:00:00', 8, 13000, 11.9, 'NE', 5, 0, 'R', 7.8, 75.9], ['2016-10-14 21:00:00', 8, 12000, 11.9, 'NE', 6, 0, 'F', 8.0, 77.0], ['2016-10-14 22:00:00', 8, 12000, 11.6, 'NE', 6, 0, 'F', 8.0, 78.6], ['2016-10-14 23:00:00', 