<center><h1>7SSG2059 Geocomputation 2016/17</h1></center>

<h1><center>Practical 7: Handling Time Series Data with Panadas</h1></center>

<p><center><i>James Millington, 07 November 2016</i></center>


## Overview
In this practical we will look at some of the details of how to handle time series data. In particular we will work with the Heathrow Weather data that that you can use for your final assignment. Following all the steps in this notebook will format and clean the data from the Met Office API ready for you to do correlation, regression and time-series analyses. The full code will be provided later if there's anything you're unsure of. 

## Met Office Weather Data

### About the Data
The data you will be working with in this practical are hourly observations of weather variables from the UK Met Office. The Met Office collects hourly observation data for 140 locations across the UK and makes them data freely available online along with several other products. Specifically, these data are for observations at Heathrow Airport (site 3772). Details about these data (including units of measurement) are available from [the Met office website](http://www.metoffice.gov.uk/datapoint/product/uk-hourly-site-specific-observations).

The data are available on KEATS in the file `HeathrowWeather2016.csv`. This data file was derived from a daily ‘scrape’ of the Met Office data feed via its API (Application Programming Interface) using Python code (which will be made available on KEATS for your information). The `HeathrowWeather2016.csv` data are in very similar format to what you worked with in Practical 5; the main difference is that these data cover 1 Jan 2016 all the way through until early November 2016. You can use these data for your final report.

We will also use the file `WeatherTypes.csv`. Download both `WeatherTypes.csv` and `HeathrowWeather2016.csv` from KEATS to _the same directory where you have saved this notebook file_.  

#### Task 
Load the `HeathrowWeather2016.csv` data file into a pandas DataFrame named `df` taking account of the fact that there is a header row in the file. Check you understand the dimensions and contents (e.g. dtype) using methods learned in past weeks.

In [None]:
import pandas as pd
df = pd.read_csv("HeathrowWeather2016.csv", header=0) 
df.info()
df.head()

You should note that each record (row) is an observation for a particular hour of a particular day. Hence these data could be called a _time series_. The date and time of the observation are provided in columns of the data (`Series` of the `DataFrame`). 

In Practical 5 you looked at various ways to work with the Met Office data (collected via API) in Pandas. This included setting a temporal Index and Data Types of Series. We'll need to do some similar manipulation with this data set to get it ready for analysis. 

### Setting a DateTimeIndex
The `pandas` package has [lots of functionality for working with time series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html), including the ability to recognise dates and times and use them as an index for DataFrames and Series.  This data type is known as a `DateTimeIndex`, which you can read more about [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex). Later, we’ll see how this temporal index is useful for selecting particular records from a DataFrame or Series.

To set the index using the dates and times in our `DataFrame` we use the following code:

In [None]:
dt = pd.to_datetime(df.Date + df.Time, format='%Y-%m-%d%H:%M')
df.index = dt
df.info()

The first line above uses the `to_datetime` method with the Date and Time series and specifies the format the `DateTimeIndex` object should take. The format `'%Y-%m-%d%H:%M'` may seem a little strange (for multiple reasons!):
1. You can read about the meaning of the symbols by checking the table [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior)
2. If you opened the `HeathrowWeather2016.csv` file in Excel the date would have the format `dd/mm/yyyy` - but actually this is Excel 'helpfully' formatting the date cells for us. When opening the csv file in a text editor you can see the date column really has format `yyyy-mm-dd` (which hopefully helps you to understand the why we used `%Y-%m-%d` and not `%d/%m/%Y`). Beware opening csv data in Excel! 

The second line then assigns the `DateTimeIndex` to our `DataFrame` (check this using `df.info()` for example). 

The third line then prints information about the DataFrame - note how the index has changed from previously. 

### Selecting Time Periods

With the `DateTimeIndex` set we can now use it to select data for specific periods of time. For example, to select data for the first seven days in May 2016:

In [None]:
may_df = df['2016-05-01':'2016-05-07'] 
print may_df.head()

Or for the first 12 hours of 2nd May:

In [None]:
may02_12hr_df = df['2016-05-02 00:00':'2016-05-02 11:00']

#### Task
Use the code block below to print the DataFrame just created for the 12 hour period in April. What do you notice is wrong with it? _[Hint: how many records are there in this DataFrame?]_

The sharp-eyed among you will have noticed that there is a missing record for 1am (check if you didn’t spot it). They may have been some issues with the measurement instrument that prevented the data being recorded for this hour. 

_Is this problem for this particular day only? What about other days? What quick ways could you use to check if it is a problem throughout the dataset?_

### Reindexing
This issue arises because above we told pandas to use the Date and Time values in the Series to provide the DateTimeIndex. But we can force pandas to include rows for missing hours of data by [re-indexing the DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) so that all hours are represented. 

To reindex we need to know the start value of the index, the end value, and how the values in between should be specified. We know the values should be every hour but how could we used code to quickly identify start and end values? The code below gets you started (replace `???`):

In [None]:
firstDate = df.index[???]
print firstDate

_Hint: what is the integer index for the first row of the DataFrame?_ You should get `2015-12-31 14:00:00` when you print `firstDate`

In code block below replace `???` to access the last date in the dateframe.

In [None]:
lastDate = df.index[???(df.Date) - 1]
print lastDate

_Hint: what function returns the number of rows (records) in the column (Series)?_ You should get `2016-11-02 15:00:00` when you print `lastDate` 

We should now be able to use the `firstDate` and `lastDate` variables to reindex pandas method so that all hours will be present in the DataFrame. Run the next line of code to do this. 

Do you get an error? You should! 

In [None]:
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

### Duplicates!
A quick bit of googling for `ValueError: cannot reindex from a duplicate axis` provides [a SO answer](http://stackoverflow.com/a/27242735) that suggests we may have duplicates in our DataFrame. Duplicate records can be another issue with automatically recorded data. 

Finding and removing these duplicate records would take some time for our 7328 rows of data if we had to do this manually in Excel. However, because we are using code we can do this very easily with the [pandas `drop_duplicates` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html):

In [None]:
df = df.drop_duplicates()  

Now run the `reindex` code again:

In [None]:
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

Hopefully you got no error this time!

Essentially, what this is doing is specifying that all rows in the DataFrame should be have a `DateTimeIndex` with one hour (`1H`) between each value of the index. The `fill_value = None` means any that rows of data for any hours missing in the original data will be given 'no data' values.

We can see what effect this has had by re-selecting our 2nd May data to check it now contains 24 hours:

In [None]:
may02_df = df['2016-05-02'] 
print may02_df

### Intuitive Labels
In the data variables you may have noticed the one called `WeatherType` which contains a bunch of integers. To work out what these values mean, we need to go back to the Met Office website to [read the documentation](http://www.metoffice.gov.uk/datapoint/support/documentation/code-definitions). Each `WeatherType` value corresponds to a text label (and [weather map symbol](http://www.metoffice.gov.uk/guide/weather/symbols)!) for a particular type of weather, the key for which is provided on KEATS for you in `WeatherTypes.csv`

#### Task
Enter code in the next code block to read `WeatherTypes.csv` into a DataFrame named `wt`, taking account of headers. Then print out the top few lines of `wt` to check the contents

The format of `WeatherTypes.csv` is often known as a look-up table – we can ‘look up’ values to find their corresponding label. This is a similar structure to Python dictionaries (think about it). To make things easier for ourselves we can convert the `WeatherType` values to text labels using the [map method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html). 

First however, we need to set the index of the `wt` DataFrame so that the map method will work properly. 

#### Task
Using the code above as a guide, set the index of `wt` from the `Value` column in `wt` (replace `???`):

In [None]:
wt.??? = wt['Value']  

_Hint: what is the pandas method that accesses a DataFrame index?_

If you set the index correctly, the output of `wt.head()` should look like:
    
`<class 'pandas.core.frame.DataFrame'>
Float64Index: 32 entries, nan to 30.0
Data columns (total 2 columns):
Value          31 non-null float64
Description    32 non-null object
dtypes: float64(1), object(1)
memory usage: 768.0+ bytes`

In [None]:
df['WeatherType'] = df['WeatherType'].map(wt['Description'])

Here, we have mapped the value in the `Description` Series of `wt` that corresponds to the value in the `WeatherType` Series of `df` and assigned it to the `WeatherType` Series in `df`. Check the result of this in df now by printing the top few lines of `df`. 

Note how the `WeatherType` column now contains descriptions rather than numerical values. These text labels will be useful later when producing plots, figures, summaries, etc. 

Given its utility, map-type functions are found in many programming languages, and is often combined with a reduce function (so-called [MapReduce](https://en.wikipedia.org/wiki/MapReduce) programming models). You may find the map useful in future so make sure you understand what it does (in pandas at least). Remember it is important to do a reindex of look-up table DataFrame before using map.

### Missing Values
You may have noticed in the `WeatherType` variables that some values corresponded to _‘Not available’_ or _‘Not used’_. These are missing values (also known as ‘no data’ values) that have no actual useful data attached to them. 

This is slightly different from the missing records issue above, as these missing values are for particular variables in records that have values for all other variables (i.e. data is only missing for certain cells in a row, not entire rows). Such values can often occur due to technical error for example and you should always check if there are missing values in a data set. 

Given that there are often problems with missing values in data, pandas (and many other libraries and languages) has [functionality to deal with missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html). Python itself has a `None` object that denotes a lack of value. To convert the _‘Not available’_ and _‘Not used’_ `WeatherType` values to _‘no data’_ in the `df` DataFrame we can do the following: 

In [None]:
df.ix[df.WeatherType == "Not available", 'WeatherType'] = None
df.ix[df.WeatherType == "Not used", 'WeatherType'] = None 

Here, we use the Python `None` object to tell pandas these values in the `WeatherType` Series should be changed to missing data. You should be able to see the effect this has had on `df` (the easiest way is to re-select and display the data for 2nd May and look at the `WeatherType` column) - do that in the code block below

#### Brief Aside...
You may also have noticed there are entire days missing in the data (because the API scraper didn't work), as indicated in the original csv file by `NA`. The [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the `read_csv` method explains that by default the following values are interpreted as 'no data': `‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.` If you have some other representation of 'no data' value in your data (e.g. -999), you can use the `na_values` argument with `read_csv`.

### Setting Appropriate Data Types
Finally, another sensible thing to do when working with data is to ensure the computer knows what data type (`dtype`) are contained in a DataFrame. We can use `df.info()` to check the current state. 

In [None]:
df.info()

For example, you should see five Series have dtype `object` - pandas knows these Series are not numerical but does not know what else to do with them. We will leave the Date and Time columns as they are as we have already used these to set the `DateTimeIndex` (so pandas knows we are dealing with dates and time). The remaining `object` Series actually reflect cargories and we should change their type to reflect this. You can read more about categorical data in pandas [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html).

The code below uses the `astype` method to set one of the remaining three `object` Series to categories. 

In [None]:
df.WeatherType = df.WeatherType.astype('category')

#### TASK
Set the two other `object` Series to `category`  and all the remaining Series to `float`. Use code from Practical 5 to help you. When completed, printing `df.info()` should produce the following:

`<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7370 entries, 2015-12-31 14:00:00 to 2016-11-02 15:00:00
Freq: H
Data columns (total 13 columns):
LocID               7320 non-null float64
Date                7320 non-null object
Time                7320 non-null object
WeatherType         6840 non-null category
Visibility          6840 non-null float64
Temperature         6840 non-null float64
WindDirection       6832 non-null category
WindSpeed           6840 non-null float64
WindGust            6840 non-null float64
Pressure            6840 non-null float64
PressureTendency    6834 non-null category
DewPoint            6840 non-null float64
Humidity            6840 non-null float64
dtypes: category(3), float64(8), object(2)
memory usage: 655.2+ KB`


NB: You may be wondering why you have been asked to set `WindGust` and `Visibility` to float type and not integer type. This is because pandas [cannot handle](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data-casting-rules-and-indexing) the conversion of 'no data' values for integer and boolean types. The explanation is for this is [rather technical](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#choice-of-na-representation), and you should just remember to always use float rather than integer in pandas. 

### Onwards...
You’re finally all done with setting up your data, and now we can move on to actually doing some analysis! We'll look at how to plot and analyse these time series data next week. 

This initial data manipulation phase can often take some time but it is always important and it is often useful to learn about your data. You could write the data you have manipulated to file to read back in next week: 

In [None]:
df.to_pickle("CleanedHeathrowData2016.pkl") 

'[Pickling](https://docs.python.org/2/library/pickle.html)' is python's way of outputing data to disk so it can be read back into memory in exactly the same format as it was previously held in memory. Above we have used pandas' `to_pickle` [method](http://pandas.pydata.org/pandas-docs/stable/io.html#pickling) to write out exactly what is held in memory, then we would use `read_pickle` to read back into file. 

In [None]:
df2 = pd.read_pickle("CleanedHeathrowData2016.pkl")

Add code to the code block above to write the first few lines of df2 to and df1 to check they are identical. 

The `.pkl` file is not very transferable between different software however, so we might also write to csv format to open in a text editor or Excel:

In [None]:
df.to_csv("CleanedHeathrowData2016.csv", index = False)

Check the `to_csv` method documentation to check what `index = False` does and to find out other arguments you might pass to the method. 

The data just written to csv file has duplicate records removed, all hours are represented, and it has intuitive labels for weather types. However, when you read this csv file back into pandas you would still need to set the DateTimeIndex and data types. 

Or we could go back to the original data and execute a script that contains all the commands needed to manipulate the data in one go (that script will be provided ready for Practical 8).