# Practical 9: Correlation



# Practical 7: Handling Time Series Data with Panadas



## Overview
In this practical we will look at some of the details of how to handle time series data. In particular we will work with the Heathrow Weather data that that you can use for your final assignment. Following all the steps in this notebook will format and clean the data from the Met Office API ready for you to do correlation, regression and time-series analyses. The full code will be provided later if there's anything you're unsure of. 

## Met Office Weather Data

### About the Data
The data you will be working with in this practical are hourly observations of weather variables from the UK Met Office. The Met Office collects hourly observation data for 140 locations across the UK and makes them data freely available online along with several other products. Specifically, these data are for observations at Heathrow Airport (site 3772). Details about these data (including units of measurement) are available from [the Met office website](http://www.metoffice.gov.uk/datapoint/product/uk-hourly-site-specific-observations).

The data are available on KEATS in the file `HeathrowWeather2016.csv`. This data file was derived from a daily ‘scrape’ of the Met Office data feed via its API (Application Programming Interface) using Python code (which will be made available on KEATS for your information). The `HeathrowWeather2016.csv` data are in very similar format to what you worked with in Practical 5; the main difference is that these data cover 1 Jan 2016 all the way through until early November 2016. You can use these data for your final report.

We will also use the file `WeatherTypes.csv`. Download both `WeatherTypes.csv` and `HeathrowWeather2016.csv` from KEATS to _the same directory where you have saved this notebook file_.  

#### Task 
Load the `HeathrowWeather2016.csv` data file into a pandas DataFrame named `df` taking account of the fact that there is a header row in the file. Check you understand the dimensions and contents (e.g. dtype) using methods learned in past weeks.

In [None]:
import pandas as pd
df = pd.read_csv("HeathrowWeather2016.csv", header=0) 
df.info()
df.head()

You should note that each record (row) is an observation for a particular hour of a particular day. Hence these data could be called a _time series_. The date and time of the observation are provided in columns of the data (`Series` of the `DataFrame`). 

In Practical 5 you looked at various ways to work with the Met Office data (collected via API) in Pandas. This included setting a temporal Index and Data Types of Series. We'll need to do some similar manipulation with this data set to get it ready for analysis. 

### Setting a DateTimeIndex
The `pandas` package has [lots of functionality for working with time series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html), including the ability to recognise dates and times and use them as an index for DataFrames and Series.  This data type is known as a `DateTimeIndex`, which you can read more about [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex). Later, we’ll see how this temporal index is useful for selecting particular records from a DataFrame or Series.

To set the index using the dates and times in our `DataFrame` we use the following code:

In [None]:
dt = pd.to_datetime(df.Date + df.Time, format='%Y-%m-%d%H:%M')
df.index = dt
df.info()

The first line above uses the `to_datetime` method with the Date and Time series and specifies the format the `DateTimeIndex` object should take. The format `'%Y-%m-%d%H:%M'` may seem a little strange (for multiple reasons!):
1. You can read about the meaning of the symbols by checking the table [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior)
2. If you opened the `HeathrowWeather2016.csv` file in Excel the date would have the format `dd/mm/yyyy` - but actually this is Excel 'helpfully' formatting the date cells for us. When opening the csv file in a text editor you can see the date column really has format `yyyy-mm-dd` (which hopefully helps you to understand the why we used `%Y-%m-%d` and not `%d/%m/%Y`). Beware opening csv data in Excel! 

The second line then assigns the `DateTimeIndex` to our `DataFrame` (check this using `df.info()` for example). 

The third line then prints information about the DataFrame - note how the index has changed from previously. 

### Selecting Time Periods

With the `DateTimeIndex` set we can now use it to select data for specific periods of time. For example, to select data for the first seven days in May 2016:

In [None]:
may_df = df['2016-05-01':'2016-05-07'] 
print may_df.head()

Or for the first 12 hours of 2nd May:

In [None]:
may02_12hr_df = df['2016-05-02 00:00':'2016-05-02 11:00']

#### Task
Use the code block below to print the DataFrame just created for the 12 hour period in April. What do you notice is wrong with it? _[Hint: how many records are there in this DataFrame?]_

The sharp-eyed among you will have noticed that there is a missing record for 1am (check if you didn’t spot it). They may have been some issues with the measurement instrument that prevented the data being recorded for this hour. 

_Is this problem for this particular day only? What about other days? What quick ways could you use to check if it is a problem throughout the dataset?_

### Reindexing
This issue arises because above we told pandas to use the Date and Time values in the Series to provide the DateTimeIndex. But we can force pandas to include rows for missing hours of data by [re-indexing the DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) so that all hours are represented. 

To reindex we need to know the start value of the index, the end value, and how the values in between should be specified. We know the values should be every hour but how could we used code to quickly identify start and end values? The code below gets you started (replace `???`):

In [None]:
firstDate = df.index[???]
print firstDate

_Hint: what is the integer index for the first row of the DataFrame?_ You should get `2015-12-31 14:00:00` when you print `firstDate`

In code block below replace `???` to access the last date in the dateframe.

In [None]:
lastDate = df.index[???(df.Date) - 1]
print lastDate

_Hint: what function returns the number of rows (records) in the column (Series)?_ You should get `2016-11-02 15:00:00` when you print `lastDate` 

We should now be able to use the `firstDate` and `lastDate` variables to reindex pandas method so that all hours will be present in the DataFrame. Run the next line of code to do this. 

Do you get an error? You should! 

In [None]:
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

### Duplicates!
A quick bit of googling for `ValueError: cannot reindex from a duplicate axis` provides [a SO answer](http://stackoverflow.com/a/27242735) that suggests we may have duplicates in our DataFrame. Duplicate records can be another issue with automatically recorded data. 

Finding and removing these duplicate records would take some time for our 7328 rows of data if we had to do this manually in Excel. However, because we are using code we can do this very easily with the [pandas `drop_duplicates` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html):

In [None]:
df = df.drop_duplicates()  

Now run the `reindex` code again:

In [None]:
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

Hopefully you got no error this time!

Essentially, what this is doing is specifying that all rows in the DataFrame should be have a `DateTimeIndex` with one hour (`1H`) between each value of the index. The `fill_value = None` means any that rows of data for any hours missing in the original data will be given 'no data' values.

We can see what effect this has had by re-selecting our 2nd May data to check it now contains 24 hours:

In [None]:
may02_df = df['2016-05-02'] 
print may02_df

### Intuitive Labels
In the data variables you may have noticed the one called `WeatherType` which contains a bunch of integers. To work out what these values mean, we need to go back to the Met Office website to [read the documentation](http://www.metoffice.gov.uk/datapoint/support/documentation/code-definitions). Each `WeatherType` value corresponds to a text label (and [weather map symbol](http://www.metoffice.gov.uk/guide/weather/symbols)!) for a particular type of weather, the key for which is provided on KEATS for you in `WeatherTypes.csv`

#### Task
Enter code in the next code block to read `WeatherTypes.csv` into a DataFrame named `wt`, taking account of headers. Then print out the top few lines of `wt` to check the contents

The format of `WeatherTypes.csv` is often known as a look-up table – we can ‘look up’ values to find their corresponding label. This is a similar structure to Python dictionaries (think about it). To make things easier for ourselves we can convert the `WeatherType` values to text labels using the [map method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html). 

First however, we need to set the index of the `wt` DataFrame so that the map method will work properly. 

#### Task
Using the code above as a guide, set the index of `wt` from the `Value` column in `wt` (replace `???`):

In [None]:
wt.??? = wt['Value']  

_Hint: what is the pandas method that accesses a DataFrame index?_

If you set the index correctly, the output of `wt.head()` should look like:
    
`<class 'pandas.core.frame.DataFrame'>
Float64Index: 32 entries, nan to 30.0
Data columns (total 2 columns):
Value          31 non-null float64
Description    32 non-null object
dtypes: float64(1), object(1)
memory usage: 768.0+ bytes`

In [None]:
df['WeatherType'] = df['WeatherType'].map(wt['Description'])

Here, we have mapped the value in the `Description` Series of `wt` that corresponds to the value in the `WeatherType` Series of `df` and assigned it to the `WeatherType` Series in `df`. Check the result of this in df now by printing the top few lines of `df`. 

Note how the `WeatherType` column now contains descriptions rather than numerical values. These text labels will be useful later when producing plots, figures, summaries, etc. 

Given its utility, map-type functions are found in many programming languages, and is often combined with a reduce function (so-called [MapReduce](https://en.wikipedia.org/wiki/MapReduce) programming models). You may find the map useful in future so make sure you understand what it does (in pandas at least). Remember it is important to do a reindex of look-up table DataFrame before using map.

### Missing Values
You may have noticed in the `WeatherType` variables that some values corresponded to _‘Not available’_ or _‘Not used’_. These are missing values (also known as ‘no data’ values) that have no actual useful data attached to them. 

This is slightly different from the missing records issue above, as these missing values are for particular variables in records that have values for all other variables (i.e. data is only missing for certain cells in a row, not entire rows). Such values can often occur due to technical error for example and you should always check if there are missing values in a data set. 

Given that there are often problems with missing values in data, pandas (and many other libraries and languages) has [functionality to deal with missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html). Python itself has a `None` object that denotes a lack of value. To convert the _‘Not available’_ and _‘Not used’_ `WeatherType` values to _‘no data’_ in the `df` DataFrame we can do the following: 

In [None]:
df.ix[df.WeatherType == "Not available", 'WeatherType'] = None
df.ix[df.WeatherType == "Not used", 'WeatherType'] = None 

Here, we use the Python `None` object to tell pandas these values in the `WeatherType` Series should be changed to missing data. You should be able to see the effect this has had on `df` (the easiest way is to re-select and display the data for 2nd May and look at the `WeatherType` column) - do that in the code block below

#### Brief Aside...
You may also have noticed there are entire days missing in the data (because the API scraper didn't work), as indicated in the original csv file by `NA`. The [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the `read_csv` method explains that by default the following values are interpreted as 'no data': `‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.` If you have some other representation of 'no data' value in your data (e.g. -999), you can use the `na_values` argument with `read_csv`.

### Setting Appropriate Data Types
Finally, another sensible thing to do when working with data is to ensure the computer knows what data type (`dtype`) are contained in a DataFrame. We can use `df.info()` to check the current state. 

In [None]:
df.info()

For example, you should see five Series have dtype `object` - pandas knows these Series are not numerical but does not know what else to do with them. We will leave the Date and Time columns as they are as we have already used these to set the `DateTimeIndex` (so pandas knows we are dealing with dates and time). The remaining `object` Series actually reflect cargories and we should change their type to reflect this. You can read more about categorical data in pandas [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html).

The code below uses the `astype` method to set one of the remaining three `object` Series to categories. 

In [None]:
df.WeatherType = df.WeatherType.astype('category')

#### TASK
Set the two other `object` Series to `category`  and all the remaining Series to `float`. Use code from Practical 5 to help you. When completed, printing `df.info()` should produce the following:

`<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7370 entries, 2015-12-31 14:00:00 to 2016-11-02 15:00:00
Freq: H
Data columns (total 13 columns):
LocID               7320 non-null float64
Date                7320 non-null object
Time                7320 non-null object
WeatherType         6840 non-null category
Visibility          6840 non-null float64
Temperature         6840 non-null float64
WindDirection       6832 non-null category
WindSpeed           6840 non-null float64
WindGust            6840 non-null float64
Pressure            6840 non-null float64
PressureTendency    6834 non-null category
DewPoint            6840 non-null float64
Humidity            6840 non-null float64
dtypes: category(3), float64(8), object(2)
memory usage: 655.2+ KB`


NB: You may be wondering why you have been asked to set `WindGust` and `Visibility` to float type and not integer type. This is because pandas [cannot handle](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data-casting-rules-and-indexing) the conversion of 'no data' values for integer and boolean types. The explanation is for this is [rather technical](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#choice-of-na-representation), and you should just remember to always use float rather than integer in pandas. 

### Onwards...
You’re finally all done with setting up your data, and now we can move on to actually doing some analysis! We'll look at how to plot and analyse these time series data next week. 

This initial data manipulation phase can often take some time but it is always important and it is often useful to learn about your data. You could write the data you have manipulated to file to read back in next week: 

In [None]:
df.to_pickle("CleanedHeathrowData2016.pkl") 

'[Pickling](https://docs.python.org/2/library/pickle.html)' is python's way of outputing data to disk so it can be read back into memory in exactly the same format as it was previously held in memory. Above we have used pandas' `to_pickle` [method](http://pandas.pydata.org/pandas-docs/stable/io.html#pickling) to write out exactly what is held in memory, then we would use `read_pickle` to read back into file. 

In [None]:
df2 = pd.read_pickle("CleanedHeathrowData2016.pkl")

Add code to the code block above to write the first few lines of df2 to and df1 to check they are identical. 

The `.pkl` file is not very transferable between different software however, so we might also write to csv format to open in a text editor or Excel:

In [None]:
df.to_csv("CleanedHeathrowData2016.csv", index = False)

Check the `to_csv` method documentation to check what `index = False` does and to find out other arguments you might pass to the method. 

The data just written to csv file has duplicate records removed, all hours are represented, and it has intuitive labels for weather types. However, when you read this csv file back into pandas you would still need to set the DateTimeIndex and data types. 

Or we could go back to the original data and execute a script that contains all the commands needed to manipulate the data in one go (that script will be provided ready for Practical 8).

# Practical 8: Correlation and Time Series



## Overview
This week will work mainly with the Met Office data introduced in Practical 5. The Exercises at the end will require you to use some of the techniques learned in the main part of the practical and apply them to NS-SeC data. 

You should create a temporary working directory (folder) on the Desktop of your operating system with the name _GeocompPracX_, where _X_ is the week number. So for example, for today’s practical you should create a temporary working directory named _GeocompPrac8_ on the Desktop. 

The temporary folder you have created will be the ‘working directory’ for this session of work. Once you have finished this session you can move this directory anywhere you like, but bear in mind this will change the path of the directory. You can think of the path of a directory (or a file) as an address to locate it within the computer hard drives’ directory structure.

Download the `HeathrowWeather2016.csv` and `WeatherTypes.csv` (from _Week 7_ section) and `Data_NSSHRP_UNIT_URESPOP.csv` (from the Week 8 section) on KEATS and move them to your Week 8 working directory.

Run the code provided for you, add code to code blocks if they are empty, and edit text blocks to provide answers to questions. Save your notbook frequently. Answers and solutions will be provided before next week's class.  

## Met Office Weather Data

### Preparing the Data
Before we can start analysing the Met Office data we need to ensure that it has been formatted and cleaned - which is what Practical 7 was all about. If you didn't complete Practical 7 there's no need to work through it now, but you should go back to it to check you understand the concepts at a later date.

So you have two options to get started with the Met Office data:   
1. Run the code from Practical 7 on the original .csv data to format and clean again 
2. Load the data data you pickled at the end of Practical 7 directly into memory (you can only do this if you did complete Practical 7)

The code for option 1 is provided in a script file on KEATS but also provided in the code block below. If you want to use **option 1** (which you _must_ if you didn't complete Practical 7) run the code below: 

In [None]:
import pandas as pd
import os

#set the path to the data directory (only use this if the csv file is NOT saved in the same directory as this notebook file)
#path = os.path.join(os.path.expanduser("~"),"Google Drive","Teaching","2016-17","Undergrad","Geocomp","Week7")
#os.chdir(path)

#read file
df = pd.read_csv("HeathrowWeather2016.csv", header=0)   

#drop duplicates
df = df.drop_duplicates()  

#set DateTimeIndex
dt = pd.to_datetime(df.Date + df.Time, format='%Y-%m-%d%H:%M')
df.index = dt

#then reindex to ensure a row for every hour
firstDate = df.index[0]   
lastDate = df.index[len(df.Date) - 1]
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

#read WeatherTypes.csv, map labels and set NA
wt = pd.read_csv('WeatherTypes.csv', header=0)  
wt.index = wt['Value']  
df['WeatherType'] = df['WeatherType'].map(wt['Description'])
df.ix[df.WeatherType == "Not available", 'WeatherType'] = None
df.ix[df.WeatherType == "Not used", 'WeatherType'] = None 

#set appropriate data types (note NA not handled by integer so use float for all numerical)
for c in ['WindDirection','WeatherType','PressureTendency']:
    df[c] = df[c].astype('category')
for c in ['DewPoint','Humidity','Temperature','WindGust','Visibility']:
    df[c] = df[c].astype('float')
    
#check data!
print df.info()
print df.head()

Alternatively, the code for **option 2** (to read pickled data) is as follows:

In [None]:
import pandas as pd

df = pd.read_pickle("CleanedHeathrowData2016.pkl")   #or change the filename to whatever you used when pickling

## Exploratory Data Analysis
### Plotting Distributions
An important aspect of data to consider in initial data analyses is the distribution of data. Plotting histograms is particularly useful for variables of continuous data type. The code below prints a distribution plot for a single variable. Note how we add the `dropna()` method to ensure seaborn can handle the missing data. 

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt    #Plotting library used by seaborn, see http://matplotlib.org/users/pyplot_tutorial.html
%matplotlib inline  

fig = sb.distplot(df['Temperature'].dropna())   #print a distribution plot for a single variable

To automate the plotting of histograms for the continuous data variables in the Met Office data, let’s create a subset of the Series that contain only numerical dtype variables. 

In [None]:
df_num = df.select_dtypes(include=['float64'])
df_num = df_num.drop('LocID', axis = 1)   #axis 1 is columns (Series)
print df_num.info()

The code above uses the `select_dtypes` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) to select only those types specified in the `include` argument (in this case a list of `dtypes` containing only `float64`). However, as `locID` is a `float64` we need to remove this manually, using the `drop` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) on line two above (think about why it doesn't make sense to plot the location id for a single location).  

We can use now simply loop through the columns in this new DataFrame, plotting the distribution for each and saving them to individual files in the directory where your data are saved:

In [None]:
#loop to automatically plot multiple variables saving to file
for name in df_num.columns:    
    fig = plt.figure(name)                                                 #create a matplotlib figure with name specificed by column 'name'
    fig = sb.distplot(df_num[name].dropna())                               #use seaborn to plot the distribution of the DataFrame column 'name' (added dropna as problem with Seaborn 0.7.1 recognising NAs)
    plt.savefig('{0}_Distribution.png'.format(name), bbox_inches='tight')  #save the figure to file
    plt.close()  

Check you understand how this loop is working and check the image files were created properly on your hard disk (no output will be produced in the notebook here, you'll need to check your hard disk with a file browser). Saving images to file like this is useful for when you want to insert a figure in an report or essay in Word (for example).

We could now look through each image file to compare distributions. However, this involves opening all the various files and switching between them. An easier way to quickly check and compare distributions for multiple variables in a data set is what is known in seaborn as a `pairplot`, which you can read about [here](http://seaborn.pydata.org/tutorial/distributions.html#visualizing-pairwise-relationships-in-a-dataset).

The code below creates the pairplot. When you run the code _be patient!_ It may take a little while for the pairplot to finish as there’s lots of data here to be plotted. As shown below, the pairplot is a set of scatter plots for each pair of variables with a histogram for each individual variable. Pretty nice. 

In [None]:
fig = sb.pairplot(df_num.dropna(axis = 0))   #note need to drop rows of data with no data

To save the pairplot as an image file on your hard disk (for example to include later in a report in Word) we need to add some additional code (using the [matplotlib](http://matplotlib.org/users/pyplot_tutorial.html) package):

In [None]:
fig = plt.figure('Pairplot')                         #create a plot object
fig = sb.pairplot(df_num.dropna(axis = 0))           #add the seaborn pairplot for our dataset
plt.savefig('Pairplot.png', bbox_inches='tight')     #save the plot with the given filename
plt.close()                                          #stop writing to the plot

#### Task
Edit this text block (double-click) to answer the following questions:

1)	Which variable has a distribution most like a normal distribution?

A: 

2)	Which variable(s) is heavily negatively skewed?

A: 

3)	How would you describe the distribution of the Wind Gust variable?

A: 


### Editing Variables
The `WindGust` and `Pressure` variables have quite different distributions from the others. For `WindGust` this is because of the nature of the physical variable, but for `Pressure` it's likely because of some erroneous values. 

Given the non-continuous nature of wind gusts, let's drop the `WindGust` variable from the dataframe of numerical variables using the `drop` method:  

In [None]:
df_num = df_num.drop('WindGust', axis = 1)   #axis 1 is columns (Series)

We specify `axis = 1` to tell Pandas to look for `WindGust` in the columns of the dataframe (`axis = 0` would be the rows).

Look back to your pairplot at the `Pressure` variable. The problem with `Pressure` appears to be that there are some zero values. An air pressure of zero does not seem feasible so let's convert any zero to 'missing data':

In [None]:
df_num.loc[df_num.Pressure == 0] = None

Check you understand how the code above accesses only the zero values in the `Pressure` Series and sets them to missing values (`None`). 

#### Task
Plot the distribution of the `Pressure` variable to check that the zeros have been removed and to see how the variables is distributed. Use the following code as a starter, changing `???`, then run the code.

In [None]:
fig = sb.???(df_num['???'].dropna())   

### Pairplot
#### Task
Create a pairplot of the variables in `num_df` dataframe to see the effect of removing `WindGust` and setting zeros in `Pressure` to no data. Use code from above to help you. 

#### Task
Looking at your pairplot above answer the following questions (edit this text block):

Q: Do you think there are any clear relationships between the raw values of these variables?

A: 

Q: What migh the reasons for the lines of data points in the `Visibility` data (and to some extent the lines in the `WindSpeed` data)?

A: 

### Boxplot
Another way to visualise distributions that we know is the use of boxplots. For example, maybe we want to know how wind speed varies for observations with and without gusts of wind. To do this, we can first create a new variable in our original dataframe to classify observations depending on whether they were for an hour with a wind gust or not: 

In [None]:
df['WindGustBin'] = pd.Series(df['WindGust'] > 0)        #create a boolean dtype

The code above create a new Pandas Series named `WindGustBin`, populated by `True` or `False` values by evaluting the conditional in the (). This new series has a boolean `dtype` as we can see from the information about the data frame:

In [None]:
print df.info()

Now we can use this new variable to classify values in a box plot: 

In [None]:
ax = sb.boxplot(x="WindGustBin", y="WindSpeed", data=df, palette="deep", linewidth=1)

plt.subplots_adjust(bottom=0.15)                                                  #http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots_adjust
plt.xlabel('Wind Gust')
plt.xticks([0,1], ['No', 'Yes'])
plt.ylabel('Wind Speed (mph)')
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label]):                         #http://stackoverflow.com/a/14971193
    item.set_fontsize(20)
for item in (ax.get_xticklabels() + ax.get_yticklabels()):
    item.set_fontsize(14)

If you want to save this figure to an image file on your hard disk, you would add the following code to the lines above :

In [None]:
plt.savefig('WindGustBoxplot.png', bbox_inches='tight')
plt.close()

_Note: if you run the two lines of code above alone and not immediately following the plotting code (e.g. in the same code block in a notebook) your image file will be empty!_

### Extracting Date and Times
In the data preparation above (and in Week 7), right after we loaded the data into our DataFrame, we set the DataFrame index to a `DateTimeIndex` object. This allowed us to select observations based on their date/time. We can also use the DateTimeIndex in the reverse; to find the date/time of particular observations. 

For example, let’s say we want to find the date/time of the warmest observation in the data set. We’ll can use the `sort_values` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) and then access the index of the top record:

In [None]:
#get date and time of warmest recording
warm_dt = df_num.sort_values(['Temperature'], ascending = False).index[0]
print warm_dt

Because the index of our DataFrame is a `DateTimeIndex` here it returns us the date/time of the record. 

Let’s continue this approach to identify what the actual warmest temperature was and then print this nicely with the date/time:

In [None]:
#use warm_dt to get temperature at this time
warm_temp = df_num.Temperature[warm_dt]
print "The highest temperature of {0} degC was observed at {1}".format(warm_temp,warm_dt)

This is good, but what if we don’t like the format the date and time are given by the `DateTimeIndex` we set? If we import the datetime package we can use the `strftime` method to convert to other formats:

In [None]:
import datetime
warm_date = warm_dt.strftime("%d-%m-%Y")
warm_time = warm_dt.strftime("%H.%M")
print "The highest temperature of {0} degC was observed at {1} on {2}".format(warm_temp, warm_time, warm_date)

You can find the full list of date and time format strings [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior). 

#### Task
Identify the date and time of the record with the lowest air pressure in a similar way to the code above, and print nicely with the minimum pressure value.

## Time Series Plots
Now we are back to thinking about dates and times we can return to thinking about our data as a time series and how we might visualise and summarise this. To plot our time series is very easy using pandas `plot` [method](http://pandas.pydata.org/pandas-docs/stable/visualization.html#basic-plotting-plot).

For example:

In [None]:
#plot the time series
df_num['Temperature'].plot()
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Note how pandas has recognized this is a time series (because of the `DateTimeIndex`) and formatted the x-axis labels nicely for us (we needed to add the y-axis label ourselves on the second line).

Let's create a subset of the data so we can view our data at a higher temporal resolution. For example, let's look at the month of May:

In [None]:
df_num_may = df_num['2016-05']     #create a new df for May only

df_num_may['Temperature'].plot()
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

There's two main things to note here:

1. We can see the diurnal cycle of change in temperature
2. We can see there are missing data

### Moving Average
A useful way to ‘smooth’ the time series so we can see more general patterns is to use a moving average (or rolling mean; for a refresher [read](http://www.emathzone.com/tutorials/basic-statistics/method-of-moving-averages.html) or [watch](https://www.youtube.com/watch?v=HkoyhK0swPk) here). 

We can use the pandas `rolling` [method](http://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-windows) to calculate our moving average:

In [None]:
myWindow = 24                                                           
mw = df_num_may['Temperature'].rolling(window=myWindow,center=True).mean()
print type(mw)

The first line above creates a variable to hold the size of the window we want to use to calculate our running mean. We then pass this to the `rolling` method applied to the Series we want. The object produced (`mw`) is another Series of values which we can add to our plot with the raw data:

In [None]:
#plot raw data with running mean
df_num_may['Temperature'].plot()
mw.plot(style='-r', label = "{0}hr Moving Window".format(myWindow))
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Note how the code above specifies a different colour for the moving average and uses the matplotlib `legend` method to determine where the legend should be placed, using the `loc` argument – details [here](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend) and guide [here](http://matplotlib.org/users/legend_guide.html). Also note how string formatting is used to automatically include the size of the moving window.

### Interpolation
To deal with the second observation of missing data one of the things we can do to fill gaps (i.e. fill the ‘no data’ rows with some numeric values) is to interpolate values from either side of the gap. The pandas package has some [useful interpolation methods](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation). We’ll use the interpolate method to fill the gaps in our weather data, creating a new dataframe to hold the new data:

In [None]:
df_num_int = df_num.interpolate()
df_num_int_may = df_num_int['2016-05']

Let’s check exactly what this interpolation method is doing by looking at line plots of the Temperature variable after the interpolation (for May) and comparing it to the raw data above:

In [None]:
df_num_int_may['Temperature'].plot()

It's not perfect and we can that for some of the bigger gaps the diurnal cycle is not well represented in the interpolated values. However, it's good enough for now (especially if we were to do this for the entire year of data), so let's re-do our running mean with these data and see what it produces.

In [None]:
#replot
df_num_int['Temperature'].plot()
myWindow = 24   
mw = df_num_int['Temperature'].rolling(window=myWindow,center=True).mean()
mw.plot(style='-r', label = "{0}hr Moving Window".format(myWindow))

plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Looking good!

## Correlation
The pandas library has a set of [computational methods](http://pandas.pydata.org/pandas-docs/stable/computation.html) that can be applied to DataFrames and Series. This includes `num_cov` and `cum_corr` methods to calculate covariance and correlation for all pairs of variables in a DataFrame, known as a Covariance Matrix and a Correlation Matrix respectively:

In [None]:
#covariance
covmat = df_num.cov()
print "Covariance matrix:", '\n', covmat, '\n'

#Pearson correlation
corrmat = df_num.corr()
print "Pearson correlation coefficient matrix:", '\n', corrmat, '\n'

#Spearman correlation
corspmat = df_num.corr(method = "spearman")
print "Spearman rank correlation coefficient matrix:", '\n', corspmat, '\n'

#### Task 
Run the code above now and take a look at the output and do two things:

1. Check you understand the structure of the matrices
2. Look back to your lecture notes and check you understand why the numbers are different between the matrices

#### Task
Based on the distribution of data for the variables (assessed via plots), which correlation coefficient is likely most useful for the following pairs. Edit this code block to answer to two decimal places. 

i) Temperature vs WindSpeed

A: 

ii) Humidity vs Visibility

A: 

### Plotting Correlation
Hopefully to answer the previous question you referred back to on of the pairplots above. You might also have used a `jointplot` – this is similar to a pairplot but is just for two variables:

In [None]:
#jointplots
sb.jointplot(x="Visibility", y="Humidity", data=df_num) 

Note that by default the Pearson correlation coefficient is presented; you can prevent this using `stat_func` argument (`stat_func = None`) or to specify a different function. For example, if you want to display the Spearman coefficient you need to import the `scipy.stats` library and use `spearmanr`:

In [None]:
from scipy.stats import spearmanr
sb.jointplot(x="Visibility", y="WindSpeed", kind = 'hex', data=df_num, stat_func=spearmanr)

Also see how the `kind` argument allows you to change how the data points are presented.

Another way to visualise your correlation matrices is to use a `heatmap` plot:

In [None]:
sb.heatmap(corspmat, vmax=.8, square=True)
plt.title("Spearman Rank Correlation")

Check you understand how this is presenting your correlation matrix data.

### Autocorrelation
There's no need to go into the details of Autocorrelation here (and you can skip straight to the Exercises if you like), but note that Pandas has functionality to investigate this. For example to produce a [lag plot](http://pandas.pydata.org/pandas-docs/stable/visualization.html#lag-plot):   

In [None]:
from pandas.tools.plotting import lag_plot
lag_plot(df_num_int['Temperature'], lag = 1)

Or for an [autocorrelation plot](http://pandas.pydata.org/pandas-docs/stable/visualization.html#autocorrelation-plot):

In [None]:
from pandas.tools.plotting import autocorrelation_plot
autocorrelation_plot(df_num_int['Temperature'])

Alternatively `acorr` [from maplotlib](http://matplotlib.org/examples/pylab_examples/xcorr_demo.html) provides more control for an autocorrelation plot:

In [None]:
plt.acorr(df_num_int['Temperature'], maxlags = 128, linestyle = "solid", usevlines = False, normed=True)
plt.show()

## Exercises

The exercises below require you to take some of the techniques presented above to apply them to the NS-Sec data set. As the NS-SeC data are not temporal the exercises will not address the time series plotting which will instead be repeated by some more examples from the Met Office data.

**Ensure the NS-SeC data are downloaded saved in the same directory as this notebook** and run the next two lines to load the data into memory. Then complete the tasks. 

In [None]:
ns_df = pd.read_csv("Data_NSSHRP_UNIT_URESPOP.csv", header=0, skiprows=[1], usecols=range(0,14))   #read csv with headers, skipping notes row and no data column 15
ns_df.columns = ["CDU_ID","GEO_CODE","GEO_LABEL",      
              "GEO_TYPE", "GEO_TYP2", "Total",        
              "Group1","Group2","Group3","Group4",    
              "Group5","Group6","Group7","Group8"]


### Task 1
Create a new data frame from `ns_df` that contains only Series with integer data type. Name this new dataframe `ns_df_num` and drop `CDU_ID` and `Total` series from it.  

### Task 2 
From `ns_df_num` create a boxplot for Population grouped by NS-SeC Group. Provide appropriate axis labels and tick marks so that the plot looks like the image below. Write the plot to an image file named 'NS-SeC_Population_Boxplot.png' 

![Task 2 output](https://kingsgeocomputation.files.wordpress.com/2016/11/geocomp2016_week8_ns-secboxplot1.png)

### Task 3
Create a pairplot from `ns_df_num`

### Task 4
Create a correlation matrix to identify the pair of Groups in the NS-SeC data that have the strongest correlation. You will need to decide what correlation coefficient is appropriate given the distribution of data observed in the pairplot you created in the previous task. Edit this text block to provide your answer.

Answer: 

### Task 5
Creat a heatplot to verify your answer to the previous task. What do you notice about patterns of correlation between groups? Are there any sharp changes in correlation? Why? (provide your answer below by editing this text block)

A: 

### Task 6
Create a boxplot to check if there is any relationship between wind direction and temperature. Rotate the x-axis labels so they are vertical. Your boxplot should look something like that shown below. 

![Task 6 output](https://kingsgeocomputation.files.wordpress.com/2016/11/geocomp2016_week8_winddtemp_boxplot.png)

### Task 7
Create a time series plot of humidity for July 2016 with a 48 hour moving window imposed on the raw data.


### Credits!

#### Contributors:
The following individuals have contributed to these teaching materials: James Millington (james.millington@kcl.ac.uk)

#### License
These teaching materials are licensed under a mix of the MIT and CC-BY licenses...

#### Acknowledgements:
Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

#### Potential Dependencies:
This notebook may depend on the following libraries: datetime (?), matplotlib (1.5.1), os (?), pandas (0.20.3), scipy (0.19.0), seaborn (0.7.0)
