<h1><center>7SSG2059 Geocomputation 2016/17</center></h1>
<h1><center>Practical 8: Correlation and Time Series</center></h1>
<p><center><i>James Millington, 12 November 2016</i></center>

## Overview
This week will work mainly with the Met Office data introduced in Practical 5. The Exercises at the end will require you to use some of the techniques learned in the main part of the practical and apply them to NS-SeC data. 

You should create a temporary working directory (folder) on the Desktop of your operating system with the name _GeocompPracX_, where _X_ is the week number. So for example, for today’s practical you should create a temporary working directory named _GeocompPrac8_ on the Desktop. 

The temporary folder you have created will be the ‘working directory’ for this session of work. Once you have finished this session you can move this directory anywhere you like, but bear in mind this will change the path of the directory. You can think of the path of a directory (or a file) as an address to locate it within the computer hard drives’ directory structure.

Download the `MetOfficeData.csv`, `WeatherTypes.csv` and `Data_NSSHRP_UNIT_URESPOP.csv` files from the Week 8 section and move them to your Week 8 working directory.

Run the code provided for you, add code to code blocks if they are empty, and edit text blocks to provide answers to questions. Save your notbook frequently. Answers and solutions will be provided before next week's class.  

# Met Office Weather Data

## Preparing the Data
Before we can start analysing the Met Office data we need to ensure that it has been formatted and cleaned - which is what Practical 7 was all about. If you didn't complete Practical 7 there's no need to work through it now, but you should go back to it to check you understand the concepts at a later date.

So you have two options to get started with the Met Office data:   
1. Run the code from Practical 7 on the original .csv data to format and clean again 
2. Load the data data you pickled at the end of Practical 7 directly into memory (you can only do this if you did complete Practical 7)

The code for option 1 is provided in a script file on KEATS but also provided in the code block below. If you want to use **option 1** (which you _must_ if you didn't complete Practical 7) run the code below: 

In [None]:
import pandas as pd
import os

#set the path to the data directory (only use this if the csv file is NOT saved in the same directory as this notebook file)
#path = os.path.join(os.path.expanduser("~"),"Google Drive","Teaching","2016-17","Undergrad","Geocomp","Week7")
#os.chdir(path)

#read file
df = pd.read_csv("HeathrowWeather2016.csv", header=0)   

#drop duplicates
df = df.drop_duplicates()  

#set DateTimeIndex
dt = pd.to_datetime(df.Date + df.Time, format='%Y-%m-%d%H:%M')
df.index = dt

#then reindex to ensure a row for every hour
firstDate = df.index[0]   
lastDate = df.index[len(df.Date) - 1]
df = df.reindex(index = pd.date_range(start = firstDate, end = lastDate, freq = '1H'), fill_value= None)

#read WeatherTypes.csv, map labels and set NA
wt = pd.read_csv('WeatherTypes.csv', header=0)  
wt.index = wt['Value']  
df['WeatherType'] = df['WeatherType'].map(wt['Description'])
df.ix[df.WeatherType == "Not available", 'WeatherType'] = None
df.ix[df.WeatherType == "Not used", 'WeatherType'] = None 

#set appropriate data types (note NA not handled by integer so use float for all numerical)
for c in ['WindDirection','WeatherType','PressureTendency']:
    df[c] = df[c].astype('category')
for c in ['DewPoint','Humidity','Temperature','WindGust','Visibility']:
    df[c] = df[c].astype('float')
    
#check data!
print df.info()
print df.head()

Alternatively, the code for **option 2** (to read pickled data) is as follows:

In [None]:
import pandas as pd

df = pd.read_pickle("CleanedHeathrowData2016.pkl")   #or change the filename to whatever you used when pickling

## Exploratory Data Analysis
### Plotting Distributions
An important aspect of data to consider in initial data analyses is the distribution of data. Plotting histograms is particularly useful for variables of continuous data type. The code below prints a distribution plot for a single variable. Note how we add the `dropna()` method to ensure seaborn can handle the missing data. 

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt    #Plotting library used by seaborn, see http://matplotlib.org/users/pyplot_tutorial.html
%matplotlib inline  

fig = sb.distplot(df_num['Temperature'].dropna())   #print a distribution plot for a single variable

To automate the plotting of histograms for the continuous data variables in the Met Office data, let’s create a subset of the Series that contain only numerical dtype variables. 

In [None]:
df_num = df.select_dtypes(include=['float64'])
df_num = df_num.drop('LocID', axis = 1)   #axis 1 is columns (Series)
print df_num.info()

The code above uses the `select_dtypes` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) to select only those types specified in the `include` argument (in this case a list of `dtypes` containing only `float64`). However, as `locID` is a `float64` we need to remove this manually, using the `drop` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) on line two above (think about why it doesn't make sense to plot the location id for a single location).  

We can use now simply loop through the columns in this new DataFrame, plotting the distribution for each and saving them to individual files in the directory where your data are saved:

In [None]:
#loop to automatically plot multiple variables saving to file
for name in df_num.columns:    
    fig = plt.figure(name)                                                 #create a matplotlib figure with name specificed by column 'name'
    fig = sb.distplot(df_num[name].dropna())                               #use seaborn to plot the distribution of the DataFrame column 'name' (added dropna as problem with Seaborn 0.7.1 recognising NAs)
    plt.savefig('{0}_Distribution.png'.format(name), bbox_inches='tight')  #save the figure to file
    plt.close()  

Check you understand how this loop is working and check the image files were created properly on your hard disk (no output will be produced in the notebook here, you'll need to check your hard disk with a file browser). Saving images to file like this is useful for when you want to insert a figure in an report or essay in Word (for example).

We could now look through each image file to compare distributions. However, this involves opening all the various files and switching between them. An easier way to quickly check and compare distributions for multiple variables in a data set is what is known in seaborn as a `pairplot`, which you can read about [here](http://seaborn.pydata.org/tutorial/distributions.html#visualizing-pairwise-relationships-in-a-dataset).

The code below creates the pairplot. When you run the code _be patient!_ It may take a little while for the pairplot to finish as there’s lots of data here to be plotted. As shown below, the pairplot is a set of scatter plots for each pair of variables with a histogram for each individual variable. Pretty nice. 

In [None]:
fig = sb.pairplot(df_num.dropna(axis = 0))   #note need to drop rows of data with no data

To save the pairplot as an image file on your hard disk (for example to include later in a report in Word) we need to add some additional code (using the [matplotlib](http://matplotlib.org/users/pyplot_tutorial.html) package):

In [None]:
fig = plt.figure('Pairplot')                         #create a plot object
fig = sb.pairplot(df_num.dropna(axis = 0))           #add the seaborn pairplot for our dataset
plt.savefig('Pairplot.png', bbox_inches='tight')     #save the plot with the given filename
plt.close()                                          #stop writing to the plot

#### Task
Edit this text block (double-click) to answer the following questions:

1)	Which variable has a distribution most like a normal distribution?

A: 

2)	Which variable(s) is heavily negatively skewed?

A: 

3)	How would you describe the distribution of the Wind Gust variable?

A: 


### Editing Variables
The `WindGust` and `Pressure` variables have quite different distributions from the others. For `WindGust` this is because of the nature of the physical variable, but for `Pressure` it's likely because of some erroneous values. 

Given the non-continuous nature of wind gusts, let's drop the `WindGust` variable from the dataframe of numerical variables using the `drop` method:  

In [None]:
df_num = df_num.drop('WindGust', axis = 1)   #axis 1 is columns (Series)

We specify `axis = 1` to tell Pandas to look for `WindGust` in the columns of the dataframe (`axis = 0` would be the rows).

Look back to your pairplot at the `Pressure` variable. The problem with `Pressure` appears to be that there are some zero values. An air pressure of zero does not seem feasible so let's convert any zero to 'missing data':

In [None]:
df_num.loc[df_num.Pressure == 0] = None

Check you understand how the code above accesses only the zero values in the `Pressure` Series and sets them to missing values (`None`). 

#### Task
Plot the distribution of the `Pressure` variable to check that the zeros have been removed and to see how the variables is distributed. Use the following code as a starter, changing `???`, then run the code.

In [None]:
fig = sb.???(df_num['???'].dropna())   

### Pairplot
#### Task
Create a pairplot of the variables in `num_df` dataframe to see the effect of removing `WindGust` and setting zeros in `Pressure` to no data. Use code from above to help you. 

#### Task
Looking at your pairplot above answer the following questions (edit this text block):

Q: Do you think there are any clear relationships between the raw values of these variables?

A: 

Q: What migh the reasons for the lines of data points in the `Visibility` data (and to some extent the lines in the `WindSpeed` data)?

A: 

### Boxplot
Another way to visualise distributions that we know is the use of boxplots. For example, maybe we want to know how wind speed varies for observations with and without gusts of wind. To do this, we can first create a new variable in our original dataframe to classify observations depending on whether they were for an hour with a wind gust or not: 

In [None]:
df['WindGustBin'] = pd.Series(df['WindGust'] > 0)        #create a boolean dtype

The code above create a new Pandas Series named `WindGustBin`, populated by `True` or `False` values by evaluting the conditional in the (). This new series has a boolean `dtype` as we can see from the information about the data frame:

In [None]:
print df.info()

Now we can use this new variable to classify values in a box plot: 

In [None]:
ax = sb.boxplot(x="WindGustBin", y="WindSpeed", data=df, palette="deep", linewidth=1)

plt.subplots_adjust(bottom=0.15)                                                  #http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots_adjust
plt.xlabel('Wind Gust')
plt.xticks([0,1], ['No', 'Yes'])
plt.ylabel('Wind Speed (mph)')
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label]):                         #http://stackoverflow.com/a/14971193
    item.set_fontsize(20)
for item in (ax.get_xticklabels() + ax.get_yticklabels()):
    item.set_fontsize(14)

If you want to save this figure to an image file on your hard disk, you would add the following code to the lines above :

In [None]:
plt.savefig('WindGustBoxplot.png', bbox_inches='tight')
plt.close()

_Note: if you run the two lines of code above alone and not immediately following the plotting code (e.g. in the same code block in a notebook) your image file will be empty!_

### Extracting Date and Times
In the data preparation above (and in Week 7), right after we loaded the data into our DataFrame, we set the DataFrame index to a `DateTimeIndex` object. This allowed us to select observations based on their date/time. We can also use the DateTimeIndex in the reverse; to find the date/time of particular observations. 

For example, let’s say we want to find the date/time of the warmest observation in the data set. We’ll can use the `sort_values` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) and then access the index of the top record:

In [None]:
#get date and time of warmest recording
warm_dt = df_num.sort_values(['Temperature'], ascending = False).index[0]
print warm_dt

Because the index of our DataFrame is a `DateTimeIndex` here it returns us the date/time of the record. 

Let’s continue this approach to identify what the actual warmest temperature was and then print this nicely with the date/time:

In [None]:
#use warm_dt to get temperature at this time
warm_temp = df_num.Temperature[warm_dt]
print "The highest temperature of {0} degC was observed at {1}".format(warm_temp,warm_dt)

This is good, but what if we don’t like the format the date and time are given by the `DateTimeIndex` we set? If we import the datetime package we can use the `strftime` method to convert to other formats:

In [None]:
import datetime
warm_date = warm_dt.strftime("%d-%m-%Y")
warm_time = warm_dt.strftime("%H.%M")
print "The highest temperature of {0} degC was observed at {1} on {2}".format(warm_temp, warm_time, warm_date)

You can find the full list of date and time format strings [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior). 

#### Task
Identify the date and time of the record with the lowest air pressure in a similar way to the code above, and print nicely with the minimum pressure value.

## Time Series Plots
Now we are back to thinking about dates and times we can return to thinking about our data as a time series and how we might visualise and summarise this. To plot our time series is very easy using pandas `plot` [method](http://pandas.pydata.org/pandas-docs/stable/visualization.html#basic-plotting-plot).

For example:

In [None]:
#plot the time series
df_num['Temperature'].plot()
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Note how pandas has recognized this is a time series (because of the `DateTimeIndex`) and formatted the x-axis labels nicely for us (we needed to add the y-axis label ourselves on the second line).

Let's create a subset of the data so we can view our data at a higher temporal resolution. For example, let's look at the month of May:

In [None]:
df_num_may = df_num['2016-05']     #create a new df for May only

df_num_may['Temperature'].plot()
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

There's two main things to note here:

1. We can see the diurnal cycle of change in temperature
2. We can see there are missing data

### Moving Average
A useful way to ‘smooth’ the time series so we can see more general patterns is to use a moving average (or rolling mean; for a refresher [read](http://www.emathzone.com/tutorials/basic-statistics/method-of-moving-averages.html) or [watch](https://www.youtube.com/watch?v=HkoyhK0swPk) here). 

We can use the pandas `rolling` [method](http://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-windows) to calculate our moving average:

In [None]:
myWindow = 24                                                           
mw = df_num_may['Temperature'].rolling(window=myWindow,center=True).mean()
print type(mw)

The first line above creates a variable to hold the size of the window we want to use to calculate our running mean. We then pass this to the `rolling` method applied to the Series we want. The object produced (`mw`) is another Series of values which we can add to our plot with the raw data:

In [None]:
#plot raw data with running mean
df_num_may['Temperature'].plot()
mw.plot(style='-r', label = "{0}hr Moving Window".format(myWindow))
plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Note how the code above specifies a different colour for the moving average and uses the matplotlib `legend` method to determine where the legend should be placed, using the `loc` argument – details [here](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend) and guide [here](http://matplotlib.org/users/legend_guide.html). Also note how string formatting is used to automatically include the size of the moving window.

### Interpolation
To deal with the second observation of missing data one of the things we can do to fill gaps (i.e. fill the ‘no data’ rows with some numeric values) is to interpolate values from either side of the gap. The pandas package has some [useful interpolation methods](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation). We’ll use the interpolate method to fill the gaps in our weather data, creating a new dataframe to hold the new data:

In [None]:
df_num_int = df_num.interpolate()
df_num_int_may = df_num_int['2016-05']

Let’s check exactly what this interpolation method is doing by looking at line plots of the Temperature variable after the interpolation (for May) and comparing it to the raw data above:

In [None]:
df_num_int_may['Temperature'].plot()

It's not perfect and we can that for some of the bigger gaps the diurnal cycle is not well represented in the interpolated values. However, it's good enough for now (especially if we were to do this for the entire year of data), so let's re-do our running mean with these data and see what it produces.

In [None]:
#replot
df_num_int['Temperature'].plot()
myWindow = 24   
mw = df_num_int['Temperature'].rolling(window=myWindow,center=True).mean()
mw.plot(style='-r', label = "{0}hr Moving Window".format(myWindow))

plt.ylabel('Temperature')    
plt.xlabel('Date')
plt.legend(loc = 0)

Looking good!

## Correlation
The pandas library has a set of [computational methods](http://pandas.pydata.org/pandas-docs/stable/computation.html) that can be applied to DataFrames and Series. This includes `num_cov` and `cum_corr` methods to calculate covariance and correlation for all pairs of variables in a DataFrame, known as a Covariance Matrix and a Correlation Matrix respectively:

In [None]:
#covariance
covmat = df_num.cov()
print "Covariance matrix:", '\n', covmat, '\n'

#Pearson correlation
corrmat = df_num.corr()
print "Pearson correlation coefficient matrix:", '\n', corrmat, '\n'

#Spearman correlation
corspmat = df_num.corr(method = "spearman")
print "Spearman rank correlation coefficient matrix:", '\n', corspmat, '\n'

#### Task 
Run the code above now and take a look at the output and do two things:

1. Check you understand the structure of the matrices
2. Look back to your lecture notes and check you understand why the numbers are different between the matrices

#### Task
Based on the distribution of data for the variables (assessed via plots), which correlation coefficient is likely most useful for the following pairs. Edit this code block to answer to two decimal places. 

i) Temperature vs WindSpeed

A: 

ii) Humidity vs Visibility

A: 

### Plotting Correlation
Hopefully to answer the previous question you referred back to on of the pairplots above. You might also have used a `jointplot` – this is similar to a pairplot but is just for two variables:

In [None]:
#jointplots
sb.jointplot(x="Visibility", y="Humidity", data=df_num) 

Note that by default the Pearson correlation coefficient is presented; you can prevent this using `stat_func` argument (`stat_func = None`) or to specify a different function. For example, if you want to display the Spearman coefficient you need to import the `scipy.stats` library and use `spearmanr`:

In [None]:
from scipy.stats import spearmanr
sb.jointplot(x="Visibility", y="WindSpeed", kind = 'hex', data=df_num, stat_func=spearmanr)

Also see how the `kind` argument allows you to change how the data points are presented.

Another way to visualise your correlation matrices is to use a `heatmap` plot:

In [None]:
sb.heatmap(corspmat, vmax=.8, square=True)
plt.title("Spearman Rank Correlation")

Check you understand how this is presenting your correlation matrix data.

### Autocorrelation
There's no need to go into the details of Autocorrelation here (and you can skip straight to the Exercises if you like), but note that Pandas has functionality to investigate this. For example to produce a [lag plot](http://pandas.pydata.org/pandas-docs/stable/visualization.html#lag-plot):   

In [None]:
from pandas.tools.plotting import lag_plot
lag_plot(df_num_int['Temperature'], lag = 1)

Or for an [autocorrelation plot](http://pandas.pydata.org/pandas-docs/stable/visualization.html#autocorrelation-plot):

In [None]:
from pandas.tools.plotting import autocorrelation_plot
autocorrelation_plot(df_num_int['Temperature'])

Alternatively `acorr` [from maplotlib](http://matplotlib.org/examples/pylab_examples/xcorr_demo.html) provides more control for an autocorrelation plot:

In [None]:
plt.acorr(df_num_int['Temperature'], maxlags = 128, linestyle = "solid", usevlines = False, normed=True)
plt.show()

## Exercises

The exercises below require you to take some of the techniques presented above to apply them to the NS-Sec data set. As the NS-SeC data are not temporal the exercises will not address the time series plotting which will instead be repeated by some more examples from the Met Office data.

**Ensure the NS-SeC data are downloaded saved in the same directory as this notebook** and run the next two lines to load the data into memory. Then complete the tasks. 

In [None]:
ns_df = pd.read_csv("Data_NSSHRP_UNIT_URESPOP.csv", header=0, skiprows=[1], usecols=range(0,14))   #read csv with headers, skipping notes row and no data column 15
ns_df.columns = ["CDU_ID","GEO_CODE","GEO_LABEL",      
              "GEO_TYPE", "GEO_TYP2", "Total",        
              "Group1","Group2","Group3","Group4",    
              "Group5","Group6","Group7","Group8"]


### Task 1
Create a new data frame from `ns_df` that contains only Series with integer data type. Name this new dataframe `ns_df_num` and drop `CDU_ID` and `Total` series from it.  

### Task 2 
From `ns_df_num` create a boxplot for Population grouped by NS-SeC Group. Provide appropriate axis labels and tick marks so that the plot looks like the image below. Write the plot to an image file named 'NS-SeC_Population_Boxplot.png' 

![Task 2 output](https://kingsgeocomputation.files.wordpress.com/2016/11/geocomp2016_week8_ns-secboxplot1.png)

### Task 3
Create a pairplot from `ns_df_num`

### Task 4
Create a correlation matrix to identify the pair of Groups in the NS-SeC data that have the strongest correlation. You will need to decide what correlation coefficient is appropriate given the distribution of data observed in the pairplot you created in the previous task. Edit this text block to provide your answer.

Answer: 

### Task 5
Creat a heatplot to verify your answer to the previous task. What do you notice about patterns of correlation between groups? Are there any sharp changes in correlation? Why? (provide your answer below by editing this text block)

A: 

### Task 6
Create a boxplot to check if there is any relationship between wind direction and temperature. Rotate the x-axis labels so they are vertical. Your boxplot should look something like that shown below. 

![Task 6 output](https://kingsgeocomputation.files.wordpress.com/2016/11/geocomp2016_week8_winddtemp_boxplot.png)

### Task 7
Create a time series plot of humidity for July 2016 with a 48 hour moving window imposed on the raw data.