<center><h1>7SSG2059 Geocomputation 2016/17</h1></center>

<h1><center>Practical 10b: Analysis of Relationships in NS-SeC and Values LSOA Data</h1></center>

<p><center><i>James Millington, 27 November 2016</i></center>


## Overview

Practical 10 is split into two notebooks - one examining relationships in Heathrow Weather and Air Quality data (10a) and one examining relationships in NS-SeC and house prices data (10b). The two notebooks are self-contained and can be used independently. You should decide which sets of data you most likely want to use for your final report, and work through the corresponding notebook during supervised practical time. This will give you the basics of analyses that you can then build on for your final report. Of course, you are welcome to work through both notebooks, although you are unlikely to be able to complete both during class time. 

## Helper Functions

Before getting to the data and code in this notebook, you should first run the code in the next three code blocks. These code blocks:

1. import packages required for functionality in the remiander of the notebook and set `matplotlib` font parameters 
2. define a function to help interpret OLS regression output
3. define a function to plot a histogram to file

Take a quick look at the code when running these blocks, but don't spend too much time as we will return to look at the function definitions more closely later in the notebook. 

In [None]:
#import packages required for functionalit below and set matplotlib font parameters 
import os
import pandas as pd
import seaborn as sb      
import numpy as np
import matplotlib.pyplot as plt    #see http://matplotlib.org/users/pyplot_tutorial.html
import statsmodels.api as sm       #see http://statsmodels.sourceforge.net/stable/  

#set matplotlib font params
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['xtick.labelsize'] = 14
plt.rcParams['ytick.labelsize'] = 14

%matplotlib inline

In [None]:
#define function to help interpret OLS regression output
def mod_diagnostics(model, data):
    
    """
    Output to file model diagnostics for an OLS model
    
    Input:
        model - statsmodels.regression.linear_model.OLS object
        data  - pandas.DataFrame containing data for model
        
    Output:
        XX-XX-OLS_SampleXX_Summary.txt contains the model summary output
        XX-XX-OLS_SampleXX_ResidHist.png is histogram of the residuals
        XX-XX-OLS_SampleXX_StdResid.png is a plot of standardised residuals against fitted values
        
        if model is univariate: XX-XX_OLS_SampleXX_Regression.png is a scatter plot with regression line
        
    Requires:
        statsmodels.api
        pandas
        numpy
        matplotlib.pyplot
    """
    
    fitted = model.fit()
    dep = model.endog_names
    indep_names = ""
    
    #create a string containing list of indep names for output files
    for name in model.exog_names[1:]:            #we don't want 0 element as that is the intercept
        indep_names += "{0}_".format(name)


    #Want to include name of DataFrame in the output filename but currently DataFrame does not have a name attribute
    #So for now use nobs from fitted  (Dan potential solution: pass data in a dictionary and access the label)
    samplesize = str(int(fitted.nobs))
    
    f1 = open("{0}-{1}OLS_Sample{2}_Summary.txt".format(dep, indep_names, samplesize), "w")
    f1.write(fitted.summary().as_text())
    f1.close()

    #calculate standardized residuals ourselves
    fitted_sr = (fitted.resid / np.std(fitted.resid)) 

    #Histogram of residuals
    ax = plt.hist(fitted.resid)
    plt.xlabel('Residuals')
    plt.savefig('{0}-{1}OLS-Sample{2}_ResidHist.png'.format(dep, indep_names, samplesize), bbox_inches='tight')
    plt.close()

    #standardized residuals vs fitted values
    ax = plt.plot(fitted.fittedvalues, fitted_sr, 'bo')
    plt.axhline(linestyle = 'dashed', c = 'black')
    plt.xlabel('Fitted Values')
    plt.ylabel('Standardized Residuals')                
    plt.savefig('{0}-{1}OLS-Sample{2}_StdResid.png'.format(dep, indep_names, samplesize), bbox_inches='tight')
    plt.close()
  
    
    if(len(model.exog_names) == 2):  #univariate model (with intercept)
            
        indep = model.exog_names[1]
        
        #scatter plot with regression line 
        ax = plt.plot(data[indep], data[dep], 'bo')
        x = np.arange(data[indep].min(), data[indep].max(), 0.1)    #list of values to plot the regression line using
        plt.plot(x, fitted.params[1]*x + fitted.params[0], '-', c = 'black')  #plot a line using the standard equation with parms from the model
        
        plt.xlabel(indep)
        plt.ylabel(dep)                
        plt.savefig('{0}-{1}OLS_Sample{2}_Regression.png'.format(dep, indep, samplesize), bbox_inches='tight')
        plt.close()


In [None]:
#define function to plot histogram to file
def plot_hist(series):
    
    """
    Output to file a simple histogram
    
    Input:
        series - pandas.Series containing data (may also be able to take a numpy array)
        
    Output:
        XX-SampleXX-Hist.png - the histogram image
        
    Requires:
        pandas
        matplotlib.pyplot
    """

    out_name = "{0}-Sample{1}-Hist.png".format(series.name, len(series))
    plt.hist(series.dropna())
    plt.xlabel(series.name)
    plt.ylabel('Count')
    plt.savefig(out_name, bbox_inches='tight')           #save the figure
    plt.close()

## NS-Sec and Amenity Values

The additional data you can use in conjunction with the NS-SeC data are found in `LSOA_ValuesData_London.csv` on KEATS. There are a variety of additional factors that you are free to explore, and you can read about them in the `AdditionalDataOverview.pdf` document also on KEATS. Smith (2010) used similar data in their study which will also likely help you to think about possible analyses you might make for your final report (e.g. between house prices and socio-economic indicators of LSOAs). 

These data are for housing and other amenity values for LSOAs in London. Consequently, we'll also use only NS-SeC data for London from now on - LSOA NS-SeC data for London only can be found in `Data_NSSHRP_UNIT_URESPOP_London.csv` on KEATS.

In Practical 9 we used code to load the two data files for London LSOAs into memory as pandas `dataframes`, tidied up their column names and droped rows with missing data. This code is copied in the next code block - you will need to run this code if you have not already done so to create the `LondonLSOAData.pkl` file. However, if you have already created `LondonLSOAData.pkl` you can skip that code a simply load the data into memory (following code block).

In [None]:
##ONLY run this code block if you have NOT already created LondonLSOAData.pkl in Practical 9

import pandas as pd

#read NS-SeC data
nsCN = ["CDU_ID","GEO_CODE","GEO_LABEL","F2084","F2085","F2094","F2102","F2107","F2114","F2119","F2127","F2133","F2136"]  
nsDF = pd.read_csv('Data_NSSHRP_UNIT_URESPOP_London.csv', header=0, skiprows=[1], usecols=nsCN)   #read csv with headers, skipping notes row and no data column 15
nsDF.columns = ["CDU_ID","GEO_CODE","GEO_LABEL","Total","Group1","Group2","Group3","Group4","Group5","Group6","Group7","Group8","NC"]  
nsDF = nsDF.dropna(axis = 0)  #drop rows with missing data

#read Additional Values Data
valCN = ["lsoa11cd","median_price","avg_distance_to_station","positive_area","moderate_area","negative_area"]
valDF = pd.read_csv('LSOA_ValuesData_London.csv', header=0, usecols = valCN)  
valDF = valDF.dropna(axis = 0)  #drop rows with missing data

#rename 'lsoa11cd' to 'GEO_CODE'!
valDF.columns = ["GEO_CODE","MedPrice","MeanStationDist", "PosArea", "ModArea", "NegArea"]  

#merge the two data frames 
nsvalDF = pd.merge(nsDF, valDF, on = 'GEO_CODE')

#write data to file
nsvalDF.to_csv("LondonLSOAData.csv")
nsvalDF.to_pickle("LondonLSOAData.pkl") 

In [None]:
## run this code if you HAVE already created LondonLSOAData.pkl
#import pandas as pd                            #already imported above but would be needed otherwise
nsvalDF = pd.read_pickle("LondonLSOAData.pkl")  #assumes file is saved in the same folder as this notebook file

Note: the previous code block assumes `LondonLSOAData.pkl` is saved in the same folder as this notebook file, but it is possible to read (and write) data to other folders by specifying the 'path' we want to use. The following code block shows one way to do this (as James discussed in Week 9 lecture). **ONLY** run the next code block if you want to read data from a location other than the folder in which this notebook files is saved - it is more for your information for future use. 

In [None]:
##ONLY run this code if you want to read data from a location other than the folder in which this notebook files is saved
#import os             #already imported above but would be needed otherwise

#set the path to the directory where we want to read and save from
path = os.path.join(os.path.expanduser("~"),"Google Drive","Teaching","2016-17","Undergrad","Geocomp","Week9")
os.chdir(path)

#the following line would now read the pkl file from my Week 9 folder (specified above)
#nsvalDF.to_pickle("LondonLSOAData.pkl")

### House Price Data

As mentioned in lectures, House Price data are often heavily skewed with a long tail (many smaller values and very, very few large values).  

Check the distribution of House Price data in the `nsvalDF` DataFrame by creating two plots:

1.	One plot using the `plot_hist` function defined above (note that this function prints a histogram to an image file on disk)
2.	One plot using the seaborn `distplot` function

In [None]:
#your plotting code here


As you can see from your plots, our House Price data are indeed skewed. When working with House Price data we often used a transformed version of the data that is the logarithm of the original data. One reason is that by reducing the skew of the data we overcome some of the problems of heteroscedasticity but also linear regression models are will better fit more normally distributed data.

Let's create a new column in the `nsvalDF` DataFrame that contains the natural logarithm of House Price:

In [None]:
nsvalDF['LogMedPrice'] = np.log(nsvalDF["MedPrice"])

Now let's check the new distribution as we just did for the un-transformed data: 

In [None]:
plot_hist(nsvalDF["LogMedPrice"])  #uses function defined above (writes to file)

sb.distplot(nsvalDF["LogMedPrice"], kde = False)

To explore how house price might be related to the NS-SeC groups let's create jointplots of the logarithm of Median House Price against the population of each NS-SeC group. The code below provides some example code of how you might do this efficiently for all groups with a loop (saving plots to image files on disk):

In [None]:
groups = ["Total","Group1","Group2","Group3","Group4","Group5","Group6","Group7","Group8"]

for group in groups:
    
    sb.jointplot(nsvalDF['LogMedPrice'], nsvalDF[group])                          
    plt.savefig('LogMedPrice_{0}_JointPlot.png'.format(group), bbox_inches='tight')      
    plt.close()                                                        

Looking at these jointplots we can see there is quite a strong positive relationship between Group1 population and the log-transformed house price data. This seems a good candidate for our first simple linear regression model.

### Fitting a Regression Model
To fit a regression we use the `OLS` function in the `statsmodels` package. We can use something like the following (note we imported `statsmodels.api` with alias `sm` above):

In [None]:
#create OLS object
logMP_G1_mod = sm.OLS.from_formula("LogMedPrice ~ Group1", data = nsvalDF) 
#fit the regression
fitted_logMP_G1_mod = logMP_G1_mod.fit()

Note how there are two steps to fitting the regression model. First, we create a OLS model object by specifying the 'formula' and the data to use - see that 'formula' does not use the `=` symbol and instead relates variables using `~`. 

The second line above then actually fits the regression model (using the `fit` method) and puts this in a 'fitted model' object. We can then get a summary of the regression model using the `summary` method with the fitted model object:

In [None]:
print fitted_logMP_G1_mod.summary()

#### Task
Take a little while to check you understand what is happening in the code above, then answer the following questions from the output of the summary _[edit this text block to answer]_ :

Q1:	How much variation in LSOA log-transformed median house price is explained by the population of NS-SeC Group 1? 

**A1:**

Q2:	Does the confidence interval for the Group 1 parameter value encompass zero?

**A2:**

Q3:	Do you think the normality assumption of the residuals has been violated?

**A3:**

Q4:	How many observations were used to fit the model? 

**A4:**

Q5: Given your answer to Q4), do you do you care about your answer to Q3)? _[Hint: think about what Lumley et al. (2002) discuss]_ 

**A5:**

Q6: Replace ??? to characterise the effect size of Group 1 population on Median House Price _Hint: remember the dependent variable is log(price) and read [this](http://stats.stackexchange.com/a/18639) CV answer and see Table 2 of Lin et.al (2013)_

**A6:** “For every one person more in Group 1, we would expect median house price to ??? [increase/decrease] by ??? %” 



Given the number of observations we may not care about whether the residuals are normally distributed but we do still need to check the assumption about heteroscedasticity. You can look at the Durbin-Watson score for this, but given the large number of observations it is probably just as informative to look at plots of the residuals. Rather than type out code to do this we can just use the `mod_diagnostics` function defined for you above.

### More Model Diagnostics

Go and look at the `mod_diagnostics` function now to check you understand it. Identify where it does the following:
- Fits the model passed to it 
- Writes the fit model summary to file
- Saves the histogram of the residuals to file using a string format
- Creates a scatter plot of standardized residuals
- Only creates a scatter plot with a regression line when the number of independent variables is equal to 1 (and think about why we don’t try to do this when we have more than one independent variable).

Now we can use the `mod_diagnostics` function to create diagnostic plots for the `logMP_G1_mod` model. Remember two things about `mod_diagnostics` function:
1. it takes a OLS model object as an argument, NOT a fitted model object
2. it writes its output to file (so you'll need to check your hard disk for output) 

In [None]:
mod_diagnostics(logMP_G1_mod, nsvalDF)

From the plots you've just created you should be able to see that there is no heteroscedasticity in the residuals. We should be pretty happy with this model!  Remember that by using the `mod_diagnostics` function the summary of the model has been saved to a text file for later consultation. 

### Improving the Model

Although we can be quite happy with the model we have fit, we should have a think about how we can improve it. For example, how could you improve your statement to answer Q6 above about effect size? Is _absolute_ number of people useful for comparing between LSOAs of differing population size? It would probably be better to consider the number of people in Group 1 as a percentage of the total population of the LSOA. Not only would this make comparison between LSOAs more intuitive but it may also improve the predictive power of the model.

#### Task 
Fit a second simple linear regression to predict log-transformed median house price from the percentage of people in NS-SeC Group 1 in an LSOA. 

Take the following steps:
1.	Create a column in `nsvalDF` containing the percentage population of Group1 as a proportion of the Total LSOA population (create the new column like we did for `logMedPrice`)
2.	Fit the model using code similar to that provided for you above in the _Fitting a Regression Model_ section
3.	Create diagnostic plots using the `mod_diagnostics` function (using the code in the _More Model Diagnostics_ section as a guide)

#### Task

Use your results from the last task to asnwer the following questions _[edit this text block]_:

Q7:	Does this model explain more or less variation in log-transformed median house price than the precious model? By how much? 

**A7: **

Q8:	Do you think this model violates the assumption of about heteroscedasticity of residuals? 

**A8:  **

Q9:	Replace ??? to explain the effect size:

**A9:** “For every one percent more people in Group 1, median house price ??? [increases/decreases] by ??? %”


### Multivariate Model

Another way we might improve our model to gain even more explanatory power is to add a second independent variable (making it a multivariate model). For example, maybe we can use the additional Group percentages to improve the model. 

To do this first we need to create a additional columns in our DataFrame containing Group percentages. The quickest way to do this is a loop:

In [None]:
groups = ["Group2","Group3","Group4","Group5","Group6","Group7","Group8"]
for group in groups:
    
        # Derive Proportions
        nsvalDF[group+'Pct'] = 100 * (nsvalDF[group]/nsvalDF["Total"])

Including additional predictors is pretty straight-forward to do with the formula syntax of the `statsmodel` package:

In [None]:
logP_G1pct_G3pct_mod = sm.OLS.from_formula("LogMedPrice ~ Group1Pct + Group3Pct", data = nsvalDF)  


#### Task

After fitting the model implied in the last line of code, answer the following questions _[edit this text block]_:

Q10: Does the multivariate model improve the amount of variance explained? 

**A10: **

Q11: Is the relationship between Group 3 population and house price positive or negative? Explain why this might make sense. 

**A11: **

Q12: Are you confident both independent variables in this model have an effect on the dependent variable? Why? 

**A12: **


## Final Project

Think about what the results above tell us about the how house prices of an area are related to the socio-econmic composition of that area.

Start thinking about your final project and what data you might analyse for it. 


### Reference
- Lin et al. (2013) Too Big to Fail: Large Samples and the p-Value Problem _Information Systems Research_ 24 906–917 DOI: [10.1287/isre.2013.0480](http://dx.doi.org/10.1287/isre.2013.0480)
- Smith, D. (2010) _Valuing housing and green spaces: Understanding local amenities, the built environment and house prices in London._ London: Greater London Authority. 