# Introduction
The purpose of this notebook is to examine the Ames Home Prices dataset and identify the most likely features and feature transformations to support a regression model capable of estimating home prices.

The first step is to take a look at what's available in the training data.  Each of the various features are described in the "Data" section of the Kaggle competition site. but let's read the dataset into a pandas dataframe and take a look at the column headers.

In [None]:
# Import the various libraries we'll be using in this notebook
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import statsmodels.api as sm

# Graphics libraries:
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Input data files are available in the "../input/" directory.
df = pd.read_csv('../input/train.csv')

print('In the Ames Home Prices training set, there are: ')
print('    ', df.shape[0], 'rows of data')
print('with  ', df.shape[1] , 'features for each row')
df.columns

So...there are only 1460 rows of data in the training set - which isn't a lot in the data science world.  However, each of these rows contains data for some 81 different features.  That's a lot of detail to consider!

# SalePrice - The Principal Feature
The variable we will ultimately try to predict is the Sales Price, so let's start by taking a close look at this feature.  The best place to start is to have pandas give us a description of the general statistical attributes of this variable:


In [None]:
df['SalePrice'].describe()

In [None]:
sns.distplot(df['SalePrice']);


This looks a bit sharper and skewed (to the left)  than you would expect for a normal distribution.  A measure of the steepness of the central peak of a distribution is called "kurtosis", and a normal distribution has a kurtosis of 3.0.   For a normal distribution, the skew would be 0.0. A skewness value > 0 means that there is more weight in the left tail of the distribution. 

Let's see how our Ames sales data compares to a normal curve.

In [None]:
print('Kurtosis = ', df['SalePrice'].kurtosis())
print('Skew     = ', df['SalePrice'].skew())

The Sales Price is indeed much steeper than "normal", and skewed significantly to the left.

# Correlations
Now, for the most imortant analysis:  What is highly correlated to the sales price???  We can see this graphically with a correlation-matrix heat map, or by printing out the correlation values themselves.  Let's do both.  Note:  Concentrate on the last row of the heat map to see the correlations of all the other features to Sales Price.

In [None]:

sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.15, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .75})

In [None]:
df.corr()['SalePrice'].sort_values(ascending=False)

The features most-highly correlated to Sales Price include:
* **OverallQual**: Overall material and finish quality
* **GrLivArea**: Above-ground living area in square feet
* **GarageCars**: Size of garage in car capacity
* **GarageArea**: Size of garage in square feet
* **TotalBsmtSF**: Total square feet of basement area
* **1stFlrSF**: Area of the 1st floor in square feet
* **FullBath**: The number of full bathrooms above grade
* **TotRmsAbvGrd**: Total rooms above grade (not including bathrooms)
* **YearBuilt**: Original construction date
* **YearRemodAdd**: Remodel date

Two features with high negative correlation to Sales Price are:
* **KitchenAbvGr**: The kitchen is above grade (who wants to carry groceries upstairs?)
* **EnclosedPorch**: Enclosed porch area in square feet


Let's take a closer look at some of these key features.

# Overall Quality ("OverallQual")
The OverallQual feature is an integer/discrete value, ranging from 1 to 10.   We can see this by printing the first few rows of this column, and asking pandas to describe it.

In [None]:
print(df.head(10)['OverallQual'])
df['OverallQual'].describe()

Let's use a Seaborn box and swarm plot to look at this feature in more detail.

In [None]:
data = pd.concat([df['SalePrice'], df['OverallQual']], axis=1)
fig = plt.subplots(figsize=(8, 6))
fig = sns.swarmplot(x="OverallQual", y="SalePrice", data=df, color=".25")
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

Note that there aren't very many homes with the highest quality (9 or 10), and those with a 10 have a high standard deviation and are less-correlated to the Sales Price than the other quality values.

# Above-Ground Living Area ("GrLivArea")

In [None]:
ax = sns.regplot(x="GrLivArea", y="SalePrice", data=df)


In [None]:
X = df["GrLivArea"]
y = df["SalePrice"]

# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

That's a lot of information. Let's examine some of the key information provided in this table.  
* The model used to estimate the correlation is Ordinary Least Squares (OLS) - which means that this model fits an estimated regression line by minimizing the squared-distance of each point from the line.
* Date and time indicate when the model was run.
* DF's are the degrees of freedom  — “the number of values in the final calculation of a statistic that are free to vary.”
* An R-Squared value of 0.919 means that 91.9% of the variance in home prices can be explained by this single variable. This is a high correlation! 
* The coef value (118.06910) indicates that every increase of 1 in the GrLivArea raises the house price by \$118. In other words, the average house sells for about \$118 per square foot of above-ground living area.

In [None]:
test_df = pd.read_csv('../input/test.csv')

test_df['SalePrice'] = 118.0691*test_df['GrLivArea']
test_df.to_csv('submission.csv', columns=['Id', 'SalePrice'], index=False)