# Predicting home prices in Ames, IA

The purpose of this notebook is for me to learn something about Python programming, how to process data using Python, and how to perform machine learning tasks in Python. It may perhaps also provide some decent predictions of house prices in Ames, IA.

This data holds quite a bit of interest for me, because my wife and I are currently going through the process on both sides of the table - we've put our current house on the market, and we have a new house under contract.

## 1 Import libraries

In [1]:
import pandas as pd
import numpy as np

## 2 Read in the data

In [14]:
train = pd.read_csv("./Data/train.csv", header=0)
train.head(5)
test = pd.read_csv("./Data/test.csv", header=0)

## 3 Data cleaning and exploration

Let's check to what extent we need to clean up the data before throwing a model at it. Let's also pick the features we are going to use, in order to cut down on the amount of cleaning we have to do - we can always expand it later.

### 3.1 Missing values

It looks like several columns have a lot of missing values. From the data dictionary provided, it seems that this should be expected. These should be interpreted as the absence of a given feature rather than a missing value. However, we're going to have to do something with them, because scikit-learn doesn't like them. Scikit also doesn't want the alpha fields, which appear to be categorical in nature.

In [19]:
print ("Training set rows:", len(train))
print ("Test set rows:", len(test))

Training set rows: 1460
Test set rows: 1459


In [12]:
train['SalePrice'].isnull().sum()

0

No missing sale prices either, so that's good.

### 3.2 Data exploration

There are 81 columns of data and 1460 rows. We don't want to go too nuts with the number of features here, as there will not be enough houses in any one set of feature values to obtain stable predictions.

Furthermore, I have some familiarity with R, and many of the alpha fields I would recognize in R as factors. Pandas has this capability now, so we'll fool with that when we've decided what to work with. Let's count the missing values anyway, to see if there are features we can ignore right away.

In [26]:
missing = len(train) - train.count()

In [33]:
missing.sort_values(ascending=False)[0:10]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
dtype: int64

So most houses do not have a pool, and don't have junk like extra garages, tennis courts, or elevators. In addition most don't have an alley access and don't have a fence. I have two enormous dogs - a Saint Bernard and a Greater Swiss Mountain Dog mix, so a fence would be a must for me - but I feel like most people would take that into account as regards their own situation