# Predicting House Sale Prices
In this project, we'll explore ways to iteratively improve a simple linear model using feature transformation, feature selection, and cross validation.

## Data
The housing data is from Ames, Iowa USA from 2006 to 2010 and can be found [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627).

Detailed descriptions of the columns can be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [2]:
houses = pd.read_csv('AmesHousing.tsv', delimiter='\t')
houses.head(5)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [3]:
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    train = df.iloc[:1460]
    test = df.iloc[1460:]
    
    numerical_train = train.select_dtypes(include=['integer', 'float'])
    numerical_test = test.select_dtypes(include=['integer', 'float'])
    
    target = 'SalePrice'
    features = numerical_train.columns.drop(target)
    
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train[target])
    
    predictions = lr.predict(test[features])
    rmse = np.sqrt(mean_squared_error(predictions, test[target]))
    
    return rmse

In [4]:
first_rmse = train_and_test(houses)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## Feature Engineering
Now we'll do some basic null value handling.

First let's drop any columns with more than 5% values missing **for now**.

In [5]:
houses = houses.loc[:, houses.isnull().sum() < 0.05*houses.shape[0]]
print(houses.shape)

(2930, 71)
