# Predicting House Sale Prices

I will use housing data for the city of Ames, Iowa from 2006 to 2010 to create models that can predict housing prices in that city for a given set of features. The dataset can be downloaded [here](https://dsserver-prod-resources-1.s3.amazonaws.com/235/AmesHousing.txt), and you can learn more about the columns [here](https://s3.amazonaws.com/dq-content/307/data_description.txt)

I will begin by setting up a pipeline that will let us quickly iterate on different models.

In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import seaborn as sns

pd.options.display.max_columns = 999

houses = pd.read_csv('AmesHousing.tsv', delimiter = '\t')

In [40]:
def transform_features(df):
    # returns DF. Will update later
    return df

def select_features(df):
    # returns 'Gr Liv Area' and 'SalePrice' columns. Will update later
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    
    # Set first 1460 rows to train, set remaining rows to test
    train = df[0:1460]
    test = df[1460:]
    
    # Select only numeric data
    train_num = train.select_dtypes(include=['integer', 'float'])
    test_num = test.select_dtypes(include=['integer', 'float'])
    
    # Select all numerical columns returned from select_features except 'SalePrice', which we want to predict
    features = select_features(train_num).columns.drop('SalePrice')
    
    # Trains a model using selected features, tests it on test data, and returns root mean squared error from test
    lr = LinearRegression()
    lr.fit(train_num[features], train_num['SalePrice'])
    predictions = lr.predict(test_num[features])
    rmse = mean_squared_error(predictions, test_num['SalePrice'])**(1/2)
    return rmse

In [41]:
rmse = train_and_test(houses)
print(rmse)

57088.25161263909


Now we will update transform features to:
- remove features we don't want to use in the model, based on number of missing values or data leakage
- transform features into the proper format
- create new features by combining other features

Handle missing values:
- Columns with > 5% missing values: drop column
- Numerical columns with < 5% missing data: fill null values with mode of column
- Text columns: drop all columns that contain any missing values

New features:
- Years before sold - difference between Year Sold and Year Build
- Years since remodel - difference between Year Remod/Add and Year Sold

Drop columns that:
- aren't useful for ML (PID, Order)
- Leak data about the final sale (Mo Sold, Sale Condition, Sale Type, Yr Sold)


I will first work with this on a separate df to test, then implement it into a function

In [4]:
# Drop cols with > 5% missing values
df = houses
num_missing = df.isnull().sum()
drop_missing = num_missing[num_missing > .05 * len(df)]
df = df.drop(drop_missing.index, axis=1)

In [5]:
# Fill in numerical columns with mode of column
num_cols = df.select_dtypes(include=['int', 'float']).isnull().sum()
num_cols[num_cols > 0]

Mas Vnr Area      23
BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64

In [6]:
fixable_cols = num_cols[(num_cols < .05 * len(df)) & (num_cols > 0)]
fixable_cols

Mas Vnr Area      23
BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64

In [7]:
replacement_vals = df[fixable_cols.index].mode().to_dict(orient='records')[0]
df = df.fillna(replacement_vals)
df.select_dtypes(include=['int', 'float']).isnull().sum()

Order              0
PID                0
MS SubClass        0
Lot Area           0
Overall Qual       0
Overall Cond       0
Year Built         0
Year Remod/Add     0
Mas Vnr Area       0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Total Bsmt SF      0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Bsmt Full Bath     0
Bsmt Half Bath     0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Garage Cars        0
Garage Area        0
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
Misc Val           0
Mo Sold            0
Yr Sold            0
SalePrice          0
dtype: int64

In [8]:
# remove all text cols with missing values from df
text_null = df.select_dtypes(include=['object']).isnull().sum()
drop_null = text_null[text_null > 0]
df = df.drop(drop_null.index, axis=1)

In [9]:
df.isnull().sum().value_counts()

0    64
dtype: int64

All null values are now gone from df

Now lets work on adding new features, years_before_sold and years_since_remodel

In [10]:
years_sold = df['Yr Sold'] - df['Year Built']
years_since_remodel = df['Yr Sold'] - df['Year Remod/Add']
print(years_sold[years_sold < 0])
print(years_since_remodel[years_since_remodel < 0])

2180   -1
dtype: int64
1702   -1
2180   -2
2181   -1
dtype: int64


Rows 2180, 1702, and 2181 have negative years sold or years since remodel - drop

In [11]:
df['Years Before Sale'] = years_sold
df['Years Since Remod'] = years_since_remodel

df = df.drop([1702, 2180, 2181], axis=0)

## No longer need original year columns
df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)

Drop columns that:

    aren't useful for ML (PID, Order)
    Leak data about the final sale (Mo Sold, Sale Condition, Sale Type, Yr Sold)


In [12]:
df = df.drop(['PID', 'Order', 'Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1)


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 58 columns):
MS SubClass          2927 non-null int64
MS Zoning            2927 non-null object
Lot Area             2927 non-null int64
Street               2927 non-null object
Lot Shape            2927 non-null object
Land Contour         2927 non-null object
Utilities            2927 non-null object
Lot Config           2927 non-null object
Land Slope           2927 non-null object
Neighborhood         2927 non-null object
Condition 1          2927 non-null object
Condition 2          2927 non-null object
Bldg Type            2927 non-null object
House Style          2927 non-null object
Overall Qual         2927 non-null int64
Overall Cond         2927 non-null int64
Roof Style           2927 non-null object
Roof Matl            2927 non-null object
Exterior 1st         2927 non-null object
Exterior 2nd         2927 non-null object
Mas Vnr Area         2927 non-null float64
Exter Qual    

We are now ready to update the transform_features function

In [42]:
def transform_features(df1):
    df = df1
    
    # Drop cols with > 5% missing values
    num_missing = df.isnull().sum()
    drop_missing = num_missing[num_missing > .05 * len(df)]
    df = df.drop(drop_missing.index, axis=1)
    
    # Fill in numerical columns with mode of column
    num_cols = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_cols = num_cols[(num_cols < .05 * len(df)) & (num_cols > 0)]
    replacement_vals = df[fixable_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_vals)
    
    # Drop cols with any missing text values
    text_null = df.select_dtypes(include=['object']).isnull().sum()
    drop_null = text_null[text_null > 0]
    df = df.drop(drop_null.index, axis=1)
    
    # Add 'Years Before Sale' and 'Years Since Remodel'
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remodel = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remodel'] = years_since_remodel

    df = df.drop([1702, 2180, 2181], axis=0)

    ## No longer need original year columns
    df = df.drop(["Year Built", "Year Remod/Add"], axis = 1)
    
    # Drop columns that aren't useful for ML or would leak data about final sale
    df = df.drop(['PID', 'Order', 'Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1)
    
    return df

Now that we've cleaned and transformed the features in the dataset, lets move on to feature selection for numerical features

In [15]:
numerical = df.select_dtypes(include = ['int', 'float'])
num_corr = numerical.corr()['SalePrice'].abs().sort_values(ascending=False)
num_corr

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648361
Total Bsmt SF        0.644012
Garage Area          0.641425
1st Flr SF           0.635185
Years Before Sale    0.558979
Full Bath            0.546118
Years Since Remod    0.534985
Mas Vnr Area         0.506983
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.439284
Wood Deck SF         0.328183
Open Porch SF        0.316262
Half Bath            0.284871
Bsmt Full Bath       0.276258
2nd Flr SF           0.269601
Lot Area             0.267520
Bsmt Unf SF          0.182751
Bedroom AbvGr        0.143916
Enclosed Porch       0.128685
Kitchen AbvGr        0.119760
Screen Porch         0.112280
Overall Cond         0.101540
MS SubClass          0.085128
Pool Area            0.068438
Low Qual Fin SF      0.037629
Bsmt Half Bath       0.035875
3Ssn Porch           0.032268
Misc Val             0.019273
BsmtFin SF 2         0.006127
Name: Sale

In [16]:
# Let's only keep values with a correlation coefficient > 0.4. 
# In the function, though, we will make the correlation coefficient an input
num_corr[num_corr > .4]

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648361
Total Bsmt SF        0.644012
Garage Area          0.641425
1st Flr SF           0.635185
Years Before Sale    0.558979
Full Bath            0.546118
Years Since Remod    0.534985
Mas Vnr Area         0.506983
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.439284
Name: SalePrice, dtype: float64

In [17]:
# Only keep data with corr. coefficient > 0.4:
df = df.drop(num_corr[num_corr < 0.4].index, axis=1)

Now lets find columns that are categorical, and covert them to dummies

In [18]:
# Create a list of column names from documentation that are *meant* to be categorical
cat_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]



In [19]:
# Which categorical columns have we not yet dropped?
cat_cols = []
for col in cat_features:
    if col in df.columns:
        cat_cols.append(col)
        
# How many unique values in each categorical column?
unique_counts = df[cat_cols].apply(lambda col: len(col.value_counts()))
# Cutoff at 10 unique counts, include as arguement to function
drop_nonunique_cols = unique_counts[unique_counts > 10].index
df = df.drop(drop_nonunique_cols, axis=1)

In [20]:
# Select remaining text columns and convert to categorical
text_cols = df.select_dtypes(include=['object'])
for col in text_cols:
    df[col] = df[col].astype('category')
    
# Create dummy columns & drop previous text cols
df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols, axis=1)

Update 'select_features' function

In [43]:
def select_features(df1, corr_coef = 0.4, unique_threshold = 10):    
    # first, transform the features using transform_features
    df = transform_features(df1)
    
    # Select only numerical types and generate correlation coefficients
    numerical = df.select_dtypes(include = ['int', 'float'])
    num_corr = numerical.corr()['SalePrice'].abs().sort_values(ascending=False)
    
    # Only keep data with corr. coefficient > corr_coef:
    df = df.drop(num_corr[num_corr < corr_coef].index, axis=1)

    # Create a list of column names from documentation that are *meant* to be categorical
    cat_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    
    # Which categorical columns have we not yet dropped?
    cat_cols = []
    for col in cat_features:
        if col in df.columns:
            cat_cols.append(col)

    # How many unique values in each categorical column?
    unique_counts = df[cat_cols].apply(lambda col: len(col.value_counts()))
    # Cutoff at unique_threshold
    drop_nonunique_cols = unique_counts[unique_counts > unique_threshold].index
    df = df.drop(drop_nonunique_cols, axis=1)
    
    # Select remaining text columns and convert to categorical
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')

    # Create dummy columns & drop previous text cols
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols, axis=1)
    
    return df

Now lets update the train_and_test function.

We will add a paramater k that controls the type of cross validation that occurs.
- When k == 0, perform holdout validation (what we already have implemented
    - Select first 1460 rows and assign to train, assign rest to test
    - Train on train, test on test
    - Return RMSE
- When k == 1, perform simple cross validation
    - Shuffle ordering of rows in df
    - Select first 1460 rows and assign to fold_one, assign remaining to fold_two
    - Train on fold_one and test on fold_two
    - Train on fold_two and test on fold_one
    - Return average RMSE
- Else, implement k-fold cross validation using k folds
    - Perform k-fold cross validation using k folds
    - Return average RMSE
    - *Should be the same as simple cross validation when k = 1, but this shows that I understand the concepts

In [44]:
from sklearn.model_selection import KFold

def train_and_test(df1, k = 0, corr_coef = 0.4, unique_threshold = 10):
    df = select_features(df1, corr_coef, unique_threshold)
    numeric = df.select_dtypes(include=['int', 'float'])
    features = numeric.columns.drop('SalePrice')
    lr = LinearRegression()
    
    if k == 0:
        
        # Set first 1460 rows to train, set remaining rows to test
        train = df[0:1460]
        test = df[1460:]

        # Train a model using selected features
        lr.fit(train[features], train['SalePrice'])
        
        # Test model on test data
        predictions = lr.predict(test[features])
        
        # Return root mean squared error from test
        rmse = mean_squared_error(predictions, test['SalePrice'])**(1/2)
        return rmse
    
    
    elif k == 1:
        shuffled = df.sample(frac=1)
        fold_one = df[0:1460]
        fold_two = df[1460:]
        
        # Train on fold_one and test on fold_two
        lr.fit(fold_one[features], fold_one['SalePrice'])
        predict_two = lr.predict(fold_two[features])
        rmse_two = mean_squared_error(predict_two, fold_two['SalePrice'])**(1/2)
        
        # Train on fold_two and test on fold_one
        lr.fit(fold_two[features], fold_two['SalePrice'])
        predict_one = lr.predict(fold_one[features])
        rmse_one = mean_squared_error(predict_one, fold_one['SalePrice'])**(1/2)
        
        # Return average RMSE
        avg_rmse = (rmse_one + rmse_two) / 2
        return avg_rmse
        
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmses = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            rmse = mean_squared_error(test["SalePrice"], predictions)**(1/2)
            rmses.append(rmse)
        avg_rmse = np.mean(rmses)
        return avg_rmse
        
        

In [45]:
train_and_test(houses)

36623.53562910476

In [47]:
min_rmse = 100000
k=0
corr_coef = 0
unique_threshold = 0

for k_ in range(0,10):
    rmse_c = []
    for corr in range(0, 8):
        rmse_c_u = []
        for ut in range(5, 51, 5):
            rmse = train_and_test(houses, k_, corr/10, ut)
            if rmse < min_rmse:
                min_rmse = rmse
                k = k_
                corr_coef = corr
                unique_threshold = ut

print('Min RMSE: ', min_rmse)
print('k: ', k, '\nCorrelation coefficient: ', corr_coef, '\nUnique Threshold: ', unique_threshold)

Min RMSE:  30943.69301164943
k:  8 
Correlation coefficient:  0 
Unique Threshold:  45


The code above takes quite a while to run, however I obtained the following values:
- Min RMSE:  30943.69301164943
- k:  8 
- Correlation coefficient:  0 
- Unique Threshold:  45

At best, the computer predicted the accuracy of the houses within approx. $30,944. 

The best approximation comes from using 8 folds, including all values regardles of how well they are correlated, and including all columns with less than 45 individual values.