# Predicting House Sale Prices

We're going to look at predicting house sale prices from a selection of other features (to be determined), using linear regression.

Current status: Transforming features.

## Reading the data

Tab delimited file available from https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt with data dictionary https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import  LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
%matplotlib inline

data = pd.read_csv('AmesHousing.txt',delimiter='\t')

data.head()


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


## Pipeline version of process
### Transform > Select > Train and test

These three cells (will) contain the whole pipeline packaged into functions, to allow quick reuse. The reasoning behind the steps taken here are reproduced in full below. This pipeline structure mimics what a productionized version might look like, with each step available to receive new training data and data to be prepared for predictive analysis.

In [2]:
def transform_features(inframe):
    
    # Remove columns with more than 25% of values missing.
    t_train = trainset.drop(['Alley','Fireplace Qu','Pool QC','Fence','Misc Feature'],axis=1)
    t_test = testset.drop(['Alley','Fireplace Qu','Pool QC','Fence','Misc Feature'],axis=1)
    
    # PID (postal id) and Order (observation number) aren't predictive
    del t_train['PID']
    del t_test['PID']
    del t_train['Order']
    del t_test['Order']
    
    return t_train, t_test

In [3]:
def select_features(in_df):
    selected_train = in_df[['Gr Liv Area','SalePrice']] 
    return selected_train

In [4]:
def train_and_test(inframe=data):
    lrm = LinearRegression()
    train_data = select_features(inframe)
    
    mean_s_errors = cross_val_score(lrm,
                                    train_data.drop('SalePrice',axis=1),
                                    train_data['SalePrice'],
                                    scoring='neg_mean_squared_error',
                                    cv=10)
    # Root mean squared errors for each fold
    r_ms_errors = [abs(m)**(1/2) for m in mean_s_errors]
    
    # Average root mean squared error across all folds
    avg_rms_error = np.mean(r_ms_errors)
    
    return r_ms_errors, avg_rms_error

In [5]:
rmses, armse = train_and_test()
print(rmses,'\n')
print(armse)


[55364.966905101006, 67414.767057828198, 51652.028143302261, 61142.428190830913, 46974.79071976852, 69710.357300782242, 52492.991532255481, 61806.901095779387, 52299.554294257054, 47351.675966573712] 

56621.0461206


# Exploratory Cells

## Transformations

### Numerical nulls and data leaks 

In [6]:
# First looking for features with large numbers of nulls (more than 25%)
# We'll remove these wholesale, assuming nothing too vital jumps out.

num_val_counts = data.isnull().sum()
num_val_counts[num_val_counts > 0.25*data.shape[0]]

Alley           2732
Fireplace Qu    1422
Pool QC         2917
Fence           2358
Misc Feature    2824
dtype: int64

In [7]:
# Looks ok, so dropping these. Starting new set for transformed data, in case we want to check data again.
t_data = data[num_val_counts[num_val_counts < 0.25*data.shape[0]].index]

In [8]:
# Reassign to drop the four missing columns
num_val_counts = t_data.isnull().sum()

# Now we'll take a look at columns with fewer but still non-zero nulls.
t_data[num_val_counts[num_val_counts > 0].index].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 22 columns):
Lot Frontage      2440 non-null float64
Mas Vnr Type      2907 non-null object
Mas Vnr Area      2907 non-null float64
Bsmt Qual         2850 non-null object
Bsmt Cond         2850 non-null object
Bsmt Exposure     2847 non-null object
BsmtFin Type 1    2850 non-null object
BsmtFin SF 1      2929 non-null float64
BsmtFin Type 2    2849 non-null object
BsmtFin SF 2      2929 non-null float64
Bsmt Unf SF       2929 non-null float64
Total Bsmt SF     2929 non-null float64
Electrical        2929 non-null object
Bsmt Full Bath    2928 non-null float64
Bsmt Half Bath    2928 non-null float64
Garage Type       2773 non-null object
Garage Yr Blt     2771 non-null float64
Garage Finish     2771 non-null object
Garage Cars       2929 non-null float64
Garage Area       2929 non-null float64
Garage Qual       2771 non-null object
Garage Cond       2771 non-null object
dtypes: float64(11), obj

From the data dictionary we can see all the numeric columns here are ordinal or continuous, so we can replace nulls in these columns with the mean for the column without creating nonsense values.

In [9]:
# Turning off chained assignment warning
pd.options.mode.chained_assignment = None  # default='warn'

# Select columns to replace
cols = t_data[num_val_counts[num_val_counts > 0].index].select_dtypes(include=['float64']).columns.tolist()

# Fillna on selected columns
t_data[cols] = t_data.loc[:,cols].fillna(t_data.loc[:,cols].mean())

t_data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,IR1,Lvl,AllPub,...,0,0,0,0,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,...,0,0,120,0,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,...,0,0,0,0,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,Reg,Lvl,AllPub,...,0,0,0,0,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,...,0,0,0,0,0,3,2010,WD,Normal,189900


We'll return to the string fields separately. Let's briefly take a diversion to remove columns that may leak info about the final sale (i.e. columns that contain data we won't have when making a prediction, like the Sale Month and Year). We'll drop all of these. Additionally, reading the documentation we can see that PID and Order are not going to be useful for us so we'll drop those too.

In [10]:
leak_cols = ['Sale Type','Sale Condition','Mo Sold','Yr Sold','PID','Order']
t_data = t_data.drop(leak_cols,axis=1)

### Nominal features

In [17]:
# Columns listed as nominal in the dictionary (minus those we've already dropped)
nominals = ['MS SubClass','MS Zoning','Street','Land Contour','Lot Config',
            'Neighborhood','Condition 1','Condition 2','Bldg Type','House Style',
            'Roof Style','Roof Matl','Exterior 1st','Exterior 2nd','Mas Vnr Type',
            'Foundation','Heating','Central Air','Garage Type']

# A couple of stats on each column. Also stored in dictionary.
noms = {}

for n in nominals:
    counts = data[n].value_counts(dropna=False)
    
    # Dictionary will contain % of rows belonging to the most common category
    noms[n] = [counts.max()/counts.sum(),counts.shape[0]]
    print(n,'\n % of rows with single value ',counts.max()/counts.sum(),'\n unique vals ',counts.shape[0],'\n')

MS SubClass 
 % of rows with single value  0.368259385666 
 unique vals  16 

MS Zoning 
 % of rows with single value  0.775767918089 
 unique vals  7 

Street 
 % of rows with single value  0.99590443686 
 unique vals  2 

Land Contour 
 % of rows with single value  0.898634812287 
 unique vals  4 

Lot Config 
 % of rows with single value  0.730375426621 
 unique vals  5 

Neighborhood 
 % of rows with single value  0.151194539249 
 unique vals  28 

Condition 1 
 % of rows with single value  0.860750853242 
 unique vals  9 

Condition 2 
 % of rows with single value  0.98976109215 
 unique vals  8 

Bldg Type 
 % of rows with single value  0.827645051195 
 unique vals  5 

House Style 
 % of rows with single value  0.505460750853 
 unique vals  8 

Roof Style 
 % of rows with single value  0.792150170648 
 unique vals  6 

Roof Matl 
 % of rows with single value  0.985324232082 
 unique vals  8 

Exterior 1st 
 % of rows with single value  0.350170648464 
 unique vals  16 

Exterior

Columns where a high number of rows have the same value won't have much use in the model, so can probably be discounted. We'll drop those here. I'm picking a threshold of 90% here.

In [18]:
print(t_data.shape)
for n in noms:
    if noms[n][0] > 0.9:
        # Drop from our transformed data and also from nominals list
        t_data.drop(n,axis=1,inplace=True)
        nominals.remove(n)
print(t_data.shape)

(2930, 71)
(2930, 66)


Next we'll think about whether or not we want to transform the other nominal columns into dummy numerical columns. Scanning back over the list a few cells up, there are a few columns with a number of categories but they tend to be fairly important. The numbers aren't in the hundreds, so I'm inclined to keep them all.

In [21]:
# Change to category type
for n in nominals:
    t_data[n] = t_data[n].astype('category')

# Get dummy cols and combine to dataset
dummy_cols = pd.get_dummies(t_data[nominals])
t_data = pd.concat([t_data,dummy_cols],axis=1)

# Drop the original columns
t_data.drop(nominals,axis=1,inplace=True)

### Ordinal string columns

String columns with a meaningful order will be turned mapped ot numeric values.

In [22]:
t_data.select_dtypes(include=['object']).columns.tolist()

['Lot Shape',
 'Utilities',
 'Land Slope',
 'Exter Qual',
 'Exter Cond',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Heating QC',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Paved Drive']