# House Prices Attempt I: Linear Regression

For my first non-physics machine learning project I will be testing out various techniques on a straightforward house prices dataset. The first attempt will be a simple linear regression. The plan is to use one hot encoding (etc.) to deal with the categorical data which should be easy to do with the **pandas** toolset. I also expect some ordinal features will not lend well to a simple linear model, so I will also try making up some nonlinear features (*e.g.* (Overall Quality)^2).  I expect that the number of features may become unwieldly, but we will see :).

Standard imports:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

Load raw training data:

In [2]:
data = pd.read_csv("../data/train.csv")
print(data.shape)

(1460, 81)


OK, so there are $m = 1460$ training examples and $n = 80$ features (ignoring 'Id'). So I'm pretty sure after one hot encoding $n$ will be approaching $m$. This may be bad but I'm going to allow it just to see what happens.

## One hot encoding
Here are the categorical features that I have decided to encode:

In [3]:
ohe_categories = [
    'MSSubClass',
    'MSZoning',
    'Street',
    'Alley',
    'LotShape',
    'LandContour',
    'LotConfig',
    'Neighborhood',
    'Condition1',
    'Condition2',
    'BldgType',
    'HouseStyle',
    'RoofStyle',
    'RoofMatl',
    'Exterior1st',
    'Exterior2nd',
    'MasVnrType',
    'Foundation',
    'Heating',
    'Electrical',
    'GarageType',
    'MiscFeature',
    'SaleType',
    'SaleCondition'
]
data_augmented_1 = pd.get_dummies(data[ohe_categories])
data_augmented_1.head()

Unnamed: 0,MSSubClass,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
1,20,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
2,60,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
3,70,0,0,0,1,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
4,60,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0


I will add the treated data to the *data_augmented* variable as I go.

## Ordinal data mapping

I will use the **pd.DataFrame.cat** functionality to map the ordinal data to numbers. I will just map them as natural numbers for now. Here are the categories that will need maping.

In [4]:
odm_ordered_lists = {
    'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],
    'LandSlope': ['Sev', 'Mod', 'Gtl'],
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtExposure': ['No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'BsmtFinType2': ['Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], 
    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
    'FireplaceQu': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageFinish': ['Unf', 'RFn','Fin'],
    'GarageQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'PavedDrive': ['Y', 'P', 'N'],
    'PoolQC': ['Fa', 'TA', 'Gd', 'Ex'],
    'Fence': ['MnWw', 'GdWo', 'MnPrv', 'GdPrv'],    
}

def index_exclude_nan(ordered_list,string):
    """The NaN's will be set to 0 since not having something is probably bad.
       Also add 1 to the indices so they are always natural numbers."""
    if isinstance(string, str):
        return ordered_list.index(string)+1
    else:
        return 0
        
def odm(series,ordered_list):
    "Quick and dirty way to map series entry to ordered_list index"
    return list(map(lambda x: index_exclude_nan(ordered_list,x),series))

data_augmented_2 = pd.DataFrame()
for cat in odm_ordered_lists:
    data_augmented_2[cat] = odm(data[cat],odm_ordered_lists[cat])

data_augmented_2.head()

Unnamed: 0,Utilities,LandSlope,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,KitchenQual,Functional,FireplaceQu,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence
0,4,3,4,3,4,3,1,6,1,4,8,0,2,3,3,1,0,0
1,4,3,3,3,4,3,4,5,1,3,8,3,2,3,3,1,0,0
2,4,3,4,3,4,3,2,6,1,4,8,3,2,3,3,1,0,0
3,4,3,3,3,3,4,1,5,1,4,8,4,1,3,3,1,0,0
4,4,3,4,3,4,3,3,6,1,4,8,3,2,3,3,1,0,0
