This notebook produces a cross validation comparison of the linear models Ridge, Lasso and ElasticNet with $\alpha$ tuning. At the time of publication this script performs in the top ~15% of submissions. Using a RMSE for evaluation it is found that ElasticNet appears slightly better than the Lasso and Ridge models. Some basic preprocessing and feature creation is also included but it should be emphasised that the median imputation used on missing values is very crude. For example, Area features with missing values may be this way because the property does not have that feature (e.g. a pool) so it would make more sense to set this to zero. Feature creation is done by taking the square root of all numerical area features.

In [99]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler

train = pd.read_csv("train.csv",index_col="Id")
test = pd.read_csv("test.csv",index_col="Id")

def print_full(x):
    """
    Full printing of dataframes for error checking
    """
    pd.set_option('display.max_columns', 999)
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_columns')
    pd.reset_option('display.max_rows')

def clean(df):
    """
    Cleans NaNs and creates new features
    """
    
    # List of new features to be created: (new_feature, original_feature, transform_function)
    transform = [("sqLotArea","LotArea",np.sqrt),
                 ("sqGrLivArea","GrLivArea",np.sqrt),
                 ("sqBsmtFinSF1","BsmtFinSF1",np.sqrt),
                 ("sqBsmtFinSF2","BsmtFinSF2",np.sqrt),
                 ("sqBsmtUnfSF","BsmtUnfSF",np.sqrt),
                 ("sqTotalBsmtSF","TotalBsmtSF",np.sqrt),
                 ("sq1stFlrSF","1stFlrSF",np.sqrt),
                 ("sq2ndFlrSF","2ndFlrSF",np.sqrt),
                 ("sqLotFrontage","LotFrontage",np.sqrt),
                 ("sqMasVnrArea","MasVnrArea",np.sqrt),
                 ("sqPoolArea","PoolArea",np.sqrt),
                 ("sqGarageArea","GarageArea",np.sqrt),
                 ("sqWoodDeckSF","WoodDeckSF",np.sqrt),
                 ("sqOpenPorchSF","OpenPorchSF",np.sqrt),
                 ("sqEnclosedPorch","EnclosedPorch",np.sqrt),
                ]
    
    # Find categorical and numerical features
    categoricals = train.select_dtypes(include=["object"]).columns.values
    numericals = [feat for feat in train.select_dtypes(include=["int","float"]).columns.values]
    
    # Remove NaNs... bear in mind this is a rough script. I recommend a more intelligent way of doing this,
    # for example a feature like LotFrontage may be NaN because the property has no Lot Frontage, so it makes
    # more sense to set this to zero instead of imputing the median value as shown below.
    
    # Transform to create new features, scale using MinMaxScaler
    for (new_feature,original_feature,f) in transform: 
        df[new_feature] = df[original_feature].fillna(df[original_feature].median(), inplace = False)
        df[new_feature] = MinMaxScaler().fit_transform(f(df[new_feature].apply(float)).reshape(-1,1))
    # Scale and remove NaNs for numerical features by imputing median value
    for feature in numericals: 
        df[feature].fillna(df[feature].median(), inplace = True)
        df[feature] = MinMaxScaler().fit_transform(df[feature].apply(float).reshape(-1,1))
    # Impute NaNs for categorical features
    for feature in categoricals: 
        df[feature].fillna(df[feature].value_counts().idxmax(), inplace = True)
    # Perform one hot encoding on the categorical features
    for cat in categoricals:
        dummies = pd.get_dummies(df[cat])
        dummies.columns = [col_name + cat for col_name in dummies.columns.values]            
        df = df.drop(cat,axis=1)
        df = df.join(dummies)
    return df

target = train["SalePrice"] # Note that we will take the Log of this when fitting - check the histogram of this feature
train = train.drop("SalePrice",axis=1)

dd = clean(pd.concat([train,test]))

train = dd[:len(train)]
test = dd[len(train):]

In [100]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LassoLarsCV
from sklearn import cross_validation
from sklearn.cross_validation import cross_val_score

def rmse_cv(model): # Cross val using the competition scoring metric
    return(np.sqrt(-cross_val_score(model, train, np.log(target), scoring="mean_squared_error", cv = 5)))

print("=====Lasso Regression RMSE=====")
for a in [1.0e-4,3.0e-4,1.0e-3,3.0e-3]:
    print(a, rmse_cv(Lasso(alpha = a)).mean())
    
print("=====Ridge Regression RMSE w/ alphas =====")
for a in np.arange(1,10,1):
    print(a, rmse_cv(Ridge(alpha = a)).mean())
    
print("=====ElasticNet RMSE w/ alphas =====")
#for a in [3.0e-4,1.0e-3,3.0e-3,1.0e-4]:
for a in np.arange(3.0e-4,3.0e-3,1.0e-4):
    print(a, rmse_cv(ElasticNet(alpha = a,max_iter=10000)).mean())

    
print("=====LassoLars w/ alphas =====")
print(rmse_cv(LassoLars(alpha=0.000496269234175)).mean())
    
# Best RMSE is ElasticNet with alpha = 0.0009
best_model = ElasticNet(alpha=0.0009).fit(train,np.log(target))

# Output to CSV
test["SalePrice"] = np.exp(best_model.predict(test))
test[["SalePrice"]].to_csv("submit.txt")

=====Lasso Regression RMSE=====
0.0001 0.133060344119
0.0003 0.127280305333
0.001 0.129134911203
0.003 0.146925504614
=====Ridge Regression RMSE w/ alphas =====
1 0.136784674567
2 0.135535092275
3 0.135105758434
4 0.134995672138
5 0.135054231221
6 0.135215174773
7 0.135443219182
8 0.135717369914
9 0.13602419713
=====ElasticNet RMSE w/ alphas =====
0.0003 0.129778843502
0.0004 0.128521524153
0.0005 0.127809961627
0.0006 0.12731380798
0.0007 0.126960267957
0.0008 0.126761552613
0.0009 0.126753655608
0.001 0.126861278601
0.0011 0.127057642503
0.0012 0.127349421042
0.0013 0.127710818952
0.0014 0.128135595479
0.0015 0.128616046301
0.0016 0.129119230004
0.0017 0.129623901034
0.0018 0.130111067187
0.0019 0.130612719001
0.002 0.131124826853
0.0021 0.131646049595
0.0022 0.132179311413
0.0023 0.132717100711
0.0024 0.133206910762
0.0025 0.133684209551
0.0026 0.134160481608
0.0027 0.134631426208
0.0028 0.135097740364
0.0029 0.135579483798
=====LassoLars w/ alphas =====
0.142594922935


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [98]:

# train = pd.read_csv("train.csv",index_col="Id")
# print(train["LotFrontage"].isnull().sum())
# print(len(train["LotFrontage"]))