I prefer to approach the problem in stages starting with a "quick and dirty" (but complete) model and then improving gradually from there. I like to see the bigger picture before focusing on details of the analysis. The dataset is described by the author (Dean De Cock) in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf and it has 34 numeric variables (20 continuous and 14 discrete) and 46 categorical variables (23 nominal and 23 ordinal). It is is small enough to examine in a spreadsheet at first. After reading the data to pandas we split the qualitative variables and replace them with numeric ones (using sklean LabelEncoder) and fill the missing values. We tranform all numeric variables using ln(x+1). Then we apply lasso regression using the Lars method from sklearn. Such model can be done using less than 20 lines of code and it produces prediction which ranks better than average (730 out of 1800 submissions with the test error 0.128 and about the same train error).

In [1]:
import numpy as np
import pandas as pd 
import sklearn.linear_model as linear_model
from sklearn.preprocessing import LabelEncoder

train=pd.read_csv("../input/train.csv")
test=pd.read_csv("../input/test.csv")

y_train = train['SalePrice']
train = pd.concat((train,test)).reset_index(drop=True)
train.drop(['Id','SalePrice'], axis = 1, inplace = True)

qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
train[qualitative] = train[qualitative].fillna('Missing')
for c in qualitative:  
    le = LabelEncoder().fit(list(train[c].values)) 
    train[c] = le.transform(list(train[c].values))
    
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
for item in quantitative:
    train[item] = np.log1p(train[item].values)

X_train = train[:len(y_train)].fillna(0)
X_test = train[len(y_train):].fillna(0)
                        
model = linear_model.LassoLarsCV()
model.fit(X_train, np.log(y_train))

prediction = pd.DataFrame({"Id": test["Id"], "SalePrice": np.exp(model.predict(X_test))})
prediction.to_csv('house_submission1.csv', index=False)   

print(np.sqrt(np.sum(np.square(np.log(y_train)-model.predict(X_train)))/len(y_train)))


0.12953649785
