# **House Prices: Advanced Regression Techniques**
<p>By <a href="https://www.linkedin.com/in/jmperafan/">Juan Manuel Perafan</a></p>

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Based on the [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [34]:
# download dataset using the Kaggle API
!kaggle competitions download house-prices-advanced-regression-techniques -f train.csv -p data -q
!kaggle competitions download house-prices-advanced-regression-techniques -f test.csv -p data -q

Traceback (most recent call last):
  File "c:\users\perafju\appdata\local\continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\perafju\appdata\local\continuum\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\perafju\AppData\Local\Continuum\anaconda3\Scripts\kaggle.exe\__main__.py", line 4, in <module>
  File "c:\users\perafju\appdata\local\continuum\anaconda3\lib\site-packages\kaggle\__init__.py", line 23, in <module>
    api.authenticate()
  File "c:\users\perafju\appdata\local\continuum\anaconda3\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 149, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in C:\Users\perafju\.kaggle. Or use the environment method.
Traceback (most recent call last):
  File "c:\users\perafju\appdata\local\continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__

In [27]:
# Import packages
import pandas as pd
import numpy as np

# Scoring your data
from scipy import stats
from sklearn.metrics import mean_squared_error as MSE

# Pre-processing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# We are using lasso to identify the most important features
from sklearn.linear_model import Lasso

# Read data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

### Union train and test datasets into a combined dataset
This makes data preprocessing and data cleaner easier. If I don't union both datasets, I will have to apply each transformation twice. In a later stage, I will split them using the SalePrice column. This value will remain empty 0 or nan for houses in the test dataset. 

In [28]:
# Union train + test
combined = pd.concat([train, test])

# Check the data
combined.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000.0


### Replace NaN with "Missing" or 0
I am replacing every Nan with the word "Missing" if it is a categorical value or with 0 if it is a numeric value. There are probably smarter ways of doing this, but I just wanted a quick and dirty approach for now. 

In [29]:
# For every column that is NOT SalePrice
for cols in combined.loc[:, combined.columns != 'SalePrice']:
    
    # If the data type of a column is object
    if combined[cols].dtype == 'O':
        
        # Fill na with the word 'Missing'
        combined[cols].fillna('Missing', inplace = True)
   
    # If the data type is numeric, fill na with 0
    else: 
        combined[cols].fillna(0, inplace = True)
  
# Just check if all columns, except SalePrice have 0 Nan.
combined.isnull().mean()

Id               0.000000
MSSubClass       0.000000
MSZoning         0.000000
LotFrontage      0.000000
LotArea          0.000000
                   ...   
MoSold           0.000000
YrSold           0.000000
SaleType         0.000000
SaleCondition    0.000000
SalePrice        0.499829
Length: 81, dtype: float64

In [33]:
# Getting indicator values
combined = pd.get_dummies(combined)

combined.shape

(2919, 313)

In [32]:
# Splitting datasets into train + test again
train = combined[combined['SalePrice'].notnull()]
test = combined[combined['SalePrice'].isnull()]

# Creating a train-test split
X_train, X_test, y_train, y_test = train_test_split(
    train
    , train['SalePrice']
    , random_state = 0
)

# Fitting the model
Model = Lasso(alpha = 298.4).fit(X_train, y_train)

# Model's output
print("Train score:", Model.score(X_train, y_train))
print("Test score:", Model.score(X_test, y_test))
np.sum(Model.coef_ != 0)

Train score: 0.9999999999999977
Test score: 0.9999999999999977


1

In [None]:
# Prepare submission in the right shape
submission = pd.DataFrame(
    model.predict(test)
    , columns = ['SalePrice']
    , index = test.index
)

# Transform submissions into csv
submission.to_csv('data/submission.csv')

In [None]:
# Make submission using csv
# !kaggle competitions submit house-prices-advanced-regression-techniques -f data/submission.csv -m "My submission message"

# Check all previous submissions
# !kaggle competitions submissions -c house-prices-advanced-regression-techniques