A SYSTEMATIC APPROACH OF REGRESSION MODELS USING SKLEARN AND TENSORFLOW

Christian Masdeval - November 2017


In this kernel I will try to bring to you a resume of the analysis I have made in the House Prices dataset. My objective was to explore as many different predective models as possible trying to find the best one. It is divided in the following way:

    Understanding and cleaning the data
        Linear Regression Models
            Linear Regression
            Stochastic Gradient Descent
            Ridge
            Lasso
            Elastic Net
    Suport Vector Machine
    Neural Network
    Random Forests
    Gradient Boosting
    Deep Learning with TensorFlow
    Ensemble Technics

1 . Understanding and cleaning the data

First thing first. As mentioned by several authors, much of the work in make a good model is related to a good feature engineering process. It take a while until we start gain some significant intuition about what is going on but some steps are always present. I will describe what I did trying to stabilize the data and how I got to this decision. Anyway, I think the goals of this step that are always present in any dataset and should guide us are:

    Deal with missing values
    Select the best features

One more thing. As our objective is to analyze the test.csv file, create the submission file and upload it to kaggle, one premisse I have used was not to remove any single line from the test set. We must submit a result file with exactly the number of lines of the original test file.

What is the data

There are excellent kernels showing how to get a glimpse of the data (like Comprehensive data exploration with Python). Most of them have used programming languages like R ore Python to extract useful visualizations. I would also suggest to use the Weka tool to do that. I could get excellent insights about the data via Weka.

We should also look the data dictonary when available. In our case it exists and bring important informations:

    *The categorical and numerial fields
    *The description of each one
    *That there is a category called NA to denote the absence of some characteristic (and Python also use NA)
    *That there are some categorical fields that have only numerical values (and Python will interpret them as numbers)


Categorical field with numerical values

    OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor    


Value NA being used as a category

    Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

Missing values

Maybe the most common problem in real datasets is the presence of null values and most of the implementations of the regression algorithms break in the presence of them. Sometimes, instead of a blank in the position, a wildcard is used like ? or NA. However, as we see above, in the House Prices example the value NA is being used to denote other kind of information, what is a problem. So, our first attempt will be to exchange these NA values to the value 'No', denoting the absence of some characteristic.

Fisrt, lets see what are the missing values

In [15]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
import seaborn as sns
import warnings
from scipy.stats import norm
from scipy import stats
warnings.filterwarnings('ignore')

np.set_printoptions(threshold=np.nan) #to print all the elements of a matrix
pd.set_option('display.max_rows', 2000)#to print all the elements of a data frame

%matplotlib inline
    
df = pd.read_csv("train.csv",na_values=['?',''],delimiter=',',delim_whitespace=False)
df_test = pd.read_csv("test.csv",na_values=['?',''],delimiter=',',delim_whitespace=False)


In [16]:
print("Shape of training set: ", df.shape)
print("Shape of test set: ", df_test.shape)
data_aux = df.append(df_test) #merging the two datasets to facilitate the cleaning
print("Missing values before remove NA: " , data_aux.columns[data_aux.isnull().any()])

Shape of training set:  (1460, 81)
Shape of test set:  (1459, 80)
Missing values before remove NA:  Index(['Alley', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'Electrical', 'Exterior1st', 'Exterior2nd',
       'Fence', 'FireplaceQu', 'Functional', 'GarageArea', 'GarageCars',
       'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',
       'KitchenQual', 'LotFrontage', 'MSZoning', 'MasVnrArea', 'MasVnrType',
       'MiscFeature', 'PoolQC', 'SalePrice', 'SaleType', 'TotalBsmtSF',
       'Utilities'],
      dtype='object')


Now, lets replace NA for No where appropriate and search for the missing values again.

In [17]:
#Alley
data_aux.Alley.fillna(inplace=True,value='No')

#BsmtQual
data_aux.BsmtQual.fillna(inplace=True,value='No')

#BsmtCond
data_aux.BsmtCond.fillna(inplace=True,value='No')

#BsmtExposure
data_aux.BsmtExposure.fillna(inplace=True,value='No')

#BsmtFinType1
data_aux.BsmtFinType1.fillna(inplace=True,value='No')

#BsmtFinType2
data_aux.BsmtFinType2.fillna(inplace=True,value='No')

#FireplaceQu
data_aux.FireplaceQu.fillna(inplace=True,value='No')    

#GarageType
data_aux.GarageType.fillna(inplace=True,value='No')

#GarageFinish
data_aux.GarageFinish.fillna(inplace=True,value='No')

#GarageQual 
data_aux.GarageQual.fillna(inplace=True,value='No')
    
#GarageCond
data_aux.GarageCond.fillna(inplace=True,value='No')

#PoolQC
data_aux.PoolQC.fillna(inplace=True,value='No')
    
#Fence
data_aux.Fence.fillna(inplace=True,value='No')

#MiscFeature
data_aux.MiscFeature.fillna(inplace=True,value='No')
    
print("Missing values after insert No, i.e., real missing values: " , data_aux.columns[data_aux.isnull().any()])


Missing values after insert No, i.e., real missing values:  Index(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF',
       'Electrical', 'Exterior1st', 'Exterior2nd', 'Functional', 'GarageArea',
       'GarageCars', 'GarageYrBlt', 'KitchenQual', 'LotFrontage', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'SalePrice', 'SaleType', 'TotalBsmtSF',
       'Utilities'],
      dtype='object')


These are the real missing values. Among the columns that were returned, we now going to treat the numeric fields. Note that these numeric fields had NA to denote a missing value. So, the dataset was using the same code to encode two different informations what would cause a lot of confusion!

In [18]:
#Numeric fields    

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)  #insert the mean   
    
#BsmtFinSF1
#df.dropna(axis=0,subset=['BsmtFinSF1'],inplace=True)
#df['BsmtFinSF1'] = imp.fit_transform(df['BsmtFinSF1'].reshape(1,-1)).transpose()    
data_aux.BsmtFinSF1.fillna(inplace=True,value=0)
    
#BsmtFinSF2
#df.dropna(axis=0,subset=['BsmtFinSF2'],inplace=True)
#df['BsmtFinSF2'] = imp.fit_transform(df['BsmtFinSF2'].reshape(1,-1)).transpose()    
data_aux.BsmtFinSF2.fillna(inplace=True,value=0)
    
#BsmtUnfSF
#df.dropna(axis=0,subset=['BsmtUnfSF'],inplace=True)
#df.drop('BsmtUnfSF',axis=1,inplace=True)
#df['BsmtUnfSF'] = imp.fit_transform(df['BsmtUnfSF'].reshape(1,-1)).transpose()    
data_aux.BsmtUnfSF.fillna(inplace=True,value=0)
    
#TotalBsmtSF
#df.dropna(axis=0,subset=['TotalBsmtSF'],inplace=True)
#df['TotalBsmtSF'] = imp.fit_transform(df['TotalBsmtSF'].reshape(1,-1)).transpose()    
data_aux.TotalBsmtSF.fillna(value=0,inplace=True)
    
#BsmtFullBath - apenas na base de teste tem NA.Nao posso remover a linha
#df['BsmtFullBath'] = imp.fit_transform(df['TotalBsmtSF'].reshape(1,-1)).transpose()    
data_aux.BsmtFullBath.fillna(inplace=True,value=0)
    
#BsmtHalfBath- apenas na base de teste tem NA.Nao posso remover a linha
#df['BsmtHalfBath'] = imp.fit_transform(df['BsmtHalfBath'].reshape(1,-1)).transpose()    
data_aux.BsmtHalfBath.fillna(inplace=True,value=0)
        
#GarageCars
#df.dropna(axis=0,subset=['GarageCars'],inplace=True)
#df['GarageCars'] = imp.fit_transform(df['GarageCars'].reshape(1,-1)).transpose()    
data_aux.GarageCars.fillna(value=0,inplace=True)
    
#GarageArea
#df.dropna(axis=0,subset=['GarageArea'],inplace=True)
#df['GarageArea'] = imp.fit_transform(df['GarageArea'].reshape(1,-1)).transpose()    
data_aux.GarageArea.fillna(value=0,inplace=True)
        
#LotFrontage 
data_aux['LotFrontage'].fillna(inplace=True,value=0)
    
#GarageYrBlt - remove the hole column
data_aux.GarageYrBlt.fillna(inplace=True,value=0)
   
#MasVnrArea 
data_aux.MasVnrArea.fillna(inplace=True,value=0)
    

These eleven fields had null values. My first impulse was to fill these missing values with the mean (average) of the whole values of each column or drop the hole column. Latter, i realized that would be better to set zero instead, as this approch could cause less bias. 

Next, we have to deal with the categorical values. In such cases i decided to fill with the most common value of each column, once more trying to cause the less bias as possible. 

In [19]:
#####Categorial fields


#KitchenQual
data_aux.KitchenQual = data_aux.KitchenQual.mode()[0]

#Functional
data_aux.Functional = data_aux.Functional.mode()[0]

#Utilities
data_aux.Utilities = data_aux.Utilities.mode()[0]  
    
#SaleType
data_aux.SaleType  = data_aux.SaleType.mode()[0]
    
#Exterior1st- nao posso remover linhas do teste
data_aux.Exterior1st = data_aux.Exterior1st.mode()[0]

#Exterior2nd
data_aux.Exterior2nd = data_aux.Exterior2nd.mode()[0]       

#Electrical - remove the records where the value is NA
data_aux.Electrical = df['Electrical'].mode()[0]

#MSZoning   - tem NA apenas na base de teste. Como nao posso remover linhas removo a coluna   
data_aux.MSZoning = data_aux.MSZoning.mode()[0]
     
#MasVnrType - remove the records where the value is NA 
data_aux.MasVnrType=df['MasVnrType'].mode()[0]


print("Missing values after all: " , data_aux.columns[data_aux.isnull().any()])


Missing values after all:  Index(['SalePrice'], dtype='object')


We can see that now that there are no missing value. Of course SalePrice does not count because as we merge train and test, this column became empty for the test records.

There is only one more thing to tackle. I have noticed this when i was navigating into Weka and looking the data. Categorical features where the categories are expressed by numbers are treated as quantitative values. This is an undesirable behaviour as we are planning to convert categorical to dummy values and this will not work for this columns. As the sklearn.feature_extraction documentation explain 

    When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on. For instance, a feature “f” that can take on the values “ham” and “spam” will become two features in the output, one signifying “f=ham”, the other “f=spam”.

    However, note that this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int, the DictVectorizer can be followed by OneHotEncoder to complete binary one-hot encoding.
    


In [20]:
#Converting numeric columns to nominal before applying dummy convertion
#After converting to String they will be treated as categorical

# MSSubClass as str
print(data_aux['MSSubClass'].dtype)
data_aux['MSSubClass'] = data_aux['MSSubClass'].astype("str")
print(data_aux['MSSubClass'].dtype)

    
# Converting OverallCond to str
data_aux.OverallCond = data_aux.OverallCond.astype("str")

# KitchenAbvGr to categorical
data_aux['KitchenAbvGr'] = data_aux['KitchenAbvGr'].astype("str")
    
# Year and Month to categorical
data_aux['YrSold'] = data_aux['YrSold'].astype("str")
data_aux['MoSold'] = data_aux['MoSold'].astype("str")    


int64
object


The next step in transforming our dataset in one more suitable for being used by predictive models is to re encode the categorical columns. 

In [22]:
data_final = pd.get_dummies(data_aux)

data_train = data_final.iloc[:-df_test.shape[0],:]
data_train.to_csv('train_no_categorical.csv')
print("New shape train:" , np.shape(data_train))
print("Indice da coluna SalePrice no novo dataset" , data_train.columns.get_loc('SalePrice'))

data_test = data_final.iloc[:df_test.shape[0],:]

data_test.to_csv('test_no_categorical.csv')
data_test.drop('SalePrice',inplace=True,axis=1)
print("New shape test:" , np.shape(data_test))

print("Null values train \n", data_train.columns[data_train.isnull().any()])
print("Null values test \n", data_test.columns[data_test.isnull().any()])


print("Columns only in test set : " , data_test.columns.difference(data_train.columns))
print("Columns only in train set : " , data_train.columns.difference(data_test.columns))

New shape train: (1460, 286)
Indice da coluna SalePrice no novo dataset 26
New shape test: (1459, 285)
Null values train 
 Index([], dtype='object')
Null values test 
 Index([], dtype='object')
Columns only in test set :  Index([], dtype='object')
Columns only in train set :  Index(['SalePrice'], dtype='object')


Voilà! We now have a new dataset with no null values. I have printed some diagnostic informations for us. 

    *Note that the difference in the number of columns between the train and test set is only one. Is is the expected if all the transformations we did were well done. This extra column in the train set is due to SalePrice.
    *There are the same number of rows as in the begining. 