
<br><br><br><br>
# <center> Machine Learning Level 2
### Missing Value Introduction
Source: https://www.kaggle.com/dansbecker/handling-missing-values

In [1]:
import pandas as pd
main_file_path = 'Data/train.csv'
data = pd.read_csv(main_file_path)

In [2]:
print(data.isnull().sum())

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

### Null Value Solutions
##### 1) A Simple Option: Drop Columns with Missing Values

In [3]:
original_data = pd.DataFrame(data)

In [4]:
data_without_missing_values = original_data.dropna(axis=1)
data_without_missing_values.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


##### 2) A Better Option: Imputation
* Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.
* The default behavior fills in the mean value for imputation. Statisticians have researched more complex strategies, but those complex strategies typically give no benefit once you plug the results into sophisticated machine learning models.
* One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline. Pipelines simplify model building, model validation and model deployment.

* Warning: Sklearn imputation only work on numeric values

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [6]:
# isolating numeric values
numeric_columns  = list(original_data.select_dtypes(include=[int, float]).columns)

In [7]:
from sklearn.preprocessing import Imputer
my_imputer =  Imputer()
data_with_imputed_values = my_imputer.fit_transform(original_data[numeric_columns])
data_with_imputed_values

array([[  1.00000000e+00,   6.00000000e+01,   6.50000000e+01, ...,
          2.00000000e+00,   2.00800000e+03,   2.08500000e+05],
       [  2.00000000e+00,   2.00000000e+01,   8.00000000e+01, ...,
          5.00000000e+00,   2.00700000e+03,   1.81500000e+05],
       [  3.00000000e+00,   6.00000000e+01,   6.80000000e+01, ...,
          9.00000000e+00,   2.00800000e+03,   2.23500000e+05],
       ..., 
       [  1.45800000e+03,   7.00000000e+01,   6.60000000e+01, ...,
          5.00000000e+00,   2.01000000e+03,   2.66500000e+05],
       [  1.45900000e+03,   2.00000000e+01,   6.80000000e+01, ...,
          4.00000000e+00,   2.01000000e+03,   1.42125000e+05],
       [  1.46000000e+03,   2.00000000e+01,   7.50000000e+01, ...,
          6.00000000e+00,   2.00800000e+03,   1.47500000e+05]])

##### 3) An Extention to Imputation

In [8]:
# make a copy to avoid changing original data (when Imputing)
new_data = original_data[numeric_columns].copy()

In [9]:
# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns
                    if new_data[col].isnull().any())

for col in cols_with_missing:
  new_data[col + '_was missing'] = new_data[col].isnull()
  
# Imputation
my_imputer = Imputer()
new_data = my_imputer.fit_transform(new_data)

### Example - Comparing All Null Value Solution
* Question: What does out-of-sample MAE score mean?
##### Basic Problem Set-up

In [10]:
import pandas as pd
# Load data
melb_data = pd.read_csv('Data/melb_data.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

##### Creating Function to Measure Quality of An Approach
Question: What the fuck does Random State =0 mean?

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors, 
                                                    melb_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
  model = RandomForestRegressor()
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  return (mean_absolute_error(y_test, preds))

##### Get Model Score from Dropping Columns with Missing Values

In [12]:
cols_with_missing = [col for col in X_train.columns 
                     if X_train[col].isnull().any()]
                     
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print('Mean Absolute Error form dropping columsn with Missing Values:')
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))


Mean Absolute Error form dropping columsn with Missing Values:
348343.721526


##### Get Model Score from imputation

In [13]:
from sklearn.preprocessing import Imputer

my_impture = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print('Mean Absolute Error from Impuation:')
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Mean Absolute Error from Impuation:
203677.770266


##### Get Score from Imputation with Extra Columns Showing What Was Imputed

In [14]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing =  (col for col in X_train.columns
                     if X_train[col].isnull().any())

for col in cols_with_missing:
  imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
  imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
  
  
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.fit_transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Mean Absolute Error from Imputation while Track What Was Imputed:
202816.342127


* In this case, the extension didn't make a big difference. As mentioned before, this can vary widely from one dataset to the next (largely determined by whether rows with missing values are intrinsically like or unlike those without missing values).

### Using Categorical Data with One Hot Encoding
Source: https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

Question: Why use One Hot Encoding over Label Encoding?

In [15]:
# Setup Code

# Read the data
import pandas as pd
train_data = pd.read_csv('Data/train.csv')
test_data = pd.read_csv('Data/test.csv')

# Drop the houses where the target is missing
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# Author Notes:
## Since missing values isn't the focus of this tutorial, we use the simplest
## possible approach, which drops these columns. 
## For more detail (and a better approach) to missing values, see
## https://www.kaggle.com/dansbecker/handling-missing-values


cols_with_missing = [col for col in train_data.columns
                    if train_data[col].isnull().any()]


candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)


low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

* "cardinality" means the number of unique values in a column.

In [16]:
train_predictors.dtypes.sample(10)

SaleType        object
LotArea          int64
ExterQual       object
TotalBsmtSF      int64
Heating         object
ScreenPorch      int64
HouseStyle      object
LandContour     object
ExterCond       object
BsmtFullBath     int64
dtype: object

In [17]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_training_predictors

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60,8450,7,5,2003,2003,706,0,150,856,...,0,0,0,1,0,0,0,0,1,0
1,20,9600,6,8,1976,1976,978,0,284,1262,...,0,0,0,1,0,0,0,0,1,0
2,60,11250,7,5,2001,2002,486,0,434,920,...,0,0,0,1,0,0,0,0,1,0
3,70,9550,7,5,1915,1970,216,0,540,756,...,0,0,0,1,1,0,0,0,0,0
4,60,14260,8,5,2000,2000,655,0,490,1145,...,0,0,0,1,0,0,0,0,1,0
5,50,14115,5,5,1993,1995,732,0,64,796,...,0,0,0,1,0,0,0,0,1,0
6,20,10084,8,5,2004,2005,1369,0,317,1686,...,0,0,0,1,0,0,0,0,1,0
7,60,10382,7,6,1973,1973,859,32,216,1107,...,0,0,0,1,0,0,0,0,1,0
8,50,6120,7,5,1931,1950,0,0,952,952,...,0,0,0,1,1,0,0,0,0,0
9,190,7420,5,6,1939,1950,851,0,140,991,...,0,0,0,1,0,0,0,0,1,0


Comparing one-hot encoded model vs non-categorical mode1

In [18]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
  # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
  return -1 * cross_val_score(RandomForestRegressor(50),
                             X, y, scoring='neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

Mean Absolute Error when Dropping Categoricals: 18461
Mean Abslute Error with One-Hot Encoding: 18127


### Applying to Multiple Files

In [19]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)

# This was cool, I was not aware of this
final_train, final_test = one_hot_encoded_training_predictors.align(
  one_hot_encoded_test_predictors, join='left', axis=1)

* The align command makes sure the columns show up in the same order in both datasets (it uses column names to identify which columns line up in each dataset.) The argument join='left' specifies that we will do the equivalent of SQL's left join. That means, if there are ever columns that show up in one dataset and not the other, we will keep exactly the columns from our training data. The argument join='inner' would do what SQL databases call an inner join, keeping only the columns showing up in both datasets. That's also a sensible choice.

### One-Hot Encoding Conclusion

* Pipelines: Deploying models into production ready systems is a topic unto itself. While one-hot encoding is still a great approach, your code will need to built in an especially robust way. Scikit-learn pipelines are a great tool for this. Scikit-learn offers a class for one-hot encoding and this can be added to a Pipeline. Unfortunately, it doesn't handle text or object values, which is a common use case.


* Applications To Text for Deep Learning: Keras and TensorFlow have fuctionality for one-hot encoding, which is useful for working with text.


* Categoricals with Many Values: Scikit-learn's FeatureHasher uses the hashing trick to store high-dimensional data. This will add some complexity to your modeling code.

### Learning to Use XGBoost
Source: https://www.kaggle.com/dansbecker/learning-to-use-xgboost

* XGBoost is the leading model for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). XGBoost models dominate many Kaggle competitions.


* To reach peak accuracy, XGBoost models require more knowledge and model tuning than techniques like Random Forest. 

* XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.) What is Gradient Boosted Decision Trees? We'll walk through a diagram.

### [Image Here]

* We go through cycles that repeatedly builds new models and combines them into an ensemble model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the "ensemble of models."

* To make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.

* There's one piece outside that cycle. We need some base prediction to start the cycle. In practice, the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.

* This process may sound complicated, but the code to use it is straightforward. We'll fill in some additional explanatory details in the model tuning section below.


### Example - XG Boost

In [23]:
# Example Setup
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('Data/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)

my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

Question: What dose the verbose attribute do in the fit function?

In [24]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [26]:
# make predictions 
predictions = my_model.predict(test_X)

from sklearn.metrics import mean_absolute_error
print('Mean Absolute Error:' + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error:15371.197089


##### Model Tuning