### Encoding Categorical Features
True practice of data science will require preprocessing.

In [18]:
import pandas as pd
import xgboost as xgb
import numpy as np

In [19]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
ames_housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3970/datasets/ames_unprocessed_data.csv")

In [20]:
columns_rearranged = ["MSSubClass","LotFrontage","LotArea"]+ames_housing_data.columns.tolist()[7:-2]+["MSZoning","PavedDrive"]+ames_housing_data.columns.tolist()[4:7]+["SalePrice"]
ames_housing_data = ames_housing_data[columns_rearranged]
X,y = ames_housing_data.iloc[:,:-1],ames_housing_data.iloc[:,-1]

In [21]:
ames_housing_data.shape

(1460, 21)

In [22]:
ames_housing_data.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,Remodeled,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,Fireplaces,GarageArea,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,0.476712,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,0.613014,472.980137,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,0.499629,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.644666,213.804841,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,0.0,334.5,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,480.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,1.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,576.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,1.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,1418.0,755000.0


In [23]:
ames_housing_data.describe().info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns (total 16 columns):
MSSubClass      8 non-null float64
LotFrontage     8 non-null float64
LotArea         8 non-null float64
OverallQual     8 non-null float64
OverallCond     8 non-null float64
YearBuilt       8 non-null float64
Remodeled       8 non-null float64
GrLivArea       8 non-null float64
BsmtFullBath    8 non-null float64
BsmtHalfBath    8 non-null float64
FullBath        8 non-null float64
HalfBath        8 non-null float64
BedroomAbvGr    8 non-null float64
Fireplaces      8 non-null float64
GarageArea      8 non-null float64
SalePrice       8 non-null float64
dtypes: float64(16)
memory usage: 1.1+ KB


__Observation__: LotFrontage column has 259 missing values.  MSZoning, PavedDrive, and HouseStyle are categorical. These need to be encoded numerically before we can use XGBoost. 

###  LabelEncoder
Encode any categorical columns in the dataset using __ LabelEncoder__ so that they are encoded numerically. <br>
The data has five categorical columns: 
- MSZoning
- PavedDrive
- Neighborhood
- BldgType
- HouseStyle. <br> 

Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers.

In [24]:
# required to submit soln in DataCamp shell
df = ames_housing_data.copy()

In [25]:
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

In [26]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

In [33]:
categorical_mask[10:25]

FullBath        False
HalfBath        False
BedroomAbvGr    False
Fireplaces      False
GarageArea      False
MSZoning         True
PavedDrive       True
Neighborhood     True
BldgType         True
HouseStyle       True
SalePrice       False
dtype: bool

In [28]:
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

In [29]:
categorical_columns

['MSZoning', 'PavedDrive', 'Neighborhood', 'BldgType', 'HouseStyle']

In [34]:
# Print the head of the categorical columns
print(df[categorical_columns].head())

  MSZoning PavedDrive Neighborhood BldgType HouseStyle
0       RL          Y      CollgCr     1Fam     2Story
1       RL          Y      Veenker     1Fam     1Story
2       RL          Y      CollgCr     1Fam     2Story
3       RL          Y      Crawfor     1Fam     2Story
4       RL          Y      NoRidge     1Fam     2Story


In [35]:
# Create LabelEncoder object: le
le = LabelEncoder()

In [36]:
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

In [37]:
# Print the head of the categorical columns
print(df[categorical_columns].head())

   MSZoning  PavedDrive  Neighborhood  BldgType  HouseStyle
0         3           2             5         0           5
1         3           2            24         0           2
2         3           2             5         0           5
3         3           2             6         0           5
4         3           2            15         0           5


### OneHotEncoder
In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: Apply a one-hot encoding to create binary, or "dummy" variables. Scikit-learn offers OneHotEncoder.

In [39]:
# required to submit soln in DataCamp shell
df = ames_housing_data.copy()

In [40]:
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

In [41]:
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)

In [43]:
ohe

OneHotEncoder(categorical_features=MSSubClass      False
LotFrontage     False
LotArea         False
OverallQual     False
OverallCond     False
YearBuilt       False
Remodeled       False
GrLivArea       False
BsmtFullBath    False
BsmtHalfBath    False
FullBath        False
HalfBath        False
BedroomAbvGr    False
Fireplaces      False
GarageArea      False
MSZoning         True
PavedDrive       True
Neighborhood     True
BldgType         True
HouseStyle       True
SalePrice       False
dtype: bool,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values='auto', sparse=False)

In [44]:
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)

ValueError: could not convert string to float: '1Story'

In [76]:
# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

In [75]:
# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

## DictVectorizer
LabelEncoder followed by OneHotEncoder can be simplified by using a [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html).

Using a DictVectorizer on a DataFrame that has been converted to a dictionary yields label encoding as well as one-hot encoding in one go.

When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on.  However, note that this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int, the DictVectorizer can be followed by OneHotEncoder to complete binary one-hot encoding.

In [45]:
# required to submit soln in DataCamp shell
df = ames_housing_data.copy()

# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# Convert df into a dictionary: df_dict
df_dict = df.to_dict('records')

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
print(df_encoded[:5,:])

# Print the vocabulary
print(dv.vocabulary_)

## Preprocessing within Pipeline
Having observed what steps need to be taken individually to properly process the Ames housing data, consider the much cleaner and more succinct DictVectorizer approach and put it alongside an XGBoostRegressor inside of a scikit-learn pipeline.

In [53]:
# Import necessary modules
import pandas as pd
#import xgboost as xgb
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import DictVectorizer

# Load data Ames, Iowa dataset from DataCamp's AWS url
ames_housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3970/datasets/ames_unprocessed_data.csv")

#create X and y from data
columns_rearranged = ["MSSubClass","LotFrontage","LotArea"] + ames_housing_data.columns.tolist()[7:-2] + ["MSZoning","PavedDrive"] + ames_housing_data.columns.tolist()[4:7] + ["SalePrice"]

ames_housing_data = ames_housing_data[columns_rearranged]

X,y = ames_housing_data.iloc[:,:-1],ames_housing_data.iloc[:,-1]

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)), ("xgb_model", xgb.XGBRegressor())]
          
# Create the pipeline: xgb_pipeline
pipeline = Pipeline(steps = steps) 
          
# Convert df into a dictionary: df_dict
X = X.to_dict('records')
              
# Fit the pipeline
pipeline.fit(X, y)

Pipeline(memory=None,
     steps=[('ohe_onestep', DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=False)), ('xgb_model', XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_ch...
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])

## Cross-validating  XGBoost Model with Categorical Vars Encoded within Pipeline
Evaluate a score by cross-validation of preprocessed model

In [54]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import xgboost as xgb

# Load data Ames, Iowa dataset from DataCamp's AWS url
ames_housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3970/datasets/ames_unprocessed_data.csv")

#create X and y from data
columns_rearranged = ["MSSubClass","LotFrontage","LotArea"] + ames_housing_data.columns.tolist()[7:-2] + ["MSZoning","PavedDrive"] + ames_housing_data.columns.tolist()[4:7] + ["SalePrice"]

ames_housing_data = ames_housing_data[columns_rearranged]

X,y = ames_housing_data.iloc[:,:-1],ames_housing_data.iloc[:,-1]

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps = steps)

# Convert df into a dictionary: df_dict
X = X.to_dict('records')

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X, y, scoring = "neg_mean_squared_error", cv = 10, n_jobs = -1)

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))


10-fold RMSE:  29867.6037207


## Kidney disease case study I: Categorical Imputer (sklearn_pandas)
Exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.

This example introduces a new library - __sklearn_pandas__. This lobrary permits one to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, sklearn_pandas permits one to impute missing categorical values directly using the __Categorical_Imputer()__ class, and the __DataFrameMapper()__ class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.

A transformer called a __Dictifier__ encapsulates converting a DataFrame using .to_dict("records") without you having to do it explicitly (and so that it works in a pipeline). 

The list of feature names is in kidney_feature_names, the target name is in kidney_target_name, the features in X, and the target in y.

Task: apply the CategoricalImputer to impute all of the categorical columns in the dataset. You can refer to how the numeric imputation mapper was created as a template. Notice the keyword arguments input_df=True and df_out=True? This is so that you can work with DataFrames instead of arrays. By default, the transformers are passed a numpy array of the selected columns as input, and as a result, the output of the DataFrame mapper is also an array. Scikit-learn transformers have historically been designed to work with numpy arrays, not pandas DataFrames, even though their basic indexing interfaces are similar.

In [1]:
import pandas as pd
import xgboost as xgb
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
import numpy as np

from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.preprocessing import Imputer, LabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

ModuleNotFoundError: No module named 'sklearn_pandas'

In [2]:
# transformer called a __Dictifier__ that encapsulates converting a DataFrame using .to_dict("records") 
# helper class to allow DictVectorizer to work properly in pipeline with dataframe
class Dictifier(BaseEstimator,TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        if type(X) == pd.core.frame.DataFrame:
            return X.to_dict("records")
        else:
            return pd.DataFrame(X).to_dict("records")

In [21]:
# load data (may have to re-point once in github)

kidney_columns = ["age","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot",
                  "hemo","pcv","wc","rc","htn","dm","cad","appet","pe","ane","class"]

kidney_data = pd.read_csv('C:\\Users\\seanf\\Documents\\XG_Boost_DataCamp\\data\\chronic_kidney_disease.csv',
                         header = None,names = kidney_columns, na_values = "?")

In [19]:
# kidney_columns = kidney_columns[:5] + kidney_columns[9:18] + kidney_columns[5:9] + kidney_columns[18:]

# kidney_data = kidney_data[kidney_columns]

In [29]:
# remove label from list of feature names
kidney_feature_names = kidney_data.columns.tolist()[:-1]

# create list containing label (target) 'class'
kidney_target_name = kidney_data.columns.tolist()[-1]

In [30]:
# assign training data and label to X, y
X,y = kidney_data[kidney_feature_names], kidney_data[kidney_target_name]

In [3]:
# for submission to DataCamp Exercise
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
    [([numeric_feature], Imputer(strategy="median")) for numeric_feature in non_categorical_columns],
    input_df=True,
    df_out=True
    )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
    [(category_feature, CategoricalImputer()) for category_feature in categorical_columns],
    input_df=True,
    df_out=False
    )

ModuleNotFoundError: No module named 'sklearn_pandas'

ModuleNotFoundError: No module named 'sklearn_pandas' <br>
It seems conda does not have a sklearn_pandas package that supports Win 10

## Kidney disease case study II: Feature Union (FeatureUnion)
Having separately imputed numeric as well as categorical columns, task is now to use scikit-learn's __FeatureUnion__ to concatenate their results, which are contained in two separate transformer objects - __numeric_imputation_mapper__, and __categorical_imputation_mapper__, respectively.

Just like with pipelines, pass it a list of (string, transformer) tuples, where the first half of each tuple is the name of the transformer.

In [None]:
# for submission to DataCamp Exercise
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])

## Kidney disease case study III: Full pipeline
Piece together all of the transforms along with an XGBClassifier to build the full pipeline

Besides the numeric_categorical_union created in the previous exercise, there are two other transforms needed: the Dictifier() transform and the DictVectorizer().

Task: After creating the pipeline, cross-validate it to see how well it performs.

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
import numpy as np

from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.preprocessing import Imputer, LabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# transformer called a __Dictifier__ that encapsulates converting a DataFrame using .to_dict("records") 
# helper class to allow DictVectorizer to work properly in pipeline with dataframe
class Dictifier(BaseEstimator,TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        if type(X) == pd.core.frame.DataFrame:
            return X.to_dict("records")
        else:
            return pd.DataFrame(X).to_dict("records")

In [4]:
# load data (may have to re-point once in github)

kidney_columns = ["age","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot",
                  "hemo","pcv","wc","rc","htn","dm","cad","appet","pe","ane","class"]

kidney_data = pd.read_csv('C:\\Users\\seanf\\Documents\\XG_Boost_DataCamp\\data\\chronic_kidney_disease.csv',
                         header = None,names = kidney_columns, na_values = "?")

# remove label from list of feature names
kidney_feature_names = kidney_data.columns.tolist()[:-1]

# create list containing label (target) 'class'
kidney_target_name = kidney_data.columns.tolist()[-1]

# assign training data and label to X, y
X,y = kidney_data[kidney_feature_names], kidney_data[kidney_target_name]

In [None]:
# Create full pipeline
pipeline = Pipeline([
    ("featureunion", numeric_categorical_union),
    ("dictifier", Dictifier()), 
    ('vectorizer', DictVectorizer(sort=False)),
    ('clf',xgb.XGBClassifier()),
    ])

# Perform cross-validation
cross_val_scores = cross_val_score(estimator = pipeline, 
                                   X = kidney_data, 
                                   y = y, 
                                   scoring ="roc_auc", 
                                   cv = 3)

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))

## Tuning XGBoost Hyperparameters in a Pipeline II
Perform a randomized search and identify the best hyperparameters.

In [None]:
# Create the parameter grid
gbm_param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth': np.arange(3, 10, 1),
    'clf__n_estimators': np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator = pipeline, 
                                        param_distributions = gbm_param_grid, 
                                        cv = 2, 
                                        n_iter = 2, 
                                        verbose = 1, 
                                        scoring = 'roc_auc')

# Fit the estimator
randomized_roc_auc.fit(X, y)

# Compute metrics
print("Best score: ", randomized_roc_auc.best_score_)
print("Best model: ", randomized_roc_auc.best_estimator_)
