See LICENSE for N1 and N2 for software license for this notebook.

## Classification

In this lecture, I will bring together various techniques for feature engineering that we have covered in this course to tackle a classification problem. This would give you an idea of the end-to-end pipeline to build machine learning algorithms for classification. 

I will:
- build a gradient boosted tree
- use feature-engine for the feature engineering steps
- set up an entire engineering and prediction pipeline using a Scikit-learn Pipeline

============================================================================

## In this demo:

We will use the titanic dataset, please refer to lecture **Datasets** in Section 1 of the course for instructions on how to download and prepare this dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for the model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# for feature engineering
from feature_engine import imputation as mdi
from feature_engine import discretisation as dsc
from feature_engine import encoding as ce

In [None]:
# load dataset

cols = [
    'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin',
    'embarked', 'survived'
]

data = pd.read_csv('./data/titanic.csv', usecols=cols)

data.head()

### Types of variables (section 2)

Let's find out what types of variables there are in this dataset

In [None]:
# let's inspect the type of variables in pandas

data.dtypes

There are categorical and numerical variables.

In [None]:
# let's inspect the variable values

for var in data.columns:
    print(var, data[var].unique()[0:20], '\n')

There continuous and discrete variables and also mixed variables.

In [None]:
# make list of variables  types

# numerical: discrete vs continuous
discrete = [var for var in data.columns if data[var].dtype!='O' and var!='survived' and data[var].nunique()<10]
continuous = [var for var in data.columns if data[var].dtype!='O' and var!='survived' and var not in discrete]

# mixed
mixed = ['cabin']

# categorical
categorical = [var for var in data.columns if data[var].dtype=='O' and var not in mixed]

print('There are {} discrete variables'.format(len(discrete)))
print('There are {} continuous variables'.format(len(continuous)))
print('There are {} categorical variables'.format(len(categorical)))
print('There are {} mixed variables'.format(len(mixed)))

In [None]:
discrete

In [None]:
continuous

In [None]:
categorical

In [None]:
mixed

### Variable characteristics (section 3)

In [None]:
# missing data

data.isnull().mean()

There is missing data in our variables.

In [None]:
# cardinality (number of different categories)

data[categorical+mixed].nunique()

Some variables are highly cardinal.

In [None]:
# outliers

data[continuous].boxplot(figsize=(10,4))

In [None]:
# outliers in discrete
data[discrete].boxplot(figsize=(10,4))
plt.show()

Some variables show outliers or unusual values.

In [None]:
# values bigger than 3 are rare for parch

data['parch'].value_counts()

In [None]:
# feature magnitude

data.describe()

Features are in different ranges or scales. But this is not relevant for gradient boosted trees. 

### Engineering mixed type of variables (section 11)

Extract numerical and categorical parts of variables.

In [None]:
# Cabin
data['cabin_num'] = data['cabin'].str.extract('(\d+)') # captures numerical part
data['cabin_num'] = data['cabin_num'].astype('float')
data['cabin_cat'] = data['cabin'].str[0] # captures the first letter

# show dataframe
data.head()

Now that we extracted the numerical and categorical part, we can discard the mixed variable Cabin.

In [None]:
# drop original mixed

data.drop(['cabin'], axis=1, inplace=True)

In [None]:
# separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.1,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

### Missing data imputation (Section 4)

In [None]:
# numerical

X_train.select_dtypes(exclude='O').isnull().mean()

In [None]:
# categorical

X_train.select_dtypes(include='O').isnull().mean()

Imputation methods I will perform:

- Numerical: arbitrary value imputation
- Categorical: add missing label imputation

Because I will build a Gradient Boosted tree, I am not particularly worried about disturbing linearity or distributions of variables.

### Categorical encoding and rare labels (Section 6)

In [None]:
# check cardinality again

X_train[['cabin_cat', 'sex', 'embarked']].nunique()

Now that I extracted the numerical and categorical part from cabin, its cardinality is not so high.

In [None]:
# check variable frequency

var = 'cabin_cat'
(X_train[var].value_counts() / len(X_train)).sort_values()

Categories T and G appear only in few observations, so I will replace them into rare.

### Discretisation or Variable transformation (Sections 7 and 8)

Let's inspect the variable distributions.

In [None]:
# numerical

X_train.select_dtypes(exclude='O').hist(bins=30, figsize=(8,8))
plt.show()

For decision trees, the variable distribution is not so important, so in principle, we don't need to change it. Also decision trees are robust to outliers.

### Putting it all together

In [None]:
titanic_pipe = Pipeline([

    # missing data imputation - section 4
    ('imputer_num',
     mdi.ArbitraryNumberImputer(arbitrary_number=-1,
                                variables=['age', 'fare', 'cabin_num'])),
    ('imputer_cat',
     mdi.CategoricalImputer(variables=['embarked', 'cabin_cat'])),

    # categorical encoding - section 6
    ('encoder_rare_label',
     ce.RareLabelEncoder(tol=0.01,
                                    n_categories=6,
                                    variables=['cabin_cat'])),
    ('categorical_encoder',
     ce.OrdinalEncoder(encoding_method='ordered',
                                  variables=['cabin_cat', 'sex', 'embarked'])),

    # Gradient Boosted machine
    ('gbm', GradientBoostingClassifier(random_state=0))
])

In [None]:
# let's fit the pipeline and make predictions
titanic_pipe.fit(X_train, y_train)

X_train_preds = titanic_pipe.predict_proba(X_train)[:,1]
X_test_preds = titanic_pipe.predict_proba(X_test)[:,1]

In [None]:
# a peek into the prediction values
X_train_preds

In [None]:
print('Train set')
print('GBM roc-auc: {}'.format(roc_auc_score(y_train, X_train_preds)))

print('Test set')
print('GBM roc-auc: {}'.format(roc_auc_score(y_test, X_test_preds)))

In [None]:
# let's explore the importance of the features

importance = pd.Series(titanic_pipe.named_steps['gbm'].feature_importances_)
importance.index = data.drop('survived', axis=1).columns
importance.sort_values(inplace=True, ascending=False)
importance.plot.bar(figsize=(12,6))
plt.show()