# Sensitivity Analysis

This is a notebook dedicated specifically to testing four different alterations of the bank dataset in order to examine the robustness of the best model obtained in the Capstone Notebook. The alterations consist of increasing and decreasing the values of the numeric features by one and two standard deviations. All the other preprocessing steps are held constant. The best model is asked to predict on these datasets, and its performance is evaluated to analyze its sensitivity. 

To do so, we will design a pipeline that will alter the data, preprocess it to make it compatible with the best model, and evaluate its performance. 

## Data Preparation

The preprocessing steps below are copied verbatim from the Capstone notebook. Only the name of the Data Frame variable has been changed from 'df' to 'data', to avoid confusions when feeding it through the pipeline designed below.

In [8]:
# import data
import pandas as pd

data = pd.read_csv('bank-full.csv', sep=';')

In [9]:
# drop 'duration' column from dataset
data = data.drop('duration', axis=1)

In [10]:
# define numerical and categorical labels
numerical = ['age', 'balance', 'campaign', 'pdays', 'previous']
categorical = ['job', 'day', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']

In [11]:
# define mapping function
def day2week_mapper(day):
    if 1 <= day and day <= 7:
        return 1
    elif 8 <= day and day <= 14:
        return 2
    elif 15 <= day and day <= 21:
        return 3
    elif 22 <= day and day <= 31:
        return 4
    else: return 0 #unnecessary but it closes the if-statement

# apply mapping   
data['day'] = data['day'].apply(day2week_mapper)

In [12]:
# apply func
data[categorical] = data[categorical].apply(lambda x: x.astype('category'))

In [13]:
# subtract 1 to all the values in the 'campaign' column
data['campaign'] = data['campaign'] - 1

In [14]:
# method to count outliers in a series 
def get_outliers(feature, k):
    feature_outliers = pd.DataFrame()
    Q3 = data[feature].quantile(0.75)
    Q1 = data[feature].quantile(0.25)
    iqr = Q3 - Q1
    for i in range(data.index.size):
        client = data.loc[i]
        if client[feature] < Q1 - k * iqr or client[feature] > Q3 + k * iqr:
            feature_outliers = feature_outliers.append(client)
    return feature_outliers

In [15]:
# Select all outliers in the dataframe and drop duplicate rows. 
outliers = pd.DataFrame()

for feature in numerical:
    outliers = outliers.append(get_outliers(feature, 1.5))
    
outliers = outliers.drop_duplicates()

# Number of unique outliers in the data
print(outliers.shape[0], 'outliers in the data')

14804 outliers in the data


In [16]:
# Calculate and report new value counts of target variable after having dropped outliers
val_counts_dropped = {'yes': data.y.value_counts()[1] - outliers.y.value_counts()[1], 
                      'no': data.y.value_counts()[0] - outliers.y.value_counts()[0]}

print(val_counts_dropped)
print('New ratio of "yes" to "no" is:', '1:{:.1f}'.format(val_counts_dropped['no'] / val_counts_dropped['yes']))

{'yes': 2684, 'no': 27723}
New ratio of "yes" to "no" is: 1:10.3


In [17]:
# Separate outliers with 'yes' and 'no' target variables
outliers_y = outliers.loc[outliers.y == 'yes']
outliers_n = outliers.drop(outliers_y.index.values)
assert outliers_y.index.size == outliers.y.value_counts()[1] and outliers_n.index.size == outliers.y.value_counts()[0]

In [19]:
# Create a Data Frame with only the rows that contain the target variable 'yes' and check for equality
data_y = data.loc[data.y == 'yes']
assert data_y.index.size == data.y.value_counts()[1]

# Adaptation of method to count outliers in a series (this time using df_y instead of df)
def get_outliers_y(feature, k):
    feature_outliers = pd.DataFrame()
    Q3 = data_y[feature].quantile(0.75)
    Q1 = data_y[feature].quantile(0.25)
    iqr = Q3 - Q1
    for i in data_y.index.values:
        client = data_y.loc[i]
        if client[feature] < Q1 - k * iqr or client[feature] > Q3 + k * iqr:
            feature_outliers = feature_outliers.append(client)
    return feature_outliers

In [20]:
# Select all 'outliers' in df_y and drop duplicate rows. 
outliers_y = pd.DataFrame()

for feature in numerical:
    outliers_y = outliers_y.append(get_outliers_y(feature, 2))
    
outliers_y = outliers_y.drop_duplicates()

# Number of unique outliers in the data
print(outliers_y.shape[0], '"yes" outliers in the data')

1235 "yes" outliers in the data


In [21]:
# drop 'no' outliers
data = data.drop(outliers_n.index.values)

# drop 'yes' outliers
data = data.drop(outliers_y.index.values)

In [22]:
# Reset Data Frame index for better presentation from now onwards
data.reset_index(drop=True, inplace=True)

In [23]:
# Replace -1 with new value in df.pdays
data.pdays = data.pdays.replace(to_replace=-1, value=1.5*data.pdays.max())

At this point, we will now alter the data and feed it to the trained model (obtained from the Capstone notebook). After altering the data, all the rest of the pre-processing steps will need to be carried out. Therefore, it would be useful to create a single pipeline that could automate all this. 

## Streamlining Preprocessing Sub-steps

#### Preprocessing Sub-steps

In [24]:
# Imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import numpy as np

def transform_skewed(df):
    # load features into a dummy data frame
    df_test = df[numerical]
    
    # Initiate a dictionary and possible values of lambda
    std = {}
    lmbda = [-2.0, -0.5, 0.0, 0.5, 1.0, 2.0]

    # Iterate over numerical features and each value of lambda for each feature
    for feature in numerical:
        feature_std = {}
        for i in lmbda:
            dummy = df_test[feature]
            if i == 0.0:
                feature_std[str(i)] = dummy.apply(lambda x: np.log(x + 0.01)).std()
            else:
                feature_std[str(i)] = dummy.apply(lambda x: (x + 0.01) ** i).std()
        std[str(feature)] = feature_std
        
    # Select best value of lambda for each feature
    best_lmbda = dict((feature, min(std[feature], key=(lambda k: std[feature][k]))) for feature in numerical)

    # Transform skewed features with best value of lambda:
    for feature in numerical:
        df[feature] = df[feature].apply(lambda x: (x + 0.01) ** float(best_lmbda[feature]))
        
    return df

def normalize(df):
    # Initialize a scaler, then apply it to the features
    scaler = MinMaxScaler() # default=(0, 1)
    df[numerical] = scaler.fit_transform(df[numerical])
    
    return df

def one_hot(df):
    # Replace values in target variable
    target = df.y.replace(to_replace=['yes', 'no'], value=[1, 0])

    # Drop target variable from data set before one-hot encoding
    features = df.drop('y', axis=1)

    # One-hot encode features
    features = pd.get_dummies(features)

    return target, features

def pca_red(features):
    # Create PCA object that seeks to create 27 components
    pca = PCA(n_components=27, random_state=42)
    pca.fit(features)

    # Transform features using PCA
    features_red = pca.transform(features)
    
    return features_red

#### Automating Alteration and Preprocessing

In [36]:
# Alter numeric features
def alter(df, std):
    for feature in numerical:
        df[feature] = df[feature] + std * df[feature].std()
    return df

# Complete preprocessing
def preprocess(df):
    df = transform_skewed(df)
    df = normalize(df)
    target, features = one_hot(df)
    features_red = pca_red(features)
    
    return target, features_red

#### Final Pipeline to Alter and Predict

In [38]:
def alter_and_predict(df, std):
    df = alter(df, std)
    target, features_red = preprocess(df)
    scores = best_model.evaluate(features_red, target)
    names = best_model.metrics_names
    print('For a change of +' + str(std) + ' std, the performance of the best model is:\n', \
           dict((name, score) for name, score in zip(names, scores)))

## Load Model

The model created below is the model that was selected as 'best' from the Capstone notebook. Its training weights are loaded below, too. 

In [27]:
# Import necessary keras elements
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import ReduceLROnPlateau
from keras import regularizers

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [28]:
### FULL CREDIT TO ARSENY KRAVCHENKO FOR THE CODE BELOW

import numpy as np
from sklearn.metrics import fbeta_score
from keras import backend as K


def f_2(y_true, y_pred, threshold_shift=0):
    beta = 2

    # just in case of hipster activation at the final layer
    y_pred = K.clip(y_pred, 0, 1)

    # shifting the prediction threshold from .5 if needed
    y_pred_bin = K.round(y_pred + threshold_shift)

    tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
    fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
    fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    beta_squared = beta ** 2
    return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall + K.epsilon())

In [30]:
# Architecture
best_model = Sequential()
best_model.add(Dense(units=27, input_dim=27, activation='relu'))
best_model.add(Dense(10, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
best_model.add(Dropout(0.2)) 
best_model.add(Dense(15, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
best_model.add(Dropout(0.2)) 
best_model.add(Dense(20, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
best_model.add(Dropout(0.2))
best_model.add(Dense(1, activation='sigmoid'))

# Compile
best_model.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=[f_2, 'accuracy'])

# Load weights
best_model.load_weights('weights/model_2.best.hdf5')

## Sensitivity Test

We will now use the pipeline constructed above to test the sensitivity of the model, shifting the values in the numerical features by 1, 2, -1, and -2 standard deviations.

### +1 Std

In [39]:
alter_and_predict(data, std=1)

For a change of +1 std, the performance of the best model is:
 {'loss': 0.5914005916106693, 'f_2': 0.18203985381664878, 'acc': 0.7207099474462662}


### +2 Std

In [52]:
alter_and_predict(data, std=2)

For a change of +2 std, the performance of the best model is:
 {'loss': 0.69749638551987, 'f_2': 0.1619456995886197, 'acc': 0.6749535827799981}


### -1 Std

In [53]:
alter_and_predict(data, std=-1)

  array = np.array(array, dtype=dtype, order=order, copy=copy)


For a change of +-1 std, the performance of the best model is:
 {'loss': 0.5343055053000281, 'f_2': 0.18042761148223552, 'acc': 0.7548856090883344}


### -2 Std

In [54]:
alter_and_predict(data, std=-2)

  array = np.array(array, dtype=dtype, order=order, copy=copy)


For a change of +-2 std, the performance of the best model is:
 {'loss': 0.516189042806464, 'f_2': 0.18851469662381398, 'acc': 0.7759071026213928}
