# Ramia_Assignment2

**Part 1: Data Preparation**

Seed value for random number generators to obtain reproducible results:

In [1]:
RANDOM_SEED = 1

Import base packages into the namespace for this program:

In [2]:
import numpy as np
import pandas as pd
import nbconvert

Import the dataset and drop observations with missing values:

In [3]:
bank = pd.read_csv('bank.csv', sep = ';').dropna()

Observe some descriptive characteristics of the data:

In [4]:
print('Shape: ',bank.shape)
print('\nFeatures:\n\n',bank.columns.values)
print('\nHead:\n\n',bank.head().to_string(line_width=78))

Shape:  (4521, 17)

Features:

 ['age' 'job' 'marital' 'education' 'default' 'balance' 'housing' 'loan'
 'contact' 'day' 'month' 'duration' 'campaign' 'pdays' 'previous'
 'poutcome' 'response']

Head:

    age          job  marital  education default  balance housing loan  \
0   30   unemployed  married    primary      no     1787      no   no   
1   33     services  married  secondary      no     4789     yes  yes   
2   35   management   single   tertiary      no     1350     yes   no   
3   30   management  married   tertiary      no     1476     yes  yes   
4   59  blue-collar  married  secondary      no        0     yes   no   

    contact  day month  duration  campaign  pdays  previous poutcome  \
0  cellular   19   oct        79         1     -1         0  unknown   
1  cellular   11   may       220         1    339         4  failure   
2  cellular   16   apr       185         1    330         1  failure   
3   unknown    3   jun       199         4     -1         0  unknown  

Mapping function to convert text no/yes to integer 0/1:

In [5]:
convert_to_binary = {'no' : 0, 'yes' : 1}

default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)

Gather three explanatory variables and response into a numpy array.
Here we use .T to obtain the transpose for the structure we want.

In [6]:
model_data = np.array([np.array(default),
                       np.array(housing),
                       np.array(loan),
                       np.array(response)]).T

**Part 2: Model Building and Evaluation**

In [7]:
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

Specify the set of classifiers being evaluated:

In [8]:
names = ["Naive_Bayes", "Logistic_Regression"]
classifiers = [BernoulliNB(alpha=1.0, binarize=0.5, 
                           class_prior = [0.5, 0.5], fit_prior=False), 
               LogisticRegression(solver='lbfgs')]

Specify the k-fold cross-validation design.
Here we will employ ten-fold cross-validation.

In [9]:
N_FOLDS = 10

Set up a numpy array for storing results and assign the K-Fold cross-validation to an object:

In [10]:
cv_results = np.zeros((N_FOLDS, len(names)))

kf = KFold(n_splits = N_FOLDS, shuffle=False, random_state = RANDOM_SEED)

Iterate through each fold, performing cross-validation for both Logistic Regression and Naive Bayes classifiers.

Each fold will be fit on the training set and evaluated on the test set.

**Note:** 0:model_data.shape[1]-1 slices for explanatory variables and model_data.shape[1]-1 is the index for the response variable.

In [11]:
index_for_fold = 0  # fold initialization
for train_index, test_index in kf.split(model_data):
    print('\nFold index:', index_for_fold,
          '------------------------------------------')   
    X_train = model_data[train_index, 0:model_data.shape[1]-1]
    X_test = model_data[test_index, 0:model_data.shape[1]-1]
    y_train = model_data[train_index, model_data.shape[1]-1]
    y_test = model_data[test_index, model_data.shape[1]-1]   
    print('\nShape of input data for this fold:',
          '\nData Set: (Observations, Variables)')
    print('X_train:', X_train.shape)
    print('X_test:',X_test.shape)
    print('y_train:', y_train.shape)
    print('y_test:',y_test.shape)

    index_for_method = 0  # method initialization
    for name, clf in zip(names, classifiers):
        print('\nClassifier evaluation for:', name)
        print('\nScikit Learn method:', clf)
        clf.fit(X_train, y_train)
        y_test_predict = clf.predict_proba(X_test)
        fold_method_result = roc_auc_score(y_test, y_test_predict[:,1]) 
        print('\nArea under ROC curve:', fold_method_result)
        cv_results[index_for_fold, index_for_method] = fold_method_result
        index_for_method += 1
  
    index_for_fold += 1


Fold index: 0 ------------------------------------------

Shape of input data for this fold: 
Data Set: (Observations, Variables)
X_train: (4068, 3)
X_test: (453, 3)
y_train: (4068,)
y_test: (453,)

Classifier evaluation for: Naive_Bayes

Scikit Learn method: BernoulliNB(alpha=1.0, binarize=0.5, class_prior=[0.5, 0.5], fit_prior=False)

Area under ROC curve: 0.5878522062732588

Classifier evaluation for: Logistic_Regression

Scikit Learn method: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Area under ROC curve: 0.5878522062732588

Fold index: 1 ------------------------------------------

Shape of input data for this fold: 
Data Set: (Observations, Variables)
X_train: (4069, 3)
X_test: (452, 3)
y_train: (4069,)
y_test: (452,)

Classifier evaluation for: Naive_Bayes

Sc

Create a dataframe from the results of the cross-validation:

In [12]:
cv_results_df = pd.DataFrame(cv_results)
cv_results_df.columns = names

Observe the average results across the ten folds of cross-validation for each classifier:

In [13]:
print('\n----------------------------------------------')
print('Average results from ', N_FOLDS, '-fold cross-validation\n',
      '\nMethod                 Area under ROC Curve', sep = '')     
print(cv_results_df.mean()) 


----------------------------------------------
Average results from 10-fold cross-validation

Method                 Area under ROC Curve
Naive_Bayes            0.611060
Logistic_Regression    0.611733
dtype: float64


*Neither classifier performs particularly well. Each only achieves about 61% area underneath the ROC curve. This is only marginally better than a random classifier. My belief is that having only three predictors and a very limited sample size is to blame for the poor performance. Future studies should aim to expand the sample size and explore more predictors such as demographic and socioeconomic variables. Further, a more powerful classifier, such as a Random Forest may be able to provide more predictive accuracy.*

**Part 3: Prediction**

Select method and apply to a small sample of test cases:

In [14]:
my_default = np.array([1, 1, 1, 1, 0, 0, 0, 0], np.int32)
my_housing = np.array([1, 1, 0, 0, 1, 1, 0, 0], np.int32)
my_loan = np.array([1, 0, 1, 0, 1, 0, 1, 0], np.int32)

my_X_test = np.vstack([my_default, my_housing, my_loan]).T

Define your training sample. In this case our "training" sample is the full data set.

In [15]:
X_train = model_data[:, 0:model_data.shape[1]-1]
y_train = model_data[:, model_data.shape[1]-1]

Fit the logistic regression to the full data set. In a real world application, you would split your data set into training and test samples before fitting in order to create a more generalizable model.

In [16]:
clf = LogisticRegression(solver='lbfgs')
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Use your model to predict the liklihood of opening a termed deposit account for each of the test cases:

In [17]:
y_my_test_predict = clf.predict_proba(my_X_test)

Create a dataframe from the results and observe each test case's probability of opening an account vs. not opening an account.

In [18]:
my_targeting_df = pd.DataFrame(np.hstack([my_X_test, y_my_test_predict]))
my_targeting_df.columns = ['default', 'housing', 'loan', 
                           'predict_NO', 'predict_YES']
print('\n\nLogistic regression model predictions for test cases:')
print(my_targeting_df) 



Logistic regression model predictions for test cases:
   default  housing  loan  predict_NO  predict_YES
0      1.0      1.0   1.0    0.945432     0.054568
1      1.0      1.0   0.0    0.892155     0.107845
2      1.0      0.0   1.0    0.900835     0.099165
3      1.0      0.0   0.0    0.812643     0.187357
4      0.0      1.0   1.0    0.953128     0.046872
5      0.0      1.0   0.0    0.906623     0.093377
6      0.0      0.0   1.0    0.914250     0.085750
7      0.0      0.0   0.0    0.835815     0.164185


Score your model on how well it performed classifying the training data:

Here you can see that the score for this model is much higher than what we observed using cross-validation above. This is because this model was fit to the entire dataset, which of course resulted in overfitting. While this model performs well for the training data, it will likely not be able to generalize to test data.

In [19]:
clf.score(X_train, y_train)

0.8847600088476001

Average the predicted probability of responding to the promotional mailers for each group in the "test" data:

In [20]:
my_targeting_df.groupby(['default','housing','loan']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,predict_NO,predict_YES
default,housing,loan,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.835815,0.164185
0.0,0.0,1.0,0.91425,0.08575
0.0,1.0,0.0,0.906623,0.093377
0.0,1.0,1.0,0.953128,0.046872
1.0,0.0,0.0,0.812643,0.187357
1.0,0.0,1.0,0.900835,0.099165
1.0,1.0,0.0,0.892155,0.107845
1.0,1.0,1.0,0.945432,0.054568


*According to these results, the customers most likely to respond to the promotional mailers are those who do not currently have a home or personal loan with the bank. Further, whether or not the customer has previously defaulted on a loan does not impact these probabilities drastically.*

*These results are suspicious to say the least. For the reasons outlined above, I would not trust these results when attempting to rollout a targeted marketing campaign.*

Do the same as above for the Naive Bayes classifier:

In [21]:
clf = BernoulliNB()
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [22]:
y_my_test_predict = clf.predict_proba(my_X_test)

In [23]:
my_targeting_df = pd.DataFrame(np.hstack([my_X_test, y_my_test_predict]))
my_targeting_df.columns = ['default', 'housing', 'loan', 
                           'predict_NO', 'predict_YES']
print('\n\nNaive Bayes model predictions for test cases:')
print(my_targeting_df) 



Naive Bayes model predictions for test cases:
   default  housing  loan  predict_NO  predict_YES
0      1.0      1.0   1.0    0.947911     0.052089
1      1.0      1.0   0.0    0.896225     0.103775
2      1.0      0.0   1.0    0.904384     0.095616
3      1.0      0.0   0.0    0.817810     0.182190
4      0.0      1.0   1.0    0.953537     0.046463
5      0.0      1.0   0.0    0.906885     0.093115
6      0.0      0.0   1.0    0.914286     0.085714
7      0.0      0.0   0.0    0.835042     0.164958


In [24]:
clf.score(X_train, y_train)

0.8847600088476001

In [25]:
my_targeting_df.groupby(['default','housing','loan']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,predict_NO,predict_YES
default,housing,loan,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,0.835042,0.164958
0.0,0.0,1.0,0.914286,0.085714
0.0,1.0,0.0,0.906885,0.093115
0.0,1.0,1.0,0.953537,0.046463
1.0,0.0,0.0,0.81781,0.18219
1.0,0.0,1.0,0.904384,0.095616
1.0,1.0,0.0,0.896225,0.103775
1.0,1.0,1.0,0.947911,0.052089
