<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Logistic Regresion Lab
## Exercise with bank marketing data

_Authors: Sam Stack(DC)_


## Introduction
- Data from the UCI Machine Learning Repository: data, [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
- **Goal**: Predict whether a customer will purchase a bank product marketed over the phone
- `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website

## Step 1: Read the data into Pandas

In [None]:
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, accuracy_score, log_loss

bank = pd.read_csv('../../data/bank.csv')
bank.head()

**  Target '`y`' represented as such**
    - No : 0
    - Yes : 1

In [None]:
# check the results of y
bank['y'].value_counts()


## Step 2: Prepare at least three features
- Include both numeric and categorical features
- Choose features that you think might be related to the response (based on intuition or exploration)
- Think about how to handle missing values (encoded as "unknown")

In [None]:
# I'm going to take about 6 features and build two separate models.  
# Age, Job, Marital, education, contact, day of week.
# A correlation matrix or heat map is probably beneficial to finding useful features.
# This can be difficult with the amount of categorical features in the data.
# Once converted to dummie variables that will still be a computationally expensive process
# to compare all features.

# there was no formal eda behind my selection, I just wanted to use random features.  

In [None]:
features = ['age','job','marital','education','contact','day_of_week','y']

for feat in features:
    if feat != 'age':
        print(bank[feat].value_counts())

**Qualitative data analysis**  
So I have some unknown values in `education`, `marital` and `employment`.  We could make assumptions that the 39 unkown from `employment` are most likely in `admin` professions or that the 11 unknown in `marital` are most likely `married` (unfortunate that they are uncertain about it).

Personally, im going to drop the unknowns as I do not want to encorporate any addition bias into the data itself.  
- Going forward a more sound method of replacing unknowns is to build models to predict them using K Nearest neighbors, that way you are filling in an unknown using the most similar observations you have.

In [None]:
# creating the sub dataframe with only the features we're using
bank_a =  bank[features]

# getting rid of unknowns - there are more sophisticated ways to drop these, but this works.

is_ed_unk = bank_a['education'] != 'unknown'
bank_a = bank_a[is_ed_unk]

is_job_unk = bank_a['job'] != 'unknown'
bank_a = bank_a[is_job_unk]

is_married_unk = bank_a['marital'] != 'unknown'
bank_a = bank_a[is_married_unk]

The data is ready to get dummied, but i'll wait until we're about to model 


## Step 3: Model building
- Use cross-validation to evaluate the logistic regression model with your chosen features.  
    You can use any (combination) of the following metrics to evaluate.
    - [Classification/Accuracy Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
    - [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
    - [ROC curves and area under a curve (AUC)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
    - [Log loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)
- Try to increase the AUC by selecting different sets of features
    - *Bonus*: Experiment with hyper parameters such are regularization.

**Build a Model**  
*Model 1, using `age`, `job`, `education`, and `day_of_week`*

In [None]:
# md = ModelData.  Dummies ignores numeric columns such as age and y
bank_md1 = pd.get_dummies(bank_a[['age','job','education','day_of_week','y']], drop_first = True)


bank
# no hyper parameters for first model
LogReg1 = LogisticRegression()

# X and y features
X1 = bank_md1.drop('y', axis =1)
y1 = bank_md1['y']



# using train test split to cross val
x_train1, x_test1, y_train1, y_test1 = train_test_split(X1,y1, random_state =42)

# fit model
LogReg1.fit(x_train1, y_train1)

**Get the Coefficient for each feature.**
- Be sure to make note of interesting findings.

*Seems like `job_entrepreneur` carries that largest coef.*

In [None]:
name = bank_md1.columns.drop('y')

coef = LogReg1.coef_[0]

pd.DataFrame([name,coef],index = ['Name','Coef']).transpose()

**Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.**

In [None]:
# predict with model
y_pred = LogReg1.predict(x_test1)

In [None]:
print(f'Accuracy: {accuracy_score(y_test1,y_pred)}')

** Accuracy Score**

- Wow thats a pretty good score wouldn't you say?  Almost 90!  Remember the distribution of classes though.  In our entire dataset there are 3668 "No" and 451 "Yes" and a total of 4119 observations.  If we guessed that nobody was going to convert and therefore 'No' every time, we would be correct 89% of the time (according to out data).  That being said, this accuracy is barely better than baseline and such an insignificant difference could just be from how our train test split groupped the data.

**Confusion Matrix**

Looks like we have 880 True Negatives and 99 False Negatives.  That being said it looks like all our model is doing is predicting 'no' everytime.


In [None]:
print(f'Confusion Matrix:\n {confusion_matrix(y_test1,y_pred)}')

** ROC AUC**

The Area Under the ROC Curve is 0.5 which is completely wothless and our model gains no more insight that random guessing.  If we go back to the Accuracy score, we can now conclude that its minuscule improvement above the baseline is caused by our train test split.

In [None]:
print(f'ROC-AUC Score: {roc_auc_score(y_test1,y_pred)}')

**Log Loss**

In [None]:
print(f'Log Loss: {log_loss(y_test1,y_pred)}')

### Model 2: Using `age`, `job`, `marital`, `education`, `contact` and `day_of_week` to predict If the bought or not.

In [None]:
# md = ModelData.  Dummies ignores numeric columns such as age and y
bank_md2 = pd.get_dummies(bank_a, drop_first = True)

# no hyper parameters for first model
LogReg2 = LogisticRegression()

# X and y features
X2 = bank_md2.drop('y', axis =1)
y2 = bank_md2['y']

# using train test split to cross val
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2,y2, random_state =42)

# fit model
LogReg2.fit(x_train2, y_train2)

In [None]:
y_pred2 = LogReg2.predict(x_test2)

In [None]:
# Evaluate the metrics
print(f'Accuracy: {accuracy_score(y_test2,y_pred2)}')
print()
print(f'Confusion Matrix:\n {confusion_matrix(y_test2,y_pred2)}')
print()
print(f'ROC-AUC Score: {roc_auc_score(y_test2,y_pred2)}')
print()
print(f'Log Loss: {log_loss(y_test2,y_pred2)}')



None of the metrics really changed.  Looks like the features we have arn't very helpful...


### Is your model not performing very well?

Lets try one more thing before we revert to grabbing more features.  Adjusting the probability threshold.

Use the `LogisticRegression.predict_proba()` attribute to get the probabilities.

Recall from the lesson the first probability is the for class 0 and the second is for class 1

In [None]:
y_pred_prob = LogReg2.predict_proba(x_test2)

y_pred_prob

**Visualize the distribution**

In [None]:
y_pred_prob_t = y_pred_prob.transpose()

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(y_pred_prob_t[0])
plt.show()
plt.hist(y_pred_prob_t[1])

** Calculate a new threshold and use it to convert predicted probabilities to output classes**

Lets try decreaseing the threshold to %20 predicted probability or higher.

In [None]:
y_pred3=[]
for prob in y_pred_prob_t[1]:
    if prob > .20:
        y_pred3.append(1)
    else:
        y_pred3.append(0)
        
print(len(y_pred3))
print(len(y_test2))

In [None]:
y_pred3.count(1)  #Actually made some predictions

**Evaluate the model metrics now**

In [None]:
print(f'Accuracy: {accuracy_score(y_test2,y_pred3)}')
print()
print(f'Confusion Matrix:\n {confusion_matrix(y_test2,y_pred3)}')
print()
print(f'ROC-AUC Score: {roc_auc_score(y_test2,y_pred3)}')
print()
print(f'Log Loss: {log_loss(y_test2,y_pred3)}')


## Step 4: Build a model using all of the features.

In [None]:
bank_all = pd.get_dummies(bank, drop_first = True)


In [None]:
# no hyper parameters for first model
LogReg3 = LogisticRegression(penalty='l2',C=0.01)

# X and y features
X3 = bank_all.drop('y', axis =1)
y3 = bank_all['y']

# using train test split to cross val
x_train3, x_test3, y_train3, y_test3 = train_test_split(X3,y3, random_state =42)

# fit model
LogReg3.fit(x_train3, y_train3)

In [None]:
y_pred3 = LogReg3.predict(x_test3)

In [None]:
# Evaluate the metrics
print(f'Accuracy: {accuracy_score(y_test3,y_pred3)}')
print()
print(f'Confusion Matrix:\n {confusion_matrix(y_test3,y_pred3)}')
print()
print(f'ROC-AUC Score: {roc_auc_score(y_test3,y_pred3)}')
print()
print(f'Log Loss: {log_loss(y_test3,y_pred3)}')



## Bonus: Use Regularization to optimize your model.

In [None]:
# X and y features
X = bank_all.drop('y', axis =1)
y = bank_all['y']

# using train test split to cross val
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state =42)

cees = [0.01, 0.1, 1.0, 10, 100]

print('ROC : C')
for c in cees:
    logreg = LogisticRegression(penalty='l2', C=c, max_iter=2500) # set max_iter to avoid warning
    logreg.fit(x_train,y_train)
    y_pred = logreg.predict(x_test)
    roc = metrics.roc_auc_score(y_test, y_pred)
    print(roc," : ", c)

In [None]:
# look ina 
cees = [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 ,1.8, 1.9]

for c in cees:
    logreg = LogisticRegression(penalty='l2', C=c, max_iter=3000)
    logreg.fit(x_train,y_train)
    y_pred = logreg.predict(x_test)
    roc = roc_auc_score(y_test, y_pred)
    print(roc," : ", c)