# Final Project Submission
* Student name: Miguel Santana
* Student pace: Full Time
* Scheduled project review date/time: 10/14/2020, 12-12:45pm
* Instructor name: James Irving
* Blog post URL: TBA

# Project Methodology & Goal
**A Portuguese financial institution provided data resulting from various direct telemarketing campaigns with the goal of predicting subscriber term deposits. The following will include an in depth analysis of the client, campaign, social, economic and additional features that lend to predicting whether a client will subscribe a term deposit. The analysis will culminate in actionable business recommendations that will drive the target variable.**  

Dataset Citation:

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Available via:
* UCI's machine learning repository http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

* Kaggle https://www.kaggle.com/henriqueyamahata/bank-marketing 

# Data Analysis and Modeling
OSEMN Framework
* Obtain
* Scrub
* Explore
* Model
* INterpret

# OBTAIN

## Python Project Libraries
Importing Packages & Processing our Dataset

In [None]:
# Math, Visualizations, Cleaning and Analysis
import pandas as pd # data cleaning and manipulation
import numpy as np # numerical operations  
import seaborn as sns # visualizations / plt.style.use('seaborn-poster') 
#sns.set(style='whitegrid')
import matplotlib as mlp
import matplotlib.pyplot as plt
%matplotlib inline 
# pd.set_option('display.max_columns',0)
# pd.set_option('display.float_format', lambda x: '%.3f' % x)
import warnings
warnings.filterwarnings("ignore")

# Machine Learning / Reporting
import sklearn
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler

In [None]:
df = pd.read_csv('bank-additional-full.csv', delimiter=';')

## Variable Key
The dataset includes the following client, campaign, social, economic and other attributes:

**Client Data**
* Age
* Job Type
* Marital Status
* Education 
* Default **(client credit in default)**
* Housing **(client housing loan)**
* Loan **(client personal loan)**

**Current Campaign | Last Contact** 
* Contact Type
* Month 
* Day of Week 
* Duration (in seconds)

**Other Attributes:**
* Campaign (number of contacts/this campaign)
* Pdays (days since last contacted/previous campaign)
* Previous (contacts performed before this campaign/this client)
* Poutcome (previous campaign outcome)

**Social & Economic Context Attributes**
* Emp.var.rate (quarterly employment variation rate)
* Cons.price.idx (monthly indicator - consumer price index)
* Cons.conf.idx (monthly indicator - consumer confidence index)
* Euribor3m (daily indicator - euribor 3 month rate)
* Nr.employed (quarterly indicator - number of employees)

**Output/Target**
* y (has the client subscribed a term deposit?)

In [None]:
df = df.rename(columns={'y':'term_deposit'})

In [None]:
df.info()

# SCRUB

In [None]:
for col in df.columns: # preliminary view of value counts per column
    try:
        print(col, df[col].value_counts()[:5])
    except:
        print(col, df[col].value_counts())
        # If there aren't 5+ unique values for a column the first print statement
        # will throw an error for an invalid idx slice
    print('\n') # Break up the output between columns

## Null & Unknown Values

In [None]:
print('Missing values : ', df.isnull().sum().values.sum()) # null values
print('\n')
print('Unique values: \n', df.nunique()) # unique values per column
# df.isnull().values.any() 

<div class="alert alert-success">

There are no missing values in the dataset. There are a substantial amount of "unknown" values which will need to be addressed.  

</div> 

## Addressing Unknown Variables

In [None]:
print('Unknown Values Per Column')
for col in df.columns: # preliminary view of value counts per column
    try:
        print(col, df[col].value_counts()['unknown'])
    except:
        pass

<div class="alert alert-success">

In order to narrow down my dataset I will drop the "unknown" values from each column except for 'default'. After processing both updates ('unknown' & outlier removal) I should maintain over 90% of my original dataset.

</div> 

In [None]:
orig_len = len(df)
orig_len

In [None]:
# Dropping unknown variables from various columns
def drop_unknown(df, col_name):
    start_len = len(df)
    new_df = df.loc[(df[col_name] != 'unknown')]
    print(f'There were {start_len - len(new_df)} unknown values removed from {col_name}')
    return new_df

In [None]:
remove_job = drop_unknown(df, 'job')

In [None]:
remove_marital = drop_unknown(remove_job, 'marital')

In [None]:
remove_education = drop_unknown(remove_marital, 'education')

In [None]:
remove_housing = drop_unknown(remove_education, 'housing')

In [None]:
remove_loan = drop_unknown(remove_housing, 'loan')

In [None]:
df = remove_loan
print(f'There were {orig_len - len(remove_loan)} unknown values removed from the dataset')

## Outlier Removal

The outlier removal will be centered on age

In [None]:
# IQR Outlier Removal Function
def iqr_outlier_rem(df, col_name):
    start_len = len(df)
    Q1 = df[col_name].quantile(0.25)
    Q3 = df[col_name].quantile(0.75)
    IQR = Q3-Q1 # Finding interquartile range
    lower_threshold  = Q1-1.5*IQR
    upper_threshold = Q3+1.5*IQR
    new_df = df.loc[(df[col_name] > lower_threshold) & (df[col_name] < upper_threshold)]
    print(f'There were {start_len - len(new_df)} outliers removed from {col_name}')
    return new_df

In [None]:
df = iqr_outlier_rem(df, 'age')

In [None]:
print(df.shape)
df['age'].describe()

In [None]:
print(f'The new dataframe represents {(len(df)/orig_len) * 100} percent of the original dataset.')

<div class="alert alert-success">

The dataset now represents clients between the ages of 17 and 69 with a specific job status, marital status, education level and housing or personal loan status per bank records. 

</div> 

In [None]:
# ['default', 'housing', 'loan', 'term_deposit'] binary columns

# Exploratory Data Analysis

## Exploring Last Contact

In [None]:
ax = sns.countplot(x='term_deposit',data=df)
ax.set(xlabel='Subscribed Term Deposit', ylabel='Count', title='Term Deposit')
for a in ax.patches:
        ax.annotate('{:.2f}%'.format((a.get_height()/df.shape[0])*100), (a.get_x()+0.3, a.get_height()))

axb = df.groupby(['month', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
axb.set(xlabel='Month', ylabel='Term Deposit Count', title='Term Deposit X Contact Month')
for ab in axb.patches:
    axb.text(ab.get_x()+0.05, ab.get_height()+20,str(ab.get_height()))

ax2 = df.groupby(['day_of_week', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax2.set(xlabel='Day', ylabel='Term Deposit Count', title='Term Deposit X Contact Day of Week')
for b in ax2.patches:
    ax2.text(b.get_x()+0.05, b.get_height()+20,str(b.get_height()))

ax3 = df.groupby(['contact', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax3.set(xlabel='Contact Type', ylabel='Term Deposit Count', title='Term Deposit X Contact Type')
for c in ax3.patches:
    ax3.text(c.get_x()+0.05, c.get_height()+20,str(c.get_height()))

    
# plt.subplots_adjust(wspace=0.5)
# plt.show()

<div class="alert alert-success">

Observations
* There is a wide distribution of data when considering term deposits per month. The months with the most even distribution of yes/no are March, September, October and December.
* There is no distinguishable difference in term deposits per day of the week. 
* The data illustrates higher odds of a subscriber term deposit when clients are reached via cell phone versus telephone. This may be due to changes in technology as less consumers maintain an active land line phone at home. 

</div>

## Exploring Age, Job, Education & Marital Status

In [None]:
df_fresh = df.copy()
df_fresh['age_bin'] = df_fresh['age'].apply(lambda x: '[17, 25)' if x < 25 
                                else '[25, 35)' if x < 35 
                                else '[35, 45)' if x < 45
                                else '[55, 65)' if x < 65
                                else '65+')

In [None]:
# plt.figure(figsize=(20,20))

ax = df_fresh.groupby(['age_bin', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax.set(xlabel='Age Group', ylabel='Term Deposit Count', title='Term Deposit X Age')
for a in ax.patches:
    ax.text(a.get_x()+0.05, a.get_height()+20,str(a.get_height()))
    
ax2 = df.groupby(['job', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax2.set(xlabel='Job Type', ylabel='Term Deposit Count', title='Term Deposit X Job Type')
for b in ax2.patches:
    ax2.text(b.get_x()+0.05, b.get_height()+20,str(b.get_height()))

ax3 = df.groupby(['education', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax3.set(xlabel='Education Level', ylabel='Term Deposit Count', title='Term Deposit X Education Completed')
for c in ax3.patches:
    ax3.text(c.get_x()+0.05, c.get_height()+20,str(c.get_height()))

ax4 = df.groupby(['marital', 'term_deposit']).size().unstack().plot(kind='bar', stacked=False,figsize=(6,5))
ax4.set(xlabel='Marital Status', ylabel='Term Deposit Count', title='Term Deposit X Marital Status')
for d in ax4.patches:
    ax4.text(d.get_x()+0.05, d.get_height()+20,str(d.get_height()))

# plt.subplots_adjust(wspace=0.5)
# plt.show()

<div class="alert alert-success">

Observations
* The data is unbalanced with just under 11 percent of clients subscribing for a term deposit.
* Some client jobs hold a closer no/yes ratio regarding term deposits. The four jobs with the largest variation are: admin, blue-collar, services and technician.  
* The gap between no and yes variables appears to grow the more education a client has completed. Professional course and basic 6 year are the exception. 
* The majority of the dataset is represented by married couples with the greatest no/yes ratio existing in the "divorced" category.

</div>

In [None]:
dfclean = df.copy() # clean copy for final model analysis
yesdf = df[df['term_deposit'] == 'yes']
nodf = df[df['term_deposit'] == 'no']

<div class="alert alert-success">

The dataset will be separated in order to show features that are representative of a "yes" in our target variable.

</div> 

# Data Preparation

## Notable Data Columns / Feature Conversions 
* Education - variables will be converted to numerical
* Duration - a duration of zero directly ties to the target 'no' (remove before modeling)
* Pdays - a value of 999 means the customer has not been contacted. 

In [None]:
df['education'].replace({'illiterate': 0, 
                         'basic.4y': 4, 'basic.6y': 6, 
                         'basic.9y': 9, 'high.school': 12, 
                         'professional.course': 14, 
                         'university.degree': 16}, inplace=True)

# df['education'].value_counts()

## Dropping Features

In [None]:
# drop duration per dataset guidelines / drop columns with negative values
#df = df.drop(['duration', 'emp.var.rate', 'cons.conf.idx'], axis=1)

In [None]:
# drop duration per dataset guidelines
df = df.drop(['duration'], axis=1)

## Multicollinearity

In [None]:
corr = df.corr() # analyzing correlation
# corr
fig, ax = plt.subplots(figsize=(18,26))
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr, mask=mask, square=True, annot=True, cmap="YlGnBu")
#xticklabels=labels, yticklabels=labels)
#plt.xticks(rotation=-45, fontsize=16)
ax.patch.set_edgecolor('black')  
ax.patch.set_linewidth('1')
ax.set_title("Correlation & Heat Map", fontsize=15, fontfamily="serif")
plt.show()

In [None]:
# dropping features to address multicollinearity 
df = df.drop(['emp.var.rate', 'euribor3m'], axis=1)

## Pre-Processing

### Column Names

In [None]:
# Cleaning Column Names
subs = [(' ', '_'),('.0',''),('.',''),('-','_')]

def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

df.columns = [col_formatting(col) for col in df.columns]

### One-Hot Encoding

In [None]:
# One Hot Encode
df = pd.get_dummies(df, drop_first=True)

In [None]:
# Converting uint8 datatypes back to categorical variables 
for cat_cols in df.iloc[:,8:].columns:
         df[cat_cols] = df[cat_cols].astype('category')

In [None]:
df.info()

In [None]:
# Identify X, y
y = df['term_deposit_yes']
X = df.drop(['term_deposit_yes'], axis=1) 

### Standardize 

In [None]:
# standardize the data
scaler = StandardScaler() # transform "X" features
X_scaled = scaler.fit_transform(X)

### Test/Train Split

In [None]:
# Test/Train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

# Model

In [None]:
# Machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
def model_visuals (model, X_test, y_test):
    '''Plots the confusion matrix and ROC-AUC plot'''
    fig, axes = plt.subplots(figsize = (12, 6), ncols = 2)  # confusion matrix
    metrics.plot_confusion_matrix(model, X_test, y_test, normalize = 'true', 
                          cmap = 'Blues', ax = axes[0])
    axes[0].set_title('Confusion Matrix');
    # ROC-AUC Curve
    roc_auc = metrics.plot_roc_curve(model, X_test, y_test,ax=axes[1])
    axes[1].plot([0,1],[0,1],ls=':')
    axes[1].set_title('ROC-AUC Plot')
    axes[1].grid()
    axes[1].legend()
    fig.tight_layout()
    plt.show()

## Logistic Regression

In [None]:
%%time 
# Observe time lapse
logreg_clf = LogisticRegression()
logreg_model = logreg_clf.fit(X_train, y_train)
logreg_prediction = logreg_clf.predict(X_test)

lrs = round(accuracy_score(logreg_prediction, y_test)*100,2)
print('Accuracy Percentage', lrs)
print('\n')
print(classification_report(logreg_prediction, y_test), '\n\n')
model_visuals (logreg_clf, X_test, y_test)

## Random Forest

In [None]:
%%time 

ranfor_clf = RandomForestClassifier() # random forest 
ranfor_model = ranfor_clf.fit(X_train, y_train)
ranfor_prediction = ranfor_clf.predict(X_test)

random_forest_score = round(accuracy_score(ranfor_prediction, y_test)*100,2)
print('Accuracy Percentage', random_forest_score)
print('\n')

print(classification_report(ranfor_prediction, y_test), '\n\n')
model_visuals (ranfor_clf, X_test, y_test)

## Support Vector Machine

In [None]:
%%time 

svm_clf = SVC()
svm_model = svm_clf.fit(X_train, y_train)
svm_prediction = svm_clf.predict(X_test)

svm_score = round(accuracy_score(svm_prediction, y_test)*100,2)
print('Accuracy Percentage', svm_score)
print('\n')

print(classification_report(svm_prediction, y_test), '\n\n')
model_visuals (svm_clf, X_test, y_test)

## K-Nearest Neighbors

In [None]:
%%time 

knn_clf = KNeighborsClassifier()
knn_model = knn_clf.fit(X_train, y_train)
knn_prediction = knn_clf.predict(X_test)

knn_score = round(accuracy_score(knn_prediction, y_test)*100,2)
print('Accuracy Percentage', knn_score)
print('\n')


print(classification_report(knn_prediction, y_test), '\n\n')
model_visuals (knn_clf, X_test, y_test)

## Guassian Naive Bayes

In [None]:
%%time 

gaussian_clf = GaussianNB() #gaissian naive bayes
gaussian_model = gaussian_clf.fit(X_train, y_train)
gaussian_prediction = gaussian_clf.predict(X_test)

gaussian_score = round(accuracy_score(gaussian_prediction, y_test)*100,2)
print('Accuracy Percentage', gaussian_score)
print('\n')

print(classification_report(gaussian_prediction, y_test), '\n\n\n')
model_visuals (gaussian_clf, X_test, y_test)

## Decision Tree

In [None]:
%%time 

dectree_clf = DecisionTreeClassifier() # Decision Tree 
dectree_model = dectree_clf.fit(X_train, y_train)
dectree_prediction = dectree_clf.predict(X_test)

decision_tree_score = round(accuracy_score(dectree_prediction, y_test)*100,2)
print('Accuracy Percentage', decision_tree_score)
print('\n')

print(classification_report(dectree_prediction, y_test), '\n\n')
model_visuals (dectree_clf, X_test, y_test)

## Gradient Boosting

In [None]:
%%time 

gb_clf = GradientBoostingClassifier()
gb_model = gb_clf.fit(X_train, y_train)
gb_prediction = gb_clf.predict(X_test)

gbclf_score = round(accuracy_score(gb_prediction, y_test)*100,2)
print('Accuracy Percentage', gbclf_score)
print('\n')

print(classification_report(gb_prediction, y_test), '\n\n')
model_visuals (gb_clf, X_test, y_test)

## Adaboost

In [None]:
%%time 

adabst_clf = AdaBoostClassifier()
adabst_model = adabst_clf.fit(X_train, y_train)
adabst_prediction = adabst_clf.predict(X_test)

adabst_score = round(accuracy_score(adabst_prediction, y_test)*100,2)
print('Accuracy Percentage', adabst_score)
print('\n')

print(classification_report(adabst_prediction, y_test), '\n\n')
model_visuals (adabst_clf, X_test, y_test)

# Interpret Models

In [None]:
# results dataframe
models = pd.DataFrame({
    'Model': ['Logistic Regression',
              'KNN', 
              'Random Forest', 
              'Gaussian Naive Bayes',
              'Support Vector Machine (SVC)', 
              'Decision Tree', 
              'AdaBoostClassifier', 
              'GradientBoostingClassifier',
             ],
    'Score': [lrs, 
              knn_score, 
              random_forest_score, 
              gaussian_score,
              svm_score, 
              decision_tree_score,
              adabst_score, 
              gbclf_score, 
             ]})

models.sort_values(by='Score', ascending=False) #sorting by score

<div class="alert alert-success">

It looks like Gradient Boosting Classifier is the most accurate with Ada Boost Classifier coming in at second place (close second). Decision Tree is coming in at last place approximate 7 percentage points lower than the most accurate model. 


It is worth noting that the unbalanced target leads me to consider models based on the precision, recall and F1 score of the "1" variable (target "yes"). Fortunately, Gradient Boosting Classifier and Ada Boost Classifier performed well with recall scores over 0.65 and F1 scores over 0.32. 

</div>

## Gradient Boosting Classifier with GridSearchCV
Let's see if we can improve our results at all using GridSearch.

``` Python
# Grid Search Parameters
learn_rates = [0.05, 0.1]
max_depths = [2, 3]
min_samples_leaf = [5,10]
min_samples_split = [5,10]

param_grid = {'learning_rate': learn_rates,
              'max_depth': max_depths,
              'min_samples_leaf': min_samples_leaf,
              'min_samples_split': min_samples_split}

grid_search = GridSearchCV(GradientBoostingClassifier(), 
                           param_grid, cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
```

```
GridSearchCV(cv=5, estimator=GradientBoostingClassifier(),
             param_grid={'learning_rate': [0.05, 0.1], 'max_depth': [2, 3],
                         'min_samples_leaf': [5, 10],
                         'min_samples_split': [5, 10]},
             return_train_score=True)

```

``` Python
print(grid_search.score(X_train, y_train))
print(grid_search.best_params_)
```

``` 
0.9074869490517412
{'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 5}
```

<div class="alert alert-success">

Gridsearch CV did not raise the accuracy of the model. Let's see if we can improve the accuracy by addressing our class imbalance problem.

</div>

## Class Imbalance

In [None]:
# Visualizing churn
plt.bar(['No Term Deposit', 'Term Deposit'], df.term_deposit_yes.value_counts().values, facecolor = 'blue',  linewidth=0.5)
plt.title('Target Variable (Subscribed Term Deposits)', fontsize=16)
plt.xlabel('Classes')
plt.ylabel('Total Count')
plt.show()

<div class="alert alert-success">

The graph above highlights that there is a a pretty high level of imbalance. We can remedy this using SMOTE.

</div>

## SMOTE (Synthetic Minority Over-sampling Technique)

In [None]:
# Separate target and features
y = df['term_deposit_yes']
X = df.drop(['term_deposit_yes'], axis=1) 

In [None]:
X_scaled = scaler.fit_transform(X) #scale data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0) #test/train split

In [None]:
from imblearn.over_sampling import SMOTE #import smote

In [None]:
smote = SMOTE(random_state=0) #random state 0 for consistency

In [None]:
%%time

X_res, y_res = smote.fit_resample(X_train, y_train) #fitting smote to our train sets

In [None]:
# Updated bar graph to display amount of failed vs. successful campaigns post-SMOTE
plt.bar(['Customer Retention', 'Customer Churn'], [sum(y_res), len(y_res)-sum(y_res)], facecolor = 'green',  linewidth=0.5)
plt.title('Post-SMOTE Target Variable (Success)\n', fontsize=16)
plt.xlabel('Classes')
plt.ylabel('Number of Campaigns')
plt.show()

<div class="alert alert-success">

It looks like our classes are balanced now. Let's proceed.

</div>

In [None]:
%%time 

gb_clf = GradientBoostingClassifier()
gb_model = gb_clf.fit(X_train, y_train)
gb_prediction = gb_clf.predict(X_test)

gbclf_score = round(accuracy_score(gb_prediction, y_test)*100,2)
print('Accuracy Percentage', gbclf_score)
print('\n')

print(classification_report(gb_prediction, y_test), '\n\n')
model_visuals (gb_clf, X_test, y_test)

In [None]:
%%time

# Top 2 + 2 to save time
adabst_clf = AdaBoostClassifier()
gb_clf = GradientBoostingClassifier()
ranfor_clf = RandomForestClassifier()
dectree_clf = DecisionTreeClassifier()

classifiers = [adabst_clf, gb_clf, ranfor_clf, dectree_clf] # Classifiers
classifiers_names = ['AdaBoost', 'Gradient Boost', 'Random Forest', 'Decision Tree'] # respective names

In [None]:
%%time 

# iterating through classifiers and appending accuracy to list of scores
scores = []
for i in range(len(classifiers)):
    classifiers[i].fit(X_res, y_res)
    scores.append(round(classifiers[i].score(X_test, y_test), 3))

In [None]:
# dataframe to compare results
dfsmote = pd.DataFrame({'Model': classifiers_names, 'Score': scores})

dfsmote.sort_values(by='Score', ascending=False) # sorting models by score

<div class="alert alert-success">

The models performed worse than they did initially. SMOTE did not significantly impact model performance. 

</div>

# Feature Importance of Top Classifiers

Now that we know which classifiers have the most accuracy with our data, let's compare which features were the most important in the top two models: **Gradient Boosting Classifier and Adaboost.**

## Gradient Boosting Classifier

In [None]:
# Feature Importance
gb_feature = pd.DataFrame({'Importance': gb_model.feature_importances_, 'Column': X.columns})
gb_feature = gb_feature.sort_values(by='Importance', ascending=False) 
print('Catboost Top 25 Features')
gb_feature[:25] # top 25 features

In [None]:
gb_feature = gb_feature[:25] # top 25 features
gb_feature.plot(kind='barh', x='Column', y='Importance', figsize=(20, 10), cmap = 'coolwarm')
plt.title('Gradient Boosting Feature Importance \n', fontsize=16)
plt.show()

## Adaboost Features

In [None]:
# Feature Importance
ada_feature = pd.DataFrame({'Importance': adabst_model.feature_importances_, 'Column': X.columns})
ada_feature = ada_feature.sort_values(by='Importance', ascending=False) 
print('Adaboost Top 25 Features')
ada_feature[:25] 

In [None]:
ada_feature = ada_feature[:25] # top 25 features
ada_feature.plot(kind='barh', x='Column', y='Importance', figsize=(20, 10), cmap = 'coolwarm')
plt.title('Adaboost Feature Importance \n', fontsize=16)
plt.show()

## Overlap

In [None]:
# Creating lists / top 25 features in each classifier 
gb = gb_feature.Column.unique() 
ada = ada_feature.Column.unique()

In [None]:
set(gb) & set(ada) # items appearing in both lists

<div class="alert alert-success">

There are 19 total items that appear in both classifier feature lists. 

Top rated features in both lists include:
* **Number of employees | quarterly indicator (nremployed)**
* **Age**
* **Campaign | (number of contacts / this campaign)**
* **Consumer Confidence Index | Cons.conf.idx (monthly indicator)**
* **Days since last campaign contact | Pdays**
* **Contact Type Telephone**

</div>

In [None]:
# yesdf.head()

In [None]:
number_employees = yesdf['nr.employed'].value_counts()
df_number_employees = pd.DataFrame(number_employees)

x_counts = df_number_employees['nr.employed'].index
y_counts = df_number_employees['nr.employed']

fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
graph_number_employees = sns.barplot(x=x_counts, y=y_counts, data=df_number_employees, palette='winter_r')
plt.title('Subscriber Term Deposits X Number of Employees (Quarterly)', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.xlabel('Number of Employees (Quarterly)', fontdict={'fontsize': 16})
plt.show()

In [None]:
age = yesdf['age'].value_counts()
df_age = pd.DataFrame(age)

x_counts = df_age['age'].index
y_counts = df_age['age']

fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
graph_df_age = sns.barplot(x=x_counts, y=y_counts, data=df_age, palette='winter_r')
plt.title('Subscriber Term Deposits X Age', fontdict={'fontsize': 16})
plt.xlabel('Age', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.show()

In [None]:
conf_idx = yesdf['cons.conf.idx'].value_counts()
df_conf_idx = pd.DataFrame(conf_idx)

x_counts = df_conf_idx['cons.conf.idx'].index
y_counts = df_conf_idx['cons.conf.idx']

fig, ax = plt.subplots()
fig.set_size_inches(24, 16)
graph_conf_idx = sns.barplot(x=x_counts, y=y_counts, data=df_conf_idx, palette='winter_r')
for item in graph_conf_idx.get_xticklabels():
    item.set_rotation(90)
plt.title('Subscriber Term Deposits X Consumer Confidence Index', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.xlabel('Consumer Confidence Index', fontdict={'fontsize': 16})
plt.show()

In [None]:
campaign = yesdf['campaign'].value_counts()
df_campaign = pd.DataFrame(campaign)

x_counts = df_campaign['campaign'].index
y_counts = df_campaign['campaign']

fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
graph_df_campaign = sns.barplot(x=x_counts, y=y_counts, data=df_campaign, palette='winter_r')
plt.title('Subscriber Term Deposits X Current Campaign Contacts', fontdict={'fontsize': 16})
plt.xlabel('Number of Contacts this Campaign', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.show()

In [None]:
pdays = yesdf['pdays'].value_counts()
df_pdays = pd.DataFrame(pdays)

x_counts = df_pdays['pdays'].index
y_counts = df_pdays['pdays']

fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
graph_df_pdays = sns.barplot(x=x_counts, y=y_counts, data=df_pdays, palette='winter_r')
plt.title('Subscriber Term Deposits X Last Contact', fontdict={'fontsize': 16})
plt.xlabel('Days Since Last Contact', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.show()

In [None]:
contact = yesdf['contact'].value_counts()
df_contact = pd.DataFrame(contact)

x_counts = df_contact['contact'].index
y_counts = df_contact['contact']

fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
graph_df_contact = sns.barplot(x=x_counts, y=y_counts, data=df_contact, palette='winter_r')
plt.title('Subscriber Term Deposits X contact type', fontdict={'fontsize': 16})
plt.xlabel('Days Since Last Contact', fontdict={'fontsize': 16})
plt.ylabel('Number of Term Deposits', fontdict={'fontsize': 16})
plt.show()

In [None]:
yesdf.columns

# Conclusion Business Insights and Future Work

The dataset and given business insights relay specifically to a client base that cites their specific job type, marital status and education level. Additionally, the models and features reflect a client that is between 17-69 years of age. 

**Subscriber term deposits are highest when the number of quarterly employees are at least 5099 but not over 5228. In addition, consumer price index should be between 92.89 and 93.08. Lastly, clients between the ages of 30-40 are the most likely to subscribe term deposits.**

Future work: In order to more accurately define the boundaries of our features it is important to understand what customs and cultural influences are tied to this dataset (Portuguese banking info). For example: knowing the average level of education, the geographic locations of client residences and information on financial markets in this region may alter the way we perceive each of these variables.