# AI Saturdays Training Exercise - Bank Classifier
The 'VERYCH' Bank's marketing campaigns rely on customer data. The size of this data is so large that it is impossible for a data analyst to extract good information that can help in the decision-making process. Thus, they have decided to enlist your help to make some sense of the data via ML. 

## Dataset

This data set is related to the direct marketing campaigns from a Portuguese banking institution. Marketing campaigns were based on phone calls. Often, more than one contact with the same client was required, to be able to access whether the product (term bank deposit) would be ('yes') or not ('no') subscribed.

The objective is to predict whether the client will subscribe (yes/no) to a term deposit, building a classification model using decision trees.

## Summay of data
### Categorical Variables :
job : admin,technician, services, management, retired, blue-collar, unemployed, entrepreneur, housemaid, unknown, self-employed, student

marital : married, single, divorced

education: secondary, tertiary, primary, unknown

default : yes, no

housing : yes, no

loan : yes, no

deposit : yes, no (Dependent Variable)

contact : unknown, cellular, telephone

month : jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec

poutcome: unknown, other, failure, success


### Numerical Variables:
age

balance

day

duration

campaign -> number of times that a customer has been contacted in this campaign

pdays -> days that have passed since the last contact, -1 if a customer has not been contacted

previous -> number of times a customer been contacted prior to the campaign

In [2]:
# First, we import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn import datasets
from io import StringIO
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
%matplotlib inline

In [None]:
# We open the csv and create a dataframe from it.
df = pd.read_csv('bank.csv')

# Show number of rows and columns of the dataframe
print("Rows: " + str(df.shape[0]) + " Cols: " + str(df.shape[1]))

Filas: 11162 Cols: 17


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
5,42,management,single,tertiary,no,0,yes,yes,unknown,5,may,562,2,-1,0,unknown,yes
6,56,management,married,tertiary,no,830,yes,yes,unknown,6,may,1201,1,-1,0,unknown,yes
7,60,retired,divorced,secondary,no,545,yes,no,unknown,6,may,1030,1,-1,0,unknown,yes
8,37,technician,married,secondary,no,1,yes,no,unknown,6,may,608,1,-1,0,unknown,yes
9,28,services,single,secondary,no,5090,yes,no,unknown,6,may,1297,3,-1,0,unknown,yes


**TODO** by you: Show the first 10 rows 

In [None]:
# Write your code here

**TODO**  Find the number of unique values in each column

In [None]:
# Write your code here

76
12
3
4
2
3805
2
2
3
31
12
1428
36
472
34
4
2


**TODO** Print the general information of the dataframe

In [None]:
# Write your code here

**TODO** Print the general statistics description of the dataframe

In [None]:
# Write your code here

**TODO** Print the age distribution

In [None]:
# Write your code here

### Now, we transform the categorical data into numerical

In [6]:
# We first copy the dataframe to avoid information loss
bank_data = df.copy()


**TODO**   Display the number of people that made a deposit split by their job category

In [8]:
# Write your code here

admin.          :   631
technician      :   840
services        :   369
management      :  1301
retired         :   516
blue-collar     :   708
unemployed      :   202
entrepreneur    :   123
housemaid       :   109
unknown         :    34
self-employed   :   187
student         :   269


**TODO**  Print the different types of work categories and how many items are in each 
Tip; investigate value_counts

In [None]:
# Write your code here

**TODO**  Combine similar jobs in categories
Tip: Research the function replace
['management', 'admin.'] ->  'white-collar'
['services','housemaid'] -> 'pink-collar'
['retired', 'student', 'unemployed', 'unknown'] ->  'other'

In [None]:
# Write your code here

In [None]:
# We now count the values again to see how the information has turned out
bank_data.job.value_counts()

#### poutcome

In [9]:
# We check its value counts
bank_data.poutcome.value_counts()

unknown    8326
failure    1228
success    1071
other       537
Name: poutcome, dtype: int64

**TODO**  Combine unknown and other, since 'other' is not compatible with 'sucess' and 'failure'

In [None]:
# Tip, reuse the function replace
bank_data['poutcome'] = # Write your code here
bank_data.poutcome.value_counts()

#### TODO: Drop the 'contact' column since it does not provide relevant information 

In [None]:
# Write your code here


#### default

In [None]:
# We change its 'default' values to: 'yes' and 'no'
bank_data["default"]
bank_data['default_cat'] = bank_data['default'].map( {'yes':1, 'no':0} )
bank_data.drop('default', axis=1,inplace = True)

#### housing, loan, deposit

Now do the same with the following variables

In [None]:
# values for "housing" : yes/no


In [None]:
# values for "loan" : yes/no


In [None]:
# values for "deposit" : yes/no


In [None]:
# pdays: number of days that passed after the customer was last contacted from a previous campaign
# -1 means the client was not previously contacted

print("Customers that have not been contacted before:", len(bank_data[bank_data.pdays==-1]))
print("Maximum values on padys    :", bank_data['pdays'].max())

In [None]:
# replace the -1 value of pdays with a large value i.e. 10000 to reflect that the customer has not been contacted in a long time (similar to never contacted)


In [None]:
# Create a new column containing recent_pdays, reflect those that are recent as the largest values and those that have never been contacted

# the smallest values (always positive). For this, do the inverse of the value of pdays. 

# Once done, delete pdays

# Create a new column: recent_pdays

### **TODO:** Convert to dummy values

Convert the variables to dummies: In order to treat them as (numerical) vectors we need to change the categorical variables to numeric.

To do this, what is done is, for each categorical variable, replace it with as many variables as there are values in the column, containing 1 if that row had that value and 0 if not

*Tip: investigate the function get_dummies*

In [None]:
# Write your code here
bank_with_dummies = 

In [None]:
# Check how it has changed the structure (dimensions) of the dataframe
bank_with_dummies.shape


In [None]:
bank_with_dummies.describe()


**TODO:** 
Scatterplot with age and money in the account (balance). You can do it with pandas and its plot function or with seaborn's scatterplot

In [None]:
# Write your code below

What is your intepretation of the results?

**TODO:** Display a Histogram with poutcome_success variable

In [None]:
# Write your code below

In [None]:
# People who signed up for a deposit
bank_with_dummies[bank_data.deposit_cat == 1].describe()


In [None]:
# Now we want to showcase a bar diagram displaying the deposited amount by job. Look into barplot from seaborn
plt.figure(figsize = (10,6))


### **TODO:** Establish relationships between features

In [None]:
# show variable correlation matrix
# Hint: Explore the pandas corr and seaborn heatmap functions

In [None]:
# Show correlations as a discrete function between the different variables with an array
# useful for appreciating linear relationships

# Hint: explore pd.plotting.scatter_matrix

In [None]:
# Split the test by a certain proportion (experiment!) Import the sklearn library and use train test split ;)

In [None]:
# Define a classifier
from sklearn.neighbors import KNeighborsClassifier

k = #...

KNeighborsClassifier(n_neighbors=k)

# Train the classifier with the train dataset
neigh.fit(X, y)

# Predict values for the independent test variables
neigh.predict(# the test data)

# Calculate precision
neigh.score(#test data X, test data y)
