# Bank marketing report


## Dataset description
---
The dataset has been downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The classification goal is to predict if the client will subscribe a term deposit.

**Cardinality (# of istances)** = 45211<br>**Dimensionality (# of attributes)** = 17<br>**Attributes**:
- Client related attributes:
    1. **Age** (**numeric**)
    2. **Job** (**categorical**: *admin., unknown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technician, services*) 
    3. **Marital**: marital status (**categorica**l: *married, divorced, single;* **note**: *divorced* means divorced or widowed)
    4. **Education** (**categorical**: *unknown, secondary, primary, tertiary*)
    5. **Default**: has credit in default? (**binary**: *yes, no*)
    6. **Balance**: average yearly balance, in euros (**numeric**) 
    7. **Housing**: has housing loan? (**binary**: *yes, no*)
    8. **Loan**: has personal loan? (**binary**: *yes, no*)
- Related with the last contact of the current campaign:
    9. **Contact**: contact communication type (**categorical**: *unknown, telephone, cellular*) 
    10. **Day**: last contact day of the month (**numeric**)
    11. **Month**: last contact month of year (**categorical**: *jan, feb, mar, ..., nov, dec*)
    12. **Duration**: last contact duration, in seconds (**numeric**)
- Other attributes:
    13. **Campaign**: number of contacts performed during this campaign and for this client (**numeric**, includes last contact)
    14. **Pdays**: number of days that passed by after the client was last contacted from a previous campaign (**numeric**, -1 means client was not previously contacted)
    15. **Previous**: number of contacts performed before this campaign and for this client (**numeric**)
    16. **Poutcome**: outcome of the previous marketing campaign (**categorical**: *unknown, other, failure, success*)
- Output variable (desired target):
    17. **Subscription**: has the client subscribed a term deposit? (**binary**: *yes, no*)
    
Luckily, the dataset is also complete, that is it has no missing value, since preprocessing has been already done before publishing the data.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("./input/bank-full.csv", delimiter=';')
data.rename(columns={'y': 'Subscription'}, inplace=True)
data.columns = [x.capitalize() for x in data.columns]

data.head(10)

Unnamed: 0,Age,Job,Marital,Education,Default,Balance,Housing,Loan,Contact,Day,Month,Duration,Campaign,Pdays,Previous,Poutcome,Subscription
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


### Population characteristics
---

The marketing campaing aims at getting the most out of the clients of the bank.

Now we will have a closer look to the main chacteristics of all the different profiles using the different attributes that describe them.

#### Age
---

Let's start off by having a look to the age distribution using a box plot.

In [2]:
%%HTML
<div class='tableauPlaceholder' id='viz1546248094547' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Ageboxplot&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Ageboxplot' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Ageboxplot&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546248094547');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

The two columns represent the two possible outcomes of the marketing campaing: whether the client subscribed the term deposit or not, so we will see the distribution of the ages for the two classes.

The blue boxes indicate the middle 50% of the data (that is, the middle two quartiles of the data's distribution).
The *whiskers* (top and bottom line) display all points within 1.5 times the **interquartile range** (**IQR**) $r = Q3-Q1$.

At first we see that the range of all possible ages is $[18, 95]$ and the **average** is around 41 years old for both. The upper and lower wiskers give us an indication on who is a outlier: for example, in the class **no** we have $50\%$ of the population is in the range of $[33, 48]$ so the wiskers are set to $Q1-1.5*r$ and $Q3+1.5*r$ respectively, so values outside of the range 
$[18, 70]$ are considered to be outliers.

To better have a better idea of the age distribution look at the bar chart below.

In [3]:
%%HTML
<div class='tableauPlaceholder' id='viz1546250593802' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Agedistribution&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Agedistribution' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Agedistribution&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546250593802');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we can see the bank contacted mostly people from 25 to 60 years old, maybe knowing that they might have been more interested in term deposits, rather than young and old people.

The age alone might not be very interesting, so now we will have a look at the other attributes, divided by age.

In [4]:
%%HTML
<div class='tableauPlaceholder' id='viz1546436632541' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Dashboard3&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Dashboard3' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Dashboard3&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546436632541');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Interesting facts:
- **Age - Marital** : Until the age of 31 there are more *single* clients, then from 32 years old there are more *married* clients (*divorced* are not considered). The first divorced costumer is 24 years old. From the age of 61, there are little to none *single* people.
- **Age - Default** : All defaulters are between 21 and 60 years old with only one outlier being aged 71 years old.
- **Age - Housing** : The growing pace for both classes (yes and no) is barely the same until the age of 30 (with the exception of the range 27-29), then the *yes* class keeps growing until it peaks at the age of 32 with 1.325 people, after which it starts decreasing, reaching the *no* class at the age of 47. After the age of 52 there are always less people with a housing loan with respect with people without it. The yougest to have a houseloan are aged 20, whereas the oldest is 78.
- **Age - Loan** : The yougest to have a personal loan are aged 20 and are 3 in total, whereas the oldest are 72 years old and are 3 in total. The highest ratio between those who don't and who have a personal loan can be found in the 40-60 years old range, with a max of 25,08%. The exception of this can be found in the age of 25 where we have a 23,13% ratio.

### Job
---

Let's start off by looking at the different job distribution per each age.

In [5]:
%%HTML
<div class='tableauPlaceholder' id='viz1546441027436' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Dashboard4&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Dashboard4' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Dashboard4&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546441027436');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

In [6]:
%%HTML
<div class='tableauPlaceholder' id='viz1546441787313' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Jobsperagerange&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Jobsperagerange' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Jobsperagerange&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546441787313');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Interesting facts:
- In the [20-29] range there are already 3 *retired* people, the youngest of which is only 24.
- The youngest *enterpreneurs* are 23 years old
- In the [40-49] range there are still 14 *students*, the oldest of which is 48.
- Most of the young clients ([20-29] range) contacted by the bank are employed: just 160 are *unemployed*.
- In the [80-89] range there are still 3 active workers (2 *management*, 1 *enterpreneur*), but even more interestingly in the [70-79] range there are still 6 *technicians* and 4 *blue-collars* which are not sedentary jobs.

Now let's have a look at the dashboard below.

In [7]:
%%HTML
<div class='tableauPlaceholder' id='viz1546443611019' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;KY&#47;KYFYYB3RC&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='path' value='shared&#47;KYFYYB3RC' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;KY&#47;KYFYYB3RC&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546443611019');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

The majority of the clients is a *blue-collar* or in the *management*, then we have *technicians*, *services* and *admin*. Only the 2,88% is *unemployed*.

Regarding the **Sum of Balance** we see that first 3 categories are the same as before (as we could expect, since the more clients there are, the higher will likely be the sum of all the balances), but in a different order. In fact, *managers* have more money in average than *blue-collars*.

It's interesting to notice that *enterpreneurs* and *unemployed* have barely the same **average balance**: one reason of this could be that *enterpreneurs* might have the tendency to keep investing money, thus having less money in their personal bank account, and *unemployed* might be more incline to save money they have, since they don't have a job.

### Balance
---

Here is an additional chart showing the box plot of the balances for each job.

In [8]:
%%HTML
<div class='tableauPlaceholder' id='viz1547130280240' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Balance-job&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Balance-job' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Balance-job&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1547130280240');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Now it would be interesting to see the age of different groups of balance, so we create 3 groups (*negative*, *<10k*, *>=10k*).

In [9]:
%%HTML
<div class='tableauPlaceholder' id='viz1546446956058' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Balanceboxplot&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Balanceboxplot' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Balanceboxplot&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546446956058');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we can see the widest range is held by the "average" class, whereas the "poor" class (*negative* balance) ranges from 20 to 63 years old: old people are never below zero.

### Subscriptions
---

Here the total number of **Subscriptions** will be examined.

Let's see the count of the **Subscriptions** per **Age** and **Job**

In [10]:
%%HTML
<div class='tableauPlaceholder' id='viz1546450091616' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-agedb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Subs-agedb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-agedb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546450091616');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

It's interesting to see how the ratio between yes and no is higher for young and old people, whereas from 25 to 60 years old the term deposit doesn't seem to attract as much clients. This might depend also on the lower cardinality of the former groups, that eases higher ratios.

In [11]:
%%HTML
<div class='tableauPlaceholder' id='viz1546450153212' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-Jobsdb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Subs-Jobsdb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-Jobsdb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546450153212');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

The highest number of subscriptions comes from the *management* class, but looking at the percentages we have that *students* and *retired* people are more incline to subscribing the term deposit with the bank. On the other hand, *enterpreneurs*, *housemaids* , *service* and *blue-collars* are not so interested, with an average successful subscription rate of 8,30%. Maybe the bank should invest its marketing compaign's effort towards other groups of workers...

Let's analyze other attributes.

In [12]:
%%HTML
<div class='tableauPlaceholder' id='viz1546450858151' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subsperfeatures&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Subsperfeatures' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subsperfeatures&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546450858151');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we can observe, clients that have active **loans** of any type are not so incline on subscribing. The subscription rate almost doubles in both cases if no loans are owned.

The **marital status** doesn't seem to be a very discriminative for the subscription outcome, since the max difference is equal to 4,83%.

More **educated** clients have a higher tendency to get the term deposit, whereas less educated people don't really like the idea. This outcome is affected by the fact that 55,26% of who studied only until *primary* school is a *blue-collar*, class which, as previously shown, had a bad subscription rate.

Here below is a chart showing the distribution of **Education** vs. **Job**.

In [13]:
%%HTML
<div class='tableauPlaceholder' id='viz1546530192730' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Job-Edudb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Job-Edudb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Job-Edudb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546530192730');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

### Campaign related attributes
---

The attributes that are stricly related to the campaign itself, and not the client, are the following: **Contact**, **Day**, **Month**, **Duration**, **Campaign**:, **Pdays**, **Previous**, **Poutcome**.

Among these attributes I decided not to consider the day, since it's just the numerical value of the day of the month, thus not so informative. Instead, all the other attributes might be interesting, so let's find out if we can discover something interesting.

First of all we will analyze the data relative to the **Contact type** and the **Call duration**, seeing if we can infer something useful.

In [14]:
%%HTML
<div class='tableauPlaceholder' id='viz1546530438367' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subspercallsdb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Subspercallsdb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subspercallsdb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546530438367');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

The **Contact type** on its own doesn't seem to be a very interesting feature regarding the **Subscription** outcome. In fact, both the *cellular* and *telephone* yield almost the same results on this aspect, so it will likely not be a very informative field.

On the other hand, the **Call duration** is very indicative for the final **Subscription** outcome. The longer the calls, the higher the ratio will be, and that is a result that we might have expected: disinterested people do not want to waist so much time on the phone, so they hang up very soon (as we can see from the *1-250* duration group where the answer *no* has been given by 93,61% of the costumers), whereas people who are interested take some time to have things better explained and take the final decision, since the higher the duration of the call, the higher is also the positive **Subscription** rate.

Let's now have a look at how many times in total a certain clients has been contacted to then accept or reject the term deposit.

In [15]:
%%HTML
<div class='tableauPlaceholder' id='viz1546530481538' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Ncontacs-subsdb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Ncontacs-subsdb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Ncontacs-subsdb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546530481538');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we see from the **first plot**, people who said *yes* took much less time, thus doubted less, than people who said *no*. The *yes* class only in 5 cases needed from 20 to 32 calls to decide, but the average shows that ~13 calls were enough. People who said *no* took in average double the time with ~26 calls, reaching an incredible number of 63 in one extreme case.

In the **second plot** we have the ratio of the number of people that said *yes* for each **# of contacts** count, and we see that people tended more to say *yes* after few calls, then the trend is descending, with a final growing phase due to the low number of samples for each **# of contacts**.

Now it would be interesting to see how many clients that already subscribed to a previous compaign of the bank, did the same for the current one.

In [16]:
%%HTML
<div class='tableauPlaceholder' id='viz1546530525689' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;poutcome-subs&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='path' value='views&#47;Tableau_report_0&#47;poutcome-subs?:embed=y&amp;:display_count=y' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;poutcome-subs&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546530525689');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we could imagine the results are similar to what we expected. People who embraced a former campain were now really incline to say *yes*, maybe because they had a good experience with the bank and gained more trust in it, or because they are simply trusting people. People who previously said *no* and now said *yes* are 12,61%, which, in my opinion is not a bad number, because the bank was probably trying to reach out for them and they succeded for more than 1/10. It's important to notice that clients that say *no* to the campains are a much larger number than people who say yes, thus a 12% of those clients is a good result.

The next chart shows the **Positive subscription ratio** with respect to the days elapsed from the last contact for a previous campaign.

In [17]:
%%HTML
<div class='tableauPlaceholder' id='viz1546530567523' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;elapseddb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;elapseddb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;elapseddb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546530567523');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

As we can see from the graph, the trend is mainly descending, but we have growing phases. The first one is in the [0-90] range, where we reach a very high rate (52,76%), so for the first three month, as the time passes the subscription increase. After this period the downward trend starts, with very low rates averaging ~15%, with the exception of the 180 days period (6 months), where we get close to the best ratios. Then, after a year, the subscriptions suddently ramp up to 60,61% even reaching a whopping 90,91% when 1 year and 2 months elapsed. After 470 days, the data is not very useful since we only deal with outliers, having a very fluctuated trend, where high and low values might be misleading.

Last, but not least, let's look at the distribution of the **Subscriptions** for each **Month** of the year.

In [18]:
%%HTML
<div class='tableauPlaceholder' id='viz1546596850838' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-monthdb&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;Subs-monthdb' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;Subs-monthdb&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1546596850838');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

One interesting thing about the chart is that the months where the bank was not pushing very hard for the campaign were the most successful: *March*, *September*, *October* and *December* has always well below 1000 contacts, yet they had the highest **Subscription** rate, with an average of 47,24% and best in *March* with 51,99%. The most active periods are in the late spring and summer (with always more than 4000 contacts): in this period we have the highest rejection rates.

There seems to be an **inverse proportionality** between the number of contacts and the success rate: the higher the number of the former, the lower the value of the latter, and vice versa.

## Correlations
---

Now let's see how all the different features of our dataset are related to each other. To do so, we can have a quick look at the correlation matrix.

Before showing it, it's better to introduce two fundamental concepts in statistics: **covariance** and **correlation coefficient**.

The *covariance* measures how much two variables vary together, that is it describes how one of the two variables changes when the other varies. When both increase (or decrease) together, then the covariance will be positive ($\mathrm {Cov} (X,Y)>0$). On the other hand, if they have the opposite behavior, so if one increases then the other decreases and viceversa, then the covariance will be negative  ($\mathrm {Cov} (X,Y)<0$). In the case in which we do not observe such a "controlled" and regular behavior in means that the covariance is zero ($\mathrm {Cov} (X,Y)=0$). We define the covariance in this way: $${\displaystyle \mathrm {Cov} (X,Y)=\mathbb {E} { [}{ (}X-\mathbb {E} [X]{ )}(Y-\mathbb {E} [Y]{ )}{ ]}}$$

The **Pearson correlation coefficient** ${\displaystyle \ \rho _{X,Y}}$ is defined as the *covariance* between <b>X</b> and <b>Y</b> divided by the product of the *standard deviations* of <b>X</b> and <b>Y</b>. If ${\displaystyle \ \rho _{X,Y}>0}$ the two variables are **positively correlated** or **directly correlated**. Instead, if ${\displaystyle \ \rho _{X,Y}<0}$ the two variables are **negatively correlated** or **inversely correlated**. Finally, if ${\displaystyle \ \rho _{X,Y}=0}$ the two variables are **not correlated**.
$$\rho_{X,Y} = \frac{Cov(X,Y)}{\sqrt{Var(X)*Var(Y)}} = \frac{\sum_{i=1}^{n} (x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\sum_{i=1}^{n} (x_i - \mu_x)^2}\sqrt{\sum_{i=1}^{n} (y_i - \mu_y)^2}}\,\,\,,\,\,{\displaystyle \ \rho _{X,Y} \in [-1, 1]}$$

where the **standard deviation** is defined as the square root of the **variance**, defined as following:
$${\displaystyle Var(X) = \sigma _{X}^{2}=\mathbb {E} { [}{ (}X-\mathbb {E} [X]{ )}^{2}{ ]}}$$

Usually in statistics we prefer the *Pearson correlation coefficient* since it's in the range $[-1, 1]$, so it's independent from the scale of the variables of the observations.

Following, we can observe the **Correlation Matrix** calculated using the *Pearson correlation coefficient*.

In [20]:
import plotly.plotly as py
import plotly.graph_objs as go

#data_corr = pd.get_dummies(data.loc[:, data.columns != 'Subscription']).corr()
data_corr = pd.get_dummies(data.loc[:, :]).corr()
data = [go.Heatmap(z=data_corr.values.tolist(), x=list(data_corr), y=list(data_corr))]

layout = go.Layout(
    autosize=False,
    width=1000,
    height=1000,
    title='Correlation Matrix'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Unfortunately from this graph we cannot really find any really interesting correlation, a part from the obvious ones (for example, if Education = *Tertiary* then there a good positive correlation with Job = *Management*, or a positive correlation between *Age* and *Job_retired*).


# Preprocessing
---

In our dataset we can clearly see that there is a huge difference among the number of people that subscribed a term deposit contract, and those who did not. In fact, we have 39922/45211 (88,3%) who didn't do it, against 5289/45211 (11,7%) who did, which is a very big imbalance, that can lead our models to be biased towards the majority class.

In [21]:
%%HTML
<div class='tableauPlaceholder' id='viz1547693912160' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;TotalYesNo&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Tableau_report_0&#47;TotalYesNo' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ta&#47;Tableau_report_0&#47;TotalYesNo&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1547693912160');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

In order to tackle this problem I chose the **random undersampling** tecnique, in which I always kept the most information I could from the minority class (so I kept all the 5289 samples), but to level things out I took the same amount of samples from the other class by *random sampling without replacement*.


## Dealing with categorical data
---

As we could see from the the initial overview of the dataset, we have many categorical data that are not directly usable with some of the algoritms that I will use. For this reason I decided to apply a technique called **dummy coding**, which maps each categorical value into a numerical representation. More precisely, it creates a new column for each value of each categorical feature and assigning a value of 1 only to the column representing the actual categorical value, and 0 to all the others. So, for example, the *Month* attribute was transformed into 12 new columns, one for each value (*Month_jan*, *Month_feb*, etc...). Here's an image detailing how this method works.

![Dummy coding](https://cdn-images-1.medium.com/max/1600/1*pCnqiKj-Hrdn7FajO0sh1Q.png)

Here we can see the complete preprocessing pipeline that has been followed almost in all circumstances, with the exception of the Random Forest, where I did not perform the dummy coding, so that the resulting trees could be much more interpretable.


![Preprocessing image](https://i.imgur.com/vzQqE3q.png?1)

# Classification
---
Now it's time to actually explore the data in a more analytic way, trying to create models to predict whether a given costumer might be a potential good one or not. I decided to apply four classification algorithms that we studied throughout the course and those are:
- **Random forest**
- **Logistic regression**
- **k-Nearest Neighbor**
- **Support Vector Machines**

Along with these classification methods, I'd like to use **Principal Component Analysis** to create a new representation of the data, especially for space related algorithms (KNN and SVM).

### Performance metrics
---

Before exploring all the different classification methods, I will introduce a brief description of all the metric that have been taken into consideration to evaluate the performance of the models.

- **Accuracy**, is the ratio of correctly predicted observation to the total observations. $$Accuracy = \frac{TP+TN}{TN+TP+FN+FP} = \frac{\#\,correct\,predictions}{\#\,all\,samples}$$
- **Recall**, is the ratio of relevant instances that have been retrieved over the total amount of relevant instances. More simply, how many of the Actual Positives our model capture through labeling it as Positive (True Positive). $$Recall = \frac{TP}{TP+FN} = \frac{\#\,correctly\,predicted\,POSITIVE\,classes}{\#\,all\,possible\,POSITIVE\,istances}$$
- **Precision**, is the ratio of relevant instances that have been retrieved over the total amount of relevant instances. More simply, how precise your model is out of those predicted positive, how many of them are actual positive. $$Precision = \frac{TP}{TP+FP} = \frac{\#\,correctly\,predicted\,POSITIVE\,classes}{\#\,all\,POSITIVE\,predictions}$$
where: 
$$TP\,=\,True\,Positive\qquad FP\,=\,False\,Positive\qquad TN\,=\,True\,Negative\qquad FN\,=\,False\,Negative$$
- **F1 Score**, is an harmonic mean of Precision and Recall. $$F1\,Score = \frac{2\times Precision\times Recall}{Precision+Recall}$$
- **Area Under the Curve (AUC) Receiver Operating Characteristics (ROC)**, is a performance measurement for binary classification problem at various thresholds settings, telling how much the model is capable of distinguishing between two classes. ![AUC_ROC](https://cdn-images-1.medium.com/max/1000/1*pk05QGzoWhCgRiiFbz-oKQ.png) Here is an example of how an AUC-ROC curve looks like (credits to Sarang Narkhede). TPR is the True Positive Rate (aka **Recall**, already defined) whereas FPR is the **False Positive Rate** defined as: $$FPR = 1-Specificity = \frac{FP}{TN+FP}$$ where: $$Specificity = \frac{TN}{TN+FP}$$The higher the value of AUC, the better our model will be at distinguishing from one class to another.

*Accuracy* is very good to evaluate the performances of balanced datasets, but with imbalanced datasets it could be very misleading. This is why whenever we will deal with imbalanced test sets, we will pay more attention to the *F1 Score*, because it gives us more useful indication on how the model performs. Also, in case errors have different weights, this is the measure to look at.

### Model validation technique
---
In order to obtain a more accurate result, I decided to use **K-fold cross validation**. This technique allows to avoid the problems of **underfitting** or **overfitting**.

In *K-Folds Cross Validation* we split our training data (in my case 70% of the whole balanced dataset) into k different subsets (or folds). We then use k-1 subsets to train our data and leave the last subset (or the last fold) as validation data. We then average the model against each of the folds and then finalize our model.
<img src="https://cdn-images-1.medium.com/max/2400/1*J2B_bcbd1-s1kpWOu_FZrg.png" alt="K-fold cross val" width="800"/>

This method allows us to have the best *bias-variance* tradeoff, ensuring that our model has both a low bias and low variance. In order to achieve this we also need to properly set the <b>K</b> parameter: a value too low will make us drift towards a *validation-set* approach, that is splitting the dataset in two halves and use one for testing and the other for validation/testing, leading to a <b>highly biased</b> model. On the other hand, choosing a value of K too high (with the extreme case being K = N, where N is the cardinality of the dataset) will lead to a similar approach to *Leave-One-Out Cross Validation (LOOCV)*, where we have K=N so, at each n-th iteration, we use almost all the dataset for training and do validation/testing on one observation: in addition to being more computationally expensive, it leads to have a model with **high variance**. 

Given the former reasons, I decided to perform my experiments with `K=10`.

## Random forest
---

![RF](https://c.mql5.com/2/33/image1__1.png)

Random forest is currently one of the most famous and widely spread machine learning algorithm that leverages a clever idea of using an ensemble of *Decision trees*. In fact, *Random forest* is an ensemble algorithm that, at training time, creates a moltitude of *Decision trees* (whose number is set as an hyperparameter), each created in a different way: in this way, at test time, when a classification is performed, the class that was predicted the most among all the trees is the class assigned to the unlabelled sample. This results in a wide diversity that generally results in a better model.

More precisely, in RF we first apply the concept of **bagging** (also known as *bootstrap aggregating*): out of all the observations available in the training, for each Decision Tree that we want to build, we randomly sample with replacement from the training set, virtually creating *new* training sets. By doing so, we also allow the same samples to appear multiple times in the bootstrap set and some samples to be excluded: this results in an improvement for the training phase, since it helps reducing the variance of the model and helps to avoid overfitting, because each Decision Tree will be built on a slightly different data distribution with respect to the other trees.

With this set of data, we build each of the Decision Trees considering just a random subset of features (and not all of them) to perform the splits: given $m$ features, we only consider $\sqrt {m}$ random ones to choose the attribute from. By doing so we reduce significantly the importance of those features that, otherwise, would be too influential. In fact, building different trees that use different sets of features assures us that they are not correlated between each other.

In order to classify a sample, the observation is given to all of the decision trees that make up the forest; then, each of them will yield a result and, by **majority vote**, the class that was most voted by all the trees is assigned to the sample.

Moreover, at the end of the training stage, we can identify the difference importances of the features, so that we can recognize which feature tends to be more informative than the others when it comes to discriminate from one class to another.

### Building a decision tree
---

To create a decision tree we adopt a greedy strategy to simplify the resolution of the problem, which is *NP-hard*. At each step a binary split is computed selecting the local best attribute, so the final solution is the sum of all the local optima so it's not necessarily the global optimum. But how do we find the *best attribute*? 

In order to answer this question we have to introduce the concept of **Gini Index**. The *Gini index* is defined by $$G = \sum_{k=1}^{K}\hat{p}_{mk}(1 - \hat{p}_{mk})\,\,G \in [0, \frac{1}{K}]$$ where $\hat{p}_{mk}$ represents the proportion of training observations in the $m$-th region that are from the $k$-th class ($0 \le \hat{p}_{mk} \le 1$). This index expresses the degree of *purity* of a node: the lower the value, the more observation we will have from a single class. To then evaluate the goodness of the split we go on and calculate the Gini Split: $$GINI_{split} = \sum_{i=1}^M \frac{n_i}{n}G_i$$ where $M$ is the number of obtained nodes (in our case $M=2$), $n_i$ is the number of records at child $i$ and $n$ is the number of records in the parent's node. So this second measure takes into consideration all the nodes to evaluate the overall goodness of the split. The lower this value, the better the split is.

An alternative to the *Gini index* can be the **Entropy**, defined by $$D = -\sum_{k=1}^{K}\hat{p}_{mk}\text{log}\hat{p}_{mk}$$
Knowing the *entropy* we can go on and calculate the **Information gain**, which measures the reduction in *entropy* achieved when the split is performed: our goal is to maximize this measure. The formula is $$IG = D - \sum_{i = 1}^k \frac{n_i}{n} D$$where $n$ are the samples in parent node $p$ to be partitioned into the $k$ partitions that we want to obtain ($k=2$ in our case, since it's a binary split) and $n_i$ is the number of samples in partition $i$.

Information gain has the problem to be biased favoring attributes with many values over those with few values (let's think about *Job* that has much more possible values than *Martial status*): in this way we have that a split on *Job* produce many more small subsets, leading to a higher $IG$, but we don't know if this attribute is actually better of, for example, of the binary ones. To overcome this drawback, there's the **Gain ratio** which introduces a new term called *split information* which is sensitive to how broadly and uniformly the attribute splits the data $$SPLITinfo = -\sum_{i=1}^M \frac{n_i}{n}log\frac{n_i}{n}$$ This term allows to adjust the $IG$ by the entropy of the partitioning: a higher entropy of the partitioning (that is large number of small partitions) is penalized. *Gain ratio* is defined as following $$GainRATIO = \frac{IG}{SPLITinfo}$$

To build a decision tree we iteratively search for the best attribute and value to perform the split: do it and repeat in the generated sub-tree. The process is halted whenever we reach a fixed maximum value for the *depth* or if we don't obtain a split that yields a certain minimum value for the gain.

There are different prepruning criteria for the expansion of the tree: 
- **Minimal Gain**, which is the minimum amount of Information Gain $IG$ that we want to obtain from a split. If $IG < minimal\, gain$ then stop from going any deeper with the tree in that branch. The higher this value, the less splits we will have with a resulting smaller tree.
- **Minimal size for split**, which is the minimum number of samples in a node for it to be considered for a split. Again, the higher this value, the less nodes will be split, resulting in a smaller tree.
- **Minimal Leaf size**, which is the same as the size for split, but related to leaves.

After performing a very intensive grid search to find the best hyperparameters of the Random Forest, I found the following values to be the best performing ones:
- `number of trees = 150`
- `criterion = gain_ratio`
- `maximal depth = 40`
- `minimal gain = 0.001`
- `minimal leaf size = 2`
- `minimal size for split = 10`

Here a screenshot of one of the gridsearch that I performed over some of the hyperparameters:

![Grid search](https://imgur.com/186xHjg.png)

## Process

In this case, before applying the algorithm, I did not perform dummy coding since trees can deal with categorical features and, instead, it's much better to keep them categorical so that the model is much more easily comprehensible: this, in fact, is one of the main advantages of tree-based models, the **interpretability**.

![RF PROCESS](https://imgur.com/4bq4KhB.png)
![RF Cross validation](https://imgur.com/GFTAfLj.png)

## Performance evaluation

### Validation set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>3032</td>
    <td>670</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>430</td>
    <td>3272</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>85.14%</td>
    <td>88.38%</td>
    <td>83.02%</td>
    <td>85.61%</td>
    <td>0.918</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/c3YZYfw.png" alt="ROC Random Forest valid" width="500"/>

### Test set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>1310</td>
    <td>277</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>192</td>
    <td>1395</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>85.22%</td>
    <td>87.90%</td>
    <td>83.43%</td>
    <td>85.61%</td>
    <td>0.920</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/unnHqf7.png" alt="ROC Random Forest test" width="500"/>

### Feature importance

Random forests allow us to even understand how influential each feature is for the final prediction. 

In this plot we can observe the different weights of the features: the most important is the *duration*, followed by *age* and *balance*.

Here we can see the weights, where each weight is given by the sum of improvements the selection of a given attribute provided at a node. The amount of improvement is dependent on the chosen criterion: in our case the *information gain* is taken into consideration for the calculation of the values.

![RF importance](https://imgur.com/Uil17i6.png)

As we can see the most important features are the numerical ones (since they have a higher variability and, thus, carry more information on their own), whereas categorigal attributes with a low cardinality result to be less important. I personally did not expect such a result, since I thought that maybe, even if some attribute were binary (yes/no), they would be very important for the final outcome.


## Logistic Regression
---

The Logistic Regression model is a supervised learning method in charge of assigning the probabilities of the desired outcome (usually the positive class), given the input data, which are then used to perform classification.
We define $x_i$ as the $n$-dimensional feature vector of a given sample and $\beta_{0},\,\boldsymbol{\beta} = (\beta_{1}, ..., \beta_{n})^T$ as the model parameters. The dimensions are all the features of the dataset. The *logistic function* is defined as following:

$$p(X) = P(Y=1\vert x_i)= \frac{e^{\,\beta_{0} + x_i^T\boldsymbol{\beta}}}{1+e^{\,\beta_{0} + x_i^T\boldsymbol{\beta}}} = \frac{1}{1+e^{\,-(\beta_{0} + x_i^T\boldsymbol{\beta})}}$$

where $p(X) = P(Y=1\vert x_i)\,,\,p(X) \in [0, 1]$ is the *conditional probability* of the class $Y=1$ (yes) to occur, given the observation $x_i$. This function is referred to as the **sigmoid function**. This allows us to always have values from 0 to 1 for the probability: as we can see from the graph, high values for a certain observation (along the x axis) will return us a value very close to 1, but never beyond that.

![Sigmoid](https://i0.wp.com/www.stokastik.in/wp-content/uploads/2017/07/sigmoid.png?resize=400%2C300)

As all the Machine Learning algorithms, here the goal of our training is to estimate parameters to obtain accurate predictions. Since we don't know the values of all the parameters $\beta_{n}$, $\boldsymbol{\hat{\beta}}$ is the parameter to estimate, in order to obtain $\boldsymbol{\hat{\beta}} = (\hat{\beta}_{1}, ..., \hat{\beta}_{n})^T$.

To do so we use the **likelihood function**, which is defined as following:

$$\ell(\beta_{0},\beta_{1}) = \prod\limits_{i: y_i=1}p(x_i)\prod\limits_{i': y'_i=1}(1-p(x'_i))$$

That is the function to be maximized. What this formula means is that, given $N$ samples with labels $0$ and $1$ (no and yes), we want to estimate $\boldsymbol{\hat{\beta}}$ so that:
- for all samples labelled as $1$, their predicted probability $\hat{p(X)}$ is as close to 1 as possible.
- for all samples labelled as $0$, their probability $\hat{p(X')}$ is as close to 0 as possible, so that $1 - \hat{p(X')}$ is as close to 1 as possible.

After performing some calculations to the *logistic function* we obtain the *odds*, which is the ratio between the probability of an event to occur and the probability of the same event to not occur:

$$odds = \frac{p(X)}{1-p(X)} = e^{\beta_{0} + \beta_{1}X}\;\;,\;\;odds \in [0, \infty]$$


In case we want to obtain a value $[0, 1]$ then we apply the log to both sides of the odds, resulting in the *logit* or *log-odds*:

$$logit = \log{\frac{p(X)}{1-p(X)}} = \beta_{0} + \beta_{1}X\;\;,\;\;logit \in [0, 1]$$

## Process

![LR PROCESS](https://imgur.com/8UoxXZW.png)
![LR cross validation](https://imgur.com/zYWKmm2.png)

## Performance evaluation

### Validation set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>3110</td>
    <td>592</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>687</td>
    <td>3015</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>82.73%</td>
    <td>81.44%</td>
    <td>83.60%</td>
    <td>82.49%</td>
    <td>0.908</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/ZnmCtBT.png" alt="ROC Logistic Regression Validation" width="500"/>

### Test set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>1322</td>
    <td>265</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>277</td>
    <td>1310</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>82.55%</td>
    <td>81.72%</td>
    <td>83.17%</td>
    <td>82.86%</td>
    <td>0.909</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/HYKcrEv.png" alt="ROC Logistic Regression Test" width="500"/>


## Principal Component Analysis
---
Before exploring the last two algorithms used for classification, we have to introduce Principal Component Analysis, since the two models work on data produced by the PCA.

PCA is an unsupervised *data dimentionality reduction* method that is often used for data visualization. Why? Because, when faced with a large set of correlated variables, it allows us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set: this is a great way to reduce the dimensionality of the dataset, since high dimensional datasets can be summarized by the use of PCA, improving also the display of the data across 2 or 3 dimensions. So the final goal is to:
- **maximize the variance** of the data, so that we can capture as much information from the data as possible.
- **minimize the distance** between the data point and its projection, so that, by having projections as close to the actual data point as possible, we have a reliable new representation which is not too distorted.

<img src="http://alexhwilliams.info/itsneuronalblog/img/pca/pca_two_views.png" alt="PCA min max" width="500"/>


The *Principal Component* of a set of features $X_1,X_2,...,X_p$ is the **normalized linear combination** of the features
$$Z_i = \phi_{1i}X_1 + \phi_{2i}X_2 + ... + \phi_{pi}X_p$$
where $i$ is the $i-th$ Principal Component and $p$ is the $p-th$ feature, and the reason of it being *normalized* is because of the constraint on the sum of the squares of $\phi_i$: $\displaystyle\sum_{j=1}^p {\phi_{ji}^2} = 1$. This constraint on the loadings is set because otherwise arbitrarily large absolute values of these variables would lead to arbitrarily large variance. Differently, we can say that we set this constraint to avoid that the vectors $Z_i$ exceed the hypersphere with radius=1.

In order to obtain the values of the **loadings** of the **principal component loading vector** $\phi_{i} = (\phi_{1i}, \phi_{2i},...,\phi_{pi})^T$ we have to solve the optimization problem
$$\underset{\phi_{11}, \ ...,\  \phi_{p1}}{maximize} \bigg\{ \displaystyle{  \frac{1}{n} \sum_{j=1}^n{ \bigg( \sum_{d=1}^p{\phi_{d1}x^{j}_{d}}  \bigg)^2} } \bigg\}\,\, subject\,to \,\,\displaystyle\sum_{d=1}^p {\phi_{d1}^2} = 1$$
where $j$ is the $j-th$ of the $n$ possible observations and $d$ is the $d-th$ dimension out of the $p$ possible. What is actually being done here is to calculate the linear combination between the single data point $X_j$ and the loading vector in the innermost sum, all squared. In the outermost sum we do the sum of the latter calculation, but for all observations.

Geometrically the loading vector defines a direction in feature space along which the data vary the most. In fact, after having found the first principal component, we go on to calculate the next one, looking at an orthogonal direction that has the maximum variance in the residual space. We repeat this for all PCs and, at the end, the principal components are **sorted** in a descenting way (by looking at the eigenvalues), from the most informative PC (the one with the highest expressed variance) to the last one. Mathematically imposing every PC to be orthogonal to all the others leads to **uncorrelation** between the various $Z_i$: this is crucial to be sure that by eliminating the least informative PCs we do not lose any information on the other PCs. Beware that we could not do this for the original features with a low variance: the might not be informative by definition by themselves, but they might be correlated with other variables and eliminating them would result in a huge information loss.

To calculate PCA on a $n\times m$ dimensional $X$ dataset, we need to follow these steps:
- **Standardize** the data. This is done by transforming each value $x$ into $z = \frac{x - \mu}{\sigma}$ to have all the values with mean $\mu = 0$ and standard deviation $\sigma = 1$.
    - Compute the **Covariance matrix** $Σ$ $$Σ = \displaystyle{\frac{1}{n} \sum_{i=1}^n\big(x_i-\overline{x}\big)\big(x_i-\overline{x}\big)^T}$$ where $\overline{x} = \displaystyle{\frac{1}{n} \sum_{i=1}^n x_i}$ is the *mean vector* where each value represent the mean of a feature column in the dataset $X$. The *covariance matrix* $Σ$ is a positive and semi-definite matrix, which means that $z^TXz \ge 0$ for any vector $z$ and so that all eigenvalues of $Σ$ are non-negative.<br>Compute the relative *eigenvectors* and *eigenvalues* via **Eigen decomposition**. We can factor the convariance matrix as $$Σ=UΛU^T$$ where $U$ is the matrix containing the *eigenvectors* of $X$ and $Λ$ is a diagonal matrix having the *eigenvalues* directly along the diagonal. The *eigenvectors* express the direction of the spread of data, whereas *eigenvalues* will be indicating the magnitude of the spread.<br>We then select the first $k$ eigenvectors from $U$ having the $k$ highest eigenvalues: each *eigenvector* is a principal component and the related *eigenvalue* is the variance that the component expresses.
    - An alternative to calculating the *Covariance matrix* and then perform *Eigen decomposition* is to directly perform **Singular Value Decomposition** on the raw data: we can factor a given matrix $X$ of shape $n \times m$ into $$X=UΣV^T$$ where $U$ is a $n \times n$ orthogonal matrix (**left singular vectors**), $Σ$ a $n \times m$ diagonal matrix (has the **singular values** on its diagonal) and $V^T$ a $m \times m$ orthogonal matrix (**right singular vectors**).<img src="https://imgur.com/iK980PK.png" alt="SVD" width="500"/><br>We can then go on and calculate $$XX^T = UΣV^T(UΣV^T)^T = UΣV^TVΣ^TU^T = UΣ^2U^T$$ Since $XX^T$ is a positive semi-definite and symmetric matrix, its eigenvectors will be orthogonal and they will coincide with the singular vectors contained in $U$ and its eigenvalues will be positive and will be on the diagonal of $Σ^2$. We then take the first $k$ PCs that we are interested in, sorting by the singular values inside $Σ^2$ (which also express the variance). It's important to retain around $85\%$ of the variation, using this formula: $$\frac{\sum_{j=1}^k \sigma_j^2}{\sum_{j=1}^m \sigma_j^2}$$ where $k$ is the number of components that are kept and $m$ is the total number of components.
- Lastly, after we have obtained $U_k$ with the first $k$ components (eigenvectors), we project the dataset $X$ on the new feature space: $$Y_k = XU_k$$, where $Y_k$ has size $n \times k$, where $n$ is the cardinality of the dataset, $X$ has size $n \times m$, where $m$ is the number of features, and $U_k$ has size $m \times k$.


By performing these steps we obtain a new representation of our data that, if plotted, shows the highest variance along the PC1 (first Principal Component) and then the second highest variance along the PC2 axis. All the PCs are orthogonal to each other.

![PC_data_representation](https://i.imgur.com/v4azQzr.png)

Here is a plot of the *cumulative explained variance* that shows the increase of the overall explained variance when increasing the number of PC. This means that from this plot we can try to find a threshold to limit the number of PC to use, but still have a reasonable representation of the data. The number of all the PCs that can be created is $min(\#\,features, \#\,samples)$.

<img src="https://i.imgur.com/RvER3dt.png" alt="Cumulative explained variance" width="700"/>

As we see from the plot, with 18 PC we already have the $\sim85\%$ of the total variance of the dataset, this means that by reducing the dimensionality from 50 to 18 we would greatly save memory space and computational time without losing much informative content. Another interesting thing to notice is that from the PC36 on the PCs bring little to no informative content, which means that we could reduce the dataset size with this new representation using these 36 PC and still deal with barely all the informative content that come with the original data.

In the table below we can also see the **eigenvectors** of each principal component. An *eigenvector* indicates, for each feature, how much it contributes to that specific principal component: the higher the value, the higher the contribution. The values are in the range $[-1, 1]$. Also, if we plotted these values, we might have a hint on the correlation between two features: if the angle between the two is low, it means that the two features are positively correlated with each other, whereas a big angle indicates that there's a negative correlation.

In our case, since we performed the dummy coding, for each eigenvector we will see that for binary features (housing yes/no, etc...) we have the same values in absolute value, but with opposite signs.

<img src="https://i.imgur.com/w5535Yn.png" alt="Eigenvectors PCA" width="800"/>

Here I show also the eigenvectors in the case where I changed the coding to convert categorical features into numerical, simply by converting each value with a unique integer.

![Eigenvectors PCA unique integers](https://i.imgur.com/KPx2ask.png)

As a result we can still observe that the *Housing* feature plays a very important role for the contribution to the PC1.

### Process

This is the process executed every time we want to perform PCA

![PCA PROCESS](https://imgur.com/U6ccZ3c.png)


## Support Vector Machine
---

Support Vector Machine is an algorithm that finds the optimal hyperplane between the points of two classes such that the distance of the nearest points to the decision boundary (called *margin*) is maximized. Sometimes (actually most of the times) this is not doable, so to overcome this problem we represent the data $\bf x$ in higher dimensions ${\bf x} \longrightarrow \phi({\bf x})$ to be able to linearly separate the data in this new higher dimension, resulting in a non-linear boundary in the original space.

Before diving into a more detailed explanation of SVM, let's first have a look at **Support Vector Classifier**. The *support vector classifier*, aims at finding the hyperplane defined by $f(X): {\beta}_0 + {\beta}_1x_1 + {\beta}_2x_2 + ... + {\beta}_dx_d = 0$: when we are in a bidimensional space the hyperplane is a flat one-dimensional subspace (a line), in a three dimensional space it's a flat two-dimensional subspace (a plane), etc... that divides the space, and thus the data lying in it, in two separate groups, that is the two classes. In a lineraly separable scenario with two classes $\{-1, +1\}$, a data point identified by $X = (x_1, x_2, ..., x_d)^T$ will be classified:
- $+1$ if $f(X) > 0$, since the $X$ point lies in the side of class $(+1)$
- $-1$ if $f(X) < 0$, since the $X$ point lies in the side of class $(-1)$

Given such scenario there are infinite hyperplanes that would do the job: how do we find the best one, then? Via the concept of **margin**. The margin is the minimum distance between the hyperplane and the training data points closer to the hyperplane from each class: these points are called *support vectors* because they "support" the hyperplane, also known as *decision boundary*. Interestingly, the hyperplane depends directly on the support vectors, but not on the other data points: a movement to any of the other observations would not affect the separating hyperplane, provided that the observation's movement does not cause it to cross the boundary set by the margin. The classifier that perform the task of finding the best hyperplane is called **Maximal Margin Classifier** when the problem is linearly separable: the *Support Vector Classifier* is a generalization to this problem, so it allows us to perform the task even in scenarios that are not linearly separable.

Given a task of constructing the maximal margin hyperplane based on a set of $n$ training observations $x_1,...,x_n \in \mathbb{R}^n$ and associated class labels $y_1,...,y_n \in \{−1, 1\}$. Briefly, the maximal margin hyperplane is the solution to the optimization problem: $$\underset{\beta{}_1,\beta{}_2,..,\beta{}_d}{\mathrm{maximize}} \ \ M$$ $$subject\,to \displaystyle\sum_{j=1}^d {{\beta}_j}^2 = 1$$ $${y}_i(\beta_0 + \beta_1x_{i1} + ... + \beta_d{x}_{id})\ge M \ \ \forall i = 1,..., N$$  The constraints ensure that each observation is on the correct side of the hyperplane and at least a distance $M$ from the hyperplane. Hence, $M$ represents the margin of our hyperplane, and the optimization problem chooses $\beta_0,\beta_1,...,\beta_p$ to maximize $M$.

In the more common case in which data is not linearly separable we have to use some **soft margins**, where we "relax" our margins that were formerly obliging each point to be on the right side of the margin, so that we can still perform the separation task, even if some points are not on the right side of the margin (still leading to a correct classification) or, in some cases, even on the wrong side of the hyperplane (leading to a wrong classification). This is done mathematically by adding **slack variables** $\epsilon_i, ..., \epsilon_n$ for each individual sample that modify the former optimization problem:
$$\underset{\beta{}_1,\beta{}_2,..,\beta{}_d}{\mathrm{maximize}} \ \ M$$ $$subject\,to \displaystyle\sum_{j=1}^d {{\beta}_j}^2 = 1$$ $${y}_i(\beta_0 + \beta_1x_{i1} + ... + \beta_d{x}_{id})\ge M(1-\epsilon_i)$$ $$\epsilon_i \ge 0, \displaystyle\sum_{i=1}^n {{\epsilon}_i} \le C$$
As we can see the constraints slightly change because of the introduction of the parameters $\epsilon$ and $C$. The meaning of $\epsilon$ is:
- if $\epsilon_i = 0$, the $X$ point is on the right side of the margin
- if $0 < \epsilon_i < 1$, the $X$ point is on the wrong side of the margin, but correct side of the hyperplane
- if $\epsilon_i > 1$, the $X$ point is on the wrong side of the hyperplane

On the other hand, $C$ is the parameter that represent the upper bound of the sum of all the $\epsilon_i$ and indicates how many points we allow to be wrongly classified, or, a limit on the number of points that lie on the wrong side of the margin (but are correctly classified, because on the right side of the hyperplane). So if $C = 0$ we do not have a slack, so we do not allow to any point to be on the wrong side of the margin (going back to the *Maximal Margin Classifier*, and having a **hard margin**), otherwise if $C > 0$ the greater the value, the higher the slack and, consequently, the "relax" on the margin (**soft margin**). 

Now we could go a step further and see that the linear *Support Vector Classifier* can be represented as
$$f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_i \rangle$$
where there are $n$ paramters $\alpha_i$ which are non-negative only for the support vectors: for this reason if we define $S$ as the subset of all the support points, we can rewrite the solution function in the form
$$f(x) = \beta_0 + \sum_{i \in S} \alpha_i \langle x, x_i \rangle$$
which usually has much fewer data points than the previous one. Moreover, we can substitute the inner product $\langle x, x_i \rangle$ with a generalization of it, called *kernel* $K(x, x_{i'})$, which is a function that quantifies the similarity of two observations: as long as there is some higher dimensional space where this function is the dot product between ${\bf x_i}$ and ${\bf x_j}$, then also $K$ is a dot product and therefore it can be used for the classification. Here are some examples of the most used kernels:
- Linear kernel: $ K({\bf x}_1,{\bf x}_2) = {\bf x}_1 \cdot {\bf x}_2$
- Radial Basis Function: $ K({\bf x}_1,{\bf x}_2) = e^{-\gamma({\Vert {\bf x}_1 - {\bf x}_2 \Vert}^2)}$
- Polynomial kernel: $ K({\bf x}_1,{\bf x}_2) = ({\bf x}_1\cdot{\bf x}_2+1)^d$

Since SVM uses space to find the boundaries, I applied PCA before using the model.

After performing a grid search on the hyperparameters and the different available kernels, the best ones were:
- `C = 10`
- `gamma = 0.008`
- `kernel = rbf`

## Process

![SVM PROCESS](https://imgur.com/9AlNff3.png)
![SVM cross validation](https://imgur.com/EmIMtWU.png)

## Performance evaluation

### Validation set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>3060</td>
    <td>642</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>609</td>
    <td>3093</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>83.10%</td>
    <td>83.55%</td>
    <td>82.83%</td>
    <td>83.14%</td>
    <td>0.908</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/F2HroqP.png" alt="ROC SVM valid" width="500"/>

### Test set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>1299</td>
    <td>288</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>221</td>
    <td>1366</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>83.96%</td>
    <td>86.07%</td>
    <td>82.59%</td>
    <td>84.29%</td>
    <td>0.917</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/5Q3NCTu.png" alt="ROC SVM test" width="500"/>

## K Nearest Neighbors
---

![KNN](https://cdn-images-1.medium.com/max/1600/0*Sk18h9op6uK9EpT8.)

This technique is instance-based, meaning that we use the stored training records to predict the class label of input samples. The parameter **K** represents the number of closest points (nearest neighbors) to perform the classification task. This model has the advantage that we do not perform any training on the data, but we simply use the ones we have to calculate the label. The problem is that with a high cardinality of the dataset, the classification might take a lot.

The steps to assign a label to a new sample are:
- Compute the distance to all the samples in the dataset
- Identify the K nearest neighbors
- Assign the majory class to the sample

In order to calculate the distance of two points $p$ and $q$ we can use the Euclidean distance, defined as:
$$d(p, q) = \sqrt{\sum_{i=1}^N p_{i}^2 + q_{i}^2}$$

There is a variation of the K-NN algorithm where instead of just taking the K closest points and voting for the class, we use a weigthed vote system, by introducing the weight $w = \frac{1}{d^2}$, so that closer points have a bigger relevance for the final decision of the class.

Since this algorithm looks at the spatial distance, I applied PCA to the whole dataset first, and then run the algorithm with a value of `k = 37`, which I found to be the best performing one after doing a grid search. It's, in fact, worth noticing that the parameter K has a big importance: if too small we give too much importance to noise points, if too big we might have a neighborhood comprehending too many points from other classes.

## Process

![KNN PROCESS](https://imgur.com/OXxRQyw.png)
![kNN cross validation](https://imgur.com/EmIMtWU.png)

## Performance evaluation

### Validation set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>3166</td>
    <td>536</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>1334</td>
    <td>2368</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>74.74%</td>
    <td>63.96%</td>
    <td>81.61%</td>
    <td>71.68%</td>
    <td>0.842</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/dpvEQgl.png" alt="ROC KNN valid" width="500"/>

### Test set
<center><b>Confusion Matrix</b></center>
<table>
  <tr>
    <th></th>
    <th>True NO</th>
    <th>True YES</th>
  </tr>
  <tr>
    <th>Predicted NO</th>
    <td>1363</td>
    <td>224</td>
  </tr>
  <tr>
    <th>Predicted YES<br></th>
    <td>551</td>
    <td>1036</td>
  </tr>
</table>
<br>
<center><b>Performance</b></center>
<table>
  <tr>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
  <tr>
    <td>75.58%</td>
    <td>65.28%</td>
    <td>82.22%</td>
    <td>72.78%</td>
    <td>0.846</td>
  </tr>
</table>
<center><br><b>ROC</b></center><img src="https://imgur.com/IBzP00J.png" alt="ROC KNN test" width="500"/>



## Model comparison
---

After having seen in detail each of the different models used to perform classification, it's time to see which performed the best on the test set.

<table>
  <tr>
      <th></th>
    <th>Accuracy</th>
    <th>Recall</th>
    <th>Precision</th>
    <th>F1 Score</th>
    <th>AUC</th>
  </tr>
    <tr>
    <th>Random Forest</th>
    <td>85.22%</td>
    <td>87.90%</td>
    <td>83.43%</td>
    <td>85.61%</td>
    <td>0.920</td>
    </tr>
    <tr>
    <th>Support Vector Machine</th>
    <td>83.96%</td>
    <td>86.07%</td>
    <td>82.59%</td>
    <td>84.29%</td>
    <td>0.917</td>
    </tr>
    <tr>
    <th>Logistic Regression</th>
    <td>82.55%</td>
    <td>81.72%</td>
    <td>83.17%</td>
    <td>82.86%</td>
    <td>0.909</td>
    </tr>
    <tr>
    <th>K-Nearest Neighbors</th>
    <td>75.58%</td>
    <td>65.28%</td>
    <td>82.22%</td>
    <td>72.78%</td>
    <td>0.846</td>
    </tr>
</table>

As we can see the best performing model is Random Forest followed by SVM and Logistic Regression.

The two models who used PCA did not perform very well (even if they did not perform neither bad) maybe because for these data that we had, with many categorical labels and less numerical ones, space related algorithms are not the best, and it's better to stick with models that use data as they are, just as Decision Trees do. In fact, even if I did not have big expectations with KNN, with SVM I tought that result could be better, since usually it's a very robust algorithm that works pretty well in different scenarios: here it was not bad, but not great for sure.

Here is also the plot comparing all the different ROCs, where we see that the best performing models have a higher Area Under the Curve.

SVM vs. KNN (using PCA)
<img src="https://imgur.com/EheIfzR.png" alt="ROC comparison"/>


Logistic Regression vs. Random Forest (not using PCA)
<img src="https://imgur.com/hQEISc5.png" alt="ROC comparison"/>


## Conclusions
---

We could see from this study that, apart from KNN, all algorithms perform quite well. The best one is for sure Random Forest and, for this study case, I think it would be the way to go if the bank decided to used these models for decision taking. In fact, even if RF takes a decision not instantaneously, deciding whether a client is a good one or not is not a decision to be taken real time, but we have the time to wait the computation. Moreover, this is a model which is very straightforward, so even a manager can quickly identify which are the most influential attributes to look at in the future. In other contexts, maybe SVM might be a better choice, since it's quicker at classifying, yielding a pretty good accuracy.

I expected RF to perform very well because my dataset has many categorical labels which a tree-based model can treat very easily, without the need of dummy variables.