# Bank Predictions

# Step 1: Define

### Overall Problem and Motivation

The goal of the analysis is to predict whether a client will subscribe to a [term deposit](https://www.investopedia.com/terms/t/termdeposit.asp) for a banking institution. Banks want term deposits so that they have a more consistent stream of capital to fund other investments they wish to make. Clients buy term deposits and are guaranteed a low interest rate. The difference between this and a normal checking/savings account with a bank is that the interest rate is higher and there is a penalty if the client wishes to withdraw their money prematurely. Term deposits are a low-risk investment for the client, but the reward is appropriately low. 

Banks would like to know if there are trends in which clients buy term deposits so that they can focus their time and resources contacting those potential investors rather than some who might never want to make this type of investment. I want to find the most significant traits which would lead a client to subscribe to a term deposit.

One of the biggest influences on term deposits is interest rates. In general, the higher the interest rates, the more the client will earn from the term deposit. Conversely, when interest rates drop, then the economy is generally doing better. Potential clients might see more potential gains through the stock market rather than term deposits with a bank. 

### The Data

The data was provided by

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

and can be found [here](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#)


The file I will be utilizing is called bank-additional-full.csv. It contains 41,118 clients and 20 features. There are additional features and more clients than the dataset which was originally created to do this analysis. The 20 features are:
* age (numeric)
* job: Type of job (categorical, might end up being ordinal)
    - 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'
* marital: marital status (categorical)
    - 'divorced', 'married', 'single', 'unknown'; note: 'divorced' means divorced or widowed
* education: (categorical, might be ordinal)
    - 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown'
* default: is credit in default or not (categorical)
* housing: has a housing loan or not (categorical)
* loan: has a personal loan or not (categorical)
* contact: contact communication type (categorical)
    - 'cellular','telephone'
* month: last contact month of the year (categorical)
    - 'jan', 'feb', 'mar', ..., 'nov', 'dec'
* day_of_week: last contact day of week (categorical)
    - 'mon', 'tue', 'wed', 'thu', 'fri'
* duration: duration of last contact (I will explain why this feature needs to be removed)
* campaign: number of contacts performed during this campaign for this client (numeric)
    - includes last contact with the client
* pdays: number of days that passed by after the client was last contacted from a __previous__ campaign (numeric)
    - 999 means the client was never previously contacted
* previous: number of contacts performed before this campaign and for this client (numeric, includes last contact)
* poutcome: outcome of the previous marketing campaign (categorical)
    -  'failure', 'nonexistent', 'success'
    
Note that the rest of the feature variables shown below are social and economic context attributes
* emp.var.rate: employment variation rate - quarterly indicator (numeric)
* cons.price.idx: consumer price index - monthly indicator (numeric)
* cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* euribor3m: euribor 3 month rate - daily indicator (numeric)
* nr.employed: number of employees - quarterly indicator (numeric)

The target variable is:
* y: has the client subscribed a term deposit (binary)


#### Dropping 'duration'
The variable duration will need to be removed since this variable will be known only after the phone call to the client has ended. The information cannot possibly be known before a call to the potential client subscription, and therefore cannot be used to predict whether the call will be successful. The goal of the model is to predict the success of a phone call __before__ the call has been made.

#### Explaining employment variation rate ("emp.var.rate")
Employment variation rate tracks how much a company is hiring during a given quarter. This metric can be viewed as proportional to how companies view the economy at each quarter. If the economy is perceived to be doing well, then the *emp.var.rate* will go up and vice versa. It can then be implied that if the *emp.var.rate* is up, then interest rates will be down as well as the rate at which clients will be subscribing for term deposits.

#### Explaining cosumer price index ("cons.price.idx")
[Consumer Price Index (CPI)](https://www.investopedia.com/terms/c/consumerpriceindex.asp) measures the average change in prices over time that consumers pay for a basket of goods and services. CPI is a monthly indicator used to track inflation in a country. Any sudden change in CPI can be disastrous for economies causing either hyperinflation or severe deflation. The CPI is a key indicator of changes in the interest rate generally held to an inversely proportional relationship by banks. Further reading on the relationship between inflation and interest rates can be found [here](https://www.investopedia.com/ask/answers/12/inflation-interest-rate-relationship.asp)

#### Explaining consumer confidence index ("cons.conf.idx")
The [Consumer Confidence Index (CCI)](https://www.investopedia.com/terms/c/cci.asp) measures how *consumers* feel about the near future of the economy on a monthly basis. It tries to predict whether consumers will have faith in the market and spend or they'll be skeptical and save. When consumers have faith in the market, it can be reasoned that the economy will generally grow making the interest rates fall. For example, the CCI hit record lows after the 2008 housing market collapse. This produces an inversely proportional relationship between interest rates and CCI which infers an inversely proportional relationship with term deposist subscription rates. For reference, the CCI hit record lows after the 2008 housing market collapse.

#### Explaining euribor 3 month rate ("euribor3m")
The euribor 3 month rate is the interest rate of a subset of European banks lend one another funds with a 3 month maturity. This rate is used to inform European banks on the interbank interest rates in the rest of the Europe. If there are significant changes in the euribor 3 month rate, then there is a high likelyhood that interest rates across Europe are increasing. More in-depth detail on Euribor can be found [here](https://www.euribor-rates.eu/what-is-euribor.asp)

### Evaluation Metric
The evaluation I will be using is __Area Under the Curve (AUC)__ which is the measure of the area underneath the Receiver Operating Characteristics (ROC) curve. This curve how the model's *recall* varies based on the model's *specificity*. Recall measures the strength of the model to predict a positive outcome. In this problem, it is the measure of how many subscribed clients are correctly predicted. Specificity measures the model's ability to predict the negative outcomes. In this problem, it is the measure of how many clients were correctly predicted to ignore the advertisement campaign. The formulas for both measurements are: 

$$
\begin{equation}
recall = \frac{\Sigma \text{TruePositive}}{\Sigma \text{TruePositive} + \Sigma \text{FalseNegative}}
\label{eq:recall}
\tag{1}
\end{equation}
$$

$$
\begin{equation}
specificity = \frac{\Sigma \text{TrueNegative}}{\Sigma \text{TrueNegative} + \Sigma \text{FalsePositive}}
\label{eq:specificity}
\tag{2}
\end{equation}
$$

Two things to note are that recall and specificity only range from 0 to 1, and they have a trade off. If you used a simple model which always predicted positive, then recall would be 1 and specificity would be 0. In the inversed model, recall would be 0 and specifity would be 1. The absolute perfect model would have both recall and specificity of 1.

The AUC of a model which randomly guesses positive and negatives would be .5 since roughly half of the answers would be correctly categorized as positive and negative respectively. This is always a nice threshold to compare classification models. The absolute perfect model would have an AUC of 1.

I am using AUC as a metric instead of standard accuracy because AUC gives a more in depth analysis of the classification results by comparing recall and specificity. The accuracy can vary heavily on the data given, especially in marketing campaigns where the target is generally skewed towards having more negative responses. A simple examply would be a marketing campaign with 98% rejection rate. A simple model classifying all instances as rejected would have an accuracy of 98% which sounds great except for the fact that the model is deeply flawed.

### Output
The output of my model will be a csv file named "subscription_predictions.csv" in which there will be a single column filled with 1s and 0s. 1s will represent the clients who are predicted to subscribe to a term deposit this campaign. The 0s will be the clients who are predicted to not subscribe to a term deposit this marketing campaign. 