# Bank Marketing

The <b>bank-marketing.csv</b> data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. The ultimate goal is to predict if the client will subscribe to a term deposit (variable y). This is a classic classification problem where the attempt is to classify between two classes - those who'll subscribe and those who won't.

Dataset reference:
- S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
- In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

#### Variable description:

- 1 age (numeric)
- 2 job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services") 
- 3 marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
- 4 education (categorical: "unknown","secondary","primary","tertiary")
- 5 default: has credit in default? (binary: "yes","no")
- 6 balance: average yearly balance, in euros (numeric) 
- 7 housing: has housing loan? (binary: "yes","no")
- 8 loan: has personal loan? (binary: "yes","no")
   
#### related with the last contact of the current campaign:
- 9 contact: contact communication type (categorical: "unknown","telephone","cellular") 
- 10 day: last contact day of the month (numeric)
- 11 month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- 12 duration: last contact duration, in seconds (numeric)

#### other attributes:
- 13 campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- 14 pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- 15 previous: number of contacts performed before this campaign and for this client (numeric)
- 16 poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):
- 17 y - has the client subscribed a term deposit? (binary: "yes","no")

In [1]:
import pandas as pd

In [11]:
from __future__ import division #Using Python 2, so importing this to ensure division allows for floats

In [3]:
df = pd.read_csv('bank-marketing.csv')

In [7]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
5,35,management,single,tertiary,no,747,no,no,cellular,23,feb,141,2,176,3,failure,no
6,36,self-employed,married,tertiary,no,307,yes,no,cellular,14,may,341,1,330,2,other,no
7,39,technician,married,secondary,no,147,yes,no,cellular,6,may,151,2,-1,0,unknown,no
8,41,entrepreneur,married,tertiary,no,221,yes,no,unknown,14,may,57,2,-1,0,unknown,no
9,43,services,married,primary,no,-88,yes,yes,cellular,17,apr,313,1,147,2,failure,no


### Read the dataset and answer the following questions.

### Question

Extract all column names. Count the number of columns (using code).

In [4]:
list(df.columns) #List of Column Names

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'y']

In [5]:
len(df.columns) #Counts the number of Columns

17

In [6]:
df.info() # Confirm Column Names & number of columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age          4521 non-null int64
job          4521 non-null object
marital      4521 non-null object
education    4521 non-null object
default      4521 non-null object
balance      4521 non-null int64
housing      4521 non-null object
loan         4521 non-null object
contact      4521 non-null object
day          4521 non-null int64
month        4521 non-null object
duration     4521 non-null int64
campaign     4521 non-null int64
pdays        4521 non-null int64
previous     4521 non-null int64
poutcome     4521 non-null object
y            4521 non-null object
dtypes: int64(7), object(10)
memory usage: 600.5+ KB


### Question

Is data in the correct format? By that we mean, do you see integers and floats where you expect them to be the case? If not, convert them into the correct format. Also make sure that all entries are non-null.

In [None]:
#Yes, data is in correct format. Numeric values are indeed listed as int64. All entries are non-null.

### Question

Provide a general summary statistics of the entire dataset. The describe() method is what you would want to use.

In [8]:
df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,3025.0,50.0,871.0,25.0


### Question

The data type of columns like job, married, education etc. is called categorical data. List all the different categories in the job column.

In [9]:
list(df.job.unique())

['unemployed',
 'services',
 'management',
 'blue-collar',
 'self-employed',
 'technician',
 'entrepreneur',
 'admin',
 'student',
 'housemaid',
 'retired',
 'unknown']

### Question

In one line of code, provide a count of the number of people who were unemployed and owned a home.

In [13]:
len(df.loc[(df.job == 'unemployed') & (df.housing == 'yes')])

58

### Question

What is the education level of a typical blue-collar worker? Explore value_counts() method.

In [14]:
df2 = df.loc[df. job == 'blue-collar'] #Extract a new data frame consistening of Blue-collar Workers

In [15]:
df2.education.value_counts() #Return count of education level of blue-collar workers

secondary    524
primary      369
unknown       41
tertiary      12
Name: education, dtype: int64

In [None]:
#A typical blue-collar worker has secondary education 

In [19]:
df2.education.value_counts(normalize=True)*100 #Using secondary because it covers over 55% of blue-collar workers

secondary    55.391121
primary      39.006342
unknown       4.334038
tertiary      1.268499
Name: education, dtype: float64

### Question

How many who are unemployed have an outstanding loan? Is that percentage more than that of the employed ones? 

In [None]:
#Defining unemployed as those with the unemployed job role. 

In [None]:
#Defining outstanding loan in regards to personal loan

In [22]:
len(df.loc[(df.job == 'unemployed') & (df.loan == 'yes')]) 

13

In [23]:
df.loc[df.loan == 'yes'].job.value_counts(normalize=True)*100

blue-collar      22.575977
management       17.366136
technician       17.221418
admin            13.169320
services         10.709117
entrepreneur      5.933430
retired           4.630970
self-employed     4.341534
unemployed        1.881331
housemaid         1.881331
student           0.144718
unknown           0.144718
Name: job, dtype: float64

In [None]:
# No, the percentage is NOT more than that of employed ones.

### Question

What percent of clients subscribed to the term deposit (column y)? 

In [None]:
#Percent of Clients Subscribed to Term desposit calculated by Finding how many 'yes' in Y column / Total rows in that column. Then multiply by 100 to produce percentage.

In [12]:
(len(df.loc[df.y == 'yes', :]) / len(df.y)) * 100 

11.523999115239992

In [24]:
df.y.value_counts(normalize = True )*100 #Alternate method

no     88.476001
yes    11.523999
Name: y, dtype: float64

In [None]:
# 11.52% subscribed to term despsit 

### Question

What percent of married clients subscribed to the term deposit? Is that more or less than that for single folks? 

Revised Wording: "From all the clients, what percent of married clients subscribed to the term deposit? Is that more or less than that for single folks?"

In [27]:
df.loc[df.marital == 'married'].y.value_counts(normalize = True)*100

no     90.096532
yes     9.903468
Name: y, dtype: float64

In [None]:
# 9.90% of married clients subscribed to term deposit, so that is less than that for single folks. 

In [28]:
df.loc[df.marital == 'single'].y.value_counts(normalize = True)*100

no     86.036789
yes    13.963211
Name: y, dtype: float64

In [None]:
# 13.96% or rounding to 14% of single folks subscribed to term deposit.

### Question

<p>Ask an interesting question of this data set and provide a solution to answer that.</p>
<p>The goal is to help teach your fellow students all possible questions we can collectively ask of this dataset. Your question should be as clear as possible and as short as possible. Try to avoid asking questions that are too trivial or obvious.</p>

<p>This is a bonus point question. The only way not to get full points are if you do the following:
<ul>
<li>You do not perform this task.
<li>You do not include a solution. 
</ul>
</p>
<p>If the solution you provide is incorrect, you will still receive full points. But you must make an honest effort to get it right.
</p>

Write your question here: Where there was more than than 5 contacts made, what percentage of clients subscribed to the term deposit?

In [None]:
Write your solution here: Around 7% # Steps below

In [25]:
df.loc[df.campaign > 5].y.value_counts(normalize = True)*100

no     92.811839
yes     7.188161
Name: y, dtype: float64