Data Details: http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html

## Data wrangling and EDA

1. Read the data
2. See the definition of functions
3. Check first few and last few rows of dataframe
4. Display all columns
6. Display chosen columns only
7. Count number of rows and columns in a dataframe
8. Summary of numeric columns
9. Change the name of few columns
 

In [1]:
import pandas as pd
import numpy as np

In [2]:
## 1. Read the data
## 2. See the definition of functions: Press shift + Tab

path  = 'https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv'

banking_df = pd.read_csv(path)

In [3]:
# 3. Check first few and last few rows of dataframe
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,...,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55,retired,married,basic.4y,no,yes,no,cellular,aug,fri,...,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


## Data Dictionary

1. age (numeric)  
     
2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”,   “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)  
  
3. marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)  
  
4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”,   “professional.course”, “university.degree”, “unknown”)  
  
5. default: has credit in default? (categorical: “no”, “yes”, “unknown”)  
  
6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)  
  
7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)  

8. contact: contact communication type (categorical: “cellular”, “telephone”)  

9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)  

10. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)  

11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model  


12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)  

13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)  

14. previous: number of contacts performed before this campaign and for this client (numeric)  

15. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)  

16. emp.var.rate: employment variation rate — (numeric)  

17. cons.price.idx: consumer price index — (numeric)  

18. cons.conf.idx: consumer confidence index — (numeric)  

19. euribor3m: euribor 3 month rate — (numeric)  

20. nr.employed: number of employees — (numeric)  

21. y: client  subscribe (1/0) to a term deposit (Target Variable)

In [4]:
## 4. Display all columns
pd.set_option("display.max_columns",22)
banking_df.head()

## Display options are also avaialble for rows

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,210,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,138,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,339,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,185,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55,retired,married,basic.4y,no,yes,no,cellular,aug,fri,137,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


In [5]:
## 6. Display chosen columns only

banking_df[['job', 'education', 'y']].head()

Unnamed: 0,job,education,y
0,blue-collar,basic.4y,0
1,technician,unknown,0
2,management,university.degree,1
3,services,high.school,0
4,retired,basic.4y,1


In [6]:
## 7. Count number of rows and columns in a dataframe
banking_df.shape

(41188, 21)

In [7]:
## 8. Summary of numeric columns
banking_df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911,0.112654
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528,0.316173
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6,0.0
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1,0.0
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0,0.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1,0.0
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1,1.0


In [8]:
## 9. Change the name of few columns
banking_df = banking_df.rename(columns = {'euribor3m': 'euribor_3m'})
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor_3m,nr_employed,y
0,44,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,210,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,138,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,339,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,185,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55,retired,married,basic.4y,no,yes,no,cellular,aug,fri,137,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


## Section 2: Data Exploration and cleaning

1. Distribution of categorical variables
2. 