Data fields

- customer_id - an anonymous id unique to a given customer

- age - the age of a customer (numeric)

- job - type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

- marital - marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

- education - education level of the customer(categorical: primary, secondary, tertiary and unknown)

- default - has credit in default? (categorical: 'no','yes','unknown')

- balance - Balance of the individual.

- housing - has housing loan? (categorical: 'no','yes','unknown')

- loan - has personal loan? (categorical: 'no','yes','unknown')

- contact - contact communication type (categorical: 'cellular','telephone')

- day - last contacted day of the month

- month - last contact month of year (categorical: January to December)

- duration - last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

- campaign - number of contacts performed during this campaign and for this client (numeric, includes last contact)

- pdays - number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

- previous - number of contacts performed before this campaign and for this client (numeric)

- poutcome - outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

- deposit - the target variable, has the client subscribed a term deposit? (yes: 1','no':0)

In [2]:
import numpy as np
import pandas as pd

In [3]:
dataframe_train = pd.read_csv("dataset/Train.csv")
print(dataframe_train.info())
dataframe_train.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8929 entries, 0 to 8928
Data columns (total 18 columns):
customer_id    8929 non-null object
age            8929 non-null int64
job            8929 non-null object
marital        8929 non-null object
education      8929 non-null object
default        8929 non-null object
balance        8929 non-null int64
housing        8929 non-null object
loan           8929 non-null object
contact        8929 non-null object
day            8929 non-null int64
month          8929 non-null object
duration       8929 non-null int64
campaign       8929 non-null int64
pdays          8929 non-null int64
previous       8929 non-null int64
poutcome       8929 non-null object
deposit        8929 non-null int64
dtypes: int64(8), object(10)
memory usage: 1.2+ MB
None


Unnamed: 0,customer_id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,RGD002844,31,management,married,tertiary,no,81,yes,no,cellular,29,oct,164,2,-1,0,unknown,1
1,RGD003806,62,retired,married,secondary,no,569,no,no,cellular,3,aug,187,2,180,6,success,1
2,RGD008310,35,technician,married,tertiary,no,432,no,no,cellular,12,aug,104,8,-1,0,unknown,0
3,RGD001840,43,management,married,tertiary,no,1429,yes,no,cellular,7,may,1030,1,169,3,success,1
4,RGD005881,29,blue-collar,married,primary,no,25,yes,no,unknown,4,jun,188,2,-1,0,unknown,0


In [4]:
dataframe_train.columns

Index(['customer_id', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'],
      dtype='object')

In [5]:
dataframe_train.columns.values

array(['customer_id', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype=object)

### Exploratory Data Analysis EDA

In [6]:
def examine_dataframe (df):
    for name in df.columns:
        print("--------------------------")
        print(df[name].dtype)
        if df[name].dtype is np.dtype("O"):  # This handles normal objects
            print(df[name].value_counts())
            print("Name: ", name)
        else:
            print(df[name].describe())  #This handles the int values
        
examine_dataframe(dataframe_train)

--------------------------
object
RGD001422     1
RGD002083     1
RGD003417     1
RGD004189     1
RGD004472     1
RGD004225     1
RGD00122      1
RGD001753     1
RGD003045     1
RGD002513     1
RGD004715     1
RGD00318      1
RGD008117     1
RGD009615     1
RGD001221     1
RGD002154     1
RGD001817     1
RGD008523     1
RGD003874     1
RGD00319      1
RGD005236     1
RGD008264     1
RGD004972     1
RGD006397     1
RGD009197     1
RGD001896     1
RGD00925      1
RGD001177     1
RGD00604      1
RGD009080     1
             ..
RGD006180     1
RGD004287     1
RGD008858     1
RGD006443     1
RGD005635     1
RGD0010321    1
RGD001717     1
RGD001419     1
RGD00812      1
RGD002181     1
RGD004140     1
RGD006468     1
RGD003658     1
RGD009652     1
RGD002694     1
RGD001716     1
RGD007689     1
RGD0010397    1
RGD007250     1
RGD002383     1
RGD0010960    1
RGD0010220    1
RGD005364     1
RGD001534     1
RGD005912     1
RGD004884     1
RGD0011064    1
RGD0079       1
RGD005101     1
RGD001

In [7]:
# save column values in an array
cols = ["age", "job", "marital", "education", "default",  
        "housing", "loan", "month", "duration", "campaign", "previous", "poutcome"]
data1 = dataframe_train[cols]
data_dummies = pd.get_dummies(data1)
result_df = pd.concat([data_dummies, dataframe_train], axis =1)
result_df.head(12)

Unnamed: 0,age,duration,campaign,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,...,loan,contact,day,month,duration.1,campaign.1,pdays,previous.1,poutcome,deposit
0,31,164,2,0,0,0,0,0,1,0,...,no,cellular,29,oct,164,2,-1,0,unknown,1
1,62,187,2,6,0,0,0,0,0,1,...,no,cellular,3,aug,187,2,180,6,success,1
2,35,104,8,0,0,0,0,0,0,0,...,no,cellular,12,aug,104,8,-1,0,unknown,0
3,43,1030,1,3,0,0,0,0,1,0,...,no,cellular,7,may,1030,1,169,3,success,1
4,29,188,2,0,0,1,0,0,0,0,...,no,unknown,4,jun,188,2,-1,0,unknown,0
5,38,360,1,0,0,1,0,0,0,0,...,no,cellular,12,may,360,1,-1,0,unknown,0
6,51,75,7,0,0,1,0,0,0,0,...,yes,unknown,16,jun,75,7,-1,0,unknown,0
7,54,207,1,0,0,0,0,0,1,0,...,no,cellular,5,jun,207,1,-1,0,unknown,1
8,52,477,2,0,0,0,0,0,1,0,...,no,cellular,23,apr,477,2,-1,0,unknown,0
9,31,161,3,9,0,0,1,0,0,0,...,yes,cellular,29,jan,161,3,188,9,failure,0


In [8]:
result_df.columns.values # Performed one-hot encoding so that I can use it in EDA

array(['age', 'duration', 'campaign', 'previous', 'job_admin.',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed',
       'job_services', 'job_student', 'job_technician', 'job_unemployed',
       'job_unknown', 'marital_divorced', 'marital_married',
       'marital_single', 'education_primary', 'education_secondary',
       'education_tertiary', 'education_unknown', 'default_no',
       'default_yes', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes',
       'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_other',
       'poutcome_success', 'poutcome_unknown', 'customer_id', 'age', 'job',
       'marital', 'education', 'default', 'balance', 'housing', 'loan',
       'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'

In [10]:
 result_df['output'] = result_df['deposit'].apply(lambda x: 1 if x =='yes' else 0) ## what happens here

### Comparing Qualitative vs Quantitative Analysis

#### age vs Output

In [15]:
grouped = result_df.groupby( "deposit")
age = grouped[ "age"].describe()
age = age.unstack( level=-1)
print (age)

            deposit
age  count  0          4716.000000
            1          4213.000000
     mean   0            40.828456
            1            41.766200
     std    0            10.240731
            1            13.584954
     min    0            18.000000
            1            18.000000
     25%    0            33.000000
            1            31.000000
     50%    0            39.000000
            1            38.000000
     75%    0            48.000000
            1            51.000000
     max    0            87.000000
            1            95.000000
     count  0          4716.000000
            1          4213.000000
     mean   0            40.828456
            1            41.766200
     std    0            10.240731
            1            13.584954
     min    0            18.000000
            1            18.000000
     25%    0            33.000000
            1            31.000000
     50%    0            39.000000
            1            38.000000


#### month vs output

In [17]:
grouped = result_df.groupby( "deposit")
month = grouped[ "month"].describe()
month = month.unstack( level=-1)
print (month)

        deposit
count   0          4716
        1          4213
unique  0            12
        1            12
top     0           may
        1           may
freq    0          1547
        1           750
dtype: object


In [18]:
frequencies = pd.crosstab( result_df[ "month"], result_df[ "deposit"]).apply(lambda r: r/len(result_df))
print (frequencies)

sns.heatmap( frequencies)

deposit         0         1
month                      
apr      0.030575  0.051854
aug      0.074812  0.057341
dec      0.001120  0.008960
feb      0.029343  0.040094
jan      0.018143  0.013103
jul      0.079404  0.056221
jun      0.059245  0.048494
mar      0.002800  0.022287
may      0.173256  0.083996
nov      0.048494  0.035838
oct      0.006384  0.028335
sep      0.004592  0.025311


NameError: name 'sns' is not defined