# DS-SF-25 | Lab 09 | Introduction to Logistic Regression

In [119]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 40)

from sklearn import linear_model, cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [100]:
df = pd.read_csv(os.path.join('..', 'datasets', 'bank-marketing.csv'))

In [101]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


> The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attributes Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical)
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

> Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign.

> ## Question 1.  Remove the categorical variables with the most number of distinct values

In [102]:
df.apply(pd.Series.nunique)
#df.previous.unique()

age           67
job           12
marital        3
education      4
default        2
            ... 
campaign      32
pdays        292
previous      24
poutcome       4
y              2
dtype: int64

In [103]:
df.job.value_counts()
df.marital.value_counts()

married     2797
single      1196
divorced     528
Name: marital, dtype: int64

Answer: TODO

> ## Question 2.  Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

In [104]:
# default
df.default.value_counts()
df['default_no']=df.default.apply(lambda value: 1 if value== "no" else 0)

In [105]:
#houseing
df.housing.value_counts()
df['housing_yes']=df.housing.apply(lambda value: 1 if value=="yes" else 0)

In [106]:
#loan
df.loan.value_counts()
df['loan_no']=df.loan.apply(lambda value: 1 if value=="no" else 0)

> ## Question 3.  Create binary/dummy variables for the other categorical variables

In [107]:
df['is_married']=(df.marital=="married")*1
df['education_is_secondary']=(df.education=="secondary")*1
df['education_is_tertiary']=(df.education=="tertiary")*1
df['education_is_primary']=(df.education=="primary")*1
df['contact_is_cellular']=(df.contact=="cellular")*1
df['contact_is_telephone']=(df.contact=="telephone")*1
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,default_no,housing_yes,loan_no,is_married,education_is_secondary,education_is_tertiary,education_is_primary,contact_is_cellular,contact_is_telephone
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no,1,0,1,1,0,0,1,1,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no,1,1,0,1,1,0,0,1,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no,1,1,1,0,0,1,0,1,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no,1,1,0,1,0,1,0,0,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no,1,1,1,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no,1,1,1,1,1,0,0,1,0
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no,0,1,0,1,0,1,0,0,0
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no,1,0,1,1,1,0,0,1,0
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no,1,0,1,1,1,0,0,1,0


In [108]:
#drop categorical variables
df.drop('job',axis=1,inplace=True)

In [109]:
df.drop('marital', axis=1, inplace=True)
df.drop('education',axis=1,inplace=True)
df.drop('default',axis=1,inplace=True)
df.drop('housing',axis=1,inplace=True)
df.drop('loan',axis=1,inplace=True)
df.drop('contact',axis=1,inplace=True)

In [110]:
df['y']=(df.y=="yes")*1

In [112]:
df.y.value_counts()

0    4000
1     521
Name: y, dtype: int64

> ## Question 4.  What should be your baseline for these binary variables (namely, which binary variables should you not include in your model)?

df.mean()

In [121]:
df.mean().sort_values(ascending=False)

balance                   1422.657819
duration                   263.961292
age                         41.170095
pdays                       39.766645
day                         15.915284
campaign                     2.793630
default_no                   0.983190
loan_no                      0.847158
contact_is_cellular          0.640566
is_married                   0.618668
housing_yes                  0.566025
previous                     0.542579
education_is_secondary       0.510064
education_is_tertiary        0.298607
education_is_primary         0.149967
y                            0.115240
contact_is_telephone         0.066578
dtype: float64

In [None]:
# note there seeems to be few data points for education and contact

In [113]:
corr=df.corr()
corr

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,default_no,housing_yes,loan_no,is_married,education_is_secondary,education_is_tertiary,education_is_primary,contact_is_cellular,contact_is_telephone
age,1.000000,0.083820,-0.017853,-0.002367,-0.005148,-0.008894,-0.003511,0.045092,0.017885,-0.193888,0.011250,0.275139,-0.106872,-0.094042,0.224938,-0.061794,0.183060
balance,0.083820,1.000000,-0.008677,-0.015950,-0.009976,0.009437,0.026196,0.017905,0.070886,-0.050227,0.071349,0.017158,-0.076574,0.076487,-0.001551,0.000240,0.034025
day,-0.017853,-0.008677,1.000000,-0.024629,0.160706,-0.094352,-0.059114,-0.011244,0.013261,-0.031291,0.004879,-0.001438,0.007745,0.007465,-0.020851,0.017850,0.053527
duration,-0.002367,-0.015950,-0.024629,1.000000,-0.068382,0.010380,0.018080,0.401118,0.011615,0.015740,0.004997,-0.036436,0.023179,-0.017779,-0.003640,0.016191,-0.021180
campaign,-0.005148,-0.009976,0.160706,-0.068382,1.000000,-0.093137,-0.067833,-0.061147,0.012348,-0.003574,-0.017120,0.022000,-0.019510,0.022631,0.009746,-0.018435,0.026571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
education_is_secondary,-0.106872,-0.076574,0.007745,0.023179,-0.019510,0.011899,-0.008410,-0.028744,-0.024901,0.111368,-0.078139,0.000319,1.000000,-0.665751,-0.428571,-0.028720,-0.022239
education_is_tertiary,-0.094042,0.076487,0.007465,-0.017779,0.022631,0.000377,0.026977,0.056649,0.021407,-0.098624,0.043434,-0.107669,-0.665751,1.000000,-0.274062,0.148305,-0.046299
education_is_primary,0.224938,-0.001551,-0.020851,-0.003640,0.009746,-0.019708,-0.020439,-0.027420,0.006734,-0.000956,0.016574,0.135892,-0.428571,-0.274062,1.000000,-0.117882,0.074205
contact_is_cellular,-0.061794,0.000240,0.017850,0.016191,-0.018435,0.223347,0.167604,0.118761,0.002449,-0.164820,-0.008159,-0.054726,-0.028720,0.148305,-0.117882,1.000000,-0.356533


Answer: TODO

In [117]:
df.corr().y.sort_values()

housing_yes              -0.104683
is_married               -0.064643
campaign                 -0.061147
education_is_secondary   -0.028744
education_is_primary     -0.027420
                            ...   
pdays                     0.104087
previous                  0.116714
contact_is_cellular       0.118761
duration                  0.401118
y                         1.000000
Name: y, dtype: float64

> ## Question 5.  What input variable in the dataset seems to predict the outcome quite well.  Why?

In [None]:
# see above

Answer: duration

In [122]:
df.drop('duration', axis = 1, inplace = True)

> ## Question 6.  Split the dataset into a training set (60%) and a testing set (the rest)

In [123]:
train_df = df.sample(frac = .6, random_state = 0)
test_df = df.drop(train_df.index)

> ## Question 7.  Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [140]:
train_X=train_df[['is_married','default_no','balance','housing_yes','loan_no','campaign','pdays','previous']]
train_y=train_df.y
test_X=test_df[['is_married','default_no','balance','housing_yes','loan_no','campaign','pdays','previous']]
test_y=test_df.y

In [132]:
model=linear_model.LogisticRegression().fit(train_X,train_y)
print model.coef_
print model.intercept_

[[ -4.40597569e-01  -7.62370993e-01   2.87553876e-05  -7.58355354e-01
    5.46078513e-01  -6.49684034e-02   1.71576267e-03   7.75282970e-02]]
[-1.12985647]


> ## Question 8.  What is your training error?  What is your generalization error?  Does it make sense?

In [145]:
print "training accuracy:", model.score(train_X,train_y)
print "testing accuracy:", model.score(test_X,test_y)

training accuracy: 0.883155178769
testing accuracy: 0.884955752212


Answer: TODO

> ## Question 9.  Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

In [None]:
# the odds of  effectiveness are 45% less for a married person,
# every additional campaign decreases the odds by 6%
# the odds also decrease by 70% if someone has not defaulted

Answer: TODO

> ## Question 10.  What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted

In [147]:
# X =df[['is_married','default_no','balance','housing_yes','loan_no','campaign','pdays','previous']]
X_sample=[0,1,1000,1,1,0,0,0]
print model.predict(X_sample)
print model.predict_proba(X_sample)

[0]
[[ 0.88852366  0.11147634]]




Answer: TODO

> ## Question 11.  Normalize your variables (You can reuse the function we used from a previous lab)

In [148]:
def normalize(x):
    min = x.min()
    max = x.max()
    return (x - min) / (max - min)

In [150]:
train_df[['is_married','default_no','balance','housing_yes','loan_no','campaign','pdays','previous']] = train_df[['is_married','default_no','balance','housing_yes','loan_no','campaign','pdays','previous']].apply(normalize)

> ## Question 12.  Let's do some regularization.  Use 10-fold cross validation to find the best tuning parameter `c`

(Hint: check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

(Hint 2: First try c = 10 ^ i with i = -10 ... 10)

In [161]:
ii=range(-10,11)
ii

[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [164]:
for i in ii:
    return 10**i

SyntaxError: 'return' outside function (<ipython-input-164-cbbe81919b86>, line 2)

Answer: TODO

> ## Question 13.  Now use the best `c` you found above and repeat your analysis; look over your coefficients

In [None]:
# TODO

> ## Question 14.  If you want to drop 3 variables from your analysis, which variables will you choose?

In [None]:
# TODO

Answer: TODO