# Assignment 09: Logistic Regression

In [633]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import linear_model, cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

## Probability, Odds, and  Odds ratios

**Probability:** The number of ways that an event can occur divided by the total number of possible outcomes.

The probability of drawing a red card from a standard deck of cards is 26/52 (50 percent).
The probability of drawing a club from that deck is 13/52 (25 percent).

> ### Question 1.  What's the probability of getting heads in a fair coin flip?

In [634]:
p=1./2.
print p

0.5


In [635]:
# The odds are 50% (1/2)

The odds for an event is the ratio of the number of ways the event can occur compared to the number of ways it does not occur.

For example, using the same events as above, the odds for:
- Drawing a red card from a standard deck of cards is 1:1; and
- Drawing a club from that deck is 1:3.

> ### Question 2.  What's the odds of a fair coin flip?

In [636]:
p/(1-p)

1.0

> ### Question 3.  Suppose that 18 out of 20 patients in an experiment lost weight while using diet A, while 16 out of 20  lost weight using diet B.  What's the probability of weight loss with diet A?  What's the odds?

In [637]:
pA=18./20.

In [638]:
pB=16./20.

In [639]:
oA=pA/(1-pA)

In [640]:
print "Probability ", pA
print "Odds ", oA

Probability  0.9
Odds  9.0


> ### Question 4.  What's the probablity of weight loss with diet B?  What's the odds?

In [641]:
oB=pB/(1-pB)

In [642]:
print "Probability ",pB
print "Odds ",oB

Probability  0.8
Odds  4.0


> ### Question 5.  What's the odds ratio?

In [643]:
print "Odds ratio: ", oA/oB

Odds ratio:  2.25


## Bank Marketing

In [644]:
df = pd.read_csv('C:\Users\Lauren\GA\Class Repo\DS-SF-32\lessons\lesson-9/dataset-09-bank-marketing.csv')

In [645]:
pd.set_option('display.max_columns', 50)
df


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,c
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


In [646]:
df.shape


(4521, 17)

> The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attributes Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical)
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

> Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign.

> ## Question 6.  Remove the categorical variables with the most number of distinct values

In [647]:
pd.set_option('display.max_rows', 50)
df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
c            object
dtype: object

In [648]:
df.apply(pd.Series.nunique)

age            67
job            12
marital         3
education       4
default         2
balance      2353
housing         2
loan            2
contact         3
day            31
month          12
duration      875
campaign       32
pdays         292
previous       24
poutcome        4
c               2
dtype: int64

In [649]:
#Eliminate , duration, day, month, job for now

In [650]:

df=df.drop('duration',1)
df=df.drop('day',1)
df=df.drop('month',1)
df=df.drop('job',1)


In [651]:

df.shape

(4521, 13)

Answer: TODO

> ## Question 7.  Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

In [652]:
print "default: ", df['default'].value_counts().idxmax(),df['default'].value_counts().max()
print "housing: ", df['housing'].value_counts().idxmax(),df['housing'].value_counts().max()
print "loan: ", df['loan'].value_counts().idxmax(),df['loan'].value_counts().max()
print "c: ", df['c'].value_counts().idxmax(),df['c'].value_counts().max()

default:  no 4445
housing:  yes 2559
loan:  no 3830
c:  no 4000


In [653]:
df['default_no']=df['default']
df['housing_no']=df['housing']
df['loan_no']=df['loan']
df['c_no']=df['c']

In [654]:
df.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,campaign,pdays,previous,poutcome,c,default_no,housing_no,loan_no,c_no
0,30,married,primary,no,1787,no,no,cellular,1,-1,0,unknown,no,no,no,no,no
1,33,married,secondary,no,4789,yes,yes,cellular,1,339,4,failure,no,no,yes,yes,no
2,35,single,tertiary,no,1350,yes,no,cellular,1,330,1,failure,no,no,yes,no,no
3,30,married,tertiary,no,1476,yes,yes,unknown,4,-1,0,unknown,no,no,yes,yes,no
4,59,married,secondary,no,0,yes,no,unknown,1,-1,0,unknown,no,no,yes,no,no


In [655]:
df['default'].replace(['yes', 'no'], [1, 0],inplace=True)
df['housing'].replace(['yes', 'no'], [0, 1],inplace=True)
df['loan'].replace(['yes', 'no'], [1, 0],inplace=True)
df['c'].replace(['yes', 'no'], [1, 0],inplace=True)

df['default_no'].replace(['yes', 'no'], [0, 1],inplace=True)
df['housing_no'].replace(['yes', 'no'], [1, 0],inplace=True)
df['loan_no'].replace(['yes', 'no'], [0, 1],inplace=True)
df['c_no'].replace(['yes', 'no'], [0, 1],inplace=True)

In [656]:
df.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,campaign,pdays,previous,poutcome,c,default_no,housing_no,loan_no,c_no
0,30,married,primary,0,1787,1,0,cellular,1,-1,0,unknown,0,1,0,1,1
1,33,married,secondary,0,4789,0,1,cellular,1,339,4,failure,0,1,1,0,1
2,35,single,tertiary,0,1350,0,0,cellular,1,330,1,failure,0,1,1,1,1
3,30,married,tertiary,0,1476,0,1,unknown,4,-1,0,unknown,0,1,1,0,1
4,59,married,secondary,0,0,0,0,unknown,1,-1,0,unknown,0,1,1,1,1


In [657]:
df.dtypes

age            int64
marital       object
education     object
default        int64
balance        int64
housing        int64
loan           int64
contact       object
campaign       int64
pdays          int64
previous       int64
poutcome      object
c              int64
default_no     int64
housing_no     int64
loan_no        int64
c_no           int64
dtype: object

In [658]:
df.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,campaign,pdays,previous,poutcome,c,default_no,housing_no,loan_no,c_no
0,30,married,primary,0,1787,1,0,cellular,1,-1,0,unknown,0,1,0,1,1
1,33,married,secondary,0,4789,0,1,cellular,1,339,4,failure,0,1,1,0,1
2,35,single,tertiary,0,1350,0,0,cellular,1,330,1,failure,0,1,1,1,1
3,30,married,tertiary,0,1476,0,1,unknown,4,-1,0,unknown,0,1,1,0,1
4,59,married,secondary,0,0,0,0,unknown,1,-1,0,unknown,0,1,1,1,1


> ## Question 8.  Create binary/dummy variables for the other categorical variables

In [659]:
# TODO

In [660]:
df=df.join(pd.get_dummies(df['marital'],prefix="marital"))
df=df.join(pd.get_dummies(df['education'],prefix="education"))
df=df.join(pd.get_dummies(df['contact'],prefix="contact"))
df=df.join(pd.get_dummies(df['poutcome'],prefix="poutcome"))



In [661]:
df.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,campaign,pdays,previous,poutcome,c,default_no,housing_no,loan_no,c_no,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,contact_cellular,contact_telephone,contact_unknown,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,married,primary,0,1787,1,0,cellular,1,-1,0,unknown,0,1,0,1,1,0,1,0,1,0,0,0,1,0,0,0,0,0,1
1,33,married,secondary,0,4789,0,1,cellular,1,339,4,failure,0,1,1,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0
2,35,single,tertiary,0,1350,0,0,cellular,1,330,1,failure,0,1,1,1,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0
3,30,married,tertiary,0,1476,0,1,unknown,4,-1,0,unknown,0,1,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0,0,1
4,59,married,secondary,0,0,0,0,unknown,1,-1,0,unknown,0,1,1,1,1,0,1,0,0,1,0,0,0,0,1,0,0,0,1


> ## Question 9.  What should be your baseline for these binary variables (namely, which binary variables should you not include in your model)?

In [662]:
print df.marital.value_counts()
print
print df.education.value_counts()
print
print df.contact.value_counts()
print
print df.poutcome.value_counts()

married     2797
single      1196
divorced     528
Name: marital, dtype: int64

secondary    2306
tertiary     1350
primary       678
unknown       187
Name: education, dtype: int64

cellular     2896
unknown      1324
telephone     301
Name: contact, dtype: int64

unknown    3705
failure     490
other       197
success     129
Name: poutcome, dtype: int64


The variables with the highest frequency should be left out:
marital: married,
education: secondary,
contact: cellular,
poutcome: unknown.

Since such a high proportion of poutcome is unknown, it might be best to leave out altogether.

In [663]:
df.drop(['marital', 'education', 'contact'], axis = 1, inplace = True)

> ## Question 10.  What input variable in the dataset seems to predict the outcome quite well.  Why?

In [664]:
df.corr().c_no.sort_values()


c                     -1.000000
poutcome_success      -0.283481
contact_cellular      -0.118761
previous              -0.116714
housing               -0.104683
pdays                 -0.104087
loan_no               -0.070517
education_tertiary    -0.056649
poutcome_other        -0.051908
marital_single        -0.045815
age                   -0.045092
marital_divorced      -0.034840
contact_telephone     -0.025878
balance               -0.017905
poutcome_failure      -0.014556
default               -0.001303
default_no             0.001303
education_unknown      0.008870
education_primary      0.027420
education_secondary    0.028744
campaign               0.061147
marital_married        0.064643
loan                   0.070517
housing_no             0.104683
contact_unknown        0.139399
poutcome_unknown       0.162038
c_no                   1.000000
Name: c_no, dtype: float64

#Taken singularly, the strongest predictor is when poutcome is successful. This makes sense, if the previous campaign was successful, it probably will be successful again

> ## Question 11.  Split the dataset into a training set (60%) and a testing set (the rest)

In [665]:
from sklearn.cross_validation import train_test_split

In [666]:
list(df)

['age',
 'default',
 'balance',
 'housing',
 'loan',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'c',
 'default_no',
 'housing_no',
 'loan_no',
 'c_no',
 'marital_divorced',
 'marital_married',
 'marital_single',
 'education_primary',
 'education_secondary',
 'education_tertiary',
 'education_unknown',
 'contact_cellular',
 'contact_telephone',
 'contact_unknown',
 'poutcome_failure',
 'poutcome_other',
 'poutcome_success',
 'poutcome_unknown']

In [667]:
X=df[['age',
 'default_no',
 'balance',
 'housing_no',
 'loan',
 'campaign',
 'pdays',
 'previous',


 'marital_divorced',

 'marital_single',]]

In [668]:
y=df['c']

In [669]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

> ## Question 12.  Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [670]:
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()

In [671]:
y_predict=lm.fit(X_train,y_train)

In [672]:
print lm.coef_
print lm.intercept_


[[  6.70558229e-03  -5.41777809e-01  -2.33909148e-05  -7.12270430e-01
   -8.79073022e-01  -1.57486820e-01   1.68850516e-03   1.26090279e-01
    1.71463864e-01   3.21230082e-01]]
[-1.22553278]


> ## Question 13.  What is your training error?  What is your generalization error?  Does it make sense?

In [673]:
train_succ=y_predict.score(X_train, y_train)

In [674]:
print "Training Error: ", (1-train_succ)

Training Error:  0.114675516224


In [675]:
gen_succ=y_predict.score(X_test,y_test)

In [676]:
print "Generalization Error: ", (1-gen_succ)

Generalization Error:  0.119955776672


> ## Question 14.  Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

In [677]:
unlog=(np.exp(lm.coef_[0])-1)



zip(X, unlog)


[('age', 0.0067281150390294808),
 ('default_no', -0.41828684234210556),
 ('balance', -2.3390641204601437e-05),
 ('housing_no', -0.5094707799121847),
 ('loan', -0.58483241544637588),
 ('campaign', -0.14571192699364255),
 ('pdays', 0.0016899314879310623),
 ('previous', 0.13438457464614562),
 ('marital_divorced', 0.18704124658419663),
 ('marital_single', 0.37882278701624617)]

Someone who does NOT have credit in default is 42% less likely to subscribe than someone with credit in default. People who make bad choices continue to make bad choices?


The effect that balance or pdays has is insignificant.

Someone who does not have a housing loan is 51% less likely to subscribe.


Someone who has a personal loan is 58% less likely to subscribe.

For each additional contact, there is a 14% decrease in the likelihood someone will subscribe (ie, if they said no the first time, they more likely meant it than not)

For each day that has passed, there is a 0.16% increase in the likelihood they will subscribe.

People who subscribed previously are 13% more likely to subscribe again.

Single and divorced people are 37.9% and 18.7% more likely to subscribe, respectively.

> ## Question 15.  What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted

In [678]:
X.head()

Unnamed: 0,age,default_no,balance,housing_no,loan,campaign,pdays,previous,marital_divorced,marital_single
0,30,1,1787,0,0,1,-1,0,0,0
1,33,1,4789,1,1,1,339,4,0,0
2,35,1,1350,1,0,1,330,1,0,1
3,30,1,1476,1,1,4,-1,0,0,0
4,59,1,0,1,0,1,-1,0,0,0


In [679]:
pX=[[30,1,1000,1,0,3,999,0,0,1]]

In [680]:
print lm.predict(pX)

[0]


The campaign will be unsuccessful 