# DS-SF-42 | 10 | Logistic Regression | Assignment | Answer Key

In [2]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import linear_model, cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')



## Probability, Odds, and Odds Ratios

**Probability:** The number of ways that an event can occur divided by the total number of possible outcomes.

The probability of drawing a red card from a standard deck of cards is 26/52 (50 percent).
The probability of drawing a club from that deck is 13/52 (25 percent).

> ### Question 1.  What's the probability of getting heads in a fair coin flip?

In [3]:
# One way over two outcomes

p = 1 / 2.

print p

0.5


The odds for an event is the ratio of the number of ways the event can occur compared to the number of ways it does not occur.

For example, using the same events as above, the odds for:
- Drawing a red card from a standard deck of cards is 1:1; and
- Drawing a club from that deck is 1:3.

> ### Question 2.  What's the odds of a fair coin flip?

In [4]:
p / (1 - p)

1.0

> ### Question 3.  Suppose that 18 out of 20 patients in an experiment lost weight while using diet A, while 16 out of 20  lost weight using diet B.  What's the probability of weight loss with diet A?  What's the odds?

In [5]:
prob_A = 18 / 20.
odds_A = prob_A / (1 - prob_A)

print prob_A
print odds_A

0.9
9.0


> ### Question 4.  What's the probablity of weight loss with diet B?  What are the odds?

In [6]:
prob_B = 16 / 20.
odds_B = prob_B / (1 - prob_B)

print prob_B
print odds_B

0.8
4.0


> ### Question 5.  What's the odds ratio?

In [7]:
odds_A / odds_B

2.25

## Bank Marketing

In [38]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-11-bank-marketing.csv'))

In [39]:
df

Unnamed: 0,age,job,marital,education,default,...,campaign,pdays,previous,poutcome,c
0,30,unemployed,married,primary,no,...,1,-1,0,unknown,no
1,33,services,married,secondary,no,...,1,339,4,failure,no
2,35,management,single,tertiary,no,...,1,330,1,failure,no
3,30,management,married,tertiary,no,...,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,...,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,...,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,...,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,...,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,...,4,211,3,other,no


> The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attributes Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical)
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

> Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign.

> ## Question 6.  Remove the categorical variables with the most number of distinct values

In [40]:
len(df.job.value_counts())

12

Answer: `job` is the variable with the most number of distinct values (12).

In [41]:
df.drop('job', axis = 1, inplace = True)

> ## Question 7.  Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

### `default`

In [42]:
df.default.value_counts()

no     4445
yes      76
Name: default, dtype: int64

In [43]:
df.default = df.default.apply(lambda value: 0 if value == 'no' else 1)

In [44]:
df.default.value_counts()

0    4445
1      76
Name: default, dtype: int64

### `housing`

In [45]:
df.housing.value_counts()

yes    2559
no     1962
Name: housing, dtype: int64

In [46]:
df['housing_no'] = df.housing.apply(lambda value: 0 if value == 'yes' else 1)

In [47]:
df.drop('housing', axis = 1, inplace = True)

In [48]:
df.housing_no.value_counts()

0    2559
1    1962
Name: housing_no, dtype: int64

### `loan`

In [49]:
df.loan.value_counts()

no     3830
yes     691
Name: loan, dtype: int64

In [50]:
df.loan = df.loan.apply(lambda value: 0 if value == 'no' else 1)

In [51]:
df.loan.value_counts()

0    3830
1     691
Name: loan, dtype: int64

### `c`

In [52]:
df.c.value_counts()

no     4000
yes     521
Name: c, dtype: int64

In [53]:
df.c = df.c.apply(lambda value: 0 if value == 'no' else 1)

In [54]:
df.c.value_counts()

0    4000
1     521
Name: c, dtype: int64

> ## Question 8.  Create binary/dummy variables for the other categorical variables

In [55]:
marital_df = pd.get_dummies(df.marital, prefix = 'marital')
education_df = pd.get_dummies(df.education, prefix = 'education')
contact_df = pd.get_dummies(df.contact, prefix = 'contact')

df = df.join([marital_df, education_df, contact_df])

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 26 columns):
age                    4521 non-null int64
marital                4521 non-null object
education              4521 non-null object
default                4521 non-null int64
balance                4521 non-null int64
loan                   4521 non-null int64
contact                4521 non-null object
day                    4521 non-null int64
month                  4521 non-null object
duration               4521 non-null int64
campaign               4521 non-null int64
pdays                  4521 non-null int64
previous               4521 non-null int64
poutcome               4521 non-null object
c                      4521 non-null int64
housing_no             4521 non-null int64
marital_divorced       4521 non-null uint8
marital_married        4521 non-null uint8
marital_single         4521 non-null uint8
education_primary      4521 non-null uint8
education_secondary    4521 

> ## Question 9.  What should be your baseline for these binary variables (namely, which binary variables should you not include in your model)?

In [26]:
df.marital.value_counts()

married     2797
single      1196
divorced     528
Name: marital, dtype: int64

In [27]:
df.education.value_counts()

secondary    2306
tertiary     1350
primary       678
unknown       187
Name: education, dtype: int64

In [28]:
df.contact.value_counts()

cellular     2896
unknown      1324
telephone     301
Name: contact, dtype: int64

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 22 columns):
age                    4521 non-null int64
default                4521 non-null int64
balance                4521 non-null int64
loan                   4521 non-null int64
day                    4521 non-null int64
month                  4521 non-null object
campaign               4521 non-null int64
pdays                  4521 non-null int64
previous               4521 non-null int64
poutcome               4521 non-null object
c                      4521 non-null int64
housing_no             4521 non-null int64
marital_divorced       4521 non-null uint8
marital_married        4521 non-null uint8
marital_single         4521 non-null uint8
education_primary      4521 non-null uint8
education_secondary    4521 non-null uint8
education_tertiary     4521 non-null uint8
education_unknown      4521 non-null uint8
contact_cellular       4521 non-null uint8
contact_telephone      4521 non

Answer: `married` (`marital`), `secondary` (`education`), `cellular` (`contact`) as they are the most frequent values in their respective variable.

In [36]:
df.drop(['marital', 'education', 'contact'], axis = 1, inplace = True)

ValueError: labels ['marital' 'education' 'contact'] not contained in axis

> ## Question 10.  What input variable in the dataset seems to predict the outcome quite well.  Why?

In [30]:
df.corr().c.sort_values(ascending = False).head()

c                   1.000000
duration            0.401118
contact_cellular    0.118761
previous            0.116714
housing_no          0.104683
Name: c, dtype: float64

Answer: `duration` but as seen in the documentation (you've read it, right?), we should discard it when predicting a model

In [31]:
df.drop('duration', axis = 1, inplace = True)

In [32]:
df

Unnamed: 0,age,default,balance,loan,day,...,education_tertiary,education_unknown,contact_cellular,contact_telephone,contact_unknown
0,30,0,1787,0,19,...,0,0,1,0,0
1,33,0,4789,1,11,...,0,0,1,0,0
2,35,0,1350,0,16,...,1,0,1,0,0
3,30,0,1476,1,3,...,1,0,0,0,1
4,59,0,0,0,5,...,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
4516,33,0,-333,0,30,...,0,0,1,0,0
4517,57,1,-3313,1,9,...,1,0,0,0,1
4518,57,0,295,0,19,...,0,0,1,0,0
4519,28,0,1137,0,6,...,0,0,1,0,0


> ## Question 11.  Split the dataset into a training set (60%) and a testing set (the rest)

In [31]:
train_df = df.sample(frac = .6, random_state = 0)
test_df = df.drop(train_df.index)

> ## Question 12.  Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [32]:
train_df.columns

Index([u'age', u'default', u'balance', u'loan', u'day', u'month', u'campaign',
       u'pdays', u'previous', u'poutcome', u'c', u'housing_no',
       u'marital_divorced', u'marital_married', u'marital_single',
       u'education_primary', u'education_secondary', u'education_tertiary',
       u'education_unknown', u'contact_cellular', u'contact_telephone',
       u'contact_unknown'],
      dtype='object')

In [33]:
names_X = ['age', 'marital_single', 'marital_divorced',
    'default', 'balance', 'housing_no',
    'loan', 'campaign', 'pdays', 'previous']

def X_c(df):
    X = df[ names_X ]
    c = df.c
    return X, c

train_X, train_c = X_c(train_df)
test_X, test_c = X_c(test_df)

In [34]:
model = linear_model.LogisticRegression().\
    fit(train_X, train_c)

print model.intercept_
print model.coef_

[-2.87337543]
[[  9.28109163e-03   4.63880352e-01   4.51928003e-01  -2.77116205e-03
    2.38113081e-05   6.89237037e-01  -6.67197003e-01  -6.27832989e-02
    1.66268724e-03   7.64833853e-02]]


> ## Question 13.  What is your training error?  What is your generalization error?  Does it make sense?

In [35]:
print 'training misclassification =', 1 - model.score(train_X, train_c)
print 'testing  misclassification =', 1 - model.score(test_X, test_c)

training misclassification = 0.116844821231
testing  misclassification = 0.114491150442


Answer: The generalization error is higher than the training error.  Yes, this make sense since the model should better fit the training set.

> ## Question 14.  Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

In [1]:
zip(names_X, np.exp(model.coef_[0]) - 1)

NameError: name 'names_X' is not defined

Answer: The odds that single individuals are targeted successfully by this campaign is 59% more than married people.  Perhaps if people are married they need to consult with their spouse but singles can decide on themselves.

Every extra time a client is contacted - the odds of usefulness of the marketing campaign decreases by 6%; maybe clients get frustrated overtime.

The odds of effectiveness of this campaign on those people who have not defaulted on their loans is 1.5% less than those who did.  This may suggest that those who are responsible borrowers are more careful with how to respond to bank offers.

> ## Question 15.  What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted

In [37]:
predict_X = [ [30, 1, 0, 0, 1000, 1, 0, 3, 999, 0] ]

print model.predict(predict_X)
print model.predict_proba(predict_X)

[1]
[[ 0.48633823  0.51366177]]


Answer: The campaign will be successful with a 51% success.