# Compare logistic regression and kNN

This is an open-ended lab.

You will:

1. Load in the wine dataset (create target, concatenate, normalize predictors)
- Do EDA on predictors
- Select predictors of interest
- Load KNeighborsClassifier and LogisticRegression from sklearn
- Compare performance between the two using stratified cross-validation
- [Optional bonus] Plot the results of kNN vs. Logistic regression using the plotting functions I wrote yesterday and today. You may have to modify the functions to work for you.

---

### Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

1. I usually don't laugh or joke with other people.
2. If I feel depressed, I can cheer myself up with humor.
3. If someone makes a mistake, I will tease them about it.
4. I let people laugh at me or make fun of me at my expense more than I should.
5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
6. Even when I'm alone, I am often amused by the absurdities of life.
7. People are never offended or hurt by my sense of humor.
8. I will often get carried away in putting myself down if it makes family or friends laugh.
9. I rarely make other people laugh by telling funny stories about myself.
10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
13. I laugh and joke a lot with my closest friends.
14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
15. I do not like it when people use humor as a way of criticizing or putting someone down.
16. I don't often say funny things to put myself down.
17. I usually don't like to tell jokes or amuse people.
18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
21. I enjoy making people laugh.
22. If I am feeling sad or upset, I usually lose my sense of humor.
23. I never participate in laughing at others even if all my friends are doing it.
24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
25. I donít often joke around with my friends.
26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
27. If I don't like someone, I often use humor or teasing to put them down.
28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
29. I usually can't think of witty things to say when I'm with other people.
30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("likert scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an interger.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

---

### 1. Load humor styles dataset

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.grid_search import GridSearchCV

In [2]:
hsq = pd.read_csv('../Datasets/hsq_data.csv')

---

### 2. Create a target and predictor matrix

Target and predictors are up to you. 

In [5]:
hsq.head(2).T

Unnamed: 0,0,1
Q1,2.0,2.0
Q2,2.0,3.0
Q3,3.0,2.0
Q4,1.0,2.0
Q5,4.0,4.0
Q6,5.0,4.0
Q7,4.0,4.0
Q8,3.0,3.0
Q9,4.0,4.0
Q10,3.0,3.0


In [38]:
y = hsq['accuracy'].map(lambda x: 1 if x > 80 else 0)
x = hsq.ix[:,:32].values

---

### 3. Perform any EDA you deem relevant on your predictors and target

In [39]:
y.value_counts()

1    767
0    304
Name: accuracy, dtype: int64

---

### 4. Perform stratified cross-validation on a KNN classifier and logisitic regression.

1. Gridsearch the best KNN parameters.

Note: cross_val_score conveniently does stratification for you when you have a categorical target. :/ So much for forcing you to practice StratifiedKFold...

In [40]:
# Setup our GridSearch Parmaters
search_parameters = {
    'n_neighbors':  range(1,101), 
    'weights':      ['uniform', 'distance']
}

# Intialize a blank model object
knn = KNeighborsClassifier()

# Initialize gridsearch
grid = GridSearchCV(knn, search_parameters, cv=5)
knn_best = grid.fit(x,y).best_estimator_

In [41]:
cv_indices = StratifiedKFold(y, n_folds=5)

knn_scores = []
log_scores = []

log = LogisticRegression()

for train,test in cv_indices:
    x_train = x[train,:]
    x_test = x[test,:]
    y_train = y[train]
    y_test = y[test]
    
    knn_best.fit(x_train,y_train)
    log.fit(x_train,y_train)
    
    knn_scores.append(knn_best.score(x_test,y_test))
    log_scores.append(log.score(x_test,y_test))

print knn_scores
print log_scores
print 'KNN Score: ',np.mean(knn_scores)
print 'Log Score: ',np.mean(log_scores)
print 'Baseline: ',np.mean(y)

[0.73023255813953492, 0.71162790697674416, 0.72429906542056077, 0.71962616822429903, 0.71361502347417838]
[0.68372093023255809, 0.67906976744186043, 0.71495327102803741, 0.72429906542056077, 0.71361502347417838]
KNN Score:  0.719880144447
Log Score:  0.703131611519
Baseline:  0.716153127918


---

### 5. Regularization with logistic regression

Since logistic regression _is_ a regression, it can use the Lasso and Ridge penalties.

The `penalty` keyword argument can be set to `l2` for Ridge and `l1` for Lasso. 

Note: you must set `solver='liblinear'` if you're going to use the Lasso penalty!

**`C` is the regularization strength for LogisticRegression, but IT IS THE INVERSE OF ALPHA: 1/alpha. I don't know why they did this – it's stupid.**

1. Select everything but your target to be predictors.
- Normalize the predictors!
- Gridsearch the LogisticRegression with regularization.
- Gridsearch the KNN.
- Compare their cross-validated accuracies.

In [66]:
#1
colvars = [x for x in hsq if x not in ['accuracy']]
x = hsq[colvars].values

#2
from sklearn.preprocessing import StandardScaler

s = StandardScaler()

norm_x = s.fit_transform(x)

In [47]:
#3
# Setup our GridSearch Parmaters
search_parameters = {
    'penalty':  ['l1','l2'],
    'solver':  ['liblinear']
}

# Intialize a blank model object
log = LogisticRegression()

# Initialize gridsearch
grid = GridSearchCV(log, search_parameters, cv=5)
log_best = grid.fit(norm_x,y).best_estimator_

In [48]:
# Setup our GridSearch Parmaters
search_parameters = {
    'n_neighbors':  range(1,101), 
    'weights':      ['uniform', 'distance']
}

# Intialize a blank model object
knn = KNeighborsClassifier()

# Initialize gridsearch
grid = GridSearchCV(knn, search_parameters, cv=5)
knn_best = grid.fit(norm_x,y).best_estimator_

In [50]:
cv_indices = StratifiedKFold(y, n_folds=5)

knn_scores = []
log_scores = []

log = LogisticRegression()

for train,test in cv_indices:
    x_train = norm_x[train,:]
    x_test = norm_x[test,:]
    y_train = y[train]
    y_test = y[test]
    
    knn_best.fit(x_train,y_train)
    log_best.fit(x_train,y_train)
    
    knn_scores.append(knn_best.score(x_test,y_test))
    log_scores.append(log_best.score(x_test,y_test))

print knn_scores
print log_scores
print 'KNN Score: ',np.mean(knn_scores)
print 'Log Score: ',np.mean(log_scores)
print 'Baseline: ',np.mean(y)

[0.7069767441860465, 0.68837209302325586, 0.71028037383177567, 0.70560747663551404, 0.69483568075117375]
[0.69767441860465118, 0.67441860465116277, 0.71028037383177567, 0.71962616822429903, 0.70892018779342725]
KNN Score:  0.701214473686
Log Score:  0.702183950621
Baseline:  0.716153127918


---

### 6. Explain why that regularization for logistic regression may have been chosen. Print out the most important variables for predicting your target from logistic regression.

In [63]:
results = pd.DataFrame({'Coef':log_best.coef_[0],'Predictor':[x for x in hsq.columns if x not in ['accuracy']],\
                       'Abs Coef':np.abs(log_best.coef_[0])})
results.sort_values(by='Abs Coef',ascending=False)

Unnamed: 0,Abs Coef,Coef,Predictor
18,0.233267,0.233267,Q19
32,0.227369,0.227369,affiliative
29,0.226971,0.226971,Q30
19,0.168119,-0.168119,Q20
21,0.167344,0.167344,Q22
20,0.154513,-0.154513,Q21
13,0.139602,-0.139602,Q14
26,0.134658,-0.134658,Q27
4,0.127458,0.127458,Q5
16,0.12716,-0.12716,Q17


The not-strong ridge indicates to me that the variables are decently independent (but some multicollinearity) and that they are all reasonably useful for out-of-sample prediction.

---

### 7. Re-run a (non-regularized) logistic regression with only centered coefficients (not normalized). Interperet the baseline probability and the effect of one of your predictors.

**sklearn's LogisticRegression actually uses l2 Ridge regularization by default with `C=1`! To "turn it off" set `C=1e10`.**

1. Fit the logistic regression using centered predictors.
- Write a function to turn coefficient results to probability (logistic transformation).
- Describe the baseline probability.
- Plot the distribution of one of your predictors.
- Describe based on the coefficient of the predictor the effect on probability of your target variable.

In [69]:
center_x = x - x.mean()

Yes, the baseline probability is different than the mean of the target. This can happen! It's not wrong.