# Compare logistic regression and kNN

This is an open-ended lab.

You will:

1. Load in the wine dataset (create target, concatenate, normalize predictors)
- Do EDA on predictors
- Select predictors of interest
- Load KNeighborsClassifier and LogisticRegression from sklearn
- Compare performance between the two using stratified cross-validation
- [Optional bonus] Plot the results of kNN vs. Logistic regression using the plotting functions I wrote yesterday and today. You may have to modify the functions to work for you.

---

### Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

1. I usually don't laugh or joke with other people.
2. If I feel depressed, I can cheer myself up with humor.
3. If someone makes a mistake, I will tease them about it.
4. I let people laugh at me or make fun of me at my expense more than I should.
5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
6. Even when I'm alone, I am often amused by the absurdities of life.
7. People are never offended or hurt by my sense of humor.
8. I will often get carried away in putting myself down if it makes family or friends laugh.
9. I rarely make other people laugh by telling funny stories about myself.
10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
13. I laugh and joke a lot with my closest friends.
14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
15. I do not like it when people use humor as a way of criticizing or putting someone down.
16. I don't often say funny things to put myself down.
17. I usually don't like to tell jokes or amuse people.
18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
21. I enjoy making people laugh.
22. If I am feeling sad or upset, I usually lose my sense of humor.
23. I never participate in laughing at others even if all my friends are doing it.
24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
25. I donít often joke around with my friends.
26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
27. If I don't like someone, I often use humor or teasing to put them down.
28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
29. I usually can't think of witty things to say when I'm with other people.
30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("likert scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an interger.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

---

### 1. Load humor styles dataset

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import patsy
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score, StratifiedKFold, train_test_split
from sklearn.grid_search import GridSearchCV

In [13]:
hsq = pd.read_csv('/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/humor_styles/hsq_data.csv')

In [34]:
hsq['clean_gender'] =hsq['gender'].map(lambda x: 1 if x == 1 else 0) 
hsq.head()

subset = hsq[['affiliative','selfenhancing','agressive','selfdefeating','accuracy','clean_gender']]
subset.isnull().sum()

affiliative      0
selfenhancing    0
agressive        0
selfdefeating    0
accuracy         0
clean_gender     0
dtype: int64

---

### 2. Create a target and predictor matrix

Target and predictors are up to you. 

In [37]:
#personality types
#gender
formula = 'clean_gender ~ affiliative + selfenhancing + agressive + selfdefeating + accuracy-1'
y, X = patsy.dmatrices(formula, subset)
print X.shape, type(X), y.shape, type(y)

print subset.clean_gender.value_counts()

(1071, 5) <class 'patsy.design_info.DesignMatrix'> (1071, 1) <class 'patsy.design_info.DesignMatrix'>
1    581
0    490
Name: clean_gender, dtype: int64


---

### 3. Perform any EDA you deem relevant on your predictors and target

---

### 4. Perform stratified cross-validation on a KNN classifier and logisitic regression.

1. Gridsearch the best KNN parameters.

Note: cross_val_score conveniently does stratification for you when you have a categorical target. :/ So much for forcing you to practice StratifiedKFold...

In [47]:
knn_blank = KNeighborsClassifier()

params = {
    'n_neighbors': range(1,50)
    ,'weights': ['uniform','distance']
}

estimator = GridSearchCV(knn_blank, params,cv = 5)
results = estimator.fit(X,np.ravel(y))
print results.best_params_
print results.best_score_


{'n_neighbors': 35, 'weights': 'uniform'}
0.566760037348


In [53]:
knnC = KNeighborsClassifier(n_neighbors = 35, weights='uniform')
scores = cross_val_score(knnC,X,np.ravel(y), cv=5)

print scores.mean()


knnC = KNeighborsClassifier(n_neighbors = 3, weights='uniform')
scores = cross_val_score(knnC,X,np.ravel(y), cv=5)

print scores.mean()
#logres_model = logres.fit(X,np.ravel(y))
#scores = cross_val_score(logres_model,X,y, cv=None)

0.566781134536
0.535018474245


---

### 5. Regularization with logistic regression

Since logistic regression _is_ a regression, it can use the Lasso and Ridge penalties.

The `penalty` keyword argument can be set to `l2` for Ridge and `l1` for Lasso. 

Note: you must set `solver='liblinear'` if you're going to use the Lasso penalty!

**`C` is the regularization strength for LogisticRegression, but IT IS THE INVERSE OF ALPHA: 1/alpha. I don't know why they did this – it's stupid.**

1. Select everything but your target to be predictors.
- Normalize the predictors!
- Gridsearch the LogisticRegression with regularization.
- Gridsearch the KNN.
- Compare their cross-validated accuracies.

In [95]:
from sklearn.preprocessing import StandardScaler

columns = [x for x in hsq.columns.values if x not in ('gender','clean_gender')]
print columns
formula = 'clean_gender ~ ' + ' + '.join(columns)

y, X = patsy.dmatrices(formula, hsq)

#normalize the predictors
scalar = StandardScaler()
X_norm = scalar.fit_transform(X)
y_norm = np.ravel(y)

#grid search
logres = LogisticRegression(solver='liblinear')

params = {
    'C': [0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
    , 'penalty': ['l1','l2']
}

estimator = GridSearchCV(logres, params, cv=5)
results = estimator.fit(X_norm,y_norm)
print results.best_params_
#{'penalty': 'l2', 'C': 1e-05}


knn_blank = KNeighborsClassifier()

params = {
    'n_neighbors': range(1,50)
    ,'weights': ['uniform','distance']
}

estimator = GridSearchCV(knn_blank, params, cv=5)
results = estimator.fit(X_norm,y_norm)
print results.best_params_

#{'n_neighbors': 42, 'weights': 'uniform'}

logres = LogisticRegression(solver='liblinear',penalty='l2', C=0.00001)
scores = cross_val_score(logres,X_norm,y_norm, cv=5)
print scores.mean()

knnC = KNeighborsClassifier(n_neighbors=42, weights ='uniform')
scores = cross_val_score(knnC,X_norm,y_norm, cv=5)
print scores.mean()






['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11', 'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21', 'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 'Q32', 'affiliative', 'selfenhancing', 'agressive', 'selfdefeating', 'age', 'accuracy']
{'penalty': 'l2', 'C': 1e-05}
{'n_neighbors': 42, 'weights': 'uniform'}
0.588250380352
0.605020647685


'12'

---

### 6. Explain why that regularization for logistic regression may have been chosen. Print out the most important variables for predicting your target from logistic regression.

The not-strong ridge indicates to me that the variables are decently independent (but some multicollinearity) and that they are all reasonably useful for out-of-sample prediction.

---

### 7. Re-run a (non-regularized) logistic regression with only centered coefficients (not normalized). Interperet the baseline probability and the effect of one of your predictors.

**sklearn's LogisticRegression actually uses l2 Ridge regularization by default with `C=1`! To "turn it off" set `C=1e10`.**

1. Fit the logistic regression using centered predictors.
- Write a function to turn coefficient results to probability (logistic transformation).
- Describe the baseline probability.
- Plot the distribution of one of your predictors.
- Describe based on the coefficient of the predictor the effect on probability of your target variable.

Yes, the baseline probability is different than the mean of the target. This can happen! It's not wrong.