<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating Classification Models on Humor Styles Data

---

In this lab you will be practicing evaluating classification models (Logistic Regression in particular) on a "Humor Styles" survey.

This survey is designed to evaluate what "style" of humor subjects have. Your goal will be to classify gender using the responses on the survey.

## Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

    1. I usually don't laugh or joke with other people.
    2. If I feel depressed, I can cheer myself up with humor.
    3. If someone makes a mistake, I will tease them about it.
    4. I let people laugh at me or make fun of me at my expense more than I should.
    5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
    6. Even when I'm alone, I am often amused by the absurdities of life.
    7. People are never offended or hurt by my sense of humor.
    8. I will often get carried away in putting myself down if it makes family or friends laugh.
    9. I rarely make other people laugh by telling funny stories about myself.
    10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
    11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
    12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
    13. I laugh and joke a lot with my closest friends.
    14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
    15. I do not like it when people use humor as a way of criticizing or putting someone down.
    16. I don't often say funny things to put myself down.
    17. I usually don't like to tell jokes or amuse people.
    18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
    19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
    20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
    21. I enjoy making people laugh.
    22. If I am feeling sad or upset, I usually lose my sense of humor.
    23. I never participate in laughing at others even if all my friends are doing it.
    24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
    25. I don't often joke around with my friends.
    26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
    27. If I don't like someone, I often use humor or teasing to put them down.
    28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
    29. I usually can't think of witty things to say when I'm with other people.
    30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
    31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
    32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("like scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an integer.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

### 1. Load the data and perform any EDA and cleaning you think is necessary.

It is worth reading over the description of the data columns above for this.

In [2]:
hsq = pd.read_csv('../../../../resource-datasets/humor_styles/hsq_data.csv')

In [3]:
hsq.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,...,4,2,2,4.0,3.5,3.0,2.3,25,2,100
1,2,3,2,2,4,4,4,3,4,3,...,4,3,1,3.3,3.5,3.3,2.4,44,2,90
2,3,4,3,3,4,4,3,1,2,4,...,5,4,2,3.9,3.9,3.1,2.3,50,1,75
3,3,3,3,4,3,5,4,3,-1,4,...,5,3,3,3.6,4.0,2.9,3.3,30,2,85
4,1,4,2,2,3,5,4,1,4,4,...,5,4,2,4.1,4.1,2.9,2.0,52,1,80


In [4]:
hsq.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
count,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,...,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0
mean,2.02521,3.34267,3.078431,2.8338,3.59944,4.152194,3.277311,2.535014,2.582633,2.869281,...,3.945845,2.767507,2.838469,4.010644,3.375537,2.956583,2.762745,70.966387,1.455649,87.542484
std,1.075782,1.112898,1.167877,1.160252,1.061281,0.979315,1.099974,1.23138,1.22453,1.205013,...,1.135189,1.309601,1.233889,0.708479,0.661533,0.41087,0.645982,1371.989249,0.522076,12.038483
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,1.3,0.0,0.0,0.0,14.0,0.0,2.0
25%,1.0,3.0,2.0,2.0,3.0,4.0,3.0,2.0,2.0,2.0,...,3.0,2.0,2.0,3.6,2.9,2.8,2.3,18.5,1.0,80.0
50%,2.0,3.0,3.0,3.0,4.0,4.0,3.0,2.0,2.0,3.0,...,4.0,3.0,3.0,4.1,3.4,3.0,2.8,23.0,1.0,90.0
75%,3.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,3.0,4.0,...,5.0,4.0,4.0,4.5,3.8,3.3,3.1,31.0,2.0,95.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.1,5.0,5.0,5.0,44849.0,3.0,100.0


In [5]:
hsq1 = hsq.copy()
hsq1.age.unique()

array([   25,    44,    50,    30,    52,    27,    34,    18,    33,
          26,    29,    36,    21,    20,    23,    70,    17,    16,
          39,    61,    69,    22,    38,    24,    14,    40,    62,
          51,    35,    46,    42,    19,    32,    15,    37,    45,
          28,    49,    31,    64,    54,    68,    48,    60,    43,
          41,    53,    58,   242,   151,    55,    67,    47,    56,
          59,    66,  2670,    57, 44849])

### 2. Set up a predictor matrix to predict `gender` (only male vs. female)

Choice of predictors is up to you. Justify which variables you include.

In [10]:
hsq2=hsq1[(hsq1.age<100) & (hsq1.gender<3) & (hsq1.gender>0) ]

In [11]:
y = hsq2.gender

In [12]:
X = hsq2.iloc[:,0:32]

In [13]:
#users[(users.age < 20) & (users.gender=='M')].head()


In [14]:
#users[(users.age < 10) | (users.age > 70)]


In [15]:
y.unique()


array([2, 1])

In [16]:
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X, y)
print('Logreg intercept:', logreg.intercept_)
print('Logreg coef(s):', logreg.coef_)
print('Logreg predicted probabilities:\n',
      logreg.predict_proba(X.iloc[:20, :]))


Logreg intercept: [-0.00681334]
Logreg coef(s): [[-0.16588357 -0.0112434   0.02441069 -0.12331563 -0.12255179 -0.01439628
   0.09489175  0.01346129 -0.17943148  0.10414146 -0.10739441 -0.0699568
  -0.04054762 -0.02618937  0.32398275 -0.08226194  0.14912382  0.02843625
  -0.04526799 -0.14688563  0.02011174  0.0309076  -0.06134705  0.04171899
  -0.07434897 -0.10072189  0.05229257  0.12522163  0.11320027  0.11484742
   0.01768282 -0.0514806 ]]
Logreg predicted probabilities:
 [[0.4109474  0.5890526 ]
 [0.54919189 0.45080811]
 [0.52398679 0.47601321]
 [0.31385943 0.68614057]
 [0.39909682 0.60090318]
 [0.49358859 0.50641141]
 [0.7123967  0.2876033 ]
 [0.51441004 0.48558996]
 [0.43337362 0.56662638]
 [0.8102242  0.1897758 ]
 [0.39077021 0.60922979]
 [0.74712569 0.25287431]
 [0.49136932 0.50863068]
 [0.64021451 0.35978549]
 [0.61660485 0.38339515]
 [0.52927774 0.47072226]
 [0.54717157 0.45282843]
 [0.53575238 0.46424762]
 [0.735029   0.264971  ]
 [0.57851633 0.42148367]]


In [19]:
y.value_counts()/len(y) #baseline is 54.8%,模型的proba并没有给与更好的表现

1    0.548815
2    0.451185
Name: gender, dtype: float64

In [21]:
logreg.score(X,y) # for logistic modele，score的意思是accuracy，意思是
#模型只有63.4的概率得出正确的概率，这依然不是非常有用。

0.6341232227488152

### 3. Fit a Logistic Regression model and compare your cross-validated accuracy to the baseline.

In [25]:
from sklearn.model_selection import cross_val_score
accs = cross_val_score(logreg, X, y, cv=10)
print(accs)
print(np.mean(accs))


[0.58490566 0.66037736 0.56603774 0.52830189 0.5754717  0.55660377
 0.60952381 0.59047619 0.63809524 0.50961538]
0.581940873591817


### 4. Create a 50-50 train-test split. Fit the model on the training data and get the predictions and predicted probabilities on the test data.

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=1)


In [32]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
logreg.fit(X_train, y_train)
print('Logreg intercept:', logreg.intercept_)
print('Logreg coef(s):', logreg.coef_)
print('Logreg predicted probabilities:\n',
      logreg.predict_proba(X_train.iloc[:20, :]))

(527, 32) (527,)
(528, 32) (528,)
Logreg intercept: [0.15234476]
Logreg coef(s): [[-0.13530662  0.11387695 -0.03339743 -0.18016455 -0.06245739 -0.03625247
   0.04812582 -0.16983769 -0.15076175 -0.08555681 -0.13835816 -0.10339659
  -0.13104694 -0.03168305  0.37773253 -0.25344161  0.11842657  0.14383018
  -0.06027062 -0.11255341  0.18128376  0.01239012 -0.14795583  0.00899901
  -0.102986   -0.00693224  0.06765283  0.23609587  0.1813141   0.11527139
   0.01473543 -0.00538234]]
Logreg predicted probabilities:
 [[0.30149682 0.69850318]
 [0.53729877 0.46270123]
 [0.72103184 0.27896816]
 [0.59884607 0.40115393]
 [0.51456286 0.48543714]
 [0.32207039 0.67792961]
 [0.41658554 0.58341446]
 [0.80128874 0.19871126]
 [0.37122638 0.62877362]
 [0.46077811 0.53922189]
 [0.56679514 0.43320486]
 [0.82740004 0.17259996]
 [0.91647167 0.08352833]
 [0.60344869 0.39655131]
 [0.6189977  0.3810023 ]
 [0.51782553 0.48217447]
 [0.34880673 0.65119327]
 [0.72209191 0.27790809]
 [0.56152268 0.43847732]
 [0.45815656 

In [33]:
logreg.score(X_train,y_train) #比上一次的结果稍微好了3%个百分点

0.6603415559772297

### 5. Manually calculate the true positives, false positives, true negatives, and false negatives.

In [None]:
# A:

### 6. Construct the confusion matrix. 

In [None]:
# A:

### 7. Print out the false positive count as you change your threshold for predicting label 1.

In [None]:
# A:

### 8. Plot an ROC curve using your predicted probabilities on the test data.

Calculate the area under the curve.

> *Hint: go back to the lesson to find code for plotting the ROC curve.*

In [None]:
from sklearn.metrics import roc_curve, auc

In [None]:
# A:

### 9. Cross-validate a logistic regression with a Ridge penalty.

Logistic regression can also use the Ridge penalty. Sklearn's [`LogisticRegressionCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) class will help you cross-validate an appropriate regularization strength.

**Important `LogisticRegressionCV` arguments:**
- `penalty`: this can be one of `'l1'` or `'l2'`. L1 is the Lasso, and L2 is the Ridge.
- `Cs`: How many different (automatically-selected) regularization strengths should be tested.
- `cv`: How many cross-validation folds should be used to test regularization strength.
- `solver`: When using the lasso penalty, this should be set to `'liblinear'`

> **Note:** The `C` regularization strength is the *inverse* of alpha. That is to say, `C = 1./alpha`

In [None]:
from sklearn.linear_model import LogisticRegressionCV

In [None]:
# A:

#### 9.A Calculate the predicted labels and predicted probabilities on the test set with the Ridge logisitic regression.

In [None]:
# A:

#### 9.B Construct the confusion matrix for the Ridge LR.

In [None]:
# A:

### 10. Plot the ROC curve for the original and Ridge logistic regressions on the same plot.

Which performs better?

In [None]:
# A:

### 11. Cross-validate a Lasso logistic regression.

**Hint:**
- `penalty` must be set to `'l1'`
- `solver` must be set to `'liblinear'`

> **Note:** The lasso penalty can be considerably slower. You may want to try fewer Cs or use fewer cv folds.

In [None]:
# A:

### 12. Make the confusion matrix for the Lasso model.

In [None]:
# A:

### 13. Plot all three logistic regression models on the same ROC plot.

Which is the best (if any)?

In [None]:
# A:

### 14. Look at the coefficients for the Lasso logistic regression model. Which variables are the most important ones?

In [None]:
# A: