## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer: 
1. Is lefthandedness more common in artistic people?
2. Does an aggressive personality correlate more with left-handedness?
3. Are people who are left-handed more analytical?


---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression, LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import os
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


In [2]:
lefthand = pd.read_csv('/Users/pwalesdi/4.01-lab-classification_model_comparison/data_1.txt', sep='\t')
lefthand.head()


Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [3]:
lefthand.groupby('hand')['hand'].agg(['count'])


Unnamed: 0_level_0,count
hand,Unnamed: 1_level_1
0,11
1,3542
2,452
3,179


In [4]:
features = ['Q%s' % i for i in range(1,44)]
lefthandcorr_1 = lefthand[features].corr()

In [41]:
lefthand.corr()[['hand']].sort_values('hand', ascending=False)

Unnamed: 0,hand
hand,1.0
age,0.030882
Q26,0.018077
education,0.017367
Q25,0.015104
orientation,0.014769
Q5,0.01439
Q21,0.013723
Q31,0.011751
Q17,0.006905


In [6]:
# plt.figure(figsize = (40, 35))
# mask = np.zeros_like(lefthandcorr_1)
# mask[np.triu_indices_from(mask)] = True
# sns.set(font_scale = 2)
# ax = sns.heatmap(lefthandcorr_1, mask=mask, annot=True, cmap='RdYlBu', vmax=1, vmin=-1, 
# square=False, linewidths=1.5,  cbar_kws={"shrink": 1.0}, xticklabels='auto')
# plt.xticks(rotation=45);

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: Classification

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: 

### 7. Give an example of when we might not standardize our variables.

Answer: 

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: 

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: 

In [7]:
# sns.pairplot(lefthand, hue='hand');

In [8]:
lefthand.groupby('hand')['hand'].agg(['count'])
lefthand = lefthand.loc[(lefthand['hand'] == 1) | (lefthand['hand'] == 2)]

lefthand.loc[(lefthand['hand'] == 1), 'hand'] = 0

lefthand.loc[(lefthand['hand'] == 2), 'hand'] = 1


In [68]:
lefthand.groupby('hand')['hand'].agg(['count'])


Unnamed: 0_level_0,count
hand,Unnamed: 1_level_1
0,3542
1,452


Unnamed: 0,hand
hand,1.0
age,0.030882
Q26,0.018077
education,0.017367
Q25,0.015104
orientation,0.014769
Q5,0.01439
Q21,0.013723
Q31,0.011751
Q17,0.006905


### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: 

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.


> Instantiate and fit your models.




In [10]:
lefthand['hand']

1       0
2       1
3       1
5       0
6       0
7       0
8       0
9       1
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      1
21      0
22      0
23      0
24      0
25      0
26      0
27      0
29      1
30      0
31      0
32      0
33      0
34      1
35      0
36      0
38      0
39      0
40      0
41      0
42      0
43      0
44      0
45      0
46      0
47      0
48      0
49      0
50      0
51      1
52      0
53      0
54      0
55      1
56      0
57      0
58      0
59      0
60      0
61      1
62      0
63      0
64      0
65      0
67      0
68      1
69      0
70      1
71      1
73      0
74      0
75      0
77      1
78      0
79      0
80      1
81      0
82      0
83      1
84      0
85      1
86      1
87      0
88      0
89      0
90      0
91      0
92      1
93      0
94      0
95      1
96      0
97      0
98      0
99      0
100     0
101     0
102     0
103     0
104     0
105     0
106     1


In [11]:
X=lefthand[features]
y=lefthand['hand']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
knn3 = KNeighborsClassifier(n_neighbors=3)
knn5 = KNeighborsClassifier(n_neighbors=5)
knn15 = KNeighborsClassifier(n_neighbors=15)
knn25 = KNeighborsClassifier(n_neighbors=25)
print(cross_val_score(knn3, X_train, y_train, cv=5).mean())
print(cross_val_score(knn5, X_train, y_train, cv=5).mean())
print(cross_val_score(knn15, X_train, y_train, cv=5).mean())
print(cross_val_score(knn25, X_train, y_train, cv=5).mean())

0.854754269006501
0.8741215850274425
0.8871461056424401
0.8871461056424401


In [72]:
knn3.fit(X_train, y_train)
knn5.fit(X_train, y_train)
knn15.fit(X_train, y_train)
knn25.fit(X_train, y_train)

print(f'Knn 3: {knn3.score(X_train, y_train)}')
print(knn5.score(X_train, y_train))
print(knn15.score(X_train, y_train))
print(knn25.score(X_train, y_train))
print('\n')
print(knn3.score(X_test, y_test))
print(knn5.score(X_test, y_test))
print(knn15.score(X_test, y_test))
print(knn25.score(X_test, y_test))



Knn 3: 0.8958263772954925
0.889482470784641
0.8871452420701169
0.8871452420701169


0.8558558558558559
0.8718718718718719
0.8858858858858859
0.8858858858858859


In [13]:
print(X_train.shape)
print(X_test.shape)

(2995, 43)
(999, 43)


In [14]:
print(pd.DataFrame(knn3.predict_proba(X_test)).groupby(1)[1].agg(['count']))
print(pd.DataFrame(knn3.predict_proba(X_test)).groupby(0)[0].agg(['count']))

print(pd.DataFrame(knn5.predict_proba(X_test)).groupby(1)[1].agg(['count']))
print(pd.DataFrame(knn5.predict_proba(X_test)).groupby(0)[0].agg(['count']))

print(pd.DataFrame(knn15.predict_proba(X_test)).groupby(1)[1].agg(['count']))
print(pd.DataFrame(knn15.predict_proba(X_test)).groupby(0)[0].agg(['count']))

print(pd.DataFrame(knn25.predict_proba(X_test)).groupby(1)[1].agg(['count']))
print(pd.DataFrame(knn25.predict_proba(X_test)).groupby(0)[0].agg(['count']))


          count
1              
0.000000    672
0.333333    288
0.666667     39
          count
0              
0.333333     39
0.666667    288
1.000000    672
     count
1         
0.0    506
0.2    365
0.4    113
0.6     14
0.8      1
     count
0         
0.2      1
0.4     14
0.6    113
0.8    365
1.0    506
          count
1              
0.000000    129
0.066667    300
0.133333    299
0.200000    169
0.266667     73
0.333333     26
0.400000      3
          count
0              
0.600000      3
0.666667     26
0.733333     73
0.800000    169
0.866667    299
0.933333    300
1.000000    129
      count
1          
0.00     29
0.04    125
0.08    248
0.12    232
0.16    200
0.20    100
0.24     47
0.28     14
0.32      3
0.36      1
      count
0          
0.64      1
0.68      3
0.72     14
0.76     47
0.80    100
0.84    200
0.88    232
0.92    248
0.96    125
1.00     29


Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: 

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer:

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [15]:
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)
print(f'Intercept: {LogReg.intercept_}')
print('')
print(f'Coefficient: {LogReg.coef_}')
print('')
print(f'Exponentiated Coefficient: {np.exp(LogReg.coef_)}')




Intercept: [-0.73645188]

Coefficient: [[-0.00742977 -0.04699472 -0.02072117 -0.08772433  0.06190146  0.01852482
   0.026479   -0.1512281  -0.07371314  0.09729255  0.02355302 -0.02609583
  -0.02224154  0.0202397   0.00157273  0.02621971  0.02692335  0.02326376
  -0.04747419 -0.04799327 -0.05686107 -0.11065629 -0.03460374  0.0130884
   0.0366283   0.10629215  0.02588428  0.01945747  0.06786362  0.05647171
   0.03730987 -0.04060414 -0.02522169 -0.02156501  0.02603763  0.00324536
  -0.03901551  0.08766737 -0.07721802 -0.06194837 -0.07073831 -0.0424616
  -0.11139356]]

Exponentiated Coefficient: [[0.99259777 0.95409243 0.97949204 0.91601336 1.06385751 1.01869747
  1.02683268 0.85965159 0.92893813 1.10218278 1.02383258 0.97424172
  0.97800398 1.02044591 1.00157396 1.02656647 1.02728906 1.02353647
  0.95363509 0.9531402  0.94472531 0.8952464  0.96598812 1.01317442
  1.03730739 1.11214674 1.02622219 1.019648   1.07021935 1.05809668
  1.03801462 0.96020916 0.97509372 0.97866585 1.02637957 1.00

In [16]:
print(f'Logreg predicted values: {LogReg.predict(X_train.head())}')
print(f'Logreg predicted probabilities: {LogReg.predict_proba(X_train.head())}')


Logreg predicted values: [0 0 0 0 0]
Logreg predicted probabilities: [[0.85846189 0.14153811]
 [0.90611194 0.09388806]
 [0.92950981 0.07049019]
 [0.84417505 0.15582495]
 [0.88558955 0.11441045]]


In [29]:
from sklearn.metrics import confusion_matrix
preds = LogReg.predict(X_test)
confusion_matrix(y_test, # True values.
                 preds)  # Predicted values.


array([[885,   0],
       [114,   0]])

In [33]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

In [34]:
spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

Specificity: 1.0


In [35]:
sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Sensitivity: 0.0


In [63]:

LogRegL1 = LogisticRegression(penalty='l1', C=1, solver='liblinear')
LogRegL2 = LogisticRegression(penalty='l2', C=1, solver='liblinear')
LogRegL1_10 = LogisticRegression(penalty='l1', C=10, solver='liblinear')
LogRegL2_10 = LogisticRegression(penalty='l2', C=10, solver='liblinear')

LogRegL1.fit(X_train, y_train)
LogRegL2.fit(X_train, y_train)
LogRegL1_10.fit(X_train, y_train)
LogRegL2_10.fit(X_train, y_train)

print(LogRegL1.score(X_test, y_test))
print(LogRegL2.score(X_test, y_test))
print(LogRegL1_10.score(X_test, y_test))
print(LogRegL2_10.score(X_test, y_test))

print(LogRegL1.coef_)

0.8858858858858859
0.8858858858858859
0.8858858858858859
0.8858858858858859
[[-0.00569758 -0.04638346 -0.01853951 -0.08578679  0.0611854   0.01498284
   0.02313173 -0.15160932 -0.07244508  0.09335324  0.02060853 -0.02588533
  -0.02263529  0.01699182  0.          0.02431994  0.02555358  0.02070302
  -0.04698055 -0.04648088 -0.05736181 -0.10935238 -0.03332322  0.00988677
   0.03288868  0.10455288  0.00971169  0.01825499  0.06540237  0.0542085
   0.03288798 -0.03912474 -0.02451964 -0.02074129  0.02332908  0.
  -0.03805837  0.08445194 -0.07508504 -0.05909796 -0.06913453 -0.04056245
  -0.09535763]]


In [51]:
cvs_L_1 = cross_val_score(LogRegL1, X_train, y_train, cv=7).mean()
cvs_L_10 = cross_val_score(LogRegL1_10, X_train, y_train, cv=7).mean()
cvs_r_1 = cross_val_score(LogRegL2, X_train, y_train, cv=7).mean()
cvs_r_10 = cross_val_score(LogRegL2_10, X_train, y_train, cv=7).mean()
print(cvs_L_1)
print(cvs_L_10)
print(cvs_r_1)
print(cvs_r_10)

0.8871465960261951
0.8871465960261951
0.8871465960261951
0.8871465960261951


---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer: I don't think that these X variable do a good job at predicting the y variables. On the surface the questions don't have anything to do with whether someone is lefthanded or righthanded. One could imagine that there might be something a little bit deeper that would emerge when you were running analysis but if you just look at the top 5 correlated variables with y as seen below have almost no relationship. Age is the highest correlation and it seems reasonable to assume that if a persons age is more correlate with lefthandedness than any of the questions asked, that there is essentially no predictive power.

age	0.030882
Q26	0.018077
education	0.017367
Q25	0.015104
orientation	0.014769



### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

The goal of this lesson was to teach you, in general, how hypothesis testing works. We showed you what is probably the most common variety of hypothesis test: the $t$-test. However, there are kajillions of other ones out there. It's not worth our time to go over so many more of them, as they all have the same implementation and interpretation, just in different situations. Instead, here is a list of many of the "big" ones and when to use them:

| Model | Alphas | N-Neighbors | CV Score | Test Score |
| --- | --- | --- | --- | --- |
| KNneighbors | N/A | 3  | 0.902 | 0.848
| KNneighbors | N/A | 5  | 0.889 | 0.870
| KNneighbors | N/A | 15  | 0.887 | 0.885
| KNneighbors | N/A | 25  | 0.887 | 0.885
| Logistic Reg l1 | 1 | N/A  | 0.887 | 0.885
| Logistic Reg l1 | 10 | N/A  | 0.887 | 0.885
| Logistic Reg l2 | 1 | N/A  | 0.887 | 0.885
| Logistic Reg l2 | 10 | N/A  | 0.887 | 0.885



### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer: The KNN model where N-Neighbors = 3 there is evidence of overfitting. You can see that based on the difference between the train data score and the test data score. For the training set, the Cross Validation gives a score of 0.902 but on the test data set, it give you a score of 0.848. This is becuase having fewer N will tend toward overfitting. 

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer: As you increase the the value of K in KNN you have less variance. So with smaller values of K the model is overfit and has much higher variance. We saw this in our results above. When K=3 the difference between the training score and the testing score was about 6%. However when K = 15 & 25 the difference between teh training score and the testing score was 0.2%.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: You can regularize, you can change the value of K, you can try using a different distance function while running KNN

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: There isn't trong evidence of Overfitting in the models. The variance is relatively low as we can see by the similar scores we recive from the training and testing datasets. 

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: . Increasing the C parameter leads to higher rates of variance. Decreasing the C parameter leads to stronger regularization so this will decrease the variance and increase the bias.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer: As I change the hyperparameter C the coefficients change very little. This means that changing the hyperparameters in the model doesnt' change how the model behaves. 

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: Change the value of C to be lower, scale your data, use fewer variables in your model.

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: I would use logistic regression because you can look at the coefficients of the different features and see how they predictedness lefthandedness. KNN is just looking at the closest values to make predictions. 

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [66]:
print(LogRegL1.coef_)

[[-0.00569758 -0.04638346 -0.01853951 -0.08578679  0.0611854   0.01498284
   0.02313173 -0.15160932 -0.07244508  0.09335324  0.02060853 -0.02588533
  -0.02263529  0.01699182  0.          0.02431994  0.02555358  0.02070302
  -0.04698055 -0.04648088 -0.05736181 -0.10935238 -0.03332322  0.00988677
   0.03288868  0.10455288  0.00971169  0.01825499  0.06540237  0.0542085
   0.03288798 -0.03912474 -0.02451964 -0.02074129  0.02332908  0.
  -0.03805837  0.08445194 -0.07508504 -0.05909796 -0.06913453 -0.04056245
  -0.09535763]]


Answer: The coefficent for Q1 is -0.0056 which means that people who have studied how to win at gambling are SLIGHTLY less likely to be lefthanded. 

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:I would pick the baseline model, which would just be the mean of the dataset. For this example it would be 452 out 3994 or about 11.3 %. I would use this because the other models do a very poor job at predicting lefthandedness. This is not surprising as we talked about earlier, the questions from this sruvey have very little predictive power. 

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

In [75]:
X=lefthand[['Q2', 'Q4', 'Q13', 'Q14', 'Q16', 'Q40', 'Q44']]
y=lefthand['hand']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
knn3 = KNeighborsClassifier(n_neighbors=3)
knn5 = KNeighborsClassifier(n_neighbors=5)
knn15 = KNeighborsClassifier(n_neighbors=15)
knn25 = KNeighborsClassifier(n_neighbors=25)
print(cross_val_score(knn3, X_train, y_train, cv=5).mean())
print(cross_val_score(knn5, X_train, y_train, cv=5).mean())
print(cross_val_score(knn15, X_train, y_train, cv=5).mean())
print(cross_val_score(knn25, X_train, y_train, cv=5).mean())
print('')

knn3.fit(X_train, y_train)
knn5.fit(X_train, y_train)
knn15.fit(X_train, y_train)
knn25.fit(X_train, y_train)

print(f'Knn 3: {knn3.score(X_train, y_train)}')
print(knn5.score(X_train, y_train))
print(knn15.score(X_train, y_train))
print(knn25.score(X_train, y_train))
print('\n')
print(knn3.score(X_test, y_test))
print(knn5.score(X_test, y_test))
print(knn15.score(X_test, y_test))
print(knn25.score(X_test, y_test))


LogRegL1 = LogisticRegression(penalty='l1', C=1, solver='liblinear')
LogRegL2 = LogisticRegression(penalty='l2', C=1, solver='liblinear')
LogRegL1_10 = LogisticRegression(penalty='l1', C=10, solver='liblinear')
LogRegL2_10 = LogisticRegression(penalty='l2', C=10, solver='liblinear')

LogRegL1.fit(X_train, y_train)
LogRegL2.fit(X_train, y_train)
LogRegL1_10.fit(X_train, y_train)
LogRegL2_10.fit(X_train, y_train)

print(LogRegL1.score(X_test, y_test))
print(LogRegL2.score(X_test, y_test))
print(LogRegL1_10.score(X_test, y_test))
print(LogRegL2_10.score(X_test, y_test))

print(LogRegL1.coef_)

0.8634331876055038
0.8784671814599955
0.8871461056424401
0.8871461056424401

Knn 3: 0.8958263772954925
0.889482470784641
0.8871452420701169
0.8871452420701169


0.8558558558558559
0.8718718718718719
0.8858858858858859
0.8858858858858859
0.8858858858858859
0.8858858858858859
0.8858858858858859
0.8858858858858859
[[-0.05218497 -0.08923711 -0.02971047  0.01727901  0.0402436  -0.0407415
   0.00718819]]


Answer: It's pretty hard to interpret these coefficients. I tested to see if lefthandedness is more common in artistic people. The coefficients suggest that there is essentially no predictive power between the questions that related to a person being more artistic and their lefthandedness. 

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)