## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 
- However, there are some additional questions along the way that don't fit neatly into the one main example we'll walk through. Any question that isn't explicitly part of the main example is marked with **(detour)** at the start of the question.

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other. 

> You might find it helpful to check out the codebook in the repo for some inspiration.

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [68]:
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

%matplotlib inline

In [40]:
df = pd.read_table('../datasets/data.csv')
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

># ***Answer:***
># ***Do we actually need the data?***
># ***Can collection of the data be optional?***
># ***Maybe give and option of "Prefer not to answer"***


---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [41]:
df.shape

(4184, 56)

In [42]:
df.isnull().sum().sum()

0

In [43]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Q1,4184.0,1.962715,1.360291,0.0,1.0,1.0,3.0,5.0
Q2,4184.0,3.829589,1.551683,0.0,3.0,5.0,5.0,5.0
Q3,4184.0,2.846558,1.664804,0.0,1.0,3.0,5.0,5.0
Q4,4184.0,3.186902,1.476879,0.0,2.0,3.0,5.0,5.0
Q5,4184.0,2.86544,1.545798,0.0,1.0,3.0,4.0,5.0
Q6,4184.0,3.672084,1.342238,0.0,3.0,4.0,5.0,5.0
Q7,4184.0,3.216539,1.490733,0.0,2.0,3.0,5.0,5.0
Q8,4184.0,3.184512,1.387382,0.0,2.0,3.0,4.0,5.0
Q9,4184.0,2.761233,1.511805,0.0,1.0,3.0,4.0,5.0
Q10,4184.0,3.522945,1.24289,0.0,3.0,4.0,5.0,5.0


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

># ***Answer:***
># ***Classification. The output is discrete.***

### (detour) 6. While this isn't the problem we set out to solve, suppose I wanted to predict the exact age of the respondent using Q1 - Q44 as my predictors. Would this be a classification or regression problem? Why?

># ***Answer:***
># ***Regression. Age is continuous.***

### 7. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

># ***Answer:***
># ***We standardize our variables to put everything on the same scale.***

># ***If we tried to pridict income of a household based strictly on number of kids and lot size, the coefficiant of number of kids would be large compared to the coefficiant of lot size.***

### 8. Give an example of when we might not standardize our variables.

># ***Answer:*** 
># ***Similarly scaled variables.***

### 9. Based on your answers to 7 and 8, do you think we should standardize our predictor variables in this case? Why or why not?

># ***Answer:***
># ***No, scales are similar.***

### 10. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: 

In [44]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [46]:
df['0=right,1=left'] = [1 if i == 2 else 0 for i in df['hand']]
df['0=right,1=left'].value_counts()

0    3732
1     452
Name: 0=right,1=left, dtype: int64

In [47]:
df = df[df['hand'] != 0].reset_index()

### 11. The professor for whom you work suggests that you set $k = 4$. Why might this be a bad idea in this specific case?

># ***Answer:***
># ***Even numbers can lead to misleading results. Ties and such.***
># ***Best to use odd. 3 or 5 is recommended.***

### 12. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [50]:
X = df.drop(columns=['index', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand', '0=right,1=left'], axis = 1)

y = df['0=right,1=left']

In [51]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

In [53]:
k_3 = KNeighborsClassifier(n_neighbors = 3)
k_3.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [54]:
k_5 = KNeighborsClassifier(n_neighbors = 5)
k_5.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [55]:
k_15 = KNeighborsClassifier(n_neighbors = 15)
k_15.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=2,
           weights='uniform')

In [56]:
k_25 = KNeighborsClassifier(n_neighbors = 25)
k_25.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=25, p=2,
           weights='uniform')

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

># ***Answer:***
># ***C : float, default: 1.0***

### 14. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features? Well, the answer is (as always), **it depends**. What is one reason you would standardize? What is one reason you would not standardize?

># ***Answer:***
># ***An example of when I would standardize in logistic regression is when I also wanted to regularize.***
># ***An example of when I would not standardize in logistic regression is when scales are already similar.***

### 15. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [59]:
lasso_1 = LogisticRegression(penalty = 'l1', C = 0.1)
lasso_1.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [60]:
lasso_10 = LogisticRegression(penalty = 'l1', C = 1.0)
lasso_10.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [61]:
ridge_1 = LogisticRegression(penalty = 'l2', C = 0.1)
ridge_1.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [62]:
ridge_10 = LogisticRegression(penalty = 'l2', C = 1.0)
ridge_10.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

---
## Step 5: Evaluate the model(s).

### 16. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not?

># ***Answer:***
># ***No. What do psych factors have to do with hand dominance?*** 

### 17. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)

In [65]:
print('training accuracy k = 3 ' + str(k_3.score(X_train, y_train)))
print('testing accuracy k = 3 ' + str(k_3.score(X_test, y_test)))

print('training accuracy k = 5 ' + str(k_5.score(X_train, y_train)))
print('testing accuracy k = 5 ' + str(k_5.score(X_test, y_test)))

print('training accuracy k = 15 ' + str(k_15.score(X_train, y_train)))
print('testing accuracy k = 15 ' + str(k_15.score(X_test, y_test)))

print('training accuracy k = 25 ' + str(k_25.score(X_train, y_train)))
print('testing accuracy k = 25 ' + str(k_25.score(X_test, y_test)))

print('logreg training accuracy lasso a = 1 ' + str(lasso_1.score(X_train, y_train)))
print('logreg testing accuracy lasso a = 1 ' + str(lasso_1.score(X_test, y_test)))

print('logreg training accuracy lasso a = 10 ' + str(lasso_10.score(X_train, y_train)))
print('logreg testing accuracy lasso a = 10 ' + str(lasso_10.score(X_test, y_test)))

print('logreg training accuracy ridge a = 1 ' + str(ridge_1.score(X_train, y_train)))
print('logreg testing accuracy ridge a = 1 ' + str(ridge_1.score(X_test, y_test)))

print('logreg training accuracy ridge a = 10 ' + str(ridge_10.score(X_train, y_train)))
print('logreg testing accuracy ridge a = 10 ' + str(ridge_10.score(X_test, y_test)))

training accuracy k = 3 0.9156279961649089
testing accuracy k = 3 0.8496168582375478
training accuracy k = 5 0.8996484499840205
testing accuracy k = 5 0.8735632183908046
training accuracy k = 15 0.8951741770533717
testing accuracy k = 15 0.8812260536398467
training accuracy k = 25 0.8951741770533717
testing accuracy k = 25 0.8812260536398467
logreg training accuracy lasso a = 1 0.8954937679769894
logreg testing accuracy lasso a = 1 0.8812260536398467
logreg training accuracy lasso a = 10 0.8954937679769894
logreg testing accuracy lasso a = 10 0.8812260536398467
logreg training accuracy ridge a = 1 0.8954937679769894
logreg testing accuracy ridge a = 1 0.8812260536398467
logreg training accuracy ridge a = 10 0.8951741770533717
logreg testing accuracy ridge a = 10 0.8812260536398467


| training | output | testing | output|
|:-:|:-:|:-:|:-:|
|training accuracy k = 3 | 0.9156279961649089|testing accuracy k = 3 | 0.8496168582375478|
|training accuracy k = 5 | 0.8996484499840205|testing accuracy k = 5 | 0.8735632183908046|
|training accuracy k = 15 | 0.8951741770533717|testing accuracy k = 15 | 0.8812260536398467|
|training accuracy k = 25 | 0.8951741770533717|testing accuracy k = 25 | 0.8812260536398467|
|logreg training accuracy lasso a = 1 | 0.8954937679769894|logreg testing accuracy lasso a = 1 | 0.8812260536398467|
|logreg training accuracy lasso a = 10 | 0.8954937679769894|logreg testing accuracy lasso a = 10 | 0.8812260536398467|
|logreg training accuracy ridge a = 1 | 0.8954937679769894|logreg testing accuracy ridge a = 1 | 0.8812260536398467|
|logreg training accuracy ridge a = 10 | 0.8951741770533717|logreg testing accuracy ridge a = 10 | 0.8812260536398467|

### 18. In which of your $k$-NN models is there evidence of overfitting? How do you know?

># ***Answer:***
># ***k = 3 and k = 5***
># ***The trainings score better than the testing.***

### 19. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

># ***Answer:***
># ***As k decreases, our bias decreases and our variance increases.***
># ***As k increases, our bias increases and our variance decreases.***


### 20. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

># ***Answer:***
># ***- Reduce predictors***
># ***- Increase k***
># ***- Try logreg***

### 21. In which of your logistic regression models is there evidence of overfitting? How do you know?

># ***Answer:***
># ***None.*** 
># ***All of the testing scores are better than the training scores.***

### 22. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

># ***Answer:***
># ***As C goes down our variance decreases and our bias increases.***
># ***As C goes up our variance increase and our bias decreases.***

### 23. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What might this mean in the context of this problem?

># ***Answer:***
># ***Nothing seems to change as C changes.***
># ***Likely need better X variables.***

### 24. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

># ***Answer:***
># ***Get more data.***
># ***Remove some features.***
># ***Try lasso.***

---
## Step 6: Answer the problem.

### 25. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

># ***Answer:***
># ***Logreg do to estimating coefs.***

### 26. Select a logistic regression model. Interpret the coefficient for `Q1`.

In [66]:
lasso_10.coef_

array([[-0.01978977, -0.04362395, -0.04693673, -0.06088196,  0.10951742,
         0.02875591, -0.00321948, -0.14268086, -0.04024473,  0.06593426,
        -0.02369211,  0.        , -0.04471947,  0.00456571, -0.00972691,
         0.01272543,  0.04393458, -0.01434006, -0.04057943, -0.05023702,
        -0.02359995, -0.05404905, -0.02922946, -0.01236549,  0.06027034,
         0.07572057, -0.02629942,  0.02626792,  0.02470894,  0.00690672,
         0.04863997, -0.01209977, -0.04623348,  0.01756564,  0.02339307,
        -0.06290626, -0.03631335,  0.09053533, -0.07633563, -0.09311776,
        -0.05866217, -0.06455702, -0.08424839, -0.01491415]])

># ***Coef Q1 = -0.0198***

### 27. If you have to select one model overall to be your *best* model, which model would you select? Why?

># ***Answer:***
># ***Logreg do to estimating coefs.***

### 28. BONUS: 
### Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer these for the professor based on the model you selected!

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following:
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)
- Fit and evaluate one or more of the generalized linear models discussed above.
- Create a plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?