## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

1) As education level increases, how does the probability of being left-handed change?

2) If a person enjoys dancing alone, how does the probability of being left-handed change?

3) If a person hates shopping, how does the probability of being left-handed change?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import PolynomialFeatures, OrdinalEncoder

from category_encoders import OneHotEncoder

from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [2]:
left_hand = pd.read_csv(filepath_or_buffer = './data.csv', sep='\t')
left_hand.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

I would keep the following columns hidden/anonymous:
1) Gender

2) Sexual Orientation

3) Religion

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [3]:
left_hand.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

In [4]:
left_hand.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Classification because we are determining the probability between three options. It is a discrete variable we are solving for.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Generally, we want to standardize variables when the features have a very wide range of values. The Titanic dataset from a previous lab would be a good example of a dataset we would need to standardize variables because they asked for age, count of siblings, price of ticket, etc. All of these features have different ranges and measurements, so they need to be put on the same scale.

### 7. Give an example of when we might not standardize our variables.

We wouldn't standardize our variables when every feature is on the same scale or measurement.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

No, because questions 1-44 are all based on the same answer scale of 1-5.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

There are 11 zeros, which is not associated with a possible answer to the question. Since it's impossible to determine these, I will drop the rows.

In [5]:
left_hand['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [6]:
hand_cleaned = left_hand[left_hand['hand'] > 0]
hand_cleaned['hand'].value_counts()

1    3542
2     452
3     179
Name: hand, dtype: int64

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

We shouldn't should an even number for k because if there is a tie, sklearn does not handle it in a logical way. It is always best to use an odd number so there will be a clear winner.cross_val_score(knn, X_train_scaled, y_train)

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 7$, and one with $k = 9$.

> Instantiate and fit your models using GridSearchCV.

In [7]:
X = hand_cleaned.iloc[:, :44]
y = hand_cleaned['hand']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 8)

In [9]:
params = {'n_neighbors': [3, 5, 7, 9]}

In [10]:
gs = GridSearchCV(KNeighborsClassifier(), params)

In [11]:
gs.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [3, 5, 7, 9]})

In [12]:
gs.best_params_

{'n_neighbors': 9}

In [13]:
pd.DataFrame(gs.cv_results_).sort_values(by = 'rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
3,0.038377,0.000808,0.174098,0.006459,9,{'n_neighbors': 9},0.848243,0.851438,0.841853,0.846645,0.8448,0.846596,0.003221,1
2,0.03858,0.001966,0.167106,0.002316,7,{'n_neighbors': 7},0.848243,0.84984,0.837061,0.841853,0.8432,0.844039,0.004594,2
1,0.037178,0.001719,0.169303,0.005197,5,{'n_neighbors': 5},0.840256,0.840256,0.833866,0.832268,0.8368,0.836689,0.003255,3
0,0.039575,0.004962,0.174901,0.007819,3,{'n_neighbors': 3},0.835463,0.832268,0.785942,0.821086,0.808,0.816552,0.018085,4


### 12. How does a null model do? Create a null model. Does it do better than our best KNN model?

In [14]:
hand_cleaned['hand'].value_counts(normalize = True)

1    0.848790
2    0.108315
3    0.042895
Name: hand, dtype: float64

Since the highest percentage of the split is 84.9%, our best KNN model is about the same because the mean test score tells use we are 84.6% accurate.

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. 

Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Yes, regularization is applied by default. The default parameters are penalty='l2' and C=1.0

### 14. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

No, because they are all on the same 5 point scale.

### 15. Let's use logistic regression to predict whether the person is left-handed.


Be sure to use the same train/test split with your data as with your kNN model above!

Search over the following hyperparameters:

- l2 regularization with C = [.001, .01, .1, 1, 10, 100]


In [16]:
params = {'C': [.001, .01, .1, 1, 10, 100]}
gs2 = GridSearchCV(LogisticRegression(), params, n_jobs = -1)
gs2.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]})

In [17]:
gs2.best_params_

{'C': 0.001}

In [18]:
pd.DataFrame(gs2.cv_results_).sort_values(by = 'rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.378937,0.047097,0.00725,0.007078,0.001,{'C': 0.001},0.84984,0.84984,0.848243,0.848243,0.8496,0.849153,0.000749,1
1,0.343727,0.009883,0.009374,0.007654,0.01,{'C': 0.01},0.84984,0.848243,0.848243,0.848243,0.8496,0.848834,0.000728,2
2,0.352595,0.015973,0.003125,0.00625,0.1,{'C': 0.1},0.84984,0.848243,0.845048,0.848243,0.848,0.847875,0.001558,3
3,0.340606,0.020726,0.009372,0.007652,1.0,{'C': 1},0.84984,0.848243,0.845048,0.848243,0.848,0.847875,0.001558,3
4,0.331229,0.018221,0.009375,0.012499,10.0,{'C': 10},0.84984,0.848243,0.845048,0.846645,0.848,0.847555,0.001613,5
5,0.328105,0.019765,0.009372,0.007652,100.0,{'C': 100},0.84984,0.848243,0.845048,0.846645,0.848,0.847555,0.001613,5


---
## Step 5: Evaluate the model(s).

### 16. Before calculating any score on your data, take a step back. 

Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

I do not think any of the variables used would do a good job predicting the y variable. The questions seem arbitrary and do not relate to what hand a person uses. I think this will have no affect on my scores and the model will randomly guess, being correct the same amount as if someone guessed right-handed every single time.

### 17. Using accuracy as your metric, evaluate the best of your models on both the training (mean validation) and testing sets. Put your scores below. 

In [22]:
knn = KNeighborsClassifier(n_neighbors = 9)
knn.fit(X_train, y_train)
knn.score(X_train, y_train)

0.849153084052413

In [23]:
knn.score(X_test, y_test)

0.8467432950191571

In [29]:
logr = LogisticRegression(C = 0.001, max_iter = 1000)
logr.fit(X_train, y_train)
logr.score(X_train, y_train)

0.849153084052413

In [28]:
logr.score(X_test, y_test)

0.8477011494252874

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

If k is a low number, there is high bias and low variance. As k increases the bias decreases, but then the variance increases.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

1) Increase k to include more neighbors to compare to.

2) Do more cross validation to make sure the model works in more sample sizes.

3) Adjust the weights parameter to make the sample more fair.

### 20. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:

### 21. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:

### 22. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

### 23. How might you deal with the imbalanced dataset?

Answer: 

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:

### 25. Instantiate, fit, and score a logistic regression model with no regularization. Interpret the coefficient for `Q1`.

Answer:

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
