## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

### Possible Specific and Conclusively Answerable Problems

1. Do individuals who give others handmade gifts, like to dance and decorate their things have a higher chance to be left-handed as compared to people who do not?

2. Are males more likely to be left-handed than females?

3. Are individuals who have higher education more likely to be left handed than those who are not?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [364]:
import pandas as pd 

In [365]:
pd.set_option('display.max_columns', 60)
df = pd.read_csv('data.csv', delimiter='\t' )
print(df.shape)
df.head()

(4184, 56)


Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

1. Any association regarding gender could lead to unhappy sentiments. For instance, if I say "Males are more likely to be _____ than females." Depending on what word I use to fill in that blank, there will be unhappy people.

2. Religion is also a touchy topic. To be honest, as religion is a nurture factor rather than nature factor, as a data scientist this parameter may fall into correlation rather than causation, so the use of religion as an independent variable might not be necessary.

3. Likewise, race is also a touchy subject, with similar reasons as gender and religion, this parameter would cause certain unhappiness.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [366]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

In [367]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,2.748805,2.852772,2.657505,3.33413,3.168021,2.93021,2.564771,3.424952,2.928537,3.639818,2.867591,3.595124,3.861138,3.337237,1.999761,3.001434,2.730641,2.624044,2.543738,2.894359,3.002151,2.869503,2.741874,3.022228,3.074092,2.61066,3.465344,2.798757,2.569312,2.984226,3.385277,2.704828,2.676386,2.736616,347.808556,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,1.443078,1.556284,1.559575,1.522866,1.501683,1.575544,1.61901,1.413236,1.493122,1.414569,1.360858,1.354475,1.291425,1.426095,1.290747,1.48061,1.485883,1.481709,1.611428,1.477968,1.420032,1.659141,1.40567,1.562694,1.5464,1.409707,1.52146,1.413584,1.621772,1.483752,1.423055,1.544345,1.523097,1.471845,5908.901681,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,3.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,6.0,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,4.0,3.0,4.0,3.0,4.0,4.0,3.0,1.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,4.0,3.0,2.0,3.0,4.0,3.0,3.0,3.0,12.0,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,5.0,5.0,5.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,35.0,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,252063.0,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [368]:
pd.set_option('display.max_rows', 60)
df.isnull().sum().sort_values(ascending=False)[:5]

Q1     0
Q2     0
Q31    0
Q32    0
Q33    0
dtype: int64

And there are no missing data in the dataframe. Thank you General Assembly.

Apart from country, which is the only categorical variable, all other variables have the integer datatype. The hot-encoding has been performed at the survey level. Thank you General Assembly for making my life easier.

However I need to check what the unique values from the Country column are.

In [369]:
print(len(df['country'].unique()))
df['country'].unique()

94


array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

I will do a pairplot for the question dataset.

In [370]:
import seaborn as sns
import matplotlib.pyplot as plt


%matplotlib inline

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

**Answer:** 

It would be a classification problem because the dependent variable - whether a person is left-handed or otherwise, is a binary categorical variable. Meaning the answer is yes or no, and it is non-continuous.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

**Answer:** 

We standardize our variables when the data are on different scales. For example when we are comparing house prices like in Project 2, we can see that prices and square footage are on vastly different scales and square footage can go up to really large numbers. 

Scaling will make the data more manageable and easier to plot.

### 7. Give an example of when we might not standardize our variables.

**Answer:** 

We do not scale when the data are on the same scale. The best example would be the dataset for this lab. All questions are on the 5-point likend scale, so there is no need for scaling as they are consistent.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

**Answer:** 

We do not need to standardize as  they all follow the same 5-point likend scale.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

**Import Libraries**

In [371]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

%matplotlib inline

In [372]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,US,2,1,22,3,1,1,3,2,3


**Cleaning Data**

FYI: I need to clean the country column. Settle later. For now I drop the column first before making it into a one-hot encoding

I will hot-encode the 'country' column

In [373]:
country_list = df['country'].unique()

In [374]:
country_list

array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

In [375]:
print(country_list[93])
print(np.where(country_list=='NL')[0][0])


SA
2


Country Hot-Encoding Dictionary

In [376]:
country_index_dict = {}
for i in country_list:
    country_index_dict[i] = np.where(country_list==i)[0][0]
country_index_dict

{'US': 0,
 'CA': 1,
 'NL': 2,
 'GR': 3,
 'GB': 4,
 'KR': 5,
 'SE': 6,
 'NO': 7,
 'DE': 8,
 'NZ': 9,
 'CH': 10,
 'RO': 11,
 'IL': 12,
 'IN': 13,
 'ZA': 14,
 'TR': 15,
 'JM': 16,
 'AU': 17,
 'BE': 18,
 'PL': 19,
 'CZ': 20,
 'RS': 21,
 'TW': 22,
 'A2': 23,
 'MX': 24,
 'PH': 25,
 'ES': 26,
 'AT': 27,
 'JP': 28,
 'IT': 29,
 'SG': 30,
 'MY': 31,
 'HK': 32,
 'FR': 33,
 'EU': 34,
 'DK': 35,
 'AE': 36,
 'EC': 37,
 'TH': 38,
 'IE': 39,
 'PK': 40,
 'BR': 41,
 'ID': 42,
 'EG': 43,
 'NI': 44,
 'FI': 45,
 'CN': 46,
 'RU': 47,
 'SI': 48,
 'AR': 49,
 'PT': 50,
 'LB': 51,
 'DO': 52,
 'PF': 53,
 'LT': 54,
 'BG': 55,
 'GE': 56,
 'CL': 57,
 'SK': 58,
 'EE': 59,
 'KE': 60,
 'UZ': 61,
 'LV': 62,
 'BB': 63,
 'BN': 64,
 'PR': 65,
 'HR': 66,
 'NP': 67,
 'A1': 68,
 'PE': 69,
 'UA': 70,
 'HU': 71,
 'VN': 72,
 'TZ': 73,
 'KH': 74,
 'UY': 75,
 'VE': 76,
 'IS': 77,
 'MP': 78,
 'CO': 79,
 'JO': 80,
 'TN': 81,
 'KW': 82,
 'CY': 83,
 'FJ': 84,
 'LK': 85,
 'VI': 86,
 'ZW': 87,
 'IM': 88,
 'ZM': 89,
 'QA': 90,
 'DZ': 91

Map Dictionary to Hot-Encode Country Data

In [377]:
df['country']=df['country'].map(country_index_dict)

Check Hot-Encoded Country Data

In [378]:
df['country']

0        0
1        1
2        2
3        0
4        0
        ..
4179     0
4180     0
4181    19
4182     0
4183     9
Name: country, Length: 4184, dtype: int64

Let's perform the whole train/test split sequence with the new encoded dataset

This model will include all columns except the hand column. I have not done the following yet
1. Only include the question columns in the predictor dataframe
2. Hot-encode the hand column to left handed versus the non left handed

In [379]:
X = df.drop(['hand'], axis='columns')
y = df['hand']

In [380]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

knn = KNeighborsClassifier()

In [381]:
cross_val_score(knn, X_train, y_train, cv=10)



array([0.82686567, 0.8358209 , 0.84179104, 0.83880597, 0.83880597,
       0.8358209 , 0.83880597, 0.84131737, 0.82634731, 0.83532934])

In [382]:
cross_val_score(knn, X_train, y_train, cv=10).mean()



0.8359710429886495

Seems to have a slightly weaker cross validation score. 

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: 

Even numbers are not ideal because there is no tie-breaker when deciding which cluster to follow. Odd numbers are ideal as there will not be any indecision.

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

Ok i performed the default train test split above. Maybe now I scale the data.

I will redo the train test split and use standard scaler


**The Default Model**

In [383]:
X = df.drop(['hand'], axis='columns')
y = df['hand']

In [384]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

In [385]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [386]:
knn = KNeighborsClassifier()

In [387]:
knn.fit(X_train_sc, y_train)

In [388]:
cross_val_score(knn, X_train_sc, y_train, cv=10)



array([0.8358209 , 0.82686567, 0.84179104, 0.81791045, 0.83880597,
       0.82985075, 0.8358209 , 0.83233533, 0.82934132, 0.84730539])

In [389]:
cross_val_score(knn, X_train_sc, y_train, cv=10).mean()



0.8335847707569934

## Four Separate Models

Four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.


### Hot Encode Hand Column 

"What hand do you use to write with?" 	1=Right, 2=Left, 3=Both

We need to focus on left-handed data and change left-handed coding to 1 and the rest to 0

In [390]:
df['left']=[1 if i==2 else 0 for i in df['hand']]

In [391]:
df['left'].value_counts(normalize=True)

0    0.891969
1    0.108031
Name: left, dtype: float64

About 10% of the respondents are left-handed. Let us see if the question responses affect this.

In [392]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand,left
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,0,2,1,22,3,1,1,3,2,3,0
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,1,2,1,14,1,2,2,6,1,1,0
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,2,2,2,30,4,1,1,1,1,2,1
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,0,2,1,18,2,2,5,3,2,2,1
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,0,2,1,22,3,1,1,3,2,3,0


Now I will only keep the question columns in the X dataframe.

In [393]:
from sklearn import model_selection

In [394]:
# X = df.iloc[:, 0:44] # only include the question columns

X = df.drop(columns=['index', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand', 'left'], axis = 1)

y = df['left']

KeyError: "['index'] not found in axis"

## KIV 

What is model_selection and why test_size = 0.25

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.25, random_state=42, stratify=y)

In [None]:
k_3 = KNeighborsClassifier(n_neighbors = 3)
k_3.fit(X_train, y_train)

k_5 = KNeighborsClassifier(n_neighbors = 5)
k_5.fit(X_train, y_train)

k_15 = KNeighborsClassifier(n_neighbors = 15)
k_15.fit(X_train, y_train)

k_25 = KNeighborsClassifier(n_neighbors = 25)
k_25.fit(X_train, y_train)

k_30 = KNeighborsClassifier(n_neighbors = 30)
k_30.fit(X_train, y_train)

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: 

The default Regularization is L2 regularization. It is mentioned in the documentation.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer:

I would not scale as the data are on the same scale. The questions are all on the same 5 point scale.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

Note: In sklearn.linear_model.LogisticRegression, L1 is Lasso and L2 is Ridge

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

In [None]:
from sklearn.linear_model import LogisticRegression


In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.25, random_state=42, stratify=y)

C in the LogisticRegression object is the 

**Inverse of regularization strength** ${alpha}$ ; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

**Version 1**

In [None]:
##'''
lasso_1 = LogisticRegression(C=1, penalty='l1', solver='liblinear')
lasso_1.fit(X_train, y_train)

lasso_10 = LogisticRegression(C=0.1, penalty='l1', solver='liblinear')
lasso_10.fit(X_train, y_train)

ridge_1 = LogisticRegression(penalty ='l2', C = 1.0)
ridge_1.fit(X_train, y_train)

ridge_10 = LogisticRegression(penalty ='l2', C = 0.1)
ridge_10.fit(X_train, y_train)
##'''

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The above did not work initially and kept returning "ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty."

Source: https://stackoverflow.com/questions/60868629/valueerror-solver-lbfgs-supports-only-l2-or-none-penalties-got-l1-penalty

In [None]:
from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

**Version 2**

In [None]:
# Instantiate.
ridge_model_1 = Ridge(alpha=1)
ridge_model_10 = Ridge(alpha=10)
lasso_model_1 = Lasso(alpha=1)
lasso_model_10 = Lasso(alpha=10)


# Fit.
ridge_model_1.fit(X_train, y_train)
ridge_model_10.fit(X_train, y_train)
lasso_model_1.fit(X_train, y_train)
lasso_model_10.fit(X_train, y_train)

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer:

The X variables are answers to questions that are hihgly independent from each other. Moreover, these questions are non-physiological and thus should not contribute to whether or not a person is left handed.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer:

### Evaluate KNN Models

#### **Version 1**

In [None]:
print(f"K3 Score is: {cross_val_score(k_3, X_train, y_train, cv=10).mean()}")

print(f"K5 Score is: {cross_val_score(k_5, X_train, y_train, cv=10).mean()}")

print(f"K15 Score is: {cross_val_score(k_15, X_train, y_train, cv=10).mean()}")

print(f"K25 Score is: {cross_val_score(k_25, X_train, y_train, cv=10).mean()}")

print(f"K30 Score is: {cross_val_score(k_30, X_train, y_train, cv=10).mean()}")

K3 Score is: 0.8590603751945605
K5 Score is: 0.876641476202179
K15 Score is: 0.8916594986483167
K25 Score is: 0.8916594986483167
K30 Score is: 0.8916594986483167


#### **Version 2**

Why so different outputs?

In [None]:
print(k_3.score(X_train, y_train))
print(k_5.score(X_train, y_train))
print(k_15.score(X_train, y_train))
print(k_25.score(X_train, y_train))

0.9034835410674337
0.8938958133589006
0.8916586768935763
0.8916586768935763


In [None]:
print(k_3.score(X_test, y_test))
print(k_5.score(X_test, y_test))
print(k_15.score(X_test, y_test))
print(k_25.score(X_test, y_test))

0.8448275862068966
0.8745210727969349
0.8917624521072797
0.8917624521072797


The outputs are very weird.

### Evaluate Logistic Regression Models

### Version 1

#### Ridge

In [None]:
# Evaluate model using R2.
print(ridge_1.score(X_train, y_train))
print(ridge_1.score(X_test, y_test))

print(ridge_10.score(X_train, y_train))
print(ridge_10.score(X_test, y_test))

0.891978267817194
0.8917624521072797
0.891978267817194
0.8917624521072797


#### Lasso

In [None]:
print(lasso_1.score(X_train, y_train))
print(lasso_1.score(X_test, y_test))

print(lasso_10.score(X_train, y_train))
print(lasso_10.score(X_test, y_test))

0.891978267817194
0.8917624521072797
0.891978267817194
0.8917624521072797


### Version 2

#### Ridge

In [None]:
# Evaluate model using R2.
print(ridge_model_1.score(X_train, y_train))
print(ridge_model_1.score(X_test, y_test))

print(ridge_model_10.score(X_train, y_train))
print(ridge_model_10.score(X_test, y_test))

0.019756529815492807
-0.0069736831148241585
0.019756266345273388
-0.006907591870189078


#### Lasso

In [None]:
print(lasso_model_1.score(X_train, y_train))
print(lasso_model_1.score(X_test, y_test))

print(lasso_model_10.score(X_train, y_train))
print(lasso_model_10.score(X_test, y_test))

0.0
-1.1157326640365284e-07
0.0
-1.1157326640365284e-07


Version 2 outputs are wierd. 

### Output Markdown Table

- KNN version 2
- Logistic Regression Version 1

|Model|k value|Penalty|Alpha|Training Accuracy|Testing Accuracy|
|---|---|---|---|---|---|
|K3|3|-|-|0.9037603569152326|0.8738049713193117|
|K5|5|-|-|0.8951561504142767|0.8881453154875717|
|K15|15|-|-|0.8919694072657743|0.8919694072657743|
|K25|25|-|-|0.8919694072657743|0.8919694072657743|
|Log Reg|-|RIDGE|1|0.8922880815806246|0.8919694072657743|
|Log Reg|-|RIDGE|10|0.8919694072657743|0.8919694072657743|
|Log Reg|-|LASSO|1|0.8922880815806246|0.8919694072657743|
|Log Reg|-|LASSO|10|0.8922880815806246|0.8922880815806246|

My outputs are damn weird. But I will use them.


---

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer:

Overfitting happens when the training score is higher than the test score.

K3 and K5 are overfitted.

K15 and K25 look perfect.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:

As k increases, we have a more stable model, i.e., smaller variance, however, the bias is also increased. As k decreases, the bias also decreases, but the model is less stable.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer:

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

Answer:

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)