## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [1]:
# Is it true that left-handedness more likely to suffer 
# from mental illness based on this research?

# Is it true that left-handedness more likely to be an
# introvert?

# Is it true that left-handedness have a specialty 
# in creativity and imagination?

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [2]:
# library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,classification_report,\
accuracy_score, precision_score, recall_score


In [3]:
df = pd.read_csv('data.csv',sep='\t')
df

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,3,5,4,5,2,4,2,2,2,5,...,US,1,1,18,2,1,1,6,2,1
4180,1,5,1,5,1,4,2,4,1,4,...,US,1,1,18,2,2,1,3,2,1
4181,3,2,2,4,5,4,5,2,2,5,...,PL,2,2,22,2,1,1,6,1,1
4182,1,3,4,5,1,3,3,1,1,3,...,US,2,1,16,1,2,5,1,1,1


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

### Background research
1. Left-Handed vs. Right-Handed (Medically Reviewed by Melinda Ratini, DO, MS on November 02, 2021) [Reference1](https://www.webmd.com/brain/ss/slideshow-left-handed-vs-right)
- left-handers often hold more than their share of slots in creative professions.
- There’s a well-established link between left-handedness and mental conditions like schizophrenia, which can cause hallucinations and impaired thinking.

**Note** - Schizophrenia is a serious mental disorder in which people interpret reality abnormally

2. Five personality traits of left-handed people [Reference2](https://www.asianage.com/life/more-features/240719/five-personality-traits-of-left-handed-people.html)
- **Left-hander are more creative**: This probably happened as left-handers have a dominant right brain, the side of the brain that is associated with creativity and imagination. Another possible reason is that left-handed people are used to figuring out their way around tools from a young age. Scissors, cups, everything is generally made for right-handed people.
- **Left-hander are more likely to suffer from mental illness**: they are more prone to mental illnesses as compared to their right-handed counterparts.
- **Left-hander hear speech differently**: Sound is processed differently in different parts of the brain. As lefties are right brain dominant, sound is perceived differently by them.
- **Left-handed people tend to be more fearful**: It is possible that interacting with a world created mostly by righties for rights.

3. Handedness and Personality [Reference3](https://www.researchgate.net/publication/15262165_Handedness_and_Personality)
- left-handedness tend to be introvert more than right-handedness
4. What Being Left or Right-Handed Reveals About Your Thoughts [Reference4](https://www.powerofpositivity.com/left-right-handed-says-think/)
- **Interesting facts about left-handed people**: Interestingly enough, being left or right-handed might be coincidentally connected to sexual preference.

**Note** - Gender and orientation may help to improve the performance of the model

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [4]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [5]:
# lowercase the columns name
df.columns = df.columns.str.lower()

# Shape of dataframe
print(f'This dataset have {df.shape[0]} rows and {df.shape[1]} columns.')

This dataset have 4184 rows and 56 columns.


In [6]:
df.dtypes # Almost all of data is int
# df.isnull().sum() # There are no missing value

q1              int64
q2              int64
q3              int64
q4              int64
q5              int64
q6              int64
q7              int64
q8              int64
q9              int64
q10             int64
q11             int64
q12             int64
q13             int64
q14             int64
q15             int64
q16             int64
q17             int64
q18             int64
q19             int64
q20             int64
q21             int64
q22             int64
q23             int64
q24             int64
q25             int64
q26             int64
q27             int64
q28             int64
q29             int64
q30             int64
q31             int64
q32             int64
q33             int64
q34             int64
q35             int64
q36             int64
q37             int64
q38             int64
q39             int64
q40             int64
q41             int64
q42             int64
q43             int64
q44             int64
introelapse     int64
testelapse

In [7]:
df.columns

Index(['q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'q10', 'q11',
       'q12', 'q13', 'q14', 'q15', 'q16', 'q17', 'q18', 'q19', 'q20', 'q21',
       'q22', 'q23', 'q24', 'q25', 'q26', 'q27', 'q28', 'q29', 'q30', 'q31',
       'q32', 'q33', 'q34', 'q35', 'q36', 'q37', 'q38', 'q39', 'q40', 'q41',
       'q42', 'q43', 'q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [8]:
# Drop all unnescessary columns
df = df.drop(columns = ['introelapse','country','fromgoogle','engnat',\
                       'education','race', 'religion','testelapse','age'])

In [9]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [10]:
# Target varaiable: 'hand'
# 1: Right-hander
# 2: Left-hander
# 3: Both

# Why would you drop zero?
# A: Small % of values and we aren't clear on what 0 means. 
#but we think it's likely a missing value stand-in


# Why do we do about value of 3? Ambidextrous folks.
# We're unable to differntiate between dominant hand.
# Also, larger $ of 'hand' value but not overwhelming in term of
# row or information loss

# Options: exclude them, leave ambi folks in as category of their own,
# or  combining with another category
df['hand'].value_counts(normalize=True)

1    0.846558
2    0.108031
3    0.042782
0    0.002629
Name: hand, dtype: float64

In [11]:
# filter only left-hander, right-hander (hand = 1 and 2)
df = df[(df['hand'] > 0) & (df['hand'] < 3)]

In [12]:
# Cook book tell us 1 = Disagree, 3 = Neutral, 5 = Agree
# What might #2 and #4 present?
# 2 = slightly disagree
# 4 = slightly agree
# What does 0 mean?
# Maybe non-answer
df['q1'].value_counts()

1    2436
3     458
4     435
2     369
5     292
0       4
Name: q1, dtype: int64

In [13]:
df.shape

(3994, 47)

In [14]:
(df == 0).sum()

q1               4
q2              11
q3              11
q4              16
q5              14
q6              11
q7              17
q8               9
q9              13
q10             10
q11             15
q12             20
q13             14
q14             21
q15             14
q16             19
q17             19
q18             24
q19             11
q20             12
q21             17
q22             17
q23             21
q24             22
q25             17
q26             19
q27             17
q28             17
q29             11
q30             12
q31             15
q32             10
q33             15
q34             16
q35             13
q36             15
q37             17
q38             19
q39             13
q40             11
q41             13
q42             19
q43             21
q44             17
gender          74
orientation    115
hand             0
dtype: int64

In [15]:
# Filter out value = 0 in all rows
df = df.loc[(df != 0).all(axis = 1)]

# Check any value = 0 in dataframe
(df == 0).sum().sort_values(ascending = False).head()

q1     0
q36    0
q27    0
q28    0
q29    0
dtype: int64

In [16]:
# Column contain q1 - q44 include gender, 
# orientation and target variable of hand.
df.columns

Index(['q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'q10', 'q11',
       'q12', 'q13', 'q14', 'q15', 'q16', 'q17', 'q18', 'q19', 'q20', 'q21',
       'q22', 'q23', 'q24', 'q25', 'q26', 'q27', 'q28', 'q29', 'q30', 'q31',
       'q32', 'q33', 'q34', 'q35', 'q36', 'q37', 'q38', 'q39', 'q40', 'q41',
       'q42', 'q43', 'q44', 'gender', 'orientation', 'hand'],
      dtype='object')

In [17]:
# Change value in hand column
# Right-hander: 1 to 0
# Left-hander: 2 to 1
df['hand'] = df['hand'] - 1
df['hand'].value_counts(normalize=True)

0    0.891822
1    0.108178
Name: hand, dtype: float64

In [18]:
df['hand'].value_counts()

0    3108
1     377
Name: hand, dtype: int64

### Calculate and interpret the baseline accuracy rate:

In [19]:
df['hand'].value_counts(normalize=True)

0    0.891822
1    0.108178
Name: hand, dtype: float64

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [20]:
# regression: The regression model is the model using for 
# predict the continuous target value for example, housing price.

# classification: The classification model is the model using for
# predict the category, discreate value for example, predict type of iris flower.
# sometimes binary, sometimes multiclass

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [21]:
# High bias -> underfitting
# High variance -> overfitting

# k is the number of neighbors that help kNN identify which class label
# a data point would belong in

# smaller k --> may lead high variance (overfitting)
# optimal k --> 'the sweet spot' --> neither underfitting or overfitting
# and getting good predictive power without being computional
# expensive
# larger k --> may lead high bias (underfitting)

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [22]:
# predictor variable --> features or X

# We need to standardize the predictor because kNN model is the 
# distance based model and scale/magnitude of features
# impacts performance and output.

# Standardization (StandardScaler) --> 
# rescales value in the features columns to have
# each column posses a mean of zero, std of 1

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [23]:
# explanatory variable --> features or X

# Possibly not this time as all our features to be used
# are ranged from 1-5. Already on the same or very similar scale.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [24]:
# Option

# 1. Use the default
# 2. Guess
# 3. Loop/iterate through a range of k options and select the
#    optimal one based on score

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [25]:
# The default regulariation for logistic regression is ridge regularization.
# (l2)

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

In [26]:
# C : float, default=1.0
#    Inverse of regularization strength; must be a positive float.
#    Like in support vector machines 

# Smaller values specify stronger regularization.
# C = 1/alpha
# If alpha = 1, what is C? 1
# If alpha = 0.1, what is C? 10

# What does a higher C mean? Less regularization
# What doest a lower C mean? MORE regularization

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [27]:
# Why and when do we regularizae?

# Why? Avoid error due to high variance (overfitting)
# When? High model complexity which is causing overfitting.

# C is high --> less --> more prone to overfitting
# C is low --> MORE --> less prone to overfitting and if
# overdone, we could possibly get to underfitting

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [28]:
#Interpretable?
# LogReg has coefficient (statistical parameters) which are sometimes
# referred as betas. kNN is non-parametric and has no coefficients
# or statistical parameters.


# Explainable to a non-technical individual?
# Conceptually, kNN isn't too hard to explain to a wide audience

# Explaining relationships in a meaningful and actionable way?
# kNN fails and LogReg is better

In [29]:
# Don't regularize unless you need to.
# Turn penalty off

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

In [30]:
# Set X and y
X = df.drop(columns = ['hand','gender','orientation'])
y = df['hand']

# Get dummies variable for gender and orientation in X
#X = pd.get_dummies(data = X, columns = ['gender','orientation'])

# Split the train/test: X before y, train before test
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

In [31]:
# Checking the distribution of class labels in y_train
# and y_test to check balance of data
y_train.value_counts(normalize=True)*100

0    89.169537
1    10.830463
Name: hand, dtype: float64

In [32]:
y_test.value_counts(normalize=True)*100

0    89.220183
1    10.779817
Name: hand, dtype: float64

In [33]:
# Scale the X value
# Using in logistic regression with regularization model

# Instantiate
sc = StandardScaler()

# Fit only X_train, transform both X_train and X_test
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

#### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [34]:
k_score = pd.DataFrame(columns = ['k','train_score','test_score','pred_one'])

for n, k in enumerate([3,5,15,25]):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_preds = knn.predict(X_test)
    minority_class_preds = (y_preds == 1)
    pred_one = pd.Series(minority_class_preds).sum()
    train_score = knn.score(X_train, y_train)
    test_score = knn.score(X_test, y_test)
    k_score.loc[n] = [k, train_score, test_score, pred_one]

In [35]:
k_score

Unnamed: 0,k,train_score,test_score,pred_one
0,3.0,0.903559,0.849771,37.0
1,5.0,0.894757,0.872706,17.0
2,15.0,0.891695,0.892202,0.0
3,25.0,0.891695,0.892202,0.0


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [36]:
# Any of our kNN models overfit? No
# Any of our kNN models underfit? Not necessarily, but
# The higher k test score are the baseline accuracy
# Appears good on majority, bad on minority class

# The challenge here is not so much the algorithm or the # of k
# but the imbalanced class split along with a tenuous likely
# relationship between X and y

In [37]:
# The baseline accuracy
# If the algorithm just selected the majority for each of its
# predictions
# Algorithmic equivalent --> DummyClassifier set to 'most_frequent'
y.value_counts(normalize=True).mul(100)[0]

89.18220946915352

In [38]:
y_test.value_counts(normalize=True).mul(100)[0]

89.22018348623854

In [39]:
# Option...

# Addressing the classs label imbalance (switching k at 90/10 split
# made no significant difference in models predictive accuracy)

# Find left-handed people to survey
# Oversampling the minority class
# Undersampling the majority class
# A bit both (SMOTE) --> Syntheic Minority Over-sampling Technique
# Algorithmically increase the number of your cases in
# your dataset in a more balanced way

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [43]:
# Instantiate model (C = 1/alpha)
loglasso_1 = LogisticRegression(penalty='l1', solver = 'saga')
loglasso_10 = LogisticRegression(penalty='l1', C = 0.1, solver = 'saga')
logridge_1 = LogisticRegression(solver = 'saga')
logridge_10 = LogisticRegression(C = 0.1, solver = 'saga')

log_model_list = [loglasso_1, loglasso_10, logridge_1, logridge_10]
logreg_score = pd.DataFrame(columns = ['penalty','alpha','train_score','test_score','pred_one'])
# Fit model
for i in range(len(log_model_list)):
    log_model_list[i].fit(X_train_sc, y_train)
    train_score = round(log_model_list[i].score(X_train_sc, y_train),4)
    test_score = round(log_model_list[i].score(X_test_sc, y_test),4)
    y_preds_one = (log_model_list[i].predict(X_test_sc) == 1).sum()
    logreg_score.loc[i] = [log_model_list[i].penalty, 1/log_model_list[i].C,
                           train_score, test_score, y_preds_one]

In [44]:
logreg_score

Unnamed: 0,penalty,alpha,train_score,test_score,pred_one
0,l1,1.0,0.8917,0.8922,0
1,l1,10.0,0.8917,0.8922,0
2,l2,1.0,0.8917,0.8922,0
3,l2,10.0,0.8917,0.8922,0


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [45]:
# baseline for y_test
y_test.value_counts(normalize=True)

0    0.892202
1    0.107798
Name: hand, dtype: float64

In [None]:
# All of the logistic regression models have accuracy rate equal to baseline accuracy.
# Maybe all of the model can't predict the left-handed due to unbalance data.

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?

In [None]:
# All model can't predict the minority class.
# The problem of this project is unbalanced data.