## Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [2]:
# 1. Correlation between handedness and dancing alone
# Answer: Left-handed individuals are more likely to dance alone than right-handed individuals.

# 2. Correlation between handedness and having handicrafts as a hobby
# Answer: Right-handed individuals are more likely to have handicrafts as a hobby than left-handed individuals.

# 3. Correlation between handedness and a preference for dangerous activities
#. Answer: Left-handed individuals are more likely to enjoy dangerous activities than right-handed individuals.

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [4]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from sklearn import metrics

In [5]:
data = pd.read_csv('data.csv',delimiter='\t')

In [6]:
data.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

In [8]:
data.dtypes

Q1              int64
Q2              int64
Q3              int64
Q4              int64
Q5              int64
Q6              int64
Q7              int64
Q8              int64
Q9              int64
Q10             int64
Q11             int64
Q12             int64
Q13             int64
Q14             int64
Q15             int64
Q16             int64
Q17             int64
Q18             int64
Q19             int64
Q20             int64
Q21             int64
Q22             int64
Q23             int64
Q24             int64
Q25             int64
Q26             int64
Q27             int64
Q28             int64
Q29             int64
Q30             int64
Q31             int64
Q32             int64
Q33             int64
Q34             int64
Q35             int64
Q36             int64
Q37             int64
Q38             int64
Q39             int64
Q40             int64
Q41             int64
Q42             int64
Q43             int64
Q44             int64
introelapse     int64
testelapse

In [9]:
data.isnull().sum()[data.isnull().sum()!=0]

Series([], dtype: int64)

In [10]:
data.shape

(4184, 56)

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [12]:
X = data.iloc[:, :44]
y= data['hand']

### Calculate and interpret the baseline accuracy rate:

In [14]:
data['hand'].value_counts(normalize = True).mul(100)

hand
1    84.655832
2    10.803059
3     4.278203
0     0.262906
Name: proportion, dtype: float64

In [15]:
# Target should be 1,2,3 
# So drop the row show 0
data = data[data['hand']!=0]

In [16]:
data['hand'].value_counts(normalize = True).mul(100)

hand
1    84.878984
2    10.831536
3     4.289480
Name: proportion, dtype: float64

### Short answer questions:

In this lab, you'll use K-nearest neighbors and logistic regression to model handedness based on psychological factors. 

Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [81]:
# Regression
# Output : Continuous value
# Loss functions: MSE, RMSE, MAE
# Evaluation Metrics: R^2

# Classification
# Output : Discrete value
# Evaluation Metrics: Accuracy, precision, recall

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [20]:
# Small k: Low bias, high variance (captures detail, risk of overfitting).
# Large k: High bias, low variance (more stable, risk of underfitting).

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [22]:
# To make accurate and balanced predictions, 
# To ensure that each feature contributes equally to the distance metric.

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [24]:
# We don’t need to standardize for this problem, as each variable has a similar scale for the distance metric.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [26]:
# Mannually / Grid search

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [28]:
# Ridge regression

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

In [30]:
# They have an inverse relationship.
# Higher values of C (e.g., C = 10 or C = 100) mean weaker regularization.
# Lower values of C (e.g., C = 0.1 or C = 0.01) mean stronger regularization.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [32]:
# Answer here:
# Higher regularization strength 
# Reduces variance ---> avoid overfitting.
# Increases bias --->  leading to underfitting

# Lower regularization strength
# Reduces bias ---> avoid underfitting.
# Increases variance --->  leading to overfitting

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [34]:
# Logistic regression is based on a mathematical framework and provides coefficient interpretation. 
# In contrast, k-nearest neighbors relies solely on the proximity of each data point.

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your features should be:

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=42,  test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3347, 44), (837, 44), (3347,), (837,))

In [37]:
X_train.value_counts()
y_train.value_counts(normalize=True).mul(100)

hand
1    84.642964
2    10.815656
3     4.272483
0     0.268898
Name: proportion, dtype: float64

In [38]:
X_test.value_counts()
y_test.value_counts(normalize=True).mul(100)

hand
1    84.707288
2    10.752688
3     4.301075
0     0.238949
Name: proportion, dtype: float64

#### Create and fit four separate $k$-nearest neighbors models: 
- one with $k = 3$
- one with $k = 5$
- one with $k = 15$
- one with $k = 25$:

In [40]:
k_values = [3,5,15,25]

In [41]:
def find_optimal_k(X, y, k_values):
    # Calculate accuracy for each k
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k)
        cross_score = cross_val_score(knn, X, y, cv=5).mean()
        print(f" K= {k}, cross validation score: {cross_score:.4f}")
find_optimal_k(X, y, k_values)

 K= 3, cross validation score: 0.8064
 K= 5, cross validation score: 0.8341
 K= 15, cross validation score: 0.8463
 K= 25, cross validation score: 0.8466


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [43]:
def k_model(k_list):
    for k in k_list:
        knn = KNeighborsClassifier(n_neighbors=k)
        print(f'K = {k}')

        # Fit model
        knn.fit(X_train, y_train)

        # Score model
        print(f"Training Score: {knn.score(X_train, y_train):.4f}")
        print(f"Test Score: {knn.score(X_test, y_test):.4f}\n")

# Call function with consistently formatted data
k_model(k_values)

K = 3
Training Score: 0.8611
Test Score: 0.8220

K = 5
Training Score: 0.8488
Test Score: 0.8399

K = 15
Training Score: 0.8464
Test Score: 0.8471

K = 25
Training Score: 0.8464
Test Score: 0.8471



In [44]:
# Baseline = 0.8487
# K = 15, 25 causes the model to overfit
# No model beats the baseline

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as used above with kNN.

In [83]:
alphas = [1, 10]
C_values = [1 / alpha for alpha in alphas]

models = []
scores = []

# LASSO (L1 penalty) logistic regression models
for C in C_values:
    lasso_model = LogisticRegression(penalty='l1', C=C, solver='liblinear')
    lasso_model.fit(X_train, y_train)
    print(f"'LASSO Model (alpha={1 / C}) Train Score: {lasso_model.score(X_train,y_train):.4f}")
    print(f"'LASSO Model (alpha={1 / C}) Test Score: {lasso_model.score(X_test,y_test):.4f}\n")

# Ridge (L2 penalty) logistic regression models
for C in C_values:
    ridge_model = LogisticRegression(penalty='l2', C=C, solver='liblinear')
    ridge_model.fit(X_train, y_train)
    print(f"'Ridge Model (alpha={1 / C}) Train Score: {ridge_model.score(X_train,y_train):.4f}")
    print(f"'Ridge Model (alpha={1 / C}) Test Score: {ridge_model.score(X_test,y_test):.4f}\n")

'LASSO Model (alpha=1.0) Train Score: 0.8464
'LASSO Model (alpha=1.0) Test Score: 0.8471

'LASSO Model (alpha=10.0) Train Score: 0.8458
'LASSO Model (alpha=10.0) Test Score: 0.8471

'Ridge Model (alpha=1.0) Train Score: 0.8464
'Ridge Model (alpha=1.0) Test Score: 0.8471

'Ridge Model (alpha=10.0) Train Score: 0.8464
'Ridge Model (alpha=10.0) Test Score: 0.8471



### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [48]:
# Baseline = 0.8487
# Alpha = 1,10 for Ridge and Lasso cause the model to overfit
# No model beats the baseline

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? 

What are the "best" models?

In [50]:
# From K-nearest neighbors with k = 3, 5, 15, 25 and Logistic Regression with L1 or L2 penalty and alpha = 1, 10,
# the best model is K-nearest neighbors with k = 5, achieving an accuracy score of 0.8399.
# Although this isn’t the highest score, higher scores lead to overfitting.