## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer: 

Right handed may imply that the right brain lobe activities dominates. So based on the right-brain function, I am wondering if the left-handed people have:

1. art awareness,
2. stronger creativity and imagination, and
3. low interest in prefer math or science

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [3]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [4]:
# Read data, delimiter is 'tab' or '\t'
df = pd.read_csv('data.csv', delimiter='\t')

In [5]:
# Check the first five rows
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:

1. Since there are personal/privacy data, the collection process and data should be anonymous. 
2. If possible, consider carefully if personal private data are necessary. Not to collect unnecessary data.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [6]:
# check data size
df.shape

(4184, 56)

In [7]:
# Check data type and null info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
Q1             4184 non-null int64
Q2             4184 non-null int64
Q3             4184 non-null int64
Q4             4184 non-null int64
Q5             4184 non-null int64
Q6             4184 non-null int64
Q7             4184 non-null int64
Q8             4184 non-null int64
Q9             4184 non-null int64
Q10            4184 non-null int64
Q11            4184 non-null int64
Q12            4184 non-null int64
Q13            4184 non-null int64
Q14            4184 non-null int64
Q15            4184 non-null int64
Q16            4184 non-null int64
Q17            4184 non-null int64
Q18            4184 non-null int64
Q19            4184 non-null int64
Q20            4184 non-null int64
Q21            4184 non-null int64
Q22            4184 non-null int64
Q23            4184 non-null int64
Q24            4184 non-null int64
Q25            4184 non-null int64
Q26            418

In [8]:
# confirm there is no null value
df.isna().sum().sum()

0

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: 

This would be a classification problem because the expected model output is binary.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: 

When the scale of the variables impact the model parameters or model output, we want to standardize the variables to equalize the contribution from each variable to the model. 

**Example 1: KNN Model**
1. KNN model is distance based model. 
2. Distance is calculated based on the absolute difference of each variables.
3. Because viariables with different units have different scale, their contribution to the distance varies.
4. Standardizing the variable (normally using the z-score of the variable) equalizes the contribution from each variables to the same measures. 

**Example 2: Ridge/Lasso**
1. The scale of the variables affects the regularization term in Ridge/Lasso model. The parameter (i.e., beta) for variable with larger scale gets penalized heavier.  
2. Standardizing the variable equalizes the penalty to each variables.

### 7. Give an example of when we might not standardize our variables.

Answer: 

**Simple Linear Regression** (does not have regulaziation based on the variable scale) do not need data standardization.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer:

**No, we do not.** Altough we are running KNN which is a distance based model, Q1-Q44 were valued using the same score system. Therefore, the contributions from all question/feature to the distance are already equal.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: 

1. Based on codebook.txt, the hand column should only have three unique values, which are 1 (right), 2 (left), and 3 (both). Since there is no good way to estimate the value if the value is missing or incorrect, data with values other than [1, 2, or 3] is disqualified for this study and should be deleted.

2. Since the problem is whether a person is left-handed. Map "2"(left-handed) to 1 and the rest to 0

#### 9.1 Data Cleaning

In [9]:
# Check unique value
df['hand'].unique()

array([3, 1, 2, 0])

In [10]:
# build a function for deleting disqualified data
'''Delete rows with bad data (data not in the list_of_good_data)
   Input: 
         df: pandas DataFrame, target dataframe
         feat: str, feature of interest
         list_of_good_data: list of good data
         summary (optional): boolean, to print the removed bad data
   return: dataframe
'''
def del_bad_data(df, feat, list_of_good_data, summary=False):  
    if df[feat].nunique() > len(list_of_good_data):  # if true, there is unexpected value
        bad_data = ~df[feat].isin(list_of_good_data) # mask for bad data
        print(f'Rows before cleaning: {df.shape[0]}')
        print(f'Number of disqualified and removed data: {df[bad_data].shape[0]}')
   
        if summary == True:
            print(df[bad_data])  # print the bad data   
        
        df.drop(df[bad_data].index, inplace=True) # remove the bad data and return the clean df
        print(f'Rows after cleaning: {df.shape[0]}')
        return df
    
    else:
        print('Your data is clean! Nothing to delete')
        return df

In [11]:
# Clean the data
df = del_bad_data(df, 'hand', [1,2,3])

Rows before cleaning: 4184
Number of disqualified and removed data: 11
Rows after cleaning: 4173


#### 9.2 Mapping

In [12]:
df['hand'] = df['hand'].apply(lambda x: 1 if x == 2 else 0)
df['hand'].unique()

array([0, 1])

In [13]:
df.shape

(4173, 56)

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: 

Where a person is left-handed is a binary problem. For binary outcome, using even number for k may lead to unconclusive model output (e.g, '0' = '1' = 2). Using odd number for k instead.

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [14]:
# Determine X and y
X = df.iloc[:, 0:44]
y = df['hand']

In [15]:
X.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44
0,4,1,5,1,5,1,5,1,4,1,...,5,1,1,1,5,5,5,1,5,1
1,1,5,1,4,2,5,5,4,1,5,...,4,4,4,4,1,3,1,4,4,5
2,1,2,1,1,5,4,3,2,1,4,...,2,2,4,2,1,4,2,2,2,2
3,1,4,1,5,1,4,5,4,3,5,...,5,1,3,4,1,2,1,1,1,3
4,5,1,5,1,5,1,5,1,3,1,...,5,1,1,1,5,5,5,1,5,1


In [16]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)

In [17]:
# k
ks = [3, 5, 15, 25]

In [20]:
# function for model with k values
'''run knn model
   input:
        X_train: DataFrame, training features
        X_test: DataFrame, testing features
        y_train: Series, training targets
        y_test: Series, testing targets
        k: int, n_neighbor
    return:
        y_pred (list), cv_score (float), r2_score_train (float), r2_score_test (float)
   
   '''
def run_knn(X_train, X_test, y_train, y_test, k):
    print(f'When k = {k}:')
    
    knn = KNeighborsClassifier(n_neighbors=k) # Instantiate model
    knn.fit(X_train, y_train)                 # Fit model
    
    cv_score = cross_val_score(knn, X_train, y_train, cv=10).mean() # cross validation
    accuracy_train = knn.score(X_train, y_train) # R2 score for training
    accuracy_test = knn.score(X_test, y_test) # R2 score for testing
    y_pred = knn.predict(X_test)              # predict y
    cm = confusion_matrix(y_test, y_pred)     # confustion matrix
    cm_df = pd.DataFrame(cm, columns = ['pred_right', 'pred_left'], 
                             index = ['actual_right', 'actual_left'])
    print(f' The cross validation score is: {cv_score}')
    print(f' The train data accuracy is: {accuracy_train}')
    print(f' The testing data accuracy is: {accuracy_test}')
    print(f' The confusion matrix: \n{cm_df}\n')
    return y_pred, cv_score, accuracy_train, accuracy_test, cm

In [21]:
# Run model with k values
y_preds_knn = pd.DataFrame()
cm_knn = []
cv_scores_knn = []
accuracy_train_knn = []
accuracy_test_knn = []

for k in ks:
    y_pred, cv_score, accuracy_train, accuracy_test, cm = run_knn(X_train, X_test, y_train, y_test, k) # run model
    y_preds_knn[k] = y_pred 
    cv_scores_knn.append(cv_score)
    accuracy_train_knn.append(accuracy_train)
    accuracy_test_knn.append(accuracy_test)
    cm_knn.append(cm)

When k = 3:
 The cross validation score is: 0.8600229376587205
 The train data accuracy is: 0.9060402684563759
 The testing data accuracy is: 0.8486590038314177
 The confusion matrix: 
              pred_right  pred_left
actual_right         882         49
actual_left          109          4

When k = 5:
 The cross validation score is: 0.8772814778405833
 The train data accuracy is: 0.8935762224352828
 The testing data accuracy is: 0.8735632183908046
 The confusion matrix: 
              pred_right  pred_left
actual_right         909         22
actual_left          110          3

When k = 15:
 The cross validation score is: 0.8916594986483167
 The train data accuracy is: 0.8916586768935763
 The testing data accuracy is: 0.8917624521072797
 The confusion matrix: 
              pred_right  pred_left
actual_right         931          0
actual_left          113          0

When k = 25:
 The cross validation score is: 0.8916594986483167
 The train data accuracy is: 0.8916586768935763
 The 

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: 

Yes, the regularization is applied by default. 

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer:

No. Although Logistic Regression applies regularization, we do not need to standardize because all our features are valued using the same scale system.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [22]:
# Define regularization and alpha
regs = ['l1', 'l2'] # 'l1' = Lasso, 'l2' = Ridge
alphas = [1, 10]

In [23]:
# Function to run Logistic Regression
'''Input:
        X_train: DataFrame, X training data
        X_test: Dataframe, X testinng data
        y_train: pd Series, y training data
        y_test: pd Series, y training dta
        reg: str ('l1': lasso, or 'l2': ridge)
        alpha: regulation coefficient
'''

def run_lr(X_train, X_test, y_train, y_test, reg, alpha):
    print('Regularization: Lasso' if reg == 'l1' else 'Regularization: Ridge')
    print(f'  When alpha = {alpha}')
    lr = LogisticRegression(penalty=reg, C=(1/alpha), solver='liblinear')
    lr.fit(X_train, y_train)
    cv_score = cross_val_score(lr, X_train, y_train, cv=10).mean()
    y_pred = lr.predict(X_test)
    accuracy_train = lr.score(X_train, y_train)
    accuracy_test = lr.score(X_test, y_test)
    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm, columns = ['pred_right', 'pred_left'], 
                                                           index = ['actual_right', 'actual_left'])
    coef = lr.coef_
    print(f'    The cross validation score is: {cv_score}')
    print(f'    The train data accuracy is: {accuracy_train}')
    print(f'    The testing data accracy is: {accuracy_test}')
    print(f'    The confusion matrix: \n{cm_df}\n')
    return y_pred, cv_score, accuracy_train, accuracy_test, cm, coef

In [24]:
# Run model with penalty and alpha values
y_preds_lr = []
cv_scores_lr = []
r2_scores_train_lr = []
r2_scores_test_lr = []
cm_lr = []
coef_lr_alpha = {}
coef_lr = {}

for reg in regs:
    for alpha in alphas:
        y_pred_lr, cv_score_lr, accuracy_train_lr, accuracy_test_lr, cm, coef = \
                 run_lr(X_train, X_test, y_train, y_test, reg, alpha)
        y_preds_lr.append(y_pred)
        cv_scores_lr.append(cv_score)
        r2_scores_train_lr.append(accuracy_train)
        r2_scores_test_lr.append(accuracy_test)
        cm_lr.append(cm)
        coef_lr_alpha[str(alpha)] = coef.ravel()
    coef_lr[reg] = coef_lr_alpha

Regularization: Lasso
  When alpha = 1
    The cross validation score is: 0.891978987466208
    The train data accuracy is: 0.891978267817194
    The testing data accracy is: 0.8917624521072797
    The confusion matrix: 
              pred_right  pred_left
actual_right         931          0
actual_left          113          0

Regularization: Lasso
  When alpha = 10
    The cross validation score is: 0.891978987466208
    The train data accuracy is: 0.891978267817194
    The testing data accracy is: 0.8917624521072797
    The confusion matrix: 
              pred_right  pred_left
actual_right         931          0
actual_left          113          0

Regularization: Ridge
  When alpha = 1
    The cross validation score is: 0.8916594986483167
    The train data accuracy is: 0.891978267817194
    The testing data accracy is: 0.8917624521072797
    The confusion matrix: 
              pred_right  pred_left
actual_right         931          0
actual_left          113          0

Regulari

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer:

I think therea are too many features in the X varriables (44 of them), which makes the true impact very unclear. Too many features usually causes the model to be overfit. A overfit model usually have low bias and high variance. A high variance model usually does poor job on predicting y variable. 

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer: 
$$ precision = (TP + TN) / All Prediction$$


**k-NN Model**

**K Value**|**TP**|**TN**|**FP**|**FN**|**Precision**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
3|4|882|49|109|0.848659004
5|3|909|22|110|0.873563218
15|0|931|0|113|0.891762452
25|0|931|0|113|0.891762452

**Logistic Regression**

**Model**|**Alpha**|**TP**|**TN**|**FP**|**FN**|**Precision**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Lasso|1|0|931|0|113|0.891762452
Lasso|10|0|931|0|113|0.891762452
Ridge|1|0|931|0|113|0.891762452
Ridge|10|0|931|0|113|0.891762452

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer:

When k = 3 and k = 5, the k-NN models are overfitting. The accuracy for training data is higher than accuracy for the testing data.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:

As k increase, 
1. bias increase because accuracy drops
2. variance decrease because the accuracy are getting closer. Also the results becomes more consistent for k > 15.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

1. Reduce the number of featues.
2. Choose the right K.
3. Increase the size of the dataset.

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer:

Since my logistic regression has low variance, I don't think my models are overfitting.

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:

Based on the documentary, C is the inverse of regularization strengh. Therefore, the higher the C is, the weaker the regularization (lower punishment). Bias tends to be low and variance tends to be high. Low C has the opposite affects. 

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:

In this model, the change of C has no impact to the fit and coefficients in the model.

Larger Alpha actually decreses the coefficients. This makes sense because the higher regularization decreases the sensitivity of the features (by shrinking the coefficient / or making the line flatter).

In [29]:
coef_lr

{'l1': {'1': array([-0.03306998,  0.01481416,  0.04778889, -0.08656532,  0.05177639,
         -0.06143164,  0.00746298, -0.16962719, -0.04274797,  0.02341422,
          0.01535305,  0.04075753, -0.02567381,  0.03105701, -0.02238169,
          0.02857004,  0.02124619, -0.02161584, -0.04156337, -0.05254099,
         -0.1002466 , -0.07123465, -0.03089169, -0.02166746, -0.00698333,
          0.14128654,  0.08317938,  0.01977726,  0.03642763,  0.0324297 ,
          0.0145622 , -0.03817322,  0.00701537, -0.05049121,  0.03001425,
         -0.02433674, -0.04516399,  0.09215549, -0.05355036, -0.07777899,
         -0.04639833, -0.06308872, -0.14682989, -0.02801699]),
  '10': array([-0.03378078,  0.01274694,  0.04686127, -0.08484905,  0.05090015,
         -0.06154275,  0.00557998, -0.16796954, -0.04359203,  0.02046491,
          0.01316768,  0.03760781, -0.02791098,  0.02965609, -0.0213712 ,
          0.02732356,  0.02179564, -0.02219455, -0.04263888, -0.05317671,
         -0.09989373, -0.0717379

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

1. Reduce the number of features
2. Strenghening the regularization
3. Increase the size of dataset.

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:

I would use logistic regression becuase logistic regression is parametric model. The coefficient or beta quantifies the change of left-handedness with respect to 1 unit change of the feature. 

k-NN is non-parametric model and therefore cannot be used for this purpose.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

Answer:

In [416]:
# Extract the coefficient of Q1 in Lasso regularization with alpha = 1
Q1_coef = coef_lr["l1"]["1"][0]

print(f'The coefficient for Q1 is {Q1_coef}')
print(f'One unit increase/decrease in Q1 will decrease/increase the odds of being left-handed by {np.exp(Q1_coef)}\
 times')

The coefficient for Q1 is -0.03306998192365425
One unit increase/decrease in Q1 will decrease/increase the odds of being left-handed by 0.9674708517486124 times


The logistic function can be written as: $p/(1-p) = e^{beta*Q1}*e^{beta*Q2}$.....
So one unit change in Q1 would decrease the odds by $e^{-0.033}$ or 0.97 times.

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

I would choose a parametric model like Logistic Regression with Regularization (Alpha = 10, or C = 0.1). With parametric informaiton with stronger regularization (stronger penalty for overfitting), I will be able to understand and explain what features are more important than the other. 

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

My third conclusion is that left-handed people have low interest in math or science. The coefficient is negative which is consistent with my conclusion.

The coefficient means the more the person like the math class, the less likely he/she is left-handed.

In [30]:
# Coefficient for Q13: I would prefer a class in mathematics to a class in pottery.
coef_lr['l1']['1'][12]

-0.025673805434385053

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)