## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Import LogisticRegression and LinearRegression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer:
1. Are people more likely to be left handed based on their personality?
2. Do left handed people with certain personalities question more than, less than, or about the same as right hand people?
3. Do left handed people with violent personalities question more than, less than, or about the same as right handed people?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [2]:
df = pd.read_csv('data.csv',sep='\t')
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:
1. Let participants take surveys in private room to protect identity.
2. Let participants take serveys on a computer as handwritten notes may give away identity.
3. Let participants take surveys anonymously.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [3]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [4]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [5]:
df.isnull().sum().head()

Q1    0
Q2    0
Q3    0
Q4    0
Q5    0
dtype: int64

In [6]:
df.shape

(4184, 56)

In [7]:
df['country'].unique()

array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

In [8]:
df['age'].value_counts()

18     369
17     357
16     329
20     316
19     311
      ... 
73       1
85       1
409      1
78       1
77       1
Name: age, Length: 66, dtype: int64

In [9]:
df['education'].value_counts()

2    2055
3    1086
1     546
4     446
0      51
Name: education, dtype: int64

In [10]:
df['gender'].value_counts()

2    2212
1    1586
3     304
0      82
Name: gender, dtype: int64

In [11]:
df['orientation'].value_counts()

1    2307
2     833
5     349
3     335
4     237
0     123
Name: orientation, dtype: int64

In [12]:
df['race'].value_counts()

6    2793
1     393
2     383
7     342
3     168
0      66
4      33
5       6
Name: race, dtype: int64

In [13]:
df['religion'].value_counts()

1    1857
2    1222
7     623
0     187
6     103
3      80
4      62
5      50
Name: religion, dtype: int64

In [14]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: This would be a classification problem because the question is whether or not the person is left-handed, which is categorical and unordered.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: We standardize our variables to ensure a certain feature or variable does not have a big influence on a model being weighed on a similar scale.

GPA and GRE scores are examples of a standardized variables.

### 7. Give an example of when we might not standardize our variables.

Answer: An example we might not standardize our vairalbes is when the variables are weighed on the same scale.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: I don't think we we should standardize our predictor variables because they are all being weighed on a scale of 1 to 5, with the numbers representing the same meaning for each variable.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: There are 11 '0' responses to the question asking if the person is right-handed, left-handed, or both. Those rows should be filtered out.

In [15]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [16]:
df = df[df.hand != 0]

In [17]:
df['hand'].value_counts()

1    3542
2     452
3     179
Name: hand, dtype: int64

In [18]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: Setting k = 4 could be a bad idea because the dataset contains more than 4000 responses and k = 4 can be too small risking of overfitting the model.

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler


In [20]:
type(df)

pandas.core.frame.DataFrame

In [21]:
target = df['hand']

In [22]:
col_list = list(df.columns)
features = col_list[0:44]

In [23]:
X = df[features]
y = df['hand']

In [24]:
y.value_counts(normalize=True)

1    0.848790
2    0.108315
3    0.042895
Name: hand, dtype: float64

In [25]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

# k = 3

In [26]:
knn = KNeighborsClassifier(n_neighbors=3)

In [27]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [28]:
knn.score(X_train,y_train)

0.8660914030041547

In [29]:
knn.score(X_test,y_test)

0.8199233716475096

In [30]:
cross_val_score(knn,X_train,y_train,cv=3).mean()

0.8175156132402824

# k = 5

In [31]:
knn = KNeighborsClassifier()

In [32]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [33]:
knn.score(X_train,y_train)

0.8536273569830617

In [34]:
knn.score(X_test,y_test)

0.8477011494252874

In [35]:
cross_val_score(knn,X_train,y_train,cv = 3).mean()

0.8357326149345085

# k = 15

In [36]:
knn = KNeighborsClassifier(n_neighbors=15)

In [37]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

In [38]:
knn.score(X_train,y_train)

0.8488334931287952

In [39]:
knn.score(X_test,y_test)

0.8486590038314177

In [40]:
cross_val_score(knn,X_train,y_train,cv=3).mean()

0.8488337070026014

# k = 25

In [41]:
knn = KNeighborsClassifier(n_neighbors=25)

In [42]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=25, p=2,
                     weights='uniform')

In [43]:
knn.score(X_train,y_train)

0.8488334931287952

In [44]:
knn.score(X_test,y_test)

0.8486590038314177

In [45]:
cross_val_score(knn,X_train,y_train,cv = 3).mean()

0.8488337070026014

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: Yes, there is default regularization in the form of Ridge regression regularization.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer: It depends if the features were measured on different scales then we should our features. Otherwise, we shouldn't standardize our features if they're measured on the same scale.

### 14. Let's use linear regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Instantiate and fit your model.

In [46]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [47]:
lr = LinearRegression()

In [48]:
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [49]:
lr.score(X_train,y_train)

0.028878241982779992

In [50]:
lr.score(X_test,y_test)

0.007721890623609062

Using linear regression suggests the person may not be left handed and has the $r^2$ score shifted in the opposite directions

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer: I do not think religion would be a good predictor to someone being left-handed. This is based on my opinion so I may be proven wrong.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer:

# Model 1

In [51]:
knn = KNeighborsClassifier(n_neighbors=3)

In [52]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [53]:
cross_val_score(knn,X_train,y_train,cv = 3).mean()

0.8175156132402824

In [54]:
cross_val_score(knn,X_test,y_test,cv = 3).mean()

0.8199278800165136

# Model 2

In [55]:
knn = KNeighborsClassifier(n_neighbors=5)

In [56]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [57]:
cross_val_score(knn,X_train,y_train,cv = 3).mean()

0.8357326149345085

In [58]:
cross_val_score(knn,X_test,y_test,cv = 3).mean()

0.8390850630521506

# Model 3

In [59]:
knn = KNeighborsClassifier(n_neighbors=15)

In [60]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

In [61]:
cross_val_score(knn,X_train,y_train,cv=3).mean()

0.8488337070026014

In [62]:
cross_val_score(knn,X_test,y_test,cv=3).mean()

0.8486609258203087

# Model 4

In [63]:
knn = KNeighborsClassifier(n_neighbors=25)

In [64]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=25, p=2,
                     weights='uniform')

In [65]:
cross_val_score(knn,X_train,y_train,cv=3).mean()

0.8488337070026014

In [66]:
cross_val_score(knn,X_test,y_test,cv=3).mean()

0.8486609258203087

# Model 5

In [70]:
logreg = LogisticRegression(penalty = 'l1', C = 1)

In [71]:
logreg.fit(X_train,y_train)



LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [72]:
cross_val_score(logreg,X_train,y_train,cv=3).mean()



0.8488337070026014

In [73]:
cross_val_score(logreg,X_test,y_test,cv=3).mean()



0.8438716221519802

# Model 6

In [74]:
logreg = LogisticRegression(penalty = 'l1', C = (1/10))

In [75]:
logreg.fit(X_train, y_train)



LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [76]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()



0.848513809369844

In [77]:
cross_val_score(logreg, X_test, y_test, cv = 3).mean()



0.8486609258203087

# Model 7

In [78]:
logreg = LogisticRegression(penalty = 'l2', C = 1)

In [79]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [80]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()



0.8488337070026014

In [81]:
cross_val_score(logreg, X_test, y_test, cv = 3).mean()



0.8400401728897818

# Model 8

In [82]:
logreg = LogisticRegression(penalty = 'l2', C = (1/10))

In [83]:
logreg.fit(X_train, y_train)



LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [84]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()



0.8488337070026014

In [85]:
cross_val_score(logreg, X_test, y_test, cv = 3).mean()



0.8457928517389158

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer: Model 3 and 4 with n_neighbers set to 15 and 25 respectively show evidence of overfitting. The cross_val_score is slighty greater on the training set than on the test set.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer: As k increases, bias goes up and bariance goes down

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Remove features
2. Regularize
3. Reduce k

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: Model 7 and 8 show evidence of overfitting because their cross_val_test scores on the training set are higher than their test set score.

# NOTE: In order to answer questions 21 through 23, you'll need knowledge of regularization, which we'll learn on Wedensday morning!

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: As alpha increases, bias increas and vairenace decreases. As C increases, bias decreases and variance increases.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer: As C increases, the model becomes less overfit and the cross_val_score on the train set drops closer to the cross_val_score on the test set.

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Remove features
2. Regularize
3. Reduce C

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: Logistic regression because it will learn a linear classifier.

### 25. Select your original logistic regression model (the one with no regularization). Interpret the coefficient for `Q1`.

Answer:

In [88]:
logreg = LogisticRegression(penalty = 'l1', C = (1/10))

In [89]:
logreg.fit(X_train,y_train)



LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [90]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()



0.848513809369844

In [91]:
cross_val_score(logreg, X_test, y_test, cv = 3).mean()



0.8486609258203087

In [92]:
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logreg.coef_))], axis = 1)

In [93]:
coefficients.head()

Unnamed: 0,0,0.1,1,2
0,Q1,0.0,-0.01625,0.0
1,Q2,0.007939,-0.013981,0.0
2,Q3,-0.053256,0.0,0.109249
3,Q4,0.0,-0.060247,0.12859
4,Q5,-0.037735,0.066434,0.0


In [94]:
np.exp(-0.01625)

0.9838813189766874

In [95]:
100 * ((np.exp(-0.01625))-1)

-1.6118681023312598

Answer: For every unit increase in Q1 , the respondent is 1.61%  more likely to be right-handed.

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer: I would pick the Lasso Logistic Regression model because it performed the best so far.

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer: The model doesn't seem to be able to answer any of my questions



### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)