## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [None]:
# 1.  How many population is left-handedness?
# 2.  Do left-handedness got good personality?
# 3.  what is the Personality change ratio for left-handedness and right handedness


---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [78]:
# library imports
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [14]:
df=pd.read_csv('data.csv',sep='\t')

In [15]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [16]:
df.columns


Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

In [17]:
df.drop(columns=['introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion'],inplace=True)

In [18]:
#  What is special about left-handedness?
# how much left-handedness score higher  when it comes to creativity, imagination? 


### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [19]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'hand'],
      dtype='object')

In [20]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,hand
0,4,1,5,1,5,1,5,1,4,1,...,1,1,1,5,5,5,1,5,1,3
1,1,5,1,4,2,5,5,4,1,5,...,4,4,4,1,3,1,4,4,5,1
2,1,2,1,1,5,4,3,2,1,4,...,2,4,2,1,4,2,2,2,2,2
3,1,4,1,5,1,4,5,4,3,5,...,1,3,4,1,2,1,1,1,3,2
4,5,1,5,1,5,1,5,1,3,1,...,1,1,1,5,5,5,1,5,1,3


In [21]:
df.dtypes

Q1      int64
Q2      int64
Q3      int64
Q4      int64
Q5      int64
Q6      int64
Q7      int64
Q8      int64
Q9      int64
Q10     int64
Q11     int64
Q12     int64
Q13     int64
Q14     int64
Q15     int64
Q16     int64
Q17     int64
Q18     int64
Q19     int64
Q20     int64
Q21     int64
Q22     int64
Q23     int64
Q24     int64
Q25     int64
Q26     int64
Q27     int64
Q28     int64
Q29     int64
Q30     int64
Q31     int64
Q32     int64
Q33     int64
Q34     int64
Q35     int64
Q36     int64
Q37     int64
Q38     int64
Q39     int64
Q40     int64
Q41     int64
Q42     int64
Q43     int64
Q44     int64
hand    int64
dtype: object

In [22]:
df.isnull().sum()

Q1      0
Q2      0
Q3      0
Q4      0
Q5      0
Q6      0
Q7      0
Q8      0
Q9      0
Q10     0
Q11     0
Q12     0
Q13     0
Q14     0
Q15     0
Q16     0
Q17     0
Q18     0
Q19     0
Q20     0
Q21     0
Q22     0
Q23     0
Q24     0
Q25     0
Q26     0
Q27     0
Q28     0
Q29     0
Q30     0
Q31     0
Q32     0
Q33     0
Q34     0
Q35     0
Q36     0
Q37     0
Q38     0
Q39     0
Q40     0
Q41     0
Q42     0
Q43     0
Q44     0
hand    0
dtype: int64

### Calculate and interpret the baseline accuracy rate:

In [25]:
from sklearn.dummy import DummyClassifier

X=df.drop(columns='hand')
y=df['hand']
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)

DummyClassifier(strategy='most_frequent')

In [26]:
dummy_clf.score(X, y)

0.8465583173996176

In [28]:
#This model will be 84.65% accurate in predicting right.

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

 Regression:Given x,predicting y which is numeric
 Classification : predicting y,which is a categorical column.(yes/no.0/1)

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

 bias-variance tradeoff: Finding sweetspot between bias and variance.
 K in knn : k in KNN is a parameter that refers to the 
           number of nearest neighbours to include in the majority of the voting process

both k and bias-variance tradeoff must be the best to get best model result.

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

Because it is highly sensitive to the magnitude of the feature.

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

yes.beacause knn is calculating distance between the neighbours,and it is highly sensitive to the magnitude of your feature.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

The best K will change from problem to problem, but the default is 5

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

L2 regularization 

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

If regularization increase,overfitting will increse,means variance will increase.if your model is in bias,(underfit),regularization willn't help.Generally regularisation deals with overfitting.

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

Logistic regression generates the coefficients of a formula to predict a logit transformation of the probability that the characteristic of interest is present.


---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

In [42]:
from sklearn.model_selection import train_test_split,cross_val_score
knn=KNeighborsClassifier()
knn

KNeighborsClassifier()

In [None]:
knn.score()

In [33]:
X.shape

(4184, 44)

In [34]:
y.shape

(4184,)

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify = y)

In [39]:
y_train.value_counts()

1    2657
2     339
3     134
0       8
Name: hand, dtype: int64

In [40]:
y_test.value_counts()

1    885
2    113
3     45
0      3
Name: hand, dtype: int64

In [44]:
cross_val_score(knn,X_train,y_train,cv=8).mean()

0.8333292763670354

### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [45]:
knn.fit(X_train,y_train)

KNeighborsClassifier()

In [46]:
knn.score(X_train,y_train)

0.848629700446144

In [47]:
knn.score(X_test, y_test)

0.8326959847036329

In [54]:
knn1=KNeighborsClassifier(n_neighbors=3)

In [55]:
knn1.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=3)

In [56]:
knn1.score(X_train,y_train)

0.8658381134480561

In [57]:
knn1.score(X_test, y_test)

0.8068833652007649

In [58]:
knn2=KNeighborsClassifier(n_neighbors=5)

In [59]:
knn2.fit(X_train,y_train)

KNeighborsClassifier()

In [60]:
knn2.score(X_train,y_train)

0.848629700446144

In [61]:
knn2.score(X_test,y_test)

0.8326959847036329

In [62]:
knn3=KNeighborsClassifier(n_neighbors=15)

In [63]:
knn3.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=15)

In [64]:
knn3.score(X_train,y_train)

0.8467176545570427

In [65]:
knn3.score(X_test,y_test)

0.8460803059273423

In [66]:
knn4=KNeighborsClassifier(n_neighbors=25)

In [67]:
knn4.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=25)

In [68]:
knn4.score(X_train,y_train)

0.8467176545570427

In [69]:
knn4.score(X_test,y_test)

0.8460803059273423

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [74]:
print(f'train_r2_knn1 = {knn1.score(X_train,y_train)},test_r2_knn1 = {knn1.score(X_test,y_test)}')

print(f'train_r2_knn2 = {knn2.score(X_train,y_train)},test_r2_knn2 = {knn2.score(X_test,y_test)}')
print(f'train_r2_knn3 = {knn3.score(X_train,y_train)},test_r2_knn3 = {knn3.score(X_test,y_test)}')
print(f'train_r2_knn4 = {knn4.score(X_train,y_train)},test_r2_knn4 = {knn4.score(X_test,y_test)}')




train_r2_knn1 = 0.8658381134480561,test_r2_knn1 = 0.8068833652007649
train_r2_knn2 = 0.848629700446144,test_r2_knn2 = 0.8326959847036329
train_r2_knn3 = 0.8467176545570427,test_r2_knn3 = 0.8460803059273423
train_r2_knn4 = 0.8467176545570427,test_r2_knn4 = 0.8460803059273423


knn1 is 86% accurate in predicting right for train data and 80% accurate for test data.since the % is less in test data,its undergoing bias,and we need to increase the complexity.
86.58% accuracy in a 3-fold cv of 4184 samples is in line with 15 wrong out of 3128 training samples.


knn2 is 84% accurate in predicting right for train data and 83% accurate for test data.

knn3 and knn4 is 84% accurate in predicting right for train data and 84% accurate for test data.

when compares with dummy baseline acuracy(83.3%),knn2,knn3,knn4 will be the best model.

---

## Step 4 & 5 Modeling: logistic regression

### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [120]:
lr1=LogisticRegression(C=1,penalty="l1", solver="liblinear") #lasso=1,alpha=1

In [121]:
# A regression model that uses the L1 regularization technique is called lasso regression and a model that uses the L2 is called ridge regression.
# The key difference between these two is the penalty term

In [122]:
lr1

LogisticRegression(C=1, penalty='l1', solver='liblinear')

In [123]:
lr1.fit(X_train, y_train)

LogisticRegression(C=1, penalty='l1', solver='liblinear')

In [124]:
print(lr1.score(X_train, y_train))
print(lr1.score(X_test, y_test))

0.847036328871893
0.8460803059273423


In [125]:
lr2=LogisticRegression(C=10,penalty="l1", solver="liblinear") #lasso=1,alpha=10

In [126]:
lr2.fit(X_train, y_train)

LogisticRegression(C=10, penalty='l1', solver='liblinear')

In [127]:
print(lr2.score(X_train, y_train))
print(lr2.score(X_test, y_test))

0.847036328871893
0.8451242829827916


In [128]:
lr3=LogisticRegression(C=1,penalty='l2',solver='liblinear') #ridge=1,alpha=1

In [129]:
lr3.fit(X_train, y_train)

LogisticRegression(C=1, solver='liblinear')

In [130]:
print(lr3.score(X_train, y_train))
print(lr3.score(X_test, y_test))

0.847036328871893
0.8460803059273423


In [131]:
lr4=LogisticRegression(C=10,penalty='l2',solver='liblinear') #ridge=1,alpha=10

In [132]:
lr4.fit(X_train, y_train)

LogisticRegression(C=10, solver='liblinear')

In [133]:
print(lr4.score(X_train, y_train))
print(lr4.score(X_test, y_test))

0.847036328871893
0.8460803059273423


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [134]:
print(f'train_r2_lr1 = {lr1.score(X_train,y_train)},test_r2_lr1 = {lr1.score(X_test,y_test)}')

print(f'train_r2_lr2 = {lr2.score(X_train,y_train)},test_r2_lr2 = {lr2.score(X_test,y_test)}')
print(f'train_r2_lr3 = {lr3.score(X_train,y_train)},test_r2_lr3 = {lr3.score(X_test,y_test)}')
print(f'train_r2_lr4 = {lr4.score(X_train,y_train)},test_r2_lr4 = {lr4.score(X_test,y_test)}')




train_r2_lr1 = 0.847036328871893,test_r2_lr1 = 0.8460803059273423
train_r2_lr2 = 0.847036328871893,test_r2_lr2 = 0.8451242829827916
train_r2_lr3 = 0.847036328871893,test_r2_lr3 = 0.8460803059273423
train_r2_lr4 = 0.847036328871893,test_r2_lr4 = 0.8460803059273423


All the models are performing with 84.7 % accuracy in train_data and 84.6% in test _data.
