## Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [None]:
# By using data from a survey of 20 questions, which personality types are left-handed persons likely to display?

In [None]:
# This data was collected from an interactive version of the Open Sex Role Inventory in 2014.

# The following items were rated on a five point scale, with the labels 1=Disagree, 3=Neutral, 5=Agree:

# Q1	I have studied how to win at gambling.
# Q2	I have thought about dying my hair.
# Q3	I have thrown knives, axes or other sharp things.
# Q4	I give people handmade gifts.
# Q5	I have day dreamed about saving someone from a burning building.
# Q6	I get embarrassed when people read things I have written.
# Q7	I have been very interested in historical wars.
# Q8	I know the birthdays of my friends.
# Q9	I like guns.
# Q10	I am happiest when I am in my bed.
# Q11	I did not work very hard in school.
# Q12	I use lotion on my hands.
# Q13	I would prefer a class in mathematics to a class in pottery.
# Q14	I dance when I am alone.
# Q15	I have thought it would be exciting to be an outlaw.
# Q16	When I was a child, I put on fake concerts and plays with my friends.
# Q17	I have considered joining the military.
# Q18	I get dizzy when I stand up sharply.
# Q19	I do not think it is normal to get emotionally upset upon hearing about the deaths of people you did not know.
# Q20	I sometimes feel like crying when I get angry.
# Q21	I do not remember birthdays.
# Q22	I save the letters I get.
# Q23	I playfully insult my friends.
# Q24	I oppose medical experimentation with animals.
# Q25	I could do an impressive amount of push ups.
# Q26	I jump up and down in excitement sometimes.
# Q27	I think a natural disaster would be kind of exciting.
# Q28	I wear a blanket around the house.
# Q29	I have burned things up with a magnifying glass.
# Q30	I think horoscopes are fun.
# Q31	I don't pack much luggage when I travel.
# Q32	I have thought about becoming a vegetarian.
# Q33	I hate shopping.
# Q34	I have kept a personal journal.
# Q35	I have taken apart machines just to see how they work.
# Q36	I take lots of pictures of my activities.
# Q37	I have played a lot of video games.
# Q38	I leave nice notes for people now and then.
# Q39	I have set fuels, aerosols or other chemicals on fire, just for fun.
# Q40	I really like dancing.
# Q41	I take stairs two at a time.
# Q42	I bake sweets just for myself sometimes.
# Q43	I think a natural disaster would be kind of exciting.
# Q44	I decorate my things (e.g. stickers on laptop).

# On the next page the following questions were administered:

# engnat	" Is English you native language?" 1=Yes, 2=No
# age	"What is your age?", entered as text (ages <  13 not recorded)
# education	"How much education have you completed?" 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
# gender	1=Male, 2=Female, 3=Other
# orientation	1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
# race	1=Mixed race, 2=Asian, 3=Black, 4=Native American, 5=Native Australian, 6=White, 7=Other
# religion	1=Atheist/Agnostic, 2=Christian, 3=Muslim, 4=Jewish, 5=Hindu, 6=Buddhist, 7=Other
# hand	"What hand do you use to write with?" 	1=Right, 2=Left, 3=Both

# The following technical data was also obtained:

# country	where the users computer was located (using MaxMind GeoIPLite), ISO country code
# fromgoogle 1=HTTP_referer contained '.google.', 2=it did not
# introelapse	how many seconds from when the introduction page was loaded until the user started the test
# testelapse	how many seconds from when the test was started until the page with the test items was submitted


In [None]:
# the text columns after question no. 44 are : egnat, age, education, gender, orientation, race, religion, hand, country, fromgoogle, introelapse, testelapse
#in total 12 extra questions i.e. 12 extra columns plus 44 questions = 56 columns in total

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [None]:
# library imports

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#import knn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

# Import logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

#from sklearn.metrics import accuracy_score

In [3]:
# Step 1: Read the CSV file as a text file
with open('data.csv', 'r') as file:
    lines = file.readlines()

# Step 2: Replace whitespace with commas
cleaned_lines = [line.replace('\t', ',') for line in lines]

# Step 3: Write the cleaned lines back to a temporary file
with open('data_cleaned_file.csv', 'w') as file:
    file.writelines(cleaned_lines)

# Step 4: Read the cleaned file into a DataFrame
df = pd.read_csv('data_cleaned_file.csv')

In [5]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [None]:
df.columns

In [None]:
df.shape #4184 rows

In [None]:
df.isnull().sum() #no null values

In [102]:
df['hand'].value_counts() # '1' is Right, '2' = left, '3' = both. '0' = 11 ppl, approx.  10% are lefthanded

hand
1    3542
2     452
3     179
0      11
Name: count, dtype: int64

In [None]:
df['hand'].value_counts(normalize=True)

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

In [7]:
#drop last 12 columns since we are using just Q1-Q44 and disregarding other features.

df.drop(columns=['fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion'], axis=1, inplace=True)

In [None]:
df.columns #check

In [9]:
df.drop(columns=['introelapse', 'testelapse', 'country'], axis=1, inplace=True) #forgot these columns

In [11]:
df.columns #check

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'hand'],
      dtype='object')

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

## Cleaning

In [None]:
duplicates = df[df.duplicated()] #no duplicates
duplicates

In [None]:
# the scores allowed are up to 5 which consists of 1 = disagree, 3=neutral, 5 =agree
# Loop through each column and print unique values
for column in df.columns:
    unique_vals = df[column].unique()
    print(f'The unique values in {column} are: {unique_vals}') 
#the range of unique values are from 0 to 5
# '0' can mean missing values? the person said 'not available'? or strongly disagree? they didn't understand the question?
# Since we don't know the person who designed the experiment, I assume that "0" means null.

In [None]:
# Loop through each column and count the number of '0's
zero_counts = {}
for col in df.columns:
    zero_counts[col] = (df[col] == 0).sum()

# Print the counts
for column, count in zero_counts.items():
    print(f'{col}: {count}')
# onservation: some columns have more zeroes than others.

In [13]:
df.isin([0]).any(axis=1).sum() #381 rows have zeroes in them. (381/4184)*100 = 9.1 % of the total rows, thus i will drop them as 9% is small.

390

In [15]:
no_zero_df = df[(df != 0).all(axis=1)] #reassigned to no_zero_df 
no_zero_df.isin([0]).any(axis=1).sum() #check if no zeroes

0

In [17]:
no_zero_df.shape

(3794, 45)

In [19]:
no_zero_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3794 entries, 0 to 4183
Data columns (total 45 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Q1      3794 non-null   int64
 1   Q2      3794 non-null   int64
 2   Q3      3794 non-null   int64
 3   Q4      3794 non-null   int64
 4   Q5      3794 non-null   int64
 5   Q6      3794 non-null   int64
 6   Q7      3794 non-null   int64
 7   Q8      3794 non-null   int64
 8   Q9      3794 non-null   int64
 9   Q10     3794 non-null   int64
 10  Q11     3794 non-null   int64
 11  Q12     3794 non-null   int64
 12  Q13     3794 non-null   int64
 13  Q14     3794 non-null   int64
 14  Q15     3794 non-null   int64
 15  Q16     3794 non-null   int64
 16  Q17     3794 non-null   int64
 17  Q18     3794 non-null   int64
 18  Q19     3794 non-null   int64
 19  Q20     3794 non-null   int64
 20  Q21     3794 non-null   int64
 21  Q22     3794 non-null   int64
 22  Q23     3794 non-null   int64
 23  Q24     3794 non-n

## What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? 

In [None]:
# ANSWER:

# External sources: https://www.bbc.co.uk/newsround/53739189#:~:text=%22If%20you%20are%20left%2Dhanded,good%20at%20art%20or%20music.

# 1. Left-hand people are creative, unique, good at the arts
# 2. good at sports
# 3. good at problem-solving
# 4. right-brained and left-brained : "Scientists say the two sides of the brain were better connected in lefties and more co-ordinated, particularly in the areas that involve using language."

# Features chosen from the columns: Are left-handed people creative? Creativity questions: 

# Q4: I give people handmade gifts.
# Q14: I dance when I am alone.
# Q16: When I was a child, I put on fake concerts and plays with my friends.
# Q34: I have kept a personal journal.
# Q36: I take lots of pictures of my activities.
# Q40: I really like dancing.
# Q44: I decorate my things (e.g. stickers on laptop).

#I predict that left-handed people will display these traits compared to right-handed people. 

### Calculate and interpret the baseline accuracy rate:

In [107]:
# the baseline is on average of left-hand people, right-hand: mean of 89%.

#features signalling creativity: questions 4, 14, 16, 34, 36, 40, 44

#dummy the hand column
dummified_df =pd.get_dummies(no_zero_df, columns=['hand'], drop_first=True) #dropped right-hand

In [113]:
dummified_df['hand_2'] = dummified_df['hand_2']*1 #change dummies to int
dummified_df['hand_3'] = dummified_df['hand_3']*1

In [119]:
dummified_df

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,hand_2,hand_3
0,4,1,5,1,5,1,5,1,4,1,...,1,1,5,5,5,1,5,1,0,1
1,1,5,1,4,2,5,5,4,1,5,...,4,4,1,3,1,4,4,5,0,0
2,1,2,1,1,5,4,3,2,1,4,...,4,2,1,4,2,2,2,2,1,0
3,1,4,1,5,1,4,5,4,3,5,...,3,4,1,2,1,1,1,3,1,0
4,5,1,5,1,5,1,5,1,3,1,...,1,1,5,5,5,1,5,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4178,1,2,1,1,3,4,4,1,1,2,...,3,1,1,1,2,1,2,1,0,0
4179,3,5,4,5,2,4,2,2,2,5,...,4,3,4,2,3,4,2,5,0,0
4181,3,2,2,4,5,4,5,2,2,5,...,5,1,2,2,5,1,2,1,0,0
4182,1,3,4,5,1,3,3,1,1,3,...,1,1,1,5,5,1,3,3,0,0


### Short answer questions:

In this lab, you'll use K-nearest neighbors and logistic regression to model handedness based on psychological factors. 

Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [None]:
# Answer here: 
#Regression: data types are continuous numeric

# Classification: data types are distinct classes or groups

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [None]:
# Answer here:

#lower k, the lower the bias, which means it can better capture the pattern of the training data,
#but there is more variance, which makes the model sensitive to fluctuations in the training data, increasing the likelihood of overfitting. 
#, capturing noise as if it were a true pattern.

#higher k, by taking in more neighbors, the model generalizes more broadly and may miss some subtler patterns.
#There is low variance: The model becomes more robust to variations in the training data, 
#reducing the risk of overfitting but potentially underfitting by oversimplifying the data.

#In conclusion, a lower k can result in a model that fits the training data very closely (low bias, high variance), while a higher k can lead to a model that generalizes better but might miss finer details (high bias, low variance).

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [None]:
# Answer here:
# We standardize predictor variables when using k-nearest neighbors to ensure that all features contribute equally to the distance calculations. 
# k-nearest-neightbors uses distance metrics (like Euclidean distance) to find the nearest neighbors. 
# If predictors have very different scales, larger-scale variables can dominate the distance calculation, overshadowing the smaller-scale variables.
# Standardizing them transforms each feature to have a mean of 0 and a standard deviation of 1 for fair comparison, 
# accurately capturing the relationships between variables, thus, more reliable and accurate predictions.

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [None]:
# Answer here:

#yes, because of the same reasons above: fairer predictions, different scales would be more equal so no feature dominates the problem.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [None]:
# Answer here: 

# trial and error starting from k=1
# use cross-validation score to look at the accuracy, i.e. how accurate is the model at predicting class labels

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [None]:
# Answer here:
#default anyway penalty='l2' which is Ridge

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

In [None]:
# Answer here:
# C : float, default=1.0: Inverse of regularization strength; must be a positive float. 
#Like in support vector machines, smaller values specify stronger regularization.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [None]:
# Answer here:
#if the training and testing accuracy scores are similar, there is less variance (overfitting).

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [None]:
# Answer here:
#The logistic regression represents the change in log-odds caused by the input variables as our feature, e.g. GPA. 
#GPA increases by unit and all is held constant: someone is about 221 times as likely to be admitted to grad school.

#knn : The kNN model is an example of a nonparametric model, which means there are no coefficients for the different predictors
# and the estimate is not represented by a formula of our predictor variables. 

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your features should be:

In [123]:
dummified_df.isnull().sum()

Q1        0
Q2        0
Q3        0
Q4        0
Q5        0
Q6        0
Q7        0
Q8        0
Q9        0
Q10       0
Q11       0
Q12       0
Q13       0
Q14       0
Q15       0
Q16       0
Q17       0
Q18       0
Q19       0
Q20       0
Q21       0
Q22       0
Q23       0
Q24       0
Q25       0
Q26       0
Q27       0
Q28       0
Q29       0
Q30       0
Q31       0
Q32       0
Q33       0
Q34       0
Q35       0
Q36       0
Q37       0
Q38       0
Q39       0
Q40       0
Q41       0
Q42       0
Q43       0
Q44       0
hand_2    0
hand_3    0
dtype: int64

In [125]:
dummified_df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,hand_2,hand_3
0,4,1,5,1,5,1,5,1,4,1,...,1,1,5,5,5,1,5,1,0,1
1,1,5,1,4,2,5,5,4,1,5,...,4,4,1,3,1,4,4,5,0,0
2,1,2,1,1,5,4,3,2,1,4,...,4,2,1,4,2,2,2,2,1,0
3,1,4,1,5,1,4,5,4,3,5,...,3,4,1,2,1,1,1,3,1,0
4,5,1,5,1,5,1,5,1,3,1,...,1,1,5,5,5,1,5,1,0,1


In [135]:
dummified_df[['Q4', 'Q14', 'Q16', 'Q34', 'Q36', 'Q40', 'Q44']]

Unnamed: 0,Q4,Q14,Q16,Q34,Q36,Q40,Q44
0,1,5,1,5,1,5,1
1,4,4,4,4,4,3,5
2,1,3,1,4,2,4,2
3,5,3,5,5,1,2,3
4,1,5,1,5,1,5,1
...,...,...,...,...,...,...,...
4178,1,4,1,1,2,1,1
4179,5,2,4,5,3,2,5
4181,4,2,2,2,1,2,1
4182,5,4,3,5,1,5,3


In [137]:
X= dummified_df[['Q4', 'Q14', 'Q16', 'Q34', 'Q36', 'Q40', 'Q44']] # dataframe
y = dummified_df['hand_2'] #series

In [147]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

In [149]:
y_train.value_counts(normalize=True).mul(100)

hand_2
0    89.571035
1    10.428965
Name: proportion, dtype: float64

#### Create and fit four separate $k$-nearest neighbors models: 
- one with $k = 3$
- one with $k = 5$
- one with $k = 15$
- one with $k = 25$:

In [151]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)
knn3 = KNeighborsClassifier(n_neighbors=3)

In [153]:
knn5 = KNeighborsClassifier(n_neighbors=5)

In [155]:
knn15 = KNeighborsClassifier(n_neighbors=15)

In [157]:
knn25 = KNeighborsClassifier(n_neighbors=25)

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [159]:
knn3.fit(X_train_sc, y_train)

In [161]:
cross_val_score(knn3, X_train_sc, y_train, cv=10).mean()

0.8724949822448664

In [163]:
cross_val_score(knn3, X_test_sc, y_test, cv=10).mean()

0.8746920634920634

In [165]:
knn5.fit(X_train_sc, y_train)

In [167]:
cross_val_score(knn5, X_train_sc, y_train, cv=10).mean()

0.8866589470433842

In [169]:
cross_val_score(knn5, X_test_sc, y_test, cv=10).mean()

0.8930603174603174

In [171]:
knn15.fit(X_train_sc, y_train)

In [173]:
cross_val_score(knn15, X_test_sc, y_test, cv=10).mean()

0.8954539682539684

In [175]:
cross_val_score(knn15, X_test_sc, y_test, cv=10).mean()

0.8954539682539684

In [177]:
knn25.fit(X_train_sc, y_train)

In [179]:
cross_val_score(knn25, X_test_sc, y_test, cv=10).mean()

0.8954539682539684

In [181]:
cross_val_score(knn25, X_test_sc, y_test, cv=10).mean()

0.8954539682539684

In [183]:
knn3.score(X_train_sc, y_train)

0.9110586383313656

In [185]:
knn5.score(X_train_sc, y_train)

0.898465171192444

In [187]:
knn15.score(X_train_sc, y_train)

0.8957103502558048

In [191]:
knn25.score(X_train_sc, y_train)

0.8957103502558048

In [None]:
# Are any of your models overfit or underfit? if the train and test scores are similar, there is less variance (overfitting).

In [None]:
#Do any of your models beat the baseline accuracy rate? mean is 89%, knn3 had the best accuracy score at 91%

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as used above with kNN.

In [None]:
#Lasso, alpha (C) = 1.0

In [68]:
logreg = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')

# Step 3: Fit our model
logreg.fit(X_train, y_train)

In [70]:
logreg.score(X_train, y_train)

0.8527182866556837

In [72]:
logreg.score(X_test, y_test)

0.852437417654809

In [None]:
#Lasso, alpha (C) = 10.0

In [74]:
logreg = LogisticRegression(penalty='l1', C=10.0, solver='liblinear')

# Step 3: Fit our model
logreg.fit(X_train, y_train)

In [76]:
logreg.score(X_train, y_train)

0.8527182866556837

In [78]:
logreg.score(X_test, y_test)

0.852437417654809

In [None]:
#ridge, alpha (C) = 1.0

In [80]:
logreg = LogisticRegression(penalty='l2', C=1.0, solver='liblinear')

# Step 3: Fit our model
logreg.fit(X_train, y_train)

In [82]:
logreg.score(X_train, y_train)

0.8527182866556837

In [84]:
logreg.score(X_test, y_test)

0.852437417654809

In [None]:
#ridge, alpha (C) = 10.0

In [86]:
logreg = LogisticRegression(penalty='l2', C=10.0, solver='liblinear')

# Step 3: Fit our model
logreg.fit(X_train, y_train)

In [88]:
logreg.score(X_train, y_train)

0.8527182866556837

In [90]:
logreg.score(X_test, y_test)

0.852437417654809

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [None]:
# from using accuracy = model.score(X_test, y_test), the accuracy scores are the same. I should try adding more features.

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? 

What are the "best" models?

In [None]:
#I would move forward with k-nn = 3 as the accuracy score was the highest, run  gridsearchCV to optimize for the best value of k.