<a href="https://colab.research.google.com/github/lukehdez95/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

# Data Wrangling

In [7]:
#inspecting dataframe
df.head(30)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
5,Other,1/28/2016,,,,6.99,4.0,,,,,,3.0,4.0,5.0,3.5,2.5,2.5,2.5,4.0,1.0,,,,,x,x,,x,,x,,,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,False
6,California,1/30/2016,3.0,2.9,,7.19,1.5,,,,,,2.0,3.0,3.0,2.0,2.5,2.5,,2.0,3.0,,,x,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
7,Carnitas,1/30/2016,,,,6.99,4.0,,,,,,2.5,3.0,3.0,2.5,3.0,3.5,,2.5,3.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
8,California,2/1/2016,3.0,3.7,x,9.25,3.5,,,,,,2.0,4.5,4.5,3.5,1.5,3.0,3.5,4.0,2.0,,,x,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
9,Asada,2/6/2016,4.0,4.1,,6.25,3.5,,,,,,2.5,1.5,1.5,3.0,4.5,3.0,1.5,2.0,4.5,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Date            421 non-null    object 
 2   Yelp            87 non-null     float64
 3   Google          87 non-null     float64
 4   Chips           26 non-null     object 
 5   Cost            414 non-null    float64
 6   Hunger          418 non-null    float64
 7   Mass (g)        22 non-null     float64
 8   Density (g/mL)  22 non-null     float64
 9   Length          283 non-null    float64
 10  Circum          281 non-null    float64
 11  Volume          281 non-null    float64
 12  Tortilla        421 non-null    float64
 13  Temp            401 non-null    float64
 14  Meat            407 non-null    float64
 15  Fillings        418 non-null    float64
 16  Meat:filling    412 non-null    float64
 17  Uniformity      419 non-null    flo

In [9]:
#I want to drop the Yelp, Google, Mass, Density and NonSD columns as they have too few values and I don't feel like the mean of it would make for good data
col_to_drop = ['Yelp', 'Google', 'Density (g/mL)', 'Mass (g)', 'NonSD', 'Queso']
df.drop(columns=col_to_drop,inplace=True)
df.head()

Unnamed: 0,Burrito,Date,Chips,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,,6.49,3.0,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,,5.45,3.5,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,4.85,1.5,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,5.25,2.0,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,x,6.59,4.0,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [10]:
# I want to replace the values with capital X to lowercase x, so that I could run OneHotEncoding later
df.replace(to_replace="X",value="x",inplace=True)
df['Chips'].unique()

array([nan, 'x', 'Yes', 'No'], dtype=object)

In [11]:
# I noticed that there are multiple responses here. I want to account for them all, for all of these object type features, I will make x or Yes into 1 and NaN or No into 0. This way when I do onehotencoding later, I save myself some features
df.replace(to_replace="Yes",value="x",inplace=True)
df.replace(to_replace="x",value=1,inplace=True)
df.iloc[:,2].fillna(0,inplace=True)
df.replace(to_replace='No',value=0,inplace=True)
for i in range(17,52):
  df.iloc[:,i].fillna(0,inplace=True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 53 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Burrito        421 non-null    object 
 1   Date           421 non-null    object 
 2   Chips          421 non-null    int64  
 3   Cost           414 non-null    float64
 4   Hunger         418 non-null    float64
 5   Length         283 non-null    float64
 6   Circum         281 non-null    float64
 7   Volume         281 non-null    float64
 8   Tortilla       421 non-null    float64
 9   Temp           401 non-null    float64
 10  Meat           407 non-null    float64
 11  Fillings       418 non-null    float64
 12  Meat:filling   412 non-null    float64
 13  Uniformity     419 non-null    float64
 14  Salsa          396 non-null    float64
 15  Synergy        419 non-null    float64
 16  Wrap           418 non-null    float64
 17  Unreliable     421 non-null    float64
 18  Beef      

# Train/ Val/ Test

In [13]:
date = '1/18/2016'
def date_yr(TIMEDATE):
  return int(TIMEDATE.split('/')[2])
date_yr(date)

2016

In [14]:
df['Date_yr'] = df['Date'].apply(date_yr)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 54 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Burrito        421 non-null    object 
 1   Date           421 non-null    object 
 2   Chips          421 non-null    int64  
 3   Cost           414 non-null    float64
 4   Hunger         418 non-null    float64
 5   Length         283 non-null    float64
 6   Circum         281 non-null    float64
 7   Volume         281 non-null    float64
 8   Tortilla       421 non-null    float64
 9   Temp           401 non-null    float64
 10  Meat           407 non-null    float64
 11  Fillings       418 non-null    float64
 12  Meat:filling   412 non-null    float64
 13  Uniformity     419 non-null    float64
 14  Salsa          396 non-null    float64
 15  Synergy        419 non-null    float64
 16  Wrap           418 non-null    float64
 17  Unreliable     421 non-null    float64
 18  Beef      

In [15]:
train = df[df['Date_yr'] == 2016]
val = df[df['Date_yr'] == 2017]
test = df[df['Date_yr'] >= 2018]

In [16]:
target = 'Great'
features = ['Burrito', 'Chips', 'Cost', 'Hunger', 'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Unreliable', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
# left out queso since it has no non null values
X_train = train[features]
X_val = val[features]
X_test = test[features]
y_train = train[target]
y_val = val[target]
y_test = test[target]

# Baseline

In [17]:
y_train.value_counts(normalize=True) #Percentages of each outcome, whether burrito was great or not

False    0.591216
True     0.408784
Name: Great, dtype: float64

In [18]:
majority_rating = y_train.mode()[0]
y_pred = [majority_rating] * len(y_train) #Using the baseline of all burritos being not great 
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred) # The training accuracy with the baseline of all burritos being not great

0.5912162162162162

In [19]:
# Validation accuracy with the baseline of all burritos being not great
y_pred = [majority_rating] * len(y_val)
accuracy_score(y_val, y_pred)

0.5529411764705883

# Work

In [20]:
#imports 
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

  import pandas.util.testing as tm


In [21]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)
#One Hot Encoding to split the burrito feature

  elif pd.api.types.is_categorical(cols):


In [22]:
X_train_encoded.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Chips,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,1,0,0,0,0,0,6.49,3.0,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0,0,0,0,5.45,3.5,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,1,0,0,0,0,4.85,1.5,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,1,0,0,0,5.25,2.0,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0,0,0,0,1,6.59,4.0,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#imputing to replace null values for their feature's mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [24]:
#Standardardizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [25]:
#Logistic Regression
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [26]:
#Accuracy
print('Training Accuracy', model.score(X_train_scaled, y_train))
print('Validation Accuracy', model.score(X_val_scaled, y_val))

Training Accuracy 0.9155405405405406
Validation Accuracy 0.8


In [27]:
#Final Answer
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)
print('Test Accuracy', model.score(X_test_scaled,y_test))

Test Accuracy 0.7894736842105263


# These next sections...
Show how I was going about everything based on my notes, I used SelectKBest in this version and got slightly higher accuracies, but I got stuck at the end because I didnt know exactly how to transform the test data, so that I could include the onehotencoding, and ran out of time

# One Hot Encoding/ Imputing/ Feature Selection

In [28]:
from category_encoders import OneHotEncoder
transformer_1 = OneHotEncoder(use_cat_names=True)
transformer_1.fit(X_train)


  elif pd.api.types.is_categorical(cols):


OneHotEncoder(cols=['Burrito'], drop_invariant=False, handle_missing='value',
              handle_unknown='value', return_df=True, use_cat_names=True,
              verbose=0)

In [29]:
XT_train = transformer_1.transform(X_train)
XT_val = transformer_1.transform(X_val)

In [30]:
XT_train #Here I noticed that there was two 'x' values. I want to make sure that all the capital Xs are lowercase, so i'm going back into my code to fix this

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Chips,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,1,0,0,0,0,0,6.49,3.0,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0,0,0,0,5.45,3.5,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,1,0,0,0,0,4.85,1.5,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,1,0,0,0,5.25,2.0,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0,0,0,0,1,6.59,4.0,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,1,0,0,0,0,0,5.65,3.0,19.5,22.0,0.75,4.0,1.5,2.0,3.0,4.2,4.0,3.0,2.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
297,0,0,0,1,0,0,5.49,3.0,19.0,20.5,0.64,4.5,5.0,2.0,2.0,2.5,3.5,3.0,2.5,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
298,1,0,0,0,0,0,7.75,4.0,20.0,21.0,0.70,3.5,2.5,3.0,3.3,1.4,2.3,2.2,3.3,4.5,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299,0,0,1,0,0,0,7.75,4.0,19.5,21.0,0.68,4.0,4.5,2.0,2.0,3.5,3.5,2.0,2.0,4.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
XT_train_imputed = imputer.fit_transform(XT_train)
XT_val_imputed = imputer.transform(XT_val)

In [32]:
from sklearn.feature_selection import SelectKBest, f_regression
transformer_2 = SelectKBest(k=9)
transformer_2.fit(XT_train_imputed,y_train)

XTT_train_imputed = transformer_2.transform(XT_train_imputed)
XTT_val_imputed = transformer_2.transform(XT_val_imputed)

# Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(XTT_train_imputed, y_train)
print('Training Accuracy', log_reg.score(XTT_train_imputed, y_train))
print('Validation Accuracy', log_reg.score(XTT_val_imputed, y_val))

Training Accuracy 0.8885135135135135
Validation Accuracy 0.8588235294117647


In [34]:
print('Test Accuracy', log_reg.score(X_test,y_test))

ValueError: ignored