Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

# Burrito column had 132 unique types, pick 4 main types, find all strings that
# contain them, then change to uniform values.
# We are then left with 5 unique values
california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [39]:
df.head(10)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
5,Other,1/28/2016,,,,6.99,4.0,,,,,,3.0,4.0,5.0,3.5,2.5,2.5,2.5,4.0,1.0,,,,,x,x,,x,,x,,,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,False
6,California,1/30/2016,3.0,2.9,,7.19,1.5,,,,,,2.0,3.0,3.0,2.0,2.5,2.5,,2.0,3.0,,,x,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
7,Carnitas,1/30/2016,,,,6.99,4.0,,,,,,2.5,3.0,3.0,2.5,3.0,3.5,,2.5,3.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
8,California,2/1/2016,3.0,3.7,x,9.25,3.5,,,,,,2.0,4.5,4.5,3.5,1.5,3.0,3.5,4.0,2.0,,,x,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
9,Asada,2/6/2016,4.0,4.1,,6.25,3.5,,,,,,2.5,1.5,1.5,3.0,4.5,3.0,1.5,2.0,4.5,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [65]:
# We want those NaN/x columns to be 0 or 1
import numpy as np

# only convert Chips, Unreliable, NonSD, and the 37 different ingredient columns
bool_cols = list(df.columns[-38:-1])
bool_cols.append('Chips')

df[bool_cols] = df[bool_cols].replace('x', 1)
df[bool_cols] = df[bool_cols].replace('X', 1)
df[bool_cols] = df[bool_cols].replace(np.NaN, 0)
df.head(10)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,0,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
1,California,1/24/2016,3.5,3.3,0,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
2,Carnitas,1/24/2016,,,0,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
3,Asada,1/24/2016,,,0,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
4,California,1/27/2016,4.0,3.8,1,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True
5,Other,1/28/2016,,,0,6.99,4.0,,,,,,3.0,4.0,5.0,3.5,2.5,2.5,2.5,4.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
6,California,1/30/2016,3.0,2.9,0,7.19,1.5,,,,,,2.0,3.0,3.0,2.0,2.5,2.5,,2.0,3.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
7,Carnitas,1/30/2016,,,0,6.99,4.0,,,,,,2.5,3.0,3.0,2.5,3.0,3.5,,2.5,3.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
8,California,2/1/2016,3.0,3.7,1,9.25,3.5,,,,,,2.0,4.5,4.5,3.5,1.5,3.0,3.5,4.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
9,Asada,2/6/2016,4.0,4.1,0,6.25,3.5,,,,,,2.5,1.5,1.5,3.0,4.5,3.0,1.5,2.0,4.5,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False


## Assignment Work

### Do train/validate/test split. 

Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [69]:
# Convert to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
df['Date'].describe()

count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object

In [70]:
# Wait, we have data from when?
cutoff = pd.to_datetime('2020-01-01')
df[df['Date'] >= cutoff]

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
77,California,2026-04-25,,,0,8.0,4.0,,,21.59,,,4.5,5.0,5.0,5.0,4.5,5.0,3.0,5.0,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True


In [0]:
# I'm guessing that's a typo and they meant 2016.  But I won't be using date 
# to determine if the burrito is good so I'll ignore that

# Let's actually split this
train = df[df['Date'] < pd.to_datetime('2017-01-01')]
test = df[df['Date'] >= pd.to_datetime('2018-01-01')]
val = df[(df['Date'] >= pd.to_datetime('2017-01-01')) & 
         (df['Date'] < pd.to_datetime('2018-01-01'))]

In [113]:
# Check resulting shape and ensure no rows were lost
print('Original shape:', df.shape)
print('Test shape:', test.shape)
print('Val shape:', val.shape)
print('Train shape:', train.shape)
print('Test + Train + Val rows:', len(test)+len(val)+len(train))

Original shape: (421, 59)
Test shape: (38, 59)
Val shape: (85, 59)
Train shape: (298, 59)
Test + Train + Val rows: 421


In [229]:
df.Date.describe()

count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object

### Begin with baselines for classification.

In [114]:
# Find the majority value
target = 'Great'
majority_value = train[target].mode()[0]
print('The majority value is', majority_value)

The majority value is False


In [115]:
# Find baseline scores
from sklearn.metrics import accuracy_score

y_pred = [majority_value] * len(train)
train_acc = accuracy_score(train[target], y_pred)
print(f'The training majority baseline is {train_acc*100:.02f}%')

y_pred = [majority_value] * len(val)
val_acc = accuracy_score(val[target], y_pred)
print(f'The validation majority baseline is {val_acc*100:.02f}%')

The training majority baseline is 59.06%
The validation majority baseline is 55.29%


### Use scikit-learn for logistic regression.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import category_encoders as ce

In [0]:
# Drop the date, I am choosing to ignore that column when predicting the quality
train = train.drop('Date', axis=1)
val = val.drop('Date', axis=1)
test = test.drop('Date', axis=1)

In [0]:
# One-hot encode the burrito types, now all columns are numeric
encoder = ce.OneHotEncoder(use_cat_names=True)
train = encoder.fit_transform(train)
val = encoder.transform(val)
test = encoder.transform(test)

In [0]:
# Put in a function so I can experiment with it

# After the function, these will all be fit so they can be used on the test data
imputer = SimpleImputer()
scaler = StandardScaler()
model = LogisticRegression()

def run_logistic_regression(train, val, features, target, 
                            print_results=True):
  """
  train: the training data to use
  val: the validation data to use
  features: list, the features to use
  target: str, the target column to use
  print_results: bool, whether or not to print out the accuracy

  returns the validation accuracy (our test metric)
  """
  X_train = train[features]
  y_train = train[target]

  X_val = val[features]
  y_val = val[target]

  # Impute the data
  X_train = imputer.fit_transform(X_train)
  X_val = imputer.transform(X_val)

  # Scale data
  X_train = scaler.fit_transform(X_train)
  X_val = scaler.transform(X_val)
  
  #Fit the model
  model.fit(X_train, y_train)

  if(print_results):
    train_acc = model.score(X_train, y_train)
    print(f'Training accuracy: {train_acc*100:.02f}%')

    val_acc = model.score(X_val, y_val)
    print(f'Validation accuracy: {val_acc*100:.02f}%')
  
  return model.score(X_val, y_val)

### Get your model's validation accuracy. 

(Multiple times if you try multiple iterations.)

In [121]:
# Remind myself what I'm working with
train.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Yelp,Google,Chips_0.0,Chips_1.0,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,1,0,0,0,0,3.5,4.2,1,0,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
1,1,0,0,0,0,3.5,3.3,1,0,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
2,0,1,0,0,0,,,1,0,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
3,0,0,1,0,0,,,1,0,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
4,1,0,0,0,0,4.0,3.8,0,1,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True


In [132]:
# I'll be using this syntax to select most column names
train.columns[15:25]

Index(['Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling',
       'Uniformity', 'Salsa', 'Synergy', 'Wrap'],
      dtype='object')

In [244]:
# Run regression using the main 10 ratings the blog (source) uses
features = list(train.columns[15:25])
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 89.60%
Validation accuracy: 88.24%


In [249]:
# Run test with just the 35 ingredient columns
features = list(train.columns[-36:-1])
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 68.12%
Validation accuracy: 52.94%


In [250]:
# Run test with the 10 metrics, 35 ingredients, Unreliable and NonSD
features = list(train.columns[-48:-1])
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 91.95%
Validation accuracy: 78.82%


In [126]:
# Just burrito type?
features = list(train.columns[:5])
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 59.06%
Validation accuracy: 55.29%


In [136]:
# ALL THE COLUMNS
all_features = list(train.columns[:-1]) # still gotta leave out that target
run_logistic_regression(train, val, all_features, 'Great');

Training accuracy: 92.95%
Validation accuracy: 76.47%


In [146]:
# What about finding the most common ingredients?
ingredient_cols = list(train.columns[-36:-1])
train[ingredient_cols].sum().sort_values(ascending=False)[:15]

Beef          168.0
Cheese        149.0
Pico          143.0
Guac          139.0
Fries         119.0
Sour cream     85.0
Pork           43.0
Sauce          37.0
Rice           33.0
Beans          32.0
Chicken        20.0
Shrimp         20.0
Onion          17.0
Cilantro       15.0
Avocado        13.0
dtype: float64

In [148]:
# Top 10 ingredients
top_ingredients = ['Beef', 'Cheese', 'Pico', 'Guac', 'Fries', 
                   'Sour cream', 'Pork', 'Sauce', 'Rice', 'Beans']
run_logistic_regression(train, val, top_ingredients, 'Great');

Training accuracy: 63.09%
Validation accuracy: 54.12%


In [153]:
# Top ingredients plus the 10 review metrics
features = top_ingredients + list(train.columns[15:25])
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 90.27%
Validation accuracy: 82.35%


In [240]:
# The features most strongly correlated to 'Great'
most_correlated = (train.corr()['Great'].abs().sort_values(ascending=False)
                                                .reset_index()['index'][:11])
most_correlated

0            Great
1          Synergy
2         Fillings
3             Meat
4     Meat:filling
5            Salsa
6         Tortilla
7       Uniformity
8             Temp
9       Unreliable
10            Yelp
Name: index, dtype: object

In [241]:
# Basically SelectKBest here...
results = []
most_correlated = (train.corr()['Great'].abs().sort_values(ascending=False)
                                                .reset_index()['index'])
for k in range(1, 64):
  features = most_correlated[1:k+1]
  acc = run_logistic_regression(train, val, features, 'Great', 
                                print_results=False)
  results.append([k, acc])

import plotly.express as ex
results = np.array(results)
ex.bar(x=results.T[0], y=results.T[1], 
       labels={'y':'validation accuracy', 'x':'k features'})

In [217]:
# I'm a little surprised my highest score was with only 4, but I'll take it
most_correlated[1:5]

1         Synergy
2        Fillings
3            Meat
4    Meat:filling
Name: index, dtype: object

In [231]:
# That top result
features = ['Synergy', 'Fillings', 'Meat', 'Meat:filling']
run_logistic_regression(train, val, features, 'Great');

Training accuracy: 85.57%
Validation accuracy: 91.76%


### Get your model's test accuracy. 
(One time, at the end.)

In [232]:
# IMPORTANT - this should only be run after the model is trained correctly,
# training on a different set of features before running this will not work

# Same procedure used in the function
features = ['Synergy', 'Fillings', 'Meat', 'Meat:filling']
X_test = test[features]
y_test = test['Great']

# Impute and scale the data
X_test = imputer.transform(X_test)
X_test = scaler.transform(X_test)

# Get accuracy
test_acc = model.score(X_test, y_test)
print(f'Test accuracy: {test_acc*100:.02f}%')

Test accuracy: 73.68%


That test accuracy was so low I went back and double checked I did everything right, but as far as I can tell I did.  The model was trained only on the training set, but it had much higher accuracy for the validation set than the test one.