Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment ðŸŒ¯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

## Train / Validate / Test Split

In [113]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df['Date'].dt.year.value_counts()

2016    296
2017     85
2018     27
2019     10
2026      1
2015      1
2011      1
Name: Date, dtype: int64

In [114]:
# I AM GOING TO ASSUME THIS IS A TYPO FOR 2016
df[(df['Date'].dt.year == 2026)]

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
77,California,2026-04-25,,,,8.0,4.0,,,21.59,,,4.5,5.0,5.0,5.0,4.5,5.0,3.0,5.0,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
df.loc[77, 'Date'] = pd.Timestamp('2016-04-25')

In [0]:
train = df[(df['Date'].dt.year <= 2016)]
validate = df[(df['Date'].dt.year == 2017)]
test = df[(df['Date'].dt.year >= 2018)]

## Baselines

In [0]:
target = 'Great'
y_train = train[target]
y_validate = validate[target]
y_test = test[target]

In [118]:
# 41% of 'train' burrito's rated as 'Great'
y_train.value_counts(normalize=True)

False    0.588629
True     0.411371
Name: Great, dtype: float64

In [119]:
# ALTERNATE IMPLEMENTATION
from sklearn.metrics import accuracy_score
majority = y_train.mode()[0]
y_pred = [majority] * len(y_train)
accuracy_score(y_train, y_pred)

0.5886287625418061

## Logistic Regression

In [0]:
features = ['Yelp', 'Google', 'Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat', 
            'Fillings']
X_train = train[features]
X_validate = validate[features]
X_test = test[features]

In [0]:
from sklearn.impute import SimpleImputer

In [0]:
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_validate_imputed = imputer.transform(X_validate)

In [0]:
from sklearn.linear_model import LogisticRegression

In [126]:
log_model = LogisticRegression()
log_model.fit(X_train_imputed, y_train)
accuracy = log_model.score(X_validate_imputed, y_validate)
print(f'Validation Accuracy: {accuracy*100:.2f}%')

Validation Accuracy: 85.88%


In [125]:
X_test_imputed = imputer.transform(X_test)
log_model.fit(X_test_imputed, y_test)
test_accuracy = log_model.score(X_test_imputed, y_test)
print(f'Test Accuracy: {test_accuracy*100:.2f}%')

Test Accuracy: 78.38%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [127]:
# I THINK WE NEED TO SCALE THE DATA HERE
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test_imputed)
log_model.fit(X_test_scaled, y_test)
test_accuracy = log_model.score(X_test_scaled, y_test)
print(f'Test Accuracy: {test_accuracy*100:.2f}%')

Test Accuracy: 75.68%
