Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [7]:
#Import dependencies

from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

In [8]:
# Look at data

print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [9]:
df['Great'].value_counts()

False    239
True     182
Name: Great, dtype: int64

In [10]:
# Get mean baseline mae

baseline = df['Great'].value_counts(normalize = True)[0]
df['Baseline'] = baseline

print(df['Great'].value_counts(normalize = True), '\n\n',
      df['Baseline'].value_counts())


False    0.567696
True     0.432304
Name: Great, dtype: float64 

 0.567696    421
Name: Baseline, dtype: int64


In [11]:
# Drop the temporary 'Baseline' column
df.drop('Baseline', axis = 1, inplace = True)

In [12]:
# Convert 'Date' column to date object and give each
# attribute it's own column

df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format = True)

df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df['Day'] = df.Date.dt.day

In [13]:
# Get null value totals for each column and drop
# any column missing more than 10 percent of it's
# data.

null_cols = df.isna().sum().index
null_vals = df.isna().sum()

for i in null_cols:

    if null_vals[i]/df.shape[0] > 0.10:
        df.drop(i, axis = 1, inplace = True)

In [14]:
# Check remaining null value totals in each column

df.isna().sum()

Burrito          0
Date             0
Cost             7
Hunger           3
Tortilla         0
Temp            20
Meat            14
Fillings         3
Meat:filling     9
Uniformity       2
Salsa           25
Synergy          2
Wrap             3
Great            0
Year             0
Month            0
Day              0
dtype: int64

In [15]:
# Perform train/validate/test split

train = df[df['Date'].dt.year < 2017].drop('Date', axis = 1)
validate = df[df['Date'].dt.year == 2017].drop('Date', axis = 1)
test = df[df['Date'].dt.year > 2017].drop('Date', axis = 1)


In [16]:
# Create training feature matrix and target vector 

X_train = train.drop('Great', axis = 1)
y_train = train['Great']

X_train.shape, y_train.shape

((298, 15), (298,))

In [17]:
encoder = OrdinalEncoder()
imputer = SimpleImputer()

pipe = [encoder, imputer]


model_1_params = {'fit_intercept' : True,
                   'random_state' : 42,
                         'n_jobs' : -1}

model_1 = LogisticRegression(**model_1_params)

In [18]:
def pre_proc(df, stage, pipe):
    '''
    df    : Pandas DataFrame of the training, validation,
            or test data.

    stage : One of two stages of pre-processing data.
            Valid values are 'train' or 'test' and value
            passed must be a string.

    pipe  : List of transformers to be used within
            this function at each stage of processing.


    This function is written specifically for the burrito
    dataset and assumes the 'Date' column has been dropped,
    whether or not the date information is transferred to
    columns in numeric format.

    Returned Values

    X       :   Matrix of processed features ready for the model.
    
    y       :   Target vector ready for the model.

    pipe    :   List of trained transformer objects.
    '''

# Get the current stage for proper routing.
    if stage == 'train':
        # Create feature matrix and target vector
        X = df.drop('Great', axis = 1)
        y = df['Great']

        # Encode the 'Burrito' column using OrdinalEncoder
        X['Burrito'] = pipe[0].fit_transform(X['Burrito'].values.reshape(-1,1))

        # Impute missing values with SimpleImputer
        X = pipe[1].fit_transform(X)

    elif stage == 'test':
        # Create feature matrix and target vector
        X = df.drop('Great', axis = 1)
        y = df['Great']

        # Encode the 'Burrito' column using OrdinalEncoder
        X['Burrito'] = pipe[0].transform(X['Burrito'].values.reshape(-1,1))

        # Impute missing values with SimpleImputer
        X = pipe[1].fit_transform(X)

    print(stage,'\n\n')
    
    return X, y, pipe

In [19]:
def scoring(X, y, model):
    '''
        X : Feature matrix as a datatype valid for both the model and 
              sklearn's cross_val_score.

        y : Target vector as a datatype valid for both the model and 
              sklearn's cross_val_score.

    model : Trained LogisticRegression model from sklearn.

    This function takes the feature matrix and target vector for scoring
    using the built-in accuracy scorer in the LogisticRegression
    model and by using cross_val_score from sklearn.metrics. It then 
    prints those values to the output but does not return anything.
    '''
    print('Accuracy Score: ', model.score(X, y))
    print('Cross Val Score: ', cross_val_score(model, X, y).mean())

In [20]:
# Get training feature matrix and target vector
X_train, y_train, pipe = pre_proc(train, 'train', pipe)

train 




In [21]:
# Train the model
model_1.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2', random_state=42,
                   solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [22]:
# Get the training scores for the training data
scoring(X_train, y_train, model_1)

Accuracy Score:  0.889261744966443
Cross Val Score:  0.8692090395480225


In [23]:
# Get validation feature matrix and target vector
X_val, y_val, pipe = pre_proc(validate, 'test', pipe)

test 




In [24]:
# Predict values on the validation data
y_pred_1_val = model_1.predict(X_val)

In [25]:
# Get the scores for the model on the validation data
scoring(X_val, y_val, model_1)

Accuracy Score:  0.8352941176470589
Cross Val Score:  0.7647058823529411


In [26]:
# Get the test feature matrix and target vector
X_test, y_test, pipe = pre_proc(test, 'test', pipe)

test 




In [27]:
# Make predictions using the test data
y_pred_1_test = model_1.predict(X_test)

In [28]:
# Score the model on the testing data
scoring(X_test, y_test, model_1)

Accuracy Score:  0.7631578947368421
Cross Val Score:  0.7642857142857142
