<a href="https://colab.research.google.com/github/mariokart345/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [7]:
#Coverting NaNs for what is used for burritos to 0s and x to 1s
import numpy as np
df.iloc[:,21:58] = df.iloc[:,21:58].replace(np.nan, 0)
df.iloc[:,21:58] = df.iloc[:,21:58].replace('x',1)
df.iloc[:,21:58] = df.iloc[:,21:58].replace('X',1)
#Doing the same for 'Chips'(Fixing the everloving errors before splitting them)
df['Chips'] = df['Chips'].replace(np.nan, 0)
df['Chips'] = df['Chips'].replace('x',1)
df['Chips'] = df['Chips'].replace('X',1)
df['Chips'] = df['Chips'].replace('No', 0)
df['Chips'] = df['Chips'].replace('Yes',1)

In [8]:
#Converting 'Date' from object to datetime
df ['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
#Splitting Dataframe into train/val/test dfs
train = df[df['Date']<= '12/31/2016']
val = df[(df['Date']>='1/1/2017')&(df['Date']<='12/31/2017')]
test = df[df['Date']>= '1/1/2018']

In [9]:
#Baseline 
from sklearn.metrics import accuracy_score
y_train = train['Great']
y_val = val['Great']
majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)
print(f'Accuracy of Training:{accuracy_score(y_train,y_pred)}')
y_pred = [majority_class]*len(y_val)
print(f'Accuracy of Val:{accuracy_score(y_val, y_pred)}')

Accuracy of Training:0.5906040268456376
Accuracy of Val:0.5529411764705883


In [10]:
#Importing and setting up estimators,encoders,
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from category_encoders import OneHotEncoder
from sklearn.feature_selection import SelectKBest
encode = OneHotEncoder(use_cat_names=True,cols=['Burrito'])
imputer = SimpleImputer()
SKBest = SelectKBest(k=10)
model = LogisticRegression(solver='lbfgs')

  import pandas.util.testing as tm


In [11]:
#Dropping columns that cannot be processed
X_train = train.drop(columns=['Great','Date','Mass (g)',	'Density (g/mL)',	'Length',	'Circum',	'Volume'])
X_val = val.drop(columns=['Great','Date','Mass (g)',	'Density (g/mL)',	'Length',	'Circum',	'Volume'])

In [12]:
#Encoding,Imputing, and fitting dataframe for regression model
X_train_encoded = encode.fit_transform(X_train)
X_val_encoded = encode.transform(X_val)
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)
X_train_skb = SKBest.fit_transform(X_train_imputed,y_train)
X_val_skb = SKBest.transform(X_val_imputed)
model.fit(X_train_skb,y_train)

  elif pd.api.types.is_categorical(cols):
  f = msb / msw


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
model.score(X_val_skb,y_val)

0.8470588235294118

In [15]:
model.intercept_,model.coef_

(array([-28.49244446]),
 array([[ 0.79787398,  0.68938057,  1.46425905,  1.39669262,  1.26741402,
          0.04264891,  0.33105671,  1.69345748,  0.58834215, -1.00109906]]))

In [18]:
#Doin the test
y_test = test['Great']
X_test = test.drop(columns=['Great','Date','Mass (g)',	'Density (g/mL)',	'Length',	'Circum',	'Volume'])
X_test_encoded = encode.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_skb = SKBest.transform(X_test_imputed)

In [19]:
model.predict(X_test_skb)

array([ True,  True, False,  True, False, False,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True, False,
       False, False, False,  True, False,  True,  True, False, False,
        True,  True,  True, False,  True,  True, False,  True,  True,
        True,  True])