<a href="https://colab.research.google.com/github/johhan27/DS-Unit-2-Linear-Models/blob/master/LS_DS_214_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

In [62]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Logistic Regression

Do you like burritos? 🌯 You're in luck then, because in this project you'll create a model to predict whether a burrito is `'Great'`.

The dataset for this assignment comes from [Scott Cole](https://srcole.github.io/100burritos/), a San Diego-based data scientist and burrito enthusiast. 

## Directions

The tasks for this project are the following:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function .
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build `model_logr` using a pipeline that includes three transfomers and `LogisticRegression` predictor. Train model on `X_train` and `X_test`.
- **Task 7:** Calculate the training and test accuracy score for your model.
- **Task 8:** Create a horizontal bar chart showing the 10 most influencial features for your  model. 
- **Task 9:** Demonstrate and explain the differences between `model_lr.predict()` and `model_lr.predict_proba()`.

**Note** 

You should limit yourself to the following libraries:

- `category_encoders`
- `matplotlib`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [63]:
import category_encoders as ce
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [78]:
def wrangle(filepath):
    # Import w/ DateTimeIndex
    df = pd.read_csv(filepath, parse_dates=['Date'],
                     index_col='Date')
    
    # Drop unrated burritos
    df.dropna(subset=['overall'], inplace=True)
    
    # Derive binary classification target:
    # We define a 'Great' burrito as having an
    # overall rating of 4 or higher, on a 5 point scale
    df['Great'] = (df['overall'] >= 4).astype(int)
    
    # Drop high cardinality categoricals
    df = df.drop(columns=['Notes', 'Location', 'Address', 'URL', 'Neighborhood'])
    
    # Drop columns to prevent "leakage"
    df = df.drop(columns=['Rec', 'overall'])

    #get the columns to encode 
    binary_columns = []
    for i in df.select_dtypes(exclude=np.number).columns.to_list():
      if len(df[i].value_counts()) < 10:
       binary_columns.append(i)
       #print (df[i].value_counts())

    #there's a No in Chips
    df.loc[df['Chips'] == "No", "Chips"] = float('nan')

    #encode the binary_columns 
    for i in binary_columns:
      df[i] = [0 if pd.isna(j) else 1 for j in df[i]]

    #create new features with Burrito
    four_new_feat = {'California':'california', 'Carne asada':'asada', 'Surf & Turf':'surf', 'Carnitas':'carnitas'}
    for key, value in four_new_feat.items():
      df[value] = [1 if i.strip() == key else 0 for i in df['Burrito']]
    
    return df

filepath = DATA_PATH + 'burritos/burritos.csv'

**Task 1:** Use the above `wrangle` function to import the `burritos.csv` file into a DataFrame named `df`.

In [79]:
filepath = DATA_PATH + 'burritos/burritos.csv'
df = wrangle(filepath)

In [66]:
df.describe(exclude='number') #before wrangle, plain df

Unnamed: 0,Burrito,Chips,Reviewer,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
count,421,26,420,33,7,179,158,154,159,127,92,51,21,21,6,36,35,11,7,7,1,8,38,7,15,17,4,7,2,4,4,1,5,3,3,2,13,3,1
unique,132,4,106,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,1
top,California,x,Scott,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x
freq,101,21,147,33,5,137,127,114,128,102,67,36,20,17,4,26,27,9,5,4,1,6,33,6,9,9,3,5,2,4,4,1,5,3,3,2,13,2,1


In [67]:
df.describe() #before wrangle, plain df

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso,Great
count,87.0,87.0,414.0,418.0,22.0,22.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,0.0,421.0
mean,3.887356,4.167816,7.067343,3.495335,546.181818,0.675277,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,,0.432304
std,0.475396,0.373698,1.506742,0.812069,144.445619,0.080468,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,,0.495985
min,2.5,2.9,2.99,0.5,350.0,0.56,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,,0.0
25%,3.5,4.0,6.25,3.0,450.0,0.619485,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,,0.0
50%,4.0,4.2,6.99,3.5,540.0,0.658099,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,,0.0
75%,4.0,4.4,7.88,4.0,595.0,0.721726,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,,1.0
max,4.5,5.0,25.0,5.0,925.0,0.865672,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,,1.0


In [68]:
df.info() #before wrangle, plain df

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421 entries, 2016-01-18 to 2019-08-27
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Yelp            87 non-null     float64
 2   Google          87 non-null     float64
 3   Chips           26 non-null     object 
 4   Cost            414 non-null    float64
 5   Hunger          418 non-null    float64
 6   Mass (g)        22 non-null     float64
 7   Density (g/mL)  22 non-null     float64
 8   Length          283 non-null    float64
 9   Circum          281 non-null    float64
 10  Volume          281 non-null    float64
 11  Tortilla        421 non-null    float64
 12  Temp            401 non-null    float64
 13  Meat            407 non-null    float64
 14  Fillings        418 non-null    float64
 15  Meat:filling    412 non-null    float64
 16  Uniformity      419 non-null    float64
 17  Salsa           

During your exploratory data analysis, note that there are several columns whose data type is `object` but that seem to be a binary encoding. For example, `df['Beef'].head()` returns:

```
0      x
1      x
2    NaN
3      x
4      x
Name: Beef, dtype: object
```

**Task 2:** Change the `wrangle` function so that these columns are properly encoded as `0` and `1`s. Be sure your code handles upper- and lowercase `X`s, and `NaN`s.

In [69]:
#AFTER ORIGINAL WRANGLE 

In [70]:
"""
binary_columns = []
for i in df.select_dtypes(exclude=np.number).columns.to_list():
  if len(df[i].value_counts()) < 10:
    binary_columns.append(i)
    print (df[i].value_counts())
"""

x      21
X       3
No      1
Yes     1
Name: Chips, dtype: int64
x    33
Name: Unreliable, dtype: int64
x    5
X    2
Name: NonSD, dtype: int64
x    137
X     42
Name: Beef, dtype: int64
x    127
X     31
Name: Pico, dtype: int64
x    114
X     40
Name: Guac, dtype: int64
x    128
X     31
Name: Cheese, dtype: int64
x    102
X     25
Name: Fries, dtype: int64
x    67
X    25
Name: Sour cream, dtype: int64
x    36
X    15
Name: Pork, dtype: int64
x    20
X     1
Name: Chicken, dtype: int64
x    17
X     4
Name: Shrimp, dtype: int64
x    4
X    2
Name: Fish, dtype: int64
x    26
X    10
Name: Rice, dtype: int64
x    27
X     8
Name: Beans, dtype: int64
x    9
X    2
Name: Lettuce, dtype: int64
x    5
X    2
Name: Tomato, dtype: int64
x    4
X    3
Name: Bell peper, dtype: int64
x    1
Name: Carrots, dtype: int64
x    6
X    2
Name: Cabbage, dtype: int64
x    33
X     5
Name: Sauce, dtype: int64
x    6
X    1
Name: Salsa.1, dtype: int64
x    9
X    6
Name: Cilantro, dtype: int64
x    9
X

In [71]:
#df.loc[df['Chips'] == "No", "Chips"] = float('nan')

In [72]:
#encode the binary_columns 
"""
for i in df.select_dtypes(exclude=np.number).columns.to_list():
  if len(df[i].value_counts()) < 10:
    df[i] = [0 if pd.isna(j) else 1 for j in df[i]]
"""

In [73]:
"""
for i in binary_columns:
  print (df[i].value_counts())
  print (df[i].value_counts().sum()) #all 421
"""

0    396
1     25
Name: Chips, dtype: int64
421
0    388
1     33
Name: Unreliable, dtype: int64
421
0    414
1      7
Name: NonSD, dtype: int64
421
0    242
1    179
Name: Beef, dtype: int64
421
0    263
1    158
Name: Pico, dtype: int64
421
0    267
1    154
Name: Guac, dtype: int64
421
0    262
1    159
Name: Cheese, dtype: int64
421
0    294
1    127
Name: Fries, dtype: int64
421
0    329
1     92
Name: Sour cream, dtype: int64
421
0    370
1     51
Name: Pork, dtype: int64
421
0    400
1     21
Name: Chicken, dtype: int64
421
0    400
1     21
Name: Shrimp, dtype: int64
421
0    415
1      6
Name: Fish, dtype: int64
421
0    385
1     36
Name: Rice, dtype: int64
421
0    386
1     35
Name: Beans, dtype: int64
421
0    410
1     11
Name: Lettuce, dtype: int64
421
0    414
1      7
Name: Tomato, dtype: int64
421
0    414
1      7
Name: Bell peper, dtype: int64
421
0    420
1      1
Name: Carrots, dtype: int64
421
0    413
1      8
Name: Cabbage, dtype: int64
421
0    383
1     38
Na

In [74]:
df['Burrito'].value_counts()

California                  101
Carne asada                  29
California                   26
Carnitas                     23
Surf & Turf                  14
                           ... 
Colimas burrito               1
Veg Out                       1
California (only cheese)      1
Shrimp california             1
California Chipotle           1
Name: Burrito, Length: 132, dtype: int64

In [75]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.
"""
four_new_feat = {'California':'california', 'Carne asada':'asada', 'Surf & Turf':'surf', 'Carnitas':'carnitas'}
for key, value in four_new_feat.items():
  df[value] = [1 if i.strip() == key else 0 for i in df['Burrito']]
"""

In [80]:
df[four_new_feat.values()].head()

Unnamed: 0_level_0,california,asada,surf,carnitas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-01-18,1,0,0,0
2016-01-24,1,0,0,0
2016-01-24,0,0,0,1
2016-01-24,0,1,0,0
2016-01-27,1,0,0,0


If you explore the `'Burrito'` column of `df`, you'll notice that it's a high-cardinality categorical feature. You'll also notice that there's a lot of overlap between the categories. 

**Stretch Goal:** Change the `wrangle` function above so that it engineers four new features: `'california'`, `'asada'`, `'surf'`, and `'carnitas'`. Each row should have a `1` or `0` based on the text information in the `'Burrito'` column. For example, here's how the first 5 rows of the dataset would look.

| **Burrito** | **california** | **asada** | **surf** | **carnitas** |
| :---------- | :------------: | :-------: | :------: | :----------: |
| California  |       1        |     0     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |
|  Carnitas   |       0        |     0     |    0     |      1       |
| Carne asada |       0        |     1     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |

**Note:** Be sure to also drop the `'Burrito'` once you've engineered your new features.

In [None]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.

In [82]:
df.describe()

Unnamed: 0,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great,california,asada,surf,carnitas
count,87.0,87.0,421.0,414.0,418.0,22.0,22.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,0.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0
mean,3.887356,4.167816,0.059382,7.067343,3.495335,546.181818,0.675277,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,0.078385,0.016627,0.425178,0.375297,0.365796,0.377672,0.301663,0.218527,0.12114,0.049881,0.049881,0.014252,0.085511,0.083135,0.026128,0.016627,0.016627,0.002375,0.019002,0.090261,0.016627,0.035629,0.04038,0.009501,0.016627,0.004751,0.009501,0.009501,0.002375,,0.011876,0.007126,0.007126,0.004751,0.030879,0.007126,0.002375,0.432304,0.301663,0.068884,0.033254,0.057007
std,0.475396,0.373698,0.23662,1.506742,0.812069,144.445619,0.080468,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,0.269096,0.128022,0.494958,0.484776,0.482226,0.485382,0.459526,0.413739,0.326678,0.217959,0.217959,0.118668,0.279973,0.276415,0.159706,0.128022,0.128022,0.048737,0.136696,0.286897,0.128022,0.185585,0.197083,0.097125,0.128022,0.068842,0.097125,0.097125,0.048737,,0.108459,0.084214,0.084214,0.068842,0.173195,0.084214,0.048737,0.495985,0.459526,0.253557,0.179513,0.232132
min,2.5,2.9,0.0,2.99,0.5,350.0,0.56,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.5,4.0,0.0,6.25,3.0,450.0,0.619485,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.2,0.0,6.99,3.5,540.0,0.658099,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.4,0.0,7.88,4.0,595.0,0.721726,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
max,4.5,5.0,1.0,25.0,5.0,925.0,0.865672,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [83]:
df.describe(exclude='number')

Unnamed: 0,Burrito,Reviewer
count,421,420
unique,132,106
top,California,Scott
freq,101,147


# II. Split Data

**Task 3:** Split your dataset into the feature matrix `X` and the target vector `y`. You want to predict `'Great'`.

In [93]:
features= df.select_dtypes(include=np.number).columns.to_list()
features.remove('Great')

target = 'Great'

X = df[features]
y = df[target]

**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from 2016 through 2017. 
- Your test set should include data from 2018 and later.

In [98]:
#(y_test == y[(y.index.month==4) & (y.index.year==2019)]) also a way to get a specific month-year
X_train, y_train = X[(X.index.year==2016)|(X.index.year==2017)] , y[(y.index.year==2016)|(y.index.year==2017)]
X_test, y_test = X[(X.index.year > 2017)] , y[(y.index.year > 2017)]

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents. 

In [100]:
from sklearn.metrics import accuracy_score
y_train_mode = y_train.mode()
baseline_acc = accuracy_score(y_train, [y_train_mode for i in range(len(y_train))])
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5826771653543307


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_logr`, and fit it to your training data. Your pipeline should include:

- a `OneHotEncoder` transformer for categorical features, 
- a `SimpleImputer` transformer to deal with missing values, 
- a [`StandarScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) transfomer (which often improves performance in a logistic regression model), and 
- a `LogisticRegression` predictor.

In [None]:
#I have OHE from ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


model_logr = ...

# IV. Check Metrics

**Task 7:** Calculate the training and test accuracy score for `model_lr`.

In [None]:
training_acc = ...
test_acc = ...

print('Training MAE:', training_acc)
print('Test MAE:', test_acc)

# V. Communicate Results

**Task 8:** Create a horizontal barchart that plots the 10 most important coefficients for `model_lr`, sorted by absolute value.

**Note:** Since you created your model using a `Pipeline`, you'll need to use the [`named_steps`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) attribute to access the coefficients in your `LogisticRegression` predictor. Be sure to look at the shape of the coefficients array before you combine it with the feature names.

In [None]:
# Create your horizontal barchart here.

There is more than one way to generate predictions with `model_lr`. For instance, you can use [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression) or [`predict_proba`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression.predict_proba).

**Task 9:** Generate predictions for `X_test` using both `predict` and `predict_proba`. Then below, write a summary of the differences in the output for these two methods. You should answer the following questions:

- What data type do `predict` and `predict_proba` output?
- What are the shapes of their different output?
- What numerical values are in the output?
- What do those numerical values represent?

In [None]:
# Write code here to explore the differences between `predict` and `predict_proba`.

**Give your written answer here:**

```


```