# Naive Bayes, meet Naive Humans

> This is a story of how you can be led astray by machine learning, courtesy me being pissy about [FOIA Predictor](https://datadotworld.shinyapps.io/foia_shiny_app/), which was inspired by [some work by CJS students](https://www.cjr.org/analysis/foia-request-how-to-study.php)

> Also let's talk about the [fake news challenge code](https://github.com/FakeNewsChallenge/fnc-1-baseline/blob/master/feature_engineering.py)

I love cooking, but I hate actually reading a recipe to see what cuisine it is. I only cook... italian food, let's say. If only there were a machine that could process the recipe for me!

"Soma, Soma!" you exclaim. "We just learned about **Naive Bayes**, I bet you can use it to automatically classify recipes!"

Okay, cool, I just machines to accomplish anything: **let's do it!**

## Preparing our data

### Step 1.1: Read in our data

This time it's just a csv.

In [1]:
import pandas as pd

df = pd.read_csv("recipes.csv")
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


### Step 1.2: Creating a label column

It needs to be a number, right? Let's say everything with a cuisine of "italian" is going to be `1` and everything with another cuisine is going to be `0`.

In [2]:
df['is_italian'] = (df['cuisine'] == 'italian').astype(int)

In [3]:
df[df.is_italian == 1].head(4)

Unnamed: 0,cuisine,id,ingredient_list,is_italian
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1
10,italian,5875,"pimentos, sweet pepper, dried oregano, olive o...",1
12,italian,2698,"Italian parsley leaves, walnuts, hot red peppe...",1


### Step 1.3: Create our features dataframe

I'm going to predict what cuisine our recipe is based on **only two ingredients**, because I know something about cooking.

What are two ingredients that are very much about Italian food?

In [4]:
df.ingredient_list.str.contains("tomato").astype(int)

0        1
1        1
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        1
10       0
11       0
12       0
13       1
14       0
15       1
16       0
17       0
18       0
19       0
20       0
21       1
22       0
23       0
24       0
25       0
26       1
27       0
28       0
29       0
        ..
39744    0
39745    0
39746    1
39747    0
39748    0
39749    1
39750    0
39751    0
39752    0
39753    0
39754    0
39755    1
39756    0
39757    0
39758    0
39759    0
39760    0
39761    0
39762    0
39763    0
39764    1
39765    0
39766    0
39767    0
39768    0
39769    0
39770    0
39771    0
39772    0
39773    1
Name: ingredient_list, Length: 39774, dtype: int64

In [5]:
features_df = pd.DataFrame({
    'has_tomatoes': df.ingredient_list.str.contains('tomato').astype(int),
    'has_olive_oil': df.ingredient_list.str.contains('olive oil').astype(int),
    'has_soy_sauces': df.ingredient_list.str.contains('soy sauce').astype(int)
})
features_df.head(3)

Unnamed: 0,has_olive_oil,has_soy_sauces,has_tomatoes
0,0,0,1
1,0,0,1
2,0,1,0


## Step 2: Using the classifier

### Step 2.1: Import the classifier

What kind of Naive Bayes classifier are we going to use?

In [6]:
# BernoulliNB is good for true/false, for short passages

from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()

### Step 2.2: Split our data into test and train data

In [7]:
# train_test_split will split our data into two parts
from sklearn.model_selection import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, 
    df.is_italian, 
    test_size=0.2) 

# the first parameter is our FEATURES. can't just do words_df, it won't work :(
# the second parameter is the LABEL as a number (so 0/1, not neg/pos)
# 80% training, 20% testing

### Step 2.3: Train the classifier

In [8]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## Testing our classifier

Let's test it against the test data!

In [9]:
clf.score(X_test, y_test)

0.79534883720930227

And how about the training data?

In [10]:
clf.score(X_train, y_train)

0.78852258084792104

# Now it's your turn!

At each table, you'll need to

* Pick a cuisine (or two, if you'd like!)
* Create another label column based on that cuisine
* Pick two or three or four or five or more ingredients you think are representative of your cuisine selection
* Create another features dataframe using those ingredients
* Train and test a classifier

Remember that your ingredients can signal the **presence of the cuisine** or the **absense of a cuisine** - if I was doing Japanese food, "miso" and "cheese" would be good options because they'd point firmly in one direction or the other - "YES this is japanese" and "NO this is not japanese."

In [11]:
df.cuisine.unique()

array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican', 'spanish',
       'italian', 'mexican', 'chinese', 'british', 'thai', 'vietnamese',
       'cajun_creole', 'brazilian', 'french', 'japanese', 'irish',
       'korean', 'moroccan', 'russian'], dtype=object)

In [12]:
df['is_indian'] = (df['cuisine'] == 'indian').astype(int)

In [13]:
features_df = pd.DataFrame({
    'has_curry': df.ingredient_list.str.contains('curry').astype(int),
    'has_ginger': df.ingredient_list.str.contains('ginger').astype(int),
    'has_turmeric': df.ingredient_list.str.contains('turmeric').astype(int),
    'has_ghee': df.ingredient_list.str.contains('ghee').astype(int)
})
features_df.head(3)

Unnamed: 0,has_curry,has_ghee,has_ginger,has_turmeric
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0


In [14]:
# BernoulliNB is good for true/false, for short passages

from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()

In [15]:
# train_test_split will split our data into two parts
from sklearn.model_selection import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, 
    df.is_indian, 
    test_size=0.2) 

# the first parameter is our FEATURES. can't just do words_df, it won't work :(
# the second parameter is the LABEL as a number (so 0/1, not neg/pos)
# 80% training, 20% testing

In [16]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [17]:
clf.score(X_test, y_test)

0.94267756128221247

In [18]:
df.is_italian.value_counts(normalize=True)

0    0.802937
1    0.197063
Name: is_italian, dtype: float64

In [19]:
df.is_indian.value_counts(normalize=True)

0    0.924498
1    0.075502
Name: is_indian, dtype: float64

In [20]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='constant', constant=1)

In [21]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.075424261470773094

## Label Encoders: What if we want more than `is_italian`?

A **LabelEncoder** will convert labels to numbers for you.

It has has two parts: **fit** and **transform**.

* **fit** learns all of the possible labels
* **transform** takes a list of categories and converts them into numbers

In [22]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

In [37]:
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

LabelEncoder()

In [38]:
# Get the labels out as numbers
le.transform(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

array([1, 2, 2, 2, 3, 0])

In [39]:
df.cuisine.head(10)

0          greek
1    southern_us
2       filipino
3         indian
4         indian
5       jamaican
6        spanish
7        italian
8        mexican
9        italian
Name: cuisine, dtype: object

In [26]:
# Send the label encoder each and every cuisine


In [27]:
# What does it give back when .transform-d?

In [28]:
# Add it back into the dataframe as cuisine_label


In [29]:
# Check value_counts of each to see if they match