# Food Classification

### Purpose
1. Classify each item into a USDA food group
2. Pre-populate the 'tags' associated with each order to find useful attributes and groupings.
3. Create a circular machine-learning approach to constantly refine and re-score likely food groups and other emergent categories from updates and new tags - new tags should inform new food groups and vice versa. 'Important' groups should emerge based upon how often they are used in rules

### Part 1. Cleaning Dataset

In [1]:
import pandas as pd
import numpy as np

In [4]:
# Can replace this with an S3 reference
df = pd.read_csv('C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_2018_3_15_processed_comma.csv', delimiter = ',',encoding='ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


Throughout this analysis, the format of food descriptions will all depend on the source of the data. Printing the unique values of the source column below will be a useful reference. 

In [5]:
df.source.unique()

array(['diet_facts_restaurants', 'diet_facts_brands', '700', 'fat_secret',
       'usda_raw_ingred', 'fat_secret_all_search', 'fat_secret_recipes',
       'usda_branded'], dtype=object)

In [6]:
df = df[df.source!='700']

## Self-Labeling Food Groups
1. Sampling 20k an exporting for self-labeling
2. Importing USDA food groups and matching
3. Importing fat-secret classifications as 'tags'
4. Build a master 'labeled dataset (include 'label source')

##### Self-Labeling Export

In [None]:
df_short = df[['food_description', 'brand', 'food_type_grp', 'ingredients_list']]
samp_df = df_short.sample(20000)

In [None]:
#Export for self-labeling
samp_df.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food classifications/sample_foods_to_label.csv", index=False)

##### Importing USDA Food groups and matching

Exact Matching

In [7]:
fd_grps = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/usda_food_groups.csv")

In [8]:
fd_grps

Unnamed: 0,Food,Group,Subgroup
0,"Apples, cooked or canned",Fruit,Whole Fruit
1,"Applesauce, canned, unsweetened, without vitam...",Fruit,Whole Fruit
2,"Apples, dried",Fruit,Whole Fruit
3,"Apple, dried, sulfured, uncooked",Fruit,Whole Fruit
4,"Apples, raw",Fruit,Whole Fruit
5,"Apple, raw, with skin",Fruit,Whole Fruit
6,Applesauce,Fruit,Whole Fruit
7,"Applesauce, canned, unsweetened, without vitam...",Fruit,Whole Fruit
8,"Apricot, cooked or canned",Fruit,Whole Fruit
9,"Apricot, canned, water pack",Fruit,Whole Fruit


In [18]:
full_df = df_short.merge(fd_grps, how='inner',left_on='food_description', right_on='Food')

In [21]:
full_df.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/labeled_data/usda_exact_match.csv", index=False)

Fuzzy Matching

In [22]:
import difflib

In [28]:
# Set index in fd grps
fd_grps=fd_grps.set_index(['Food'])

In [27]:
df_short=df_short.set_index(['food_description'])

In [None]:
df_short.index= df_short.index.map(lambda x: difflib.get_close_matches(x, fd_grps.index,n=1,cutoff=0.8))

In [None]:
fuzzy_grps = df_short.join(fd_grps)

In [None]:
fuzzy_grps.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/labeled_data/usda_fuzzy_match_.8.csv", index=False)

##### Importing Fat Secret 'Food Groups' as tags

In [3]:
fs_tags = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/fat_secret_groups_in_progress.csv")

In [4]:
fs_tags.head()

Unnamed: 0,food,food_group,food_label,food_sub_group,calcium_perc,calories,carb_g,cholesterol_mg,fat_g,fiber_g,iron_perc,protein,saturated_fat_g,serving_size,sodium_mg,sugar_g,vit_a_perc,vit_c_perc
0,Plain or Vegetarian Baked Beans,Beans & Legumes,,Baked Beans,9,239.0,53.7,0,0.94,10.4,17,12.06,0.224,1 cup,856,22.96,5,0
1,Baked Beans with Pork,Beans & Legumes,,Baked Beans,13,268.0,50.55,18,3.92,13.9,24,13.13,1.515,1 cup,1047,-,9,8
2,Baked Beans with Franks,Beans & Legumes,,Baked Beans,12,368.0,39.86,16,17.02,17.9,25,17.48,6.092,1 cup,1114,16.91,4,10
3,Baked Beans with Beef,Beans & Legumes,,Baked Beans,12,322.0,44.98,59,9.18,-,24,16.97,4.461,1 cup,1264,-,11,8
4,Baked Beans with Pork and Sweet Sauce,Beans & Legumes,,Baked Beans,15,283.0,53.43,18,3.64,10.6,23,13.38,1.27,1 cup,845,21.66,0,12


Perform Like To Like Transformation From FS Food Groups to USDA Food Groups

In [13]:
fs_grps = fs_tags['food_group'].unique()
fs_sub_grps = fs_tags['food_sub_group'].unique()

In [22]:
fs_grps_on = fs_grps['food_group', 'food_sub_group']

  if __name__ == '__main__':


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [21]:
fs_grps_df = fs_grps['food_group', 'food_sub_group'].drop_duplicates()

  if __name__ == '__main__':


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [6]:
fs_grps

array(['Beans & Legumes', 'Beverages', 'Breads & Cereals',
       'Cheese, Milk & Dairy', 'Eggs', 'Fast Food', 'Fish & Seafood',
       'Fruit', 'Meat', 'Nuts & Seeds', 'Pasta, Rice & Noodles', 'Salads',
       'Sauces, Spices & Spreads', 'Snacks', 'Soups',
       'Sweets, Candy & Desserts', 'Vegetables', 'Other'], dtype=object)

In [14]:
fs_sub_grps

array(['Baked Beans', 'Beans', 'Black Beans', 'Chickpeas', 'Green Beans',
       'Kidney Beans', 'Lentils', 'Lima Beans', 'Pinto Beans', 'Quinoa',
       'Refried Beans', 'Tofu', 'Alcohol', 'Apple Juice', 'Beer',
       'Cappuccino', 'Chocolate Milk', 'Cocktails', 'Cocoa', 'Coffee',
       'Cranberry Juice', 'Drink Mixes', 'Energy Drinks', 'Fruit Punch',
       'Ice Cream Sodas', 'Iced Coffee', 'Iced Tea', 'Juice', 'Latte',
       'Lemonade', 'Milk Shakes', 'Orange Juice', 'Red Wine', 'Root Beer',
       'Smoothies', 'Sodas', 'Soy Milk', 'Tea', 'Vegetable Juice', 'Vodka',
       'Water', 'White Wine', 'Wine', 'Bagels', 'Biscuits', 'Bread',
       'Breadsticks', 'Buns', 'Cereal', 'Cornbread', 'Croissants',
       'English Muffins', 'Flatbread', 'Focaccia', 'Garlic Bread',
       'Granola', 'Muesli', 'Multigrain Bread', 'Naan', 'Oatmeal', 'Oats',
       'Pita Bread', 'Potato Bread', 'Raisin Bread', 'Rolls', 'Rye Bread',
       'Scones', 'Sourdough Bread', 'Toast', 'Tortillas', 'Wheat Bre

In [9]:
usda_sub_grps =fd_grps['Subgroup'].unique()

In [11]:
usda_sub_grps

array(['Whole Fruit', 'Fruit Juice', 'Dark Green Vegetables', 'Legumes',
       'Red and Orange Vegetables', 'Starchy Vegetables',
       'Other Vegetables', 'Whole Grain', 'Refined Grains', 'Milk',
       'Yogurt', 'Cheese', 'Soymilk', 'Eggs', 'High Omega-3 Fish',
       'Low Omega-3 Fish', 'Nuts and Seeds', 'Poultry', 'Red Meats',
       'Processed Soy Products', 'Oils', 'Solid Fats'], dtype=object)

## Food Group Classifier

Features:
1. One-hot encoded tokenized words from food description
2. One-hot encoded tokenized bi-grams from food description
3. All nutritional information
4. Word vector of food desription
5. Fuzzy-match score of string similarity to major food words

Steps:
1. Cluster on nutritional info to find likely groups
2. K-nearest neighbors on nutritional info as tag/food group classification
3. More intricate food classifier algorithms

### Tokenizing and Phrasing Food Description 

1. Tokenize
2. 2-gram phrases
3. Remove brand tokens or bi-gram
4. TF/IDF of each token or big-gram
5. Create a word vector for each description
6. One-hot encode since this is a finite dictionary

In [6]:
df.food_description[df.source=='usda_branded'][0:5]

235133                         AARDVARK HABENERO HOT SAUCE,
235134            AARON'S BEST, OVEN ROASTED TURKEY BREAST,
235135    A&B AMERICAN STYLE, MORE HEAT SMALL BATCH PEPP...
235136            A&B AMERICAN STYLE, ORGANIC PEPPER SAUCE,
235137            A&B AMERICAN STYLE, PEPPER SAUCE, GARLIC,
Name: food_description, dtype: object

These descriptions often contain the brand of the item, which is going to be less relevant to the meaning wanted out of the word vector exercise

In [7]:
df[['food_description', 'brand']][df.source=='usda_branded'][0:15]

Unnamed: 0,food_description,brand
235133,"AARDVARK HABENERO HOT SAUCE,",Secret Aardvark Trading Company
235134,"AARON'S BEST, OVEN ROASTED TURKEY BREAST,",Agri Star Meat & Poultry LLC
235135,"A&B AMERICAN STYLE, MORE HEAT SMALL BATCH PEPP...",A & B AMERICAN STYLE LLC
235136,"A&B AMERICAN STYLE, ORGANIC PEPPER SAUCE,",A & B AMERICAN STYLE LLC
235137,"A&B AMERICAN STYLE, PEPPER SAUCE, GARLIC,",Namaste Foods
235138,"A&B AMERICAN STYLE, SMALL BATCH PEPPER SAUCE, ...",A & B AMERICAN STYLE LLC
235139,"A. BAUER'S, PREPARED MUSTARD,",August Bauer's Sons Inc.
235140,"ABBA-ZABA, SNACK SIZE BITES CANDY,","Annabelle Candy Co., Inc."
235141,"ABBA-ZABA'S, TAFFY, WILD STRAWBERRY, SOUR,","Annabelle Candy Co., Inc."
235142,"ABBEY FARM, RHUBARB & GINGER PRESERVE,",Bewley Irish Imports


Unhelpful! Some brands are nicely embedded in the food description, seperated by a comma. Others contain only a part of the brand name in the product description (such as Secret Aardvark Trading Company's Aardvark Habenero Hot Sauce). But a fuzzy match on the brand name within a food description would likely also strip away helpful words like 'Candy' or 'Beverage'. 

So ultimately it doesn't look like there are helpful transformations for this subset

In [19]:
df.food_description[400000:400100]

400000                         PREMIUM MEATS & CHEESES,
400001                         PREMIUM MEATS & CHEESES,
400002                      PREMIUM MILK CHOCOLATE BAR,
400003                                    PREMIUM MILK,
400004                         PREMIUM MEATS & CHEESES,
400005                         PREMIUM MEATS & CHEESES,
400006       PREMIUM MEATS HONEY ROASTED TURKEY BREAST,
400007       PREMIUM MEATS HONEY ROASTED TURKEY BREAST,
400008                              PREMIUM MEAT SNACK,
400009        PREMIUM MEATS OVEN ROASTED TURKEY BREAST,
400010                                   PREMIUM MEATS,
400011         PREMIUM MEDITERRANEAN STYLE FETA CHEESE,
400012                           PREMIUM MEDIUM RELISH,
400013                       PREMIUM MERLOT WINE JELLY,
400014                     PREMIUM MEYER ROLLED WAFERS,
400015                       PREMIUM MICROWAVE POPCORN,
400016                          PREMIUM MILD BBQ SAUCE,
400017                                   SELTZER

## Using Nutritional Information To Supplement Grouping

1. Cluster items by nutritional profile. (k-means)
2. Use distance from each group to help classify
3. Use basic classification to predict food group according to self-labeled item

## Part 2: Vector Space Modeling

In [5]:
import gensim



Creating an baseline Word2Vec model with no negative sampling

In [6]:
text_8_path = 'C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/text8/text8'

In [7]:
text_8 = gensim.models.word2vec.Text8Corpus(text_8_path)

In [14]:
phrases = gensim.models.phrases.Phrases(text_8)

In [15]:
food_phrases = gensim.models.Word2Vec(phrases)



TypeError: 'int' object is not iterable

In [10]:
food_names.most_similar('eggs benedict')

NameError: name 'food_names' is not defined

In [16]:
food_names.most_similar('apple')

  if __name__ == '__main__':


[('macintosh', 0.7894327640533447),
 ('amiga', 0.7420880794525146),
 ('intel', 0.7306458353996277),
 ('ibm', 0.7269545793533325),
 ('amd', 0.690711259841919),
 ('atari', 0.6872498989105225),
 ('pc', 0.679271936416626),
 ('nintendo', 0.674188494682312),
 ('hypercard', 0.6720134615898132),
 ('microsoft', 0.6710932850837708)]

In [17]:
food_names.most_similar('apple pie')

  if __name__ == '__main__':


KeyError: "word 'apple pie' not in vocabulary"

In [23]:
df['food_description'][0]

'Cilantro Lime Dressing'

Iterating to See How Many Raw Descriptions Are Supported In the Text8 Corpus

In [26]:
food_vec = []
results_holder = []
for i in range(len(df)):
    try:
        word_vec = food_names[df['food_description'][1]]
        food = df['food_description'][1]
        food_vec = food_vec.append(food),
        results_holder = results_holder.append(word_vec[0:10])
    except:
        pass



In [None]:
print(len(food_vec))

Two learnings here. First and foremost, this corpus is not specific enough for the exercise. 'Apple' should be returning 'fruit' and 'snack' but not 'macintosh' and 'microsoft'. Even if I had been able to nicely process my food descriptions cleanly, the results from this corpus still leave much to be desired.

Second, the model (and corpus) needs to support phrases. Entering 'apple pie' returned no results even though there is almost certainly an apple pie Wikipedia entry in these results. 

Phrases can be supported by using Negative Sampling the final network layer rather than heirarchical softmax. We can test simple implementation by retraining on the text_8 corpus, using negative sampling with 5 noise words included.

In [36]:
food_phrases = gensim.models.Word2Vec(text_8, hs=0, negative = 2)

In [23]:
food_names.most_similar('oz')

  if __name__ == '__main__':


[('wizard', 0.7509090900421143),
 ('carol', 0.702263593673706),
 ('judy', 0.6979846954345703),
 ('betty', 0.6864558458328247),
 ('biopic', 0.6815564632415771),
 ('potter', 0.6814549565315247),
 ('gloria', 0.680546760559082),
 ('doc', 0.679688572883606),
 ('remake', 0.6795153021812439),
 ('mister', 0.6771171689033508)]

## Part3: Serving Sizes

A human meal planner can look at a serving size '1 can' and decide 'I can eat half a can now, and half a can later'. A machine can perform at least a somewhat similar decision if and only if it can understand that '1' is the amount and 'can' is the unit. This exercise will attempt to use text processing to seperate values and units, and at best to normalize units to the same type (i.e. 'oz' and 'ounces'). 

To start with we investigate a sample of the 'serving size' column below.

In [9]:
import nltk

In [10]:
df.serving_size_raw.head()

0    1 egg cream
1    1 egg cream
2        1 order
3        1 order
4        1 order
Name: serving_size_raw, dtype: object

So egg cream is probably not going to be a standard unit of measurement. But is there a way to recognize when a unit stands for '1 item'?

On the other hand 'order' should be a word often repeated. We can start with word frequency as a good indication of true 'units' vs descriptions of items. 

Before getting a bag of words of the serving size units, stripping away the values would be helpful

In [None]:
df['ss_numbers_raw'] = ''
for i in range((len(df)-1)):
    line_numbers = []
    line = df.iloc[i,6]
    if type(line) == str:
        for t in line.split():
            try:
                line_numbers.append(float(t))   
            except ValueError:
                pass
        df['ss_numbers_raw'][i]= line_numbers
    else:
        df['ss_numbers_raw'][i]=line

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [19]:
df[['ss_numbers_raw', 'serving_size_raw']][10000:10010]

Unnamed: 0,ss_numbers_raw,serving_size_raw
10000,"[8.0, 240.0]",8 fl oz 240 mL
10001,"[12.0, 355.0]",12 fl oz 355 mL
10002,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10003,"[8.0, 240.0]",8 fl oz 240 mL
10004,"[8.0, 240.0]",8 fl oz 240 mL
10005,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10006,"[8.0, 240.0]",2/5 bottle 8 fl oz 240 mL
10007,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10008,"[8.0, 240.0]",8 fl oz 240 mL
10009,"[8.0, 240.0]",8 fl oz 240 mL


In [3]:
df['ss_strings_raw'] = ''
for i in range(len(df.serving_size_raw)):
    line_strings = []
    line = df.iloc[i,6]
    if type(line) == str:
        for t in line.split():
            try:
                float(t)
            except ValueError:
                line_strings.append(t)
        df['ss_strings_raw'][i]= line_strings
    else:
        df['ss_strings_raw'][i]=''

NameError: name 'df' is not defined

In [21]:
df[['ss_strings_raw', 'serving_size_raw']][10000:10010]

Unnamed: 0,ss_numbers_raw,serving_size_raw
10000,"[fl, oz, mL]",8 fl oz 240 mL
10001,"[fl, oz, mL]",12 fl oz 355 mL
10002,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10003,"[fl, oz, mL]",8 fl oz 240 mL
10004,"[fl, oz, mL]",8 fl oz 240 mL
10005,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10006,"[2/5, bottle, fl, oz, mL]",2/5 bottle 8 fl oz 240 mL
10007,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10008,"[fl, oz, mL]",8 fl oz 240 mL
10009,"[fl, oz, mL]",8 fl oz 240 mL


In [38]:
df.columns.values

array(['food_key', 'food_description', 'brand', 'food_type_grp', 'source',
       'ingredients_list', 'serving_size_raw', 'serving_size_val',
       'serving_size_unit', 'calories', 'protein_g', 'fat_g',
       'saturated_fat_g', 'carb_g', 'fiber_g', 'sugar_g', 'sodium_mg',
       'cholesterol_mg', 'calcium_mg', 'iron_mg', 'vit_a_mcg', 'vit_c_mg',
       'ss_numbers_raw'], dtype=object)

In [None]:
# First 
for i in range((len(df.serving_size_raw)-1)):
    line_strings = []
    line_floats = []
    line = df.iloc[i,6]
    if type(line) == str:
        for t in line.split():
            try:
                float(t)
                line_floats.append(t)
            except ValueError:
                line_strings.append(t)
        if len(line_floats) > 1:
            text1 = line[(line.find(line_floats[0])+1):(line.find(line_floats[1])-1)].strip()
            df.serving_size_val.iloc[i] = float(line_floats[0])
            df.serving_size_unit.iloc[i] = text1
        elif len(line_floats) == 1:
            
            df.serving_size_val.iloc[i] = float(line_floats[0])
            df.serving_size_unit.iloc[i] = line[(line.find(line_floats[0])+1):len(line)].strip()
        else:
            df.serving_size_unit.iloc[i] = line
    else:
        df.serving_size_val.iloc[i] = line

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [12]:
df.serving_size_unit.iloc[1]

'egg cream'

In [9]:
type(line_floats[0])

float

In [11]:
test = df.iloc[10004,6]

In [36]:
line_strings = []
line_floats = []
for t in test.split():
    try:
        float(t)
        line_floats.append(t)
    except ValueError:
        line_strings.append(t)
if len(line_floats) > 1:
    text1 = test[(test.find(line_floats[0])+1):(test.find(line_floats[1])-1)].strip()

In [37]:
text1

'fl oz'

In [None]:
df.to_csv('C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_2018_3_16_processed.csv', delimiter = ',',encoding='ISO-8859-1')