# Food Classification

### Purpose
1. Classify each item into a USDA food group
2. Pre-populate the 'tags' associated with each order to find useful attributes and groupings.
   * Can pre-populate tags with machine learning or social media scraping techniques
3. Create a circular machine-learning approach to constantly refine and re-score likely food groups and other emergent categories from updates and new tags - new tags should inform new food groups and vice versa. 'Important' groups should emerge based upon how often they are used in rules

### Part 1. Cleaning Dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Can replace this with an S3 reference
df = pd.read_csv('C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_2018_3_15_processed_comma.csv', delimiter = ',',encoding='ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


Throughout this analysis, the format of food descriptions will all depend on the source of the data. Printing the unique values of the source column below will be a useful reference. 

In [5]:
df.source.unique()

array(['diet_facts_restaurants', 'diet_facts_brands', '700', 'fat_secret',
       'usda_raw_ingred', 'fat_secret_all_search', 'fat_secret_recipes',
       'usda_branded'], dtype=object)

In [3]:
df = df[df.source!='700']

## Self-Labeling Food Groups
1. Sampling 20k an exporting for self-labeling
2. Importing USDA food groups and matching
3. Importing fat-secret classifications as 'tags'
4. Build a master 'labeled dataset (include 'label source')

##### Self-Labeling Export

In [65]:
df_short = df[['food_description', 'brand', 'food_type_grp', 'ingredients_list']]
samp_df = df_short.sample(20000)

In [8]:
#Export for self-labeling
samp_df.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food classifications/sample_foods_to_label.csv", index=False)

##### Importing USDA Food groups and matching

Exact Matching

In [60]:
fd_grps = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/usda_food_groups.csv")

In [61]:
fd_grps

Unnamed: 0,Food,Group,Subgroup
0,"Apples, cooked or canned",Fruit,Whole Fruit
1,"Applesauce, canned, unsweetened, without vitam...",Fruit,Whole Fruit
2,"Apples, dried",Fruit,Whole Fruit
3,"Apple, dried, sulfured, uncooked",Fruit,Whole Fruit
4,"Apples, raw",Fruit,Whole Fruit
5,"Apple, raw, with skin",Fruit,Whole Fruit
6,Applesauce,Fruit,Whole Fruit
7,"Applesauce, canned, unsweetened, without vitam...",Fruit,Whole Fruit
8,"Apricot, cooked or canned",Fruit,Whole Fruit
9,"Apricot, canned, water pack",Fruit,Whole Fruit


In [37]:
full_df = df_short.merge(fd_grps, how='inner',left_on='food_description', right_on='Food')

KeyError: 'Food'

In [21]:
full_df.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/labeled_data/usda_exact_match.csv", index=False)

Fuzzy Matching

In [17]:
import difflib

In [39]:
fd_grps

Unnamed: 0_level_0,Group,Subgroup
Food,Unnamed: 1_level_1,Unnamed: 2_level_1
"Apples, cooked or canned",Fruit,Whole Fruit
"Applesauce, canned, unsweetened, without vitamin C",Fruit,Whole Fruit
"Apples, dried",Fruit,Whole Fruit
"Apple, dried, sulfured, uncooked",Fruit,Whole Fruit
"Apples, raw",Fruit,Whole Fruit
"Apple, raw, with skin",Fruit,Whole Fruit
Applesauce,Fruit,Whole Fruit
"Applesauce, canned, unsweetened, without vitamin C",Fruit,Whole Fruit
"Apricot, cooked or canned",Fruit,Whole Fruit
"Apricot, canned, water pack",Fruit,Whole Fruit


In [38]:
# Set index in fd grps
fd_grps=fd_grps.set_index(['Food'])

KeyError: 'Food'

In [40]:
#df_short=df_short.set_index(['food_description'])

In [4]:
def matcher(x):
    l = difflib.get_close_matches(x, fd_grps.Food,n=1,cutoff=0.8)
    if len(l)==0:
        return np.nan
    else:
        return l[0]

In [64]:
df_short

Unnamed: 0,brand,food_type_grp,ingredients_list
,5 & Diner,restaurant,
,5 & Diner,restaurant,
,5 & Diner,restaurant,two poached eggs on top of sliced ham and an E...
,5 & Diner,restaurant,
,5 & Diner,restaurant,"battered cod served with French fries, corn on..."
,5 & Diner,restaurant,
,5 & Diner,restaurant,tender slices of roast beef served on a hoagie...
,5 & Diner,restaurant,
,5 & Diner,restaurant,
,5 & Diner,restaurant,


In [5]:
df_short['matched']= df_short.food_description.map(lambda x: matcher(x))

NameError: name 'df_short' is not defined

In [50]:
fuzzy_grps = df_short.merge(fd_grps, how="inner", left_on='matched',right_on='Food')

In [59]:
fuzzy_grps.filter(np.nan)

TypeError: 'float' object is not iterable

In [None]:
fuzzy_grps[]

In [54]:
fd_grps

Unnamed: 0_level_0,Group,Subgroup
Food,Unnamed: 1_level_1,Unnamed: 2_level_1
"Apples, cooked or canned",Fruit,Whole Fruit
"Applesauce, canned, unsweetened, without vitamin C",Fruit,Whole Fruit
"Apples, dried",Fruit,Whole Fruit
"Apple, dried, sulfured, uncooked",Fruit,Whole Fruit
"Apples, raw",Fruit,Whole Fruit
"Apple, raw, with skin",Fruit,Whole Fruit
Applesauce,Fruit,Whole Fruit
"Applesauce, canned, unsweetened, without vitamin C",Fruit,Whole Fruit
"Apricot, cooked or canned",Fruit,Whole Fruit
"Apricot, canned, water pack",Fruit,Whole Fruit


In [51]:
len(fuzzy_grps)

469828

In [None]:
fuzzy_grps.to_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/labeled_data/usda_fuzzy_match_.8.csv", index=False)

##### Importing Fat Secret 'Food Groups' as tags

In the model df, will one-hot encode tags also. 

In [11]:
fs_tags = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/fat_secret_groups_in_progress.csv")

In [34]:
fs_tags_short = fs_tags[['food','food_group','food_label','food_sub_group']]

Would like to perform Like To Like Transformation From FS Food Groups to USDA Food Groups - they don't often match easily. Will use one-hot-encoding first

In [24]:
usda_sub_grps =fd_grps[['Subgroup']].drop_duplicates()

In [11]:
df = df.merge(fs_tags_short, how='left',left_on='food_description', right_on='food')

NameError: name 'df' is not defined

In [50]:
df.columns.values[-4] = 'tag1'
df.columns.values[-3] = 'tag2'
df.columns.values[-2] = 'tag3'
df.columns.values[-1] = 'tag4'

In [51]:
df.columns.values

array(['food_key', 'food_description', 'brand', 'food_type_grp', 'source',
       'ingredients_list', 'serving_size_raw', 'serving_size_val',
       'serving_size_unit', 'calories', 'protein_g', 'fat_g',
       'saturated_fat_g', 'carb_g', 'fiber_g', 'sugar_g', 'sodium_mg',
       'cholesterol_mg', 'calcium_mg', 'iron_mg', 'vit_a_mcg', 'vit_c_mg',
       'tag1', 'tag2', 'tag3', 'tag4'], dtype=object)

Matching Ingredients To USDA Groups

In [6]:
df_ingreds = df.ingredients_list[(df.ingredients_list.notnull())&(df.food_type_grp=='recipe')]

In [8]:
ingred_items = []
for ingred in df_ingreds:
    l = ingred.replace('{','').replace('}','').split(',')
    for i in l:
        ingred_items.append(i)
ingred_set = list(set(ingred_items))

In [9]:
ingred_set

['',
 ' Enriched Dry Pasta',
 '2% Fat Milk',
 ' Cocoa Powder',
 ' 0% Greek Style Yogurt',
 ' Hydrogenated)',
 'Dried Apricots (Uncooked',
 ' Lean Ground Turkey 93/7',
 ' Yellow & Green Peppers',
 ' Slip Skin)',
 ' White Tuna Fish (Drained Solids In Water',
 ' Milk Chocolate Candies',
 'Sea Scallops',
 ' Green Tomatoes',
 'Organic Rice Milk',
 ' Brazil Nuts',
 ' Anchovy (Drained Solids In Oil',
 ' Reduced Sodium Black Beans',
 ' Dry Whole Wheat Pasta',
 'Monterey Cheese',
 'Spanish Olives',
 ' Cream Of Mushroom Soup (with Equal Volume Water',
 ' Trimmed to 1/8" Fat',
 ' Common Cabbage (Freshly Harvest)',
 ' Chicken Meat (Broilers or Fryers',
 ' Natural Finely Shredded Mexican Style Four Cheese',
 ' Roasted Unsalted Cashew Nuts',
 ' Nonfat Plain Greek Yogurt (170g)',
 'Blanched Almond Flour',
 ' Wheat Germ',
 ' Low Sodium Tomato Sauce',
 'Butter Oil',
 ' 100% Canola Oil',
 ' Extra Firm Silken Tofu',
 ' Unsalted Butter Stick',
 ' Frozen Shrimp',
 ' Dry or Hard Salami',
 'Chopped Walnuts',

In [10]:
ingred_set_df = pd.DataFrame(ingred_set,columns=['name'])

In [11]:
ingred_set_df

Unnamed: 0,name
0,
1,Enriched Dry Pasta
2,2% Fat Milk
3,Cocoa Powder
4,0% Greek Style Yogurt
5,Hydrogenated)
6,Dried Apricots (Uncooked
7,Lean Ground Turkey 93/7
8,Yellow & Green Peppers
9,Slip Skin)


In [12]:
usda_grps = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/labeled_data/usda_exact_match.csv")

In [13]:
exact_ingred_match = usda_grps.merge(ingred_set_df, how='inner', right_on='name', left_on='food_description')

In [14]:
exact_ingred_match['food_description'].unique()

array(['Apple Juice', 'Shrimp', 'Walnuts', 'Tomato Juice', 'White Rice',
       'Pecans', 'Bamboo Shoots', 'Lemon Juice', 'Lime Juice', 'Garlic',
       'Lentils'], dtype=object)

##### Building A Classification Dataset

First getting all tokens for one-hot encoding

In [9]:
import nltk

Real quick, going to remove a comma if that is the last string in the food description

In [54]:
df.food_description[0:10]

0    Egg Cream, Chocolate flavored
1      Egg Cream, Vanilla flavored
2                    Eggs Benedict
3     Eggs Maximilian with Chorizo
4                   Fish and Chips
5                       Fish Tacos
6                       French Dip
7                     French Fries
8        French Onion Steak Dinner
9                     French Toast
Name: food_description, dtype: object

In [8]:
def comma_remover(x):
    if x[len(x)-1]==',':
        x = x.replace(',','')
    return x

In [9]:
df['food_description'] = df.food_description.map(lambda x: comma_remover(x))

Getting Bag of Words
* Tokens food description
* B-grams for food description where bi-grams occur more than 3 times
* Tokens for ingredients

Sklearn has a process for this, but is running into memory error. So will do manually

In [5]:
from sklearn.preprocessing import MultiLabelBinarizer
import itertools
import numpy as np
import nltk

In [6]:
stops = set(nltk.corpus.stopwords.words('english'))

In [7]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

In [8]:
food_desc_doc = [tokenizer.tokenize(x) for x in df['food_description']]

In [9]:
food_desc_doc_no_stop = [word for word in food_desc_doc if word not in list(stops)]

In [71]:
len(food_desc_doc_no_stop)

469828

In [72]:
len(df)

469828

In [10]:
food_desc_set =[]
for line in food_desc_doc_no_stop:
    for word in line:
        food_desc_set.append(word)
food_desc_set = set(food_desc_set)

In [11]:
food_desc_list = list(food_desc_set)

One-hot encoding manually

In [12]:
print("The longest food desc is "+str(max(map(lambda x: len(x),food_desc_doc)))+" words.")

The longest food desc is 39 words.


To use sklearn's one-hot encoding, all vecs must be same size. So will do manually

In [13]:
# Declaring empty one hot dataframe
food_desc_list.append('food_key')
hot_df = pd.DataFrame(columns=food_desc_list)

In [62]:
hot_df.head()

Unnamed: 0,CABLE,LUNDBERG,liv,00038000941450,NAZOOK,ODEN,CAREMEL,BMP,DEBBIE,PAANI,...,00024100126606,EMMY,PROPER,Zags,Feta,TURRO,Churrascaritas,Granular,CUDDLY,food_key


In [77]:
line_vec = list(np.repeat(0,len(food_desc_list)))

In [78]:
len(line_vec)

47290

In [None]:
# So slow, it might need to be done in Spark
for i in range(len(df)):
    line_vec = list(np.repeat(0,len(food_desc_list)))
    for word in food_desc_doc_no_stop[i]:
        line_vec[food_desc_list.index(word)] =1
    line_vec[len(line_vec)-1] = df['food_key'].iloc[i]
    hot_df.append([line_vec], ignore_index=False)
    if i%1000 == 0:
        print(i)

  result = result.union(other)


0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


Bigrams

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BiGramCollocationFinder.from_words(food_desc_doc_no_stop)
finder.nbest(bigram_measures.pmi, 100) # Find top 100 bigrams

Trigrams

In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TriGramCollocationFinder.from_words(food_desc_doc_no_stop)
finder.nbest(Trigram_measures.pmi, 100) # Find top 100 bigrams

IDF

Vector Space Model

In [79]:
food_desc_toks = itertools.chain.from_iterable(food_desc_doc)

In [80]:
#Word To ID:
food_desc_by_id = {token: x for x, token in enumerate(set(food_desc_toks))}

In [82]:
len(food_desc_by_id)

52726

In [91]:
# Convert back to each token for each id
food_desc_ids = [[food_desc_by_id[token] for token in food_desc_tok] for food_desc_tok in food_desc_doc]

In [97]:
food_desc_vec = MultiLabelBinarizer()

In [98]:
food_desc_hot = food_desc_vec.fit_transform(food_desc_ids)

MemoryError: 

In [33]:
df.head()

Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,calories,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
0,234617,"Egg Cream, Chocolate flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,,,211.0,...,1.5,42.87,0.0,6.5,191.41,12.5,180.0,2.16,250.0,1.2
1,234618,"Egg Cream, Vanilla flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,,,205.0,...,1.5,44.5,0.0,44.5,85.0,12.5,180.0,0.0,250.0,1.2
2,234619,Eggs Benedict,5 & Diner,restaurant,diet_facts_restaurants,two poached eggs on top of sliced ham and an E...,1 order,,,702.0,...,24.12,36.44,1.35,10.01,1864.89,562.57,220.0,4.14,2000.0,21.0
3,234620,Eggs Maximilian with Chorizo,5 & Diner,restaurant,diet_facts_restaurants,,1 order,,,1353.0,...,27.58,75.24,6.3,6.98,2033.99,751.81,440.0,7.38,4150.0,79.2
4,234621,Fish and Chips,5 & Diner,restaurant,diet_facts_restaurants,"battered cod served with French fries, corn on...",1 order,,,672.0,...,1.09,77.55,1.08,0.94,1904.2,109.72,140.0,4.86,650.0,3.6


Getting an error because of different lengths per list. https://stackoverflow.com/questions/42391165/how-to-one-hot-encode-variant-length-features

Now getting a memory error... Use Keras?

In [73]:
enumerate(set(test_tok))

<enumerate at 0x197a1f30>

In [None]:
bow = ''

In [None]:
all_tokens = nltk.tokenize(df['food_description'])

## Food Group Classifier

Features:
1. One-hot encoded tokenized words from food description
2. One-hot encoded tokenized bi-grams from food description
3. All nutritional information
4. Word vector of food desription
5. Fuzzy-match score of string similarity to major food words

Steps:
1. Cluster on nutritional info to find likely groups
2. K-nearest neighbors on nutritional info as tag/food group classification
3. More intricate food classifier algorithms

### Tokenizing and Phrasing Food Description 

1. Tokenize
2. 2-gram phrases
3. Remove brand tokens or bi-gram
4. TF/IDF of each token or big-gram
5. Create a word vector for each description
6. One-hot encode since this is a finite dictionary

In [6]:
df.food_description[df.source=='usda_branded'][0:5]

235133                         AARDVARK HABENERO HOT SAUCE,
235134            AARON'S BEST, OVEN ROASTED TURKEY BREAST,
235135    A&B AMERICAN STYLE, MORE HEAT SMALL BATCH PEPP...
235136            A&B AMERICAN STYLE, ORGANIC PEPPER SAUCE,
235137            A&B AMERICAN STYLE, PEPPER SAUCE, GARLIC,
Name: food_description, dtype: object

These descriptions often contain the brand of the item, which is going to be less relevant to the meaning wanted out of the word vector exercise

In [7]:
df[['food_description', 'brand']][df.source=='usda_branded'][0:15]

Unnamed: 0,food_description,brand
235133,"AARDVARK HABENERO HOT SAUCE,",Secret Aardvark Trading Company
235134,"AARON'S BEST, OVEN ROASTED TURKEY BREAST,",Agri Star Meat & Poultry LLC
235135,"A&B AMERICAN STYLE, MORE HEAT SMALL BATCH PEPP...",A & B AMERICAN STYLE LLC
235136,"A&B AMERICAN STYLE, ORGANIC PEPPER SAUCE,",A & B AMERICAN STYLE LLC
235137,"A&B AMERICAN STYLE, PEPPER SAUCE, GARLIC,",Namaste Foods
235138,"A&B AMERICAN STYLE, SMALL BATCH PEPPER SAUCE, ...",A & B AMERICAN STYLE LLC
235139,"A. BAUER'S, PREPARED MUSTARD,",August Bauer's Sons Inc.
235140,"ABBA-ZABA, SNACK SIZE BITES CANDY,","Annabelle Candy Co., Inc."
235141,"ABBA-ZABA'S, TAFFY, WILD STRAWBERRY, SOUR,","Annabelle Candy Co., Inc."
235142,"ABBEY FARM, RHUBARB & GINGER PRESERVE,",Bewley Irish Imports


Unhelpful! Some brands are nicely embedded in the food description, seperated by a comma. Others contain only a part of the brand name in the product description (such as Secret Aardvark Trading Company's Aardvark Habenero Hot Sauce). But a fuzzy match on the brand name within a food description would likely also strip away helpful words like 'Candy' or 'Beverage'. 

So ultimately it doesn't look like there are helpful transformations for this subset

In [19]:
df.food_description[400000:400100]

400000                         PREMIUM MEATS & CHEESES,
400001                         PREMIUM MEATS & CHEESES,
400002                      PREMIUM MILK CHOCOLATE BAR,
400003                                    PREMIUM MILK,
400004                         PREMIUM MEATS & CHEESES,
400005                         PREMIUM MEATS & CHEESES,
400006       PREMIUM MEATS HONEY ROASTED TURKEY BREAST,
400007       PREMIUM MEATS HONEY ROASTED TURKEY BREAST,
400008                              PREMIUM MEAT SNACK,
400009        PREMIUM MEATS OVEN ROASTED TURKEY BREAST,
400010                                   PREMIUM MEATS,
400011         PREMIUM MEDITERRANEAN STYLE FETA CHEESE,
400012                           PREMIUM MEDIUM RELISH,
400013                       PREMIUM MERLOT WINE JELLY,
400014                     PREMIUM MEYER ROLLED WAFERS,
400015                       PREMIUM MICROWAVE POPCORN,
400016                          PREMIUM MILD BBQ SAUCE,
400017                                   SELTZER

## Using Nutritional Information To Supplement Grouping

1. Cluster items by nutritional profile. (k-means)
2. Use distance from each group to help classify
3. Use basic classification to predict food group according to self-labeled item

## Part 2: Vector Space Modeling

In [5]:
import gensim



Creating an baseline Word2Vec model with no negative sampling

In [6]:
text_8_path = 'C:/Users/J/Desktop/Businesses/Meal_Maker/Food Classifications/text8/text8'

In [7]:
text_8 = gensim.models.word2vec.Text8Corpus(text_8_path)

In [14]:
phrases = gensim.models.phrases.Phrases(text_8)

In [15]:
food_phrases = gensim.models.Word2Vec(phrases)



TypeError: 'int' object is not iterable

In [10]:
food_names.most_similar('eggs benedict')

NameError: name 'food_names' is not defined

In [16]:
food_names.most_similar('apple')

  if __name__ == '__main__':


[('macintosh', 0.7894327640533447),
 ('amiga', 0.7420880794525146),
 ('intel', 0.7306458353996277),
 ('ibm', 0.7269545793533325),
 ('amd', 0.690711259841919),
 ('atari', 0.6872498989105225),
 ('pc', 0.679271936416626),
 ('nintendo', 0.674188494682312),
 ('hypercard', 0.6720134615898132),
 ('microsoft', 0.6710932850837708)]

In [17]:
food_names.most_similar('apple pie')

  if __name__ == '__main__':


KeyError: "word 'apple pie' not in vocabulary"

In [23]:
df['food_description'][0]

'Cilantro Lime Dressing'

Iterating to See How Many Raw Descriptions Are Supported In the Text8 Corpus

In [26]:
food_vec = []
results_holder = []
for i in range(len(df)):
    try:
        word_vec = food_names[df['food_description'][1]]
        food = df['food_description'][1]
        food_vec = food_vec.append(food),
        results_holder = results_holder.append(word_vec[0:10])
    except:
        pass



In [None]:
print(len(food_vec))

Two learnings here. First and foremost, this corpus is not specific enough for the exercise. 'Apple' should be returning 'fruit' and 'snack' but not 'macintosh' and 'microsoft'. Even if I had been able to nicely process my food descriptions cleanly, the results from this corpus still leave much to be desired.

Second, the model (and corpus) needs to support phrases. Entering 'apple pie' returned no results even though there is almost certainly an apple pie Wikipedia entry in these results. 

Phrases can be supported by using Negative Sampling the final network layer rather than heirarchical softmax. We can test simple implementation by retraining on the text_8 corpus, using negative sampling with 5 noise words included.

In [36]:
food_phrases = gensim.models.Word2Vec(text_8, hs=0, negative = 2)

In [23]:
food_names.most_similar('oz')

  if __name__ == '__main__':


[('wizard', 0.7509090900421143),
 ('carol', 0.702263593673706),
 ('judy', 0.6979846954345703),
 ('betty', 0.6864558458328247),
 ('biopic', 0.6815564632415771),
 ('potter', 0.6814549565315247),
 ('gloria', 0.680546760559082),
 ('doc', 0.679688572883606),
 ('remake', 0.6795153021812439),
 ('mister', 0.6771171689033508)]

## Part3: Serving Sizes

A human meal planner can look at a serving size '1 can' and decide 'I can eat half a can now, and half a can later'. A machine can perform at least a somewhat similar decision if and only if it can understand that '1' is the amount and 'can' is the unit. This exercise will attempt to use text processing to seperate values and units, and at best to normalize units to the same type (i.e. 'oz' and 'ounces'). 

To start with we investigate a sample of the 'serving size' column below.

In [19]:
df[['ss_numbers_raw', 'serving_size_raw']][10000:10010]

Unnamed: 0,ss_numbers_raw,serving_size_raw
10000,"[8.0, 240.0]",8 fl oz 240 mL
10001,"[12.0, 355.0]",12 fl oz 355 mL
10002,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10003,"[8.0, 240.0]",8 fl oz 240 mL
10004,"[8.0, 240.0]",8 fl oz 240 mL
10005,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10006,"[8.0, 240.0]",2/5 bottle 8 fl oz 240 mL
10007,"[1.0, 12.0, 355.0]",1 can 12 fl oz 355 mL
10008,"[8.0, 240.0]",8 fl oz 240 mL
10009,"[8.0, 240.0]",8 fl oz 240 mL


In [3]:
df['ss_strings_raw'] = ''
for i in range(len(df.serving_size_raw)):
    line_strings = []
    line = df.iloc[i,6]
    if type(line) == str:
        for t in line.split():
            try:
                float(t)
            except ValueError:
                line_strings.append(t)
        df['ss_strings_raw'][i]= line_strings
    else:
        df['ss_strings_raw'][i]=''

NameError: name 'df' is not defined

In [21]:
df[['ss_strings_raw', 'serving_size_raw']][10000:10010]

Unnamed: 0,ss_numbers_raw,serving_size_raw
10000,"[fl, oz, mL]",8 fl oz 240 mL
10001,"[fl, oz, mL]",12 fl oz 355 mL
10002,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10003,"[fl, oz, mL]",8 fl oz 240 mL
10004,"[fl, oz, mL]",8 fl oz 240 mL
10005,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10006,"[2/5, bottle, fl, oz, mL]",2/5 bottle 8 fl oz 240 mL
10007,"[can, fl, oz, mL]",1 can 12 fl oz 355 mL
10008,"[fl, oz, mL]",8 fl oz 240 mL
10009,"[fl, oz, mL]",8 fl oz 240 mL


In [38]:
df.columns.values

array(['food_key', 'food_description', 'brand', 'food_type_grp', 'source',
       'ingredients_list', 'serving_size_raw', 'serving_size_val',
       'serving_size_unit', 'calories', 'protein_g', 'fat_g',
       'saturated_fat_g', 'carb_g', 'fiber_g', 'sugar_g', 'sodium_mg',
       'cholesterol_mg', 'calcium_mg', 'iron_mg', 'vit_a_mcg', 'vit_c_mg',
       'ss_numbers_raw'], dtype=object)

In [16]:
df['serving_size_raw'] = df['serving_size_raw'].fillna('1 item')

In [17]:
df.head()

Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,calories,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
0,234617,"Egg Cream, Chocolate flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,1 item,,211.0,...,1.5,42.87,0.0,6.5,191.41,12.5,180.0,2.16,250.0,1.2
1,234618,"Egg Cream, Vanilla flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,1 item,,205.0,...,1.5,44.5,0.0,44.5,85.0,12.5,180.0,0.0,250.0,1.2
2,234619,Eggs Benedict,5 & Diner,restaurant,diet_facts_restaurants,two poached eggs on top of sliced ham and an E...,1 order,1 item,,702.0,...,24.12,36.44,1.35,10.01,1864.89,562.57,220.0,4.14,2000.0,21.0
3,234620,Eggs Maximilian with Chorizo,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1 item,,1353.0,...,27.58,75.24,6.3,6.98,2033.99,751.81,440.0,7.38,4150.0,79.2
4,234621,Fish and Chips,5 & Diner,restaurant,diet_facts_restaurants,"battered cod served with French fries, corn on...",1 order,1 item,,672.0,...,1.09,77.55,1.08,0.94,1904.2,109.72,140.0,4.86,650.0,3.6


In [19]:
df = df[0:500]

In [29]:
df['serving_size_raw'] = df['serving_size_raw'].fillna('1 item')
for i in range((len(df.serving_size_raw))):
    line_strings = []
    line_floats = []
    float_strs = []
    frac_orig = 0
    line = df.iloc[i,6]
    if type(line) == str:
        for t in line.split():
        ##### Splits All Into Strings and Floats 
            try:
                float(t)
                line_floats.append(float(t))
                float_strs.append(t)
            except ValueError:
                try: 
                    float(Fraction(t))
                    line_floats.append(float(Fraction(t)))
                    float_strs.append(t)
                except ValueError:
                    line_strings.append(t)
        ##### If more than 1 value, set first value to unit, and set unit to the string between that value and the next value
        if len(line_floats) > 1:
            df.serving_size_val.iloc[i] = line_floats[0]
            # This way it doesn't seperate 'fl and 'oz'
            text1 = line[(line.find(float_strs[0])+len(float_strs[0])):(line.find(float_strs[1])-1)].strip()
            df.serving_size_unit.iloc[i] = text1
        ##### If
        elif len(line_floats) == 1:
            df.serving_size_val.iloc[i] = line_floats[0] # 
            df.serving_size_unit.iloc[i] = line[(line.find(float_strs[0])+len(float_strs[0])):len(line)].strip()
        ##### Empty line floats - set value to 1 and unit to the full string
        else:
            df.serving_size_val.iloc[i] = 1
            df.serving_size_unit.iloc[i] = line
    else:
        #################
        #     df is a single number or non-string
        #################
        df.serving_size_val.iloc[i] = line
df.to_csv('C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_processed_ss.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [30]:
df

Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,calories,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
0,234617,"Egg Cream, Chocolate flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,1.0,egg cream,211.0,...,1.50,42.87,0.00,6.50,191.41,12.50,180.0,2.16,250.0,1.2
1,234618,"Egg Cream, Vanilla flavored",5 & Diner,restaurant,diet_facts_restaurants,,1 egg cream,1.0,egg cream,205.0,...,1.50,44.50,0.00,44.50,85.00,12.50,180.0,0.00,250.0,1.2
2,234619,Eggs Benedict,5 & Diner,restaurant,diet_facts_restaurants,two poached eggs on top of sliced ham and an E...,1 order,1.0,order,702.0,...,24.12,36.44,1.35,10.01,1864.89,562.57,220.0,4.14,2000.0,21.0
3,234620,Eggs Maximilian with Chorizo,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1.0,order,1353.0,...,27.58,75.24,6.30,6.98,2033.99,751.81,440.0,7.38,4150.0,79.2
4,234621,Fish and Chips,5 & Diner,restaurant,diet_facts_restaurants,"battered cod served with French fries, corn on...",1 order,1.0,order,672.0,...,1.09,77.55,1.08,0.94,1904.20,109.72,140.0,4.86,650.0,3.6
5,234622,Fish Tacos,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1.0,order,821.0,...,15.92,56.91,5.16,4.01,1458.57,164.34,730.0,3.06,1450.0,10.2
6,234623,French Dip,5 & Diner,restaurant,diet_facts_restaurants,tender slices of roast beef served on a hoagie...,1 sandwich,1.0,sandwich,474.0,...,5.01,56.00,3.00,4.00,1850.00,50.00,100.0,4.86,0.0,0.0
7,234624,French Fries,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1.0,order,649.0,...,5.01,92.21,8.01,0.04,1389.71,0.00,0.0,2.88,200.0,24.0
8,234625,French Onion Steak Dinner,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1.0,order,661.0,...,13.89,17.84,2.73,2.60,2406.28,157.99,60.0,5.58,1050.0,5.4
9,234626,French Toast,5 & Diner,restaurant,diet_facts_restaurants,,1 order,1.0,order,665.0,...,4.28,91.09,6.16,33.48,854.90,362.40,270.0,5.94,450.0,0.6


In [27]:
df['serving_size_raw'].iloc[313]

'1 piece  25g  1/4 order'

In [28]:
line_floats

['1', 0.25]

In [19]:
df.serving_size_val[df.serving_size_val.isnull()] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [21]:
df[100000:100010]

Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,calories,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
100000,334695,Silky Smooth Milk Chocolate Promises - Almond,Dove,grocery,fat_secret,,5 pieces (39g),5.0,pieces (39g),210.0,...,7.00,21.00,2.00,19.00,20.00,5.00,60.0,0.36,0.0,0.0
100001,334696,Milk Chocolate Truffle Eggs,Dove,grocery,fat_secret,,1 egg (25.5g),1.0,egg (25.5g),150.0,...,7.00,14.00,1.00,13.00,15.00,5.00,20.0,0.00,0.0,0.0
100002,334697,Roasted Almonds,Dove,grocery,fat_secret,,13 pieces (39g),13.0,3 pieces (39g),210.0,...,6.00,19.00,3.00,14.00,10.00,5.00,40.0,1.08,0.0,0.0
100003,334698,Cookies & Creme,Dove,grocery,fat_secret,,5 pieces (37g),5.0,pieces (37g),200.0,...,7.00,21.00,0.00,20.00,65.00,10.00,80.0,0.00,0.0,0.0
100004,334699,Sugar Free Dove Rich Dark Chocolates with Rasp...,Dove,grocery,fat_secret,,5 peices (40g),5.0,peices (40g),190.0,...,10.00,6.00,3.00,0.00,0.00,5.00,0.0,1.08,0.0,0.0
100005,334700,Vanilla Chocolate Chunk Ice Cream,Dove,grocery,fat_secret,,1/2 cup (65g),1.0,1/2 cup (65g),180.0,...,7.00,17.00,0.00,15.00,35.00,30.00,80.0,0.00,200.0,0.0
100006,334701,Chocolate Chai Tea,Dove,grocery,fat_secret,,1 pouch (34g),1.0,pouch (34g),140.0,...,3.00,26.00,1.00,22.00,110.00,0.00,80.0,0.00,0.0,0.0
100007,334702,Silky Smooth White & Milk Chocolate Swirl,Dove,grocery,fat_secret,,9 pieces,9.0,pieces,230.0,...,9.00,25.00,1.00,24.00,40.00,10.00,60.0,0.00,100.0,0.0
100008,334703,Mint Chocolate Chunk,Dove,grocery,fat_secret,,1/2 cup (69g),1.0,1/2 cup (69g),180.0,...,7.00,17.00,1.00,14.00,35.00,30.00,60.0,0.00,300.0,0.0
100009,334704,Silky Smooth Dark Chocolate Bar - Cranberry Al...,Dove,grocery,fat_secret,,1 bar (33g),1.0,bar (33g),170.0,...,6.00,20.00,2.00,16.00,10.00,5.00,20.0,0.72,0.0,0.0


In [22]:
df_full = df

In [23]:
df = df[100000:100010]

In [24]:
df

Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,calories,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
100000,334695,Silky Smooth Milk Chocolate Promises - Almond,Dove,grocery,fat_secret,,5 pieces (39g),5.0,pieces (39g),210.0,...,7.0,21.0,2.0,19.0,20.0,5.0,60.0,0.36,0.0,0.0
100001,334696,Milk Chocolate Truffle Eggs,Dove,grocery,fat_secret,,1 egg (25.5g),1.0,egg (25.5g),150.0,...,7.0,14.0,1.0,13.0,15.0,5.0,20.0,0.0,0.0,0.0
100002,334697,Roasted Almonds,Dove,grocery,fat_secret,,13 pieces (39g),13.0,3 pieces (39g),210.0,...,6.0,19.0,3.0,14.0,10.0,5.0,40.0,1.08,0.0,0.0
100003,334698,Cookies & Creme,Dove,grocery,fat_secret,,5 pieces (37g),5.0,pieces (37g),200.0,...,7.0,21.0,0.0,20.0,65.0,10.0,80.0,0.0,0.0,0.0
100004,334699,Sugar Free Dove Rich Dark Chocolates with Rasp...,Dove,grocery,fat_secret,,5 peices (40g),5.0,peices (40g),190.0,...,10.0,6.0,3.0,0.0,0.0,5.0,0.0,1.08,0.0,0.0
100005,334700,Vanilla Chocolate Chunk Ice Cream,Dove,grocery,fat_secret,,1/2 cup (65g),1.0,1/2 cup (65g),180.0,...,7.0,17.0,0.0,15.0,35.0,30.0,80.0,0.0,200.0,0.0
100006,334701,Chocolate Chai Tea,Dove,grocery,fat_secret,,1 pouch (34g),1.0,pouch (34g),140.0,...,3.0,26.0,1.0,22.0,110.0,0.0,80.0,0.0,0.0,0.0
100007,334702,Silky Smooth White & Milk Chocolate Swirl,Dove,grocery,fat_secret,,9 pieces,9.0,pieces,230.0,...,9.0,25.0,1.0,24.0,40.0,10.0,60.0,0.0,100.0,0.0
100008,334703,Mint Chocolate Chunk,Dove,grocery,fat_secret,,1/2 cup (69g),1.0,1/2 cup (69g),180.0,...,7.0,17.0,1.0,14.0,35.0,30.0,60.0,0.0,300.0,0.0
100009,334704,Silky Smooth Dark Chocolate Bar - Cranberry Al...,Dove,grocery,fat_secret,,1 bar (33g),1.0,bar (33g),170.0,...,6.0,20.0,2.0,16.0,10.0,5.0,20.0,0.72,0.0,0.0


In [36]:
line_strings = []
line_floats = []
for t in test.split():
    try
        float(t)
        line_floats.append(t)
    except ValueError:
        line_strings.append(t)
if len(line_floats) > 1:
    text1 = test[(test.find(line_floats[0])+1):(test.find(line_floats[1])-1)].strip()

In [14]:
df.to_csv('C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_2018_4_18_processed.csv')

In [16]:
df_test = pd.read_csv("C:/Users/J/Desktop/Businesses/Meal_Maker/Scraped_Data/combined_nutrition_small/nutrition_sm_2018_4_18_processed.csv", encoding='ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
df_test[100000:1000010]

Unnamed: 0.1,Unnamed: 0,food_key,food_description,brand,food_type_grp,source,ingredients_list,serving_size_raw,serving_size_val,serving_size_unit,...,saturated_fat_g,carb_g,fiber_g,sugar_g,sodium_mg,cholesterol_mg,calcium_mg,iron_mg,vit_a_mcg,vit_c_mg
100000,100000,334695,Silky Smooth Milk Chocolate Promises - Almond,Dove,grocery,fat_secret,,5 pieces (39g),5.0,pieces (39g),...,7.00,21.00,2.00,19.00,20.00,5.00,60.0,0.36,0.0,0.0
100001,100001,334696,Milk Chocolate Truffle Eggs,Dove,grocery,fat_secret,,1 egg (25.5g),1.0,egg (25.5g),...,7.00,14.00,1.00,13.00,15.00,5.00,20.0,0.00,0.0,0.0
100002,100002,334697,Roasted Almonds,Dove,grocery,fat_secret,,13 pieces (39g),13.0,3 pieces (39g),...,6.00,19.00,3.00,14.00,10.00,5.00,40.0,1.08,0.0,0.0
100003,100003,334698,Cookies & Creme,Dove,grocery,fat_secret,,5 pieces (37g),5.0,pieces (37g),...,7.00,21.00,0.00,20.00,65.00,10.00,80.0,0.00,0.0,0.0
100004,100004,334699,Sugar Free Dove Rich Dark Chocolates with Rasp...,Dove,grocery,fat_secret,,5 peices (40g),5.0,peices (40g),...,10.00,6.00,3.00,0.00,0.00,5.00,0.0,1.08,0.0,0.0
100005,100005,334700,Vanilla Chocolate Chunk Ice Cream,Dove,grocery,fat_secret,,1/2 cup (65g),,1/2 cup (65g),...,7.00,17.00,0.00,15.00,35.00,30.00,80.0,0.00,200.0,0.0
100006,100006,334701,Chocolate Chai Tea,Dove,grocery,fat_secret,,1 pouch (34g),1.0,pouch (34g),...,3.00,26.00,1.00,22.00,110.00,0.00,80.0,0.00,0.0,0.0
100007,100007,334702,Silky Smooth White & Milk Chocolate Swirl,Dove,grocery,fat_secret,,9 pieces,9.0,pieces,...,9.00,25.00,1.00,24.00,40.00,10.00,60.0,0.00,100.0,0.0
100008,100008,334703,Mint Chocolate Chunk,Dove,grocery,fat_secret,,1/2 cup (69g),,1/2 cup (69g),...,7.00,17.00,1.00,14.00,35.00,30.00,60.0,0.00,300.0,0.0
100009,100009,334704,Silky Smooth Dark Chocolate Bar - Cranberry Al...,Dove,grocery,fat_secret,,1 bar (33g),1.0,bar (33g),...,6.00,20.00,2.00,16.00,10.00,5.00,20.0,0.72,0.0,0.0
