# Recipe cuisine classification - training and deployment

### Table of contents

* [Description](#description)
* [Getting the data from S3](#get_data)
* [Text Processing](#process)
 * [Writing a Custom Transformer](#custom_tx)
 * [Words to remove](#words_remove)
* [Create SKLearn pipeline](#pipeline)
* [Deploy the model](#deploy)
* [Test the prediction function](#prediction_request)
* [Endpoint cleanup](#endpoint_cleanup)

## Description <a class="anchor" id="description"></a>
This Sagemaker notebook trains and fits a logistic regression model on recipe data that has been scraped using lambda functions.  More detail on this project is in my Github repo and blog post. This model is pickeled and served as an endpoint. Prediction examples are shown inline in the notebook.

In [2]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Getting the data from S3 <a class="anchor" id="get_data"></a>
The scraped recipe data is in an S3 bucket, *recipes-allrecipes*.  Each recipe is stored as a json file.  The code below retrieves these json files, reads each into a pandas dataframe, and formats the ingredients into a list.  That is, each row will be one recipe, and the ingredients column will contain a list format of all of the ingredients.  To get the columns in the same level, I use the Dataframe function `.to_records()` line shown here:

`
test_df=pd.DataFrame(recipe_df.to_records())
`

to flatten the dataframe.

In [3]:
import boto3
import json
import pandas as pd

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('recipes-allrecipes')

bucket_recipes = "recipes-allrecipes"
recipe_df=None 
try:
#     if len(objs) > 0:
    for object in my_bucket.objects.all():
        key=object.key
        if (key[-5:]=='.json'): #check that this is a .json file
            data_location = 's3://{}/{}'.format(bucket_recipes, key)
            df=pd.read_json(data_location)
            df=df.groupby(['id', 'title', 'description', 'rating', 'reviews','url','category'])['ingredients'].apply(list).to_frame()

        if(recipe_df is None):
            recipe_df=df
        else:
            recipe_df = pd.concat([recipe_df,df])

except KeyError:
    print("No objects in the s3 bucket")

In [4]:
#flatten the dataframe
test_df=pd.DataFrame(recipe_df.to_records())

In [5]:
test_df.count()

id             3653
title          3653
description    3653
rating         3653
reviews        3653
url            3653
category       3653
ingredients    3653
dtype: int64

## Text Processing <a class="anchor" id="process"></a>
Now that the data is retrieved, I need to process the data for the model to train on.  I use the work from my local Jupyter notebook and make that work here in Sagemaker.

### Writing a Custom Transformer <a class="anchor" id="custom_tx"></a>
When working locally on the data, I did not perform any predictions as I was using the data I had scraped, divided it into train/test to determine the model performance.  Since I will be using the SKLearn pipeline for implementation and I plan to use it for predicting on new data (in the API endpoint), I need to format the `parse_recipes` function to use in a scikit-learn pipeline.  It is not necessary but I would rather have it all in one pipeline.

FunctionTransformer (simpler functions), Fixing for state

A `FunctionTransformer` class helps to introduce arbitrary, stateless transforms into a `Pipeline`.  Using a `TransformerMixin` class is overkill for what I need to do, though since I was able to get it to work with the recipe data, I am using it for now.



In [36]:
import string
from sklearn.base import TransformerMixin

class TransformRecipe(TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y):
        return self
    def transform(self,X, **kwargs):
        recipe_list=X
        ingr_list=[]
        bigram_list=[]
        recipe_string_list=[]

#         ilist = [[word.lower() for word in i.split()] for i in ilist] 
        for recipe in recipe_list:
#             print(recipe)
            ingr_list=[]
            for ingredient in recipe:

                ingredient.translate(str.maketrans('', '', string.punctuation))

                words = ingredient.split()
                words = [''.join(c for c in word if c not in string.punctuation) for word in words]
                words = [word for word in words if word.isalpha()]
                words = [word.lower() for word in words] 
                words = [lemmatizer.lemmatize(word) for word in words]
                words = [word for word in words if word not in measures]
                words = [word for word in words if word not in common_remove]
                words = [word for word in words if word not in data_leaks]
                #get rid of any blank
                words = list(filter(None, words))
#                 print("before if length statements")
#                 print(words)
                if(len(words)<=3):
                    ingr_list.append(' '.join(words))

                words = [word for word in words if word not in useless_singles]
            #easiest way to deal with any duplicates or blanks for now
                ingr_list = list(set(ingr_list))
                #attempts to get rid of the blank
#                 ingr_list = list(filter(None, ingr_list))
#                 ingr_list = [x for x in ingr_list if x]   
                if(len(words)>3): #handle rare case
                    ingr_list.append(' '.join(words))
                recipe_string=' '.join(ingr_list)
#             print(recipe_string)

            recipe_string_list.append(recipe_string)

        return recipe_string_list

### Words to Remove <a class="anchor" id="words_remove"></a>
This code and these word lists are copied over as is.

In [8]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

#Source for list below
#https://en.wikipedia.org/wiki/Cooking_weights_and_measures
#https://thebakingpan.com/ingredient-weights-and-measures/
measures=['litrbes','liter','millilitres','mL','grams','g', 'kg','teaspoon','tsp', 'tablespoon','tbsp','fluid', 'ounce','oz','fl.oz', 'cup','pint','pt','quart','qt','gallon','gal','smidgen','drop','pinch','dash','scruple','dessertspoon','teacup','cup','c','pottle','gill','dram','wineglass','coffeespoon','pound','pounded','lb','tbsp','plus','firmly', 'packed','lightly','level','even','rounded','heaping','heaped','sifted','bushel','peck','stick','chopped','sliced','halves', 'shredded','slivered','sliced','whole','paste','whole',' fresh', 'peeled', 'diced','mashed','dried','frozen','fresh','peeled','candied','no', 'pulp','crystallized','canned','crushed','minced','julienned','clove','head', 'small','large','medium', 'torn', 'cleaned', 'degree']

measures = [lemmatizer.lemmatize(m) for m in measures]
#some of these include data leakage words, like 'italian' - ok to remove after including bigrams
data_leaks = ['italianstyle', 'french','thai', 'chinese', 'mexican','spanish','indian','italian']

common_remove=['ground','to','taste', 'and', 'or',  'can',  'into', 'cut', 'grated', 'leaf','package','finely','divided','a','piece','optional','inch','needed','more','drained','for','flake','dry','thinly','cubed','bunch','cube','slice','pod','beaten','seeded','uncooked','root','plain','heavy','halved','crumbled','sweet','with','hot','room','temperature','trimmed','allpurpose','deveined','bulk','seasoning','jar','food','if','bag','mix','in','each','roll','instant','double','such','frying','thawed','whipping','stock','rinsed','mild','sprig','freshly','toasted','link','boiling','cooked','unsalted','container',
'cooking','thin','lengthwise','warm','softened','thick','quartered','juiced','pitted','chunk','melted','cold','coloring','puree','cored','stewed','floret','coarsely','the','blanched','zested','sweetened','powdered','garnish','dressing','soup','at','active','lean','chip','sour','long','ripe','skinned','fillet','from','stem','flaked','removed','stalk','unsweetened','cover','crust', 'extra', 'prepared', 'blend', 'of', 'ring',  'undrained', 'about', 'zest', ' ', '', 'spray', 'round', 'herb', 'seasoned', 'wedge', 'bitesize', 'broken', 'square', 'freshly', 'thickly', 'diagonally']
common_remove = [lemmatizer.lemmatize(c) for c in common_remove]
data_leaks = [lemmatizer.lemmatize(d) for d in data_leaks]
# due to using bigrams not including 
useless_singles=['','black','white','red','yellow','seed','breast','confectioner','sundried','broth','bell','baby','juice','crumb','sauce','condensed','smoked','basmati','extravirgin','brown','clarified', 'soy', 'filling', 'pine', 'virgin', 'romano', 'heart', 'shell', 'thigh', 'boneless','skinless','split', 'dark', 'wheat', 'light', 'green', 'vegetable', 'curry', 'orange', 'garam', 'sesame', 'strip', 'sea', 'canola', 'mustard','powder', 'ice', 'bay', 'roasted', 'loaf', 'roast', 'powder']
useless_singles = [lemmatizer.lemmatize(u) for u in useless_singles]

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Create SKLearn Pipeline <a class="anchor" id="pipeline"></a>
Here I put together the pipeline, using the class weighted Logistic Regression model.  Additionally, the TransformRecipe class I wrote above is used here as part of the pipeline. It was easier for me to work with the data as a list of lists instead of a pandas dataframe, I can use a list or list of lists as input to the API, and since it works, I won't change it now.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,StratifiedKFold, train_test_split
import pprint
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

skf=StratifiedKFold(n_splits=3)
pp = pprint.PrettyPrinter(indent=4)


print("\nresults for LR count vector")
lr_pipe = Pipeline([('parse_recipe_text', TransformRecipe()),
    ('vect', CountVectorizer(ngram_range=(1, 2))), 
    ('lr', LogisticRegression( max_iter=1000,random_state=123, 
    class_weight='balanced',multi_class='multinomial', solver='lbfgs'))])


cv=cross_validate(lr_pipe, test_df['ingredients'], test_df['category'].values, scoring='f1_weighted', 
                         cv=skf, return_train_score=True )

pp.pprint(cv)

X_train, X_test, y_train, y_test = train_test_split(test_df['ingredients'].values, test_df['category'], test_size=0.25, stratify=test_df['category'].tolist())

lr_pipe.fit(X_train, y_train)

# #Pickle pipeline after caling the fit
from sklearn.externals import joblib
joblib.dump(lr_pipe, 'lr_pipe.pkl')

# This puts the pickled model into an S3 bucket
key = 'model.pkl'
bucket='lrpickle'
url = 's3n://{}/{}'.format(bucket, key)
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file('lr_pipe.pkl')
print('Done writing to {}'.format(url))
        


results for LR count vector
{   'fit_time': array([2.373698  , 2.86831737, 2.66116476]),
    'score_time': array([0.63286567, 0.77275896, 0.68454027]),
    'test_score': array([0.85739242, 0.85650425, 0.85730251]),
    'train_score': array([0.99426657, 0.99266337, 0.99468617])}
Done writing to s3n://lrpickle/model.pkl



### Deploy the model <a class="anchor" id="deploy"></a>

If you want to serve the model using Sagemaker, below is commented code that was in this sagemaker template notebook.

In [11]:
# import sklearn.deploy
# predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.t2.medium")

### Test the prediction function <a class="anchor" id="prediction_request"></a>



In [12]:

test_list = [
    ['1 egg, lightly beaten', '1 pound ground beef', '1 tomato, finely chopped',
              '1 red onion, finely chopped','1/4 cup finely chopped cilantro','1/4 cup finely chopped mint',
             '2 teaspoons ginger-garlic paste','2 teaspoons coriander seeds, crushed','1 teaspoon salt',
             '3/4 teaspoon ground cumin','3/4 teaspoon ground cayenne pepper',
              '1/4 cup vegetable oil for frying, or more as needed', '2 tomatoes, sliced into rounds'],
    ['vanilla ice cream', 'shortcake', 'sliced strawberries', 'whipped cream']
    ]


lr_pipe.predict(test_list)



array(['indian', 'french'], dtype='<U7')

Not bad ... The first recipe is actually a northern Pakistani one, but of the 6 classes, Indian would be the best match.

### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up if using Sagemaker to deploy.

In [None]:
# sklearn.delete_endpoint()


In [None]:
# for the parse recipe function, I did not need to use a TransformerMixin. I keep it below for my future reference.
# import string
# from sklearn.base import TransformerMixin

# class TransformRecipe(TransformerMixin):
#     def __init__(self):
#         pass
#     def fit(self, X, y):
#         return self
#     def transform(self,X, **kwargs):
#         recipe_list=X
#         ingr_list=[]
#         bigram_list=[]
#         recipe_string_list=[]
#         for recipe in recipe_list:
# #             print(recipe)
#             ingr_list=[]
#             for ingredient in recipe:
#                 ingredient.translate(str.maketrans('', '', string.punctuation))
#                 words = ingredient.split()
#                 words = [''.join(c for c in word if c not in string.punctuation) for word in words]
#                 words = [word for word in words if word.isalpha()]
#                 words = [word.lower() for word in words] 
#                 words = [lemmatizer.lemmatize(word) for word in words]
#                 words = [word for word in words if word not in measures]
#                 words = [word for word in words if word not in common_remove]
#                 words = [word for word in words if word not in data_leaks]
#                 #get rid of any blanks
#                 words = list(filter(None, words))
#                 if(len(words)<=3):
#                     ingr_list.append(' '.join(words))

#                 words = [word for word in words if word not in useless_singles]

#             #easiest way to deal with any duplicates or blanks for now
#                 ingr_list = list(set(ingr_list)) 
#                 if(len(words)>3): #handle rare case
#                     ingr_list.append(' '.join(words))
#                 recipe_string=' '.join(ingr_list)

#             recipe_string_list.append(recipe_string)

#         return recipe_string_list