# Exploring Favourite Recipes

[AllRecipes](http://allrecipes.com/) is a recipe website, where people can mark certain recipes as 'favourites'. A student named Jeremy Cohen [scraped some of this data for an excellent machine learning project](http://www.jeremymcohen.net/posts/taste/) and we'll use his dataset to demo how to do some unsupervised machine learning with MLDB.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [2]:
from pymldb import Connection
mldb = Connection("http://localhost/")

The sequence of procedures below is based on the one explained in the [Mapping Reddit](/doc/nblink.html#_demos/Mapping Reddit) demo notebook.

In [3]:
print mldb.put('/v1/procedures/import_rcp', {
    "type": "import.text",
    "params": {
        "headers": ["user_id", "recipe_id"],
        "dataFileUrl": "https://raw.githubusercontent.com/jmcohen/taste/master/data/favorites.csv",
        "outputDataset": "rcp_raw",
        "runOnCreation": True
    }
})

print mldb.post('/v1/procedures', {
    "id": "rcp_import",
    "type": "transform",
    "params": {
        "inputData": "select pivot(recipe_id, 1) as * named user_id from rcp_raw group by user_id",
        "outputDataset": "recipes",
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>


In [4]:
print mldb.post('/v1/procedures', {
    "id": "rcp_svd",
    "type" : "svd.train",
    "params" : {
        "trainingData": "select * from recipes",
        "columnOutputDataset" : "rcp_svd_embedding_raw",
        "runOnCreation": True
    }
})

print mldb.put('/v1/procedures/rcp_clean_svd', {
    'type': 'transform',
    'params': {
        'inputData': '''select * named jseval(
                            'return s.substr(0, s.indexOf("|"))',
                            's', rowName()) from rcp_svd_embedding_raw''',
        'outputDataset': {'id': 'rcp_svd_embedding',
                          'type': 'embedding',
                          'params': {'metric': 'cosine'}},
        'runOnCreation': True
    }
})

print mldb.post('/v1/procedures', {
    "id" : "rcp_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from rcp_svd_embedding",
        "outputDataset" : "rcp_kmeans_clusters",
        "centroidsDataset" : "rcp_kmeans_centroids",
        "numClusters" : 15,
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>
<Response [201]>


Let's load up the recipe names so we can see what we clustered. First we'll read the file and then extract and clean the recipe names.

In [5]:
print mldb.put('/v1/procedures/import_rcp_names_raw', {
    'type': 'import.text',
    'params': {
        'dataFileUrl': 'https://raw.githubusercontent.com/jmcohen/taste/master/data/recipes.csv',
        'outputDataset': "rcp_names_raw",
        'delimiter':'',
        'quotechar':'',
        'runOnCreation': True
    }
})

print mldb.put('/v1/procedures/rcp_names_import', {
    'type': 'transform',
    'params': {
        'inputData': '''
            select jseval(
               'return s.substr(s.indexOf(",") + 1)
                .replace(/&#34;/g, "")
                .replace(/&#174;/g, "");',
            's', lineText) as name
            named implicit_cast(rowName()) - 1
            from rcp_names_raw
        ''',
        'outputDataset': 'rcp_names',
        'runOnCreation': True
    }
})


<Response [201]>
<Response [201]>


In [6]:
def recipe_name(row_name):
    """ simple utility that returns the name of a recipe from its id (rowName() in the datasets) """
    return mldb.get('/v1/query', q="select name from rcp_names where rowName() = '{}'".format(row_name),
                   format='aos').json()[0]['name']

Let's look a the closest recipes to each cluster centroid to try to get a sense of what the clusters mean.

In [7]:
import pandas as pd
centroids = mldb.get('/v1/query', q="select * from rcp_kmeans_centroids order by implicit_cast(rowName())",
                      format='aos', rowNames="false").json()
rows = []
for c in centroids:
    neighbours = mldb.get('/v1/datasets/rcp_svd_embedding/routes/neighbours', numNeighbours=10, **c).json()
    rows.append([recipe_name(n[0]) for n in neighbours])
pd.DataFrame(rows)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Traditional Christmas Cheese Ball,Superb Sauteed Mushrooms,Old School Mac n' Cheese,Beer Burgers,Healthier Oven Roasted Potatoes,Mistakenly Zesty Pork Chops,Stuffed Potatoes,Noodles Alfredo,Bacon and Egg Breakfast Tarts,Fantastic Chicken Burgers
1,"Fried Cabbage with Bacon, Onion, and Garlic",Sausage Flowers,Ruby Drive Sloppy Joes,Sesame Noodle Salad,Tomato Bacon Squares,Italian Subs - Restaurant Style,Man-Lovin' Potatoes,The Meatball that Fell Off the Table,Johnny Marzetti Casserole,Sassy Tailgate Sandwiches
2,Vegan Red Lentil Soup,"Spinach, Red Lentil, and Bean Curry",Lentils And Spinach,Vegan Split Pea Soup I,Spinach Chickpea Curry,Moroccan-Style Stuffed Acorn Squash,Swiss Chard with Garbanzo Beans and Fresh Toma...,Mock Tuna Salad,Mediterranean Chickpea Salad II,Tomato-Curry Lentil Stew
3,Cheese Grits,Teriyaki Mushrooms,Country Scalloped Potatoes,Hot Artichoke Dip with Sun-Dried Tomatoes,Monte Cristo Sandwich,Tangy Sliced Pork Sandwiches,Three Cheese Macaroni with Tomatoes,Ricotta Cheese Pancakes,Chicken Creole,Berry Cobbler
4,Coffee Shake,Chocolate Wontons,Tasty Salad Seasoning,Fabulous Fargozas,Orange Juice Cake,Fugi Salad,Amber's Peanut Butter,Maple-Vanilla Syrup,Bananas in Caramel Sauce,We Be Jammin' Jamaican Banana Bread
5,Chicago Dip,Baked Potato Dip,Cheesy Potato Casserole,Fantastic Mexican Dip,Zippy Egg Casserole,Mini Reubens,Baked Potato Salad I,Spaghetti Salad I,Mary's Roasted Red Pepper Dip,Aunt Phyllis' Magnificent Cheese Ball
6,Lemon Chiffon Cake,Whipped Cream Filling,Pro Ganache,Strawberry Cream Roll,Glorious Sponge Cake,Fabulous Fudge Chocolate Cake,Wedding Cake Frosting,One Bowl Buttercream Frosting,Stabilized Whipped Cream Icing,Lemon Gold Cake
7,Chap Chee Noodles,African Curry,Asian Barbequed Steak,Killer Shrimp Soup,Baked Pork Spring Rolls,Adrienne's Tom Ka Gai,Thai Noodles,Sharon's Scrumptious Souvlaki,Barbequed Thai Style Chicken,Chinese Spicy Hot And Sour Soup
8,Angel's Chunky Chicken Salad,Taco Salad II,Cheesy Fried Potatoes,Incredible Potato Casserole,Baked Ham,Broccoli Casserole I,Breaded Parmesan Chicken,Cold Tuna Macaroni Salad,Meat Filled Manicotti,"Ham, Potato, and Cheese Soup"
9,Apple Coffee Cake With Brown Sugar Sauce,Blueberry Coffee Cake I,Blueberry Cream Cheese Pound Cake I,Cranberry Upside-Down Coffee Cake,Apple Bundt Cake,Pear Bread I,Apple Honey Bundt Cake,Cherry Pound Cake,Apple Butter Spice Cake,Apple Cake and Butter Sauce


Not super informative.. Let's try to extract the most characteristic words used in the recipe names for each cluster.

We'll start by preprocessing the recipe names a bit : taking out a few punctuations and convert to lowercase.

And then for a given cluster, we will count the words taken from the recipe names, after having filtered stop words. This is all done in one (big) query.

In [8]:
print mldb.put('/v1/functions/filter_stopwords', {'type': 'filter_stopwords'})

print mldb.put('/v1/procedures/sum_words_per_cluster', {
    'type': 'transform',
    'params': {
        'inputData': """
        SELECT
        sum(
            filter_stopwords({
                words: {
                    tokenize(lower(name), {splitchars:' ''()",', min_token_length: 4}) as *
                }
            })[words]
        ) as *
        NAMED cluster
        FROM merge(rcp_names, rcp_kmeans_clusters)
        GROUP BY cluster
        """,
        'outputDataset': 'rcp_cluster_word_counts',
        'runOnCreation': True
    }
})

<Response [201]>
<Response [201]>


Here is what the created dataset looks like:

In [9]:
mldb.query('select * from rcp_cluster_word_counts order by implicit_cast(rowName())')

Unnamed: 0_level_0,absolutely,accidental,adam,addicting,ahead,aimee,albondigas,alfredo,alla,almond,...,rump,sarge,sauerbraten,slower,splendicious,superheated,texas-style,tipsy,western-style,western
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,...,,,,,,,,,,
1,2.0,,,,,,,3.0,,,...,,,,,,,,,,
2,1.0,,,,1.0,,,1.0,,,...,,,,,,,,,,
3,1.0,,,,,,,2.0,1.0,5.0,...,,,,,,,,,,
4,3.0,,1.0,,,,,2.0,,10.0,...,,,,,,,,,,
5,1.0,,,,2.0,,,7.0,,1.0,...,,,,,,,,,,
6,,,,,,,,,,1.0,...,,,,,,,,,,
7,2.0,,,,,,2.0,1.0,1.0,1.0,...,,,,,,,,,,
8,,1.0,,,,,,9.0,,1.0,...,,,,,,,,,,
9,2.0,,,,,,,,,6.0,...,,,,,,,,,,


We can use this to create a TF-IDF score for each word in the cluster. Basically this score will give us an idea of the relative importance of a each word in a given cluster.

In [10]:
print mldb.put('/v1/procedures/train_tfidf', {
     'type': 'tfidf.train',
     'params': {
         'trainingData': 'select * from rcp_cluster_word_counts',
         'modelFileUrl': 'file://rcp_tfidf.idf',
         'functionName': 'rcp_tfidf',
         'runOnCreation': True
    }
})

print mldb.put('/v1/procedures/apply_tfidf', {
     'type': 'transform',
     'params': {
         'inputData': 'select rcp_tfidf({input: {*}})[output] as * from rcp_cluster_word_counts',
         'outputDataset': 'rcp_cluster_word_scores',
         'runOnCreation': True
    }
})


<Response [201]>
<Response [201]>


In the resulting dataset, the counts have been replaced by a score.

In [11]:
mldb.query("select * from rcp_cluster_word_scores order by implicit_cast(rowName())")

Unnamed: 0_level_0,absolutely,accidental,adam,addicting,ahead,aimee,albondigas,alfredo,alla,almond,...,rump,sarge,sauerbraten,slower,splendicious,superheated,texas-style,tipsy,western-style,western
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.912902,7.00971,7.00971,7.414874,6.722329,7.414874,7.00971,11.252242,6.317465,5.712832,...,,,,,,,,,,
1,11.825804,,,,,,,16.878364,,,...,,,,,,,,,,
2,5.912902,,,,6.722329,,,5.626121,,,...,,,,,,,,,,
3,5.912902,,,,,,,11.252242,6.317465,28.564162,...,,,,,,,,,,
4,17.738707,,7.00971,,,,,11.252242,,57.128323,...,,,,,,,,,,
5,5.912902,,,,13.444658,,,39.382848,,5.712832,...,,,,,,,,,,
6,,,,,,,,,,5.712832,...,,,,,,,,,,
7,11.825804,,,,,,14.01942,5.626121,6.317465,5.712832,...,,,,,,,,,,
8,,7.00971,,,,,,50.635091,,5.712832,...,,,,,,,,,,
9,11.825804,,,,,,,,,34.276994,...,,,,,,,,,,


If we transpose that dataset, we will be able to get the highest scored words for each cluster, and we can display them nicely in a word cloud.

In [12]:
import json
from ipywidgets import interact 
from IPython.display import IFrame, display
html = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/d3.layout.cloud.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/wordcloud.js"></script>
<body> <script>drawCloud(%s)</script> </body>
"""

@interact 
def cluster_word_cloud(cluster=[0, len(centroids)]):
    num_words = 20
    cluster_words = mldb.get(
        '/v1/query',
        q="""
            SELECT rowName() as text
            FROM transpose(rcp_cluster_word_scores)
            ORDER BY "{0}" DESC
            LIMIT {1}
          """.format(cluster, num_words),
        format='aos',
        rowNames=0
    ).json()
    for i,x in enumerate(cluster_words):
        x['size'] = num_words - i
    display( IFrame("data:text/html," + (html % json.dumps(cluster_words)).replace('"',"'"), 850, 350) )

Much better!

## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).