# Exploring Favourite Recipes

[AllRecipes](http://allrecipes.com/) is a recipe website, where people can mark certain recipes as 'favourites'. A student named Jeremy Cohen [scraped some of this data for an excellent machine learning project](http://www.jeremymcohen.net/posts/taste/) and we'll use his dataset to demo how to do some unsupervised machine learning with MLDB.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection("http://localhost/")

The sequence of procedures below is based on the one explained in the [Mapping Reddit](/doc/nblink.html#_demos/Mapping Reddit) demo notebook.

In [2]:
print mldb.put('/v1/datasets/rcp_raw', {
    "type": "text.csv.tabular",
    "params": {
        "headers": ["user_id", "recipe_id"],
        "dataFileUrl": "https://raw.githubusercontent.com/jmcohen/taste/master/data/favorites.csv"
    }
})

print mldb.post('/v1/procedures', {
    "id": "rcp_import",
    "type": "transform",
    "params": {
        "inputData": "select pivot(recipe_id, 1) as * named user_id from rcp_raw group by user_id",
        "outputDataset": "recipes",
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>


In [16]:
print mldb.post('/v1/procedures', {
    "id": "rcp_svd",
    "type" : "svd.train",
    "params" : {
        "trainingData": "select * from recipes",
        "columnOutputDataset" : "rcp_svd_embedding_raw",
        "runOnCreation": True
    }
})

print mldb.put('/v1/procedures/rcp_clean_svd', {
    'type': 'transform',
    'params': {
        'inputData': '''select * named jseval(
                            'return s.substr(0, s.indexOf("|"))',
                            's', rowName()) from rcp_svd_embedding_raw''',
        'outputDataset': {'id': 'rcp_svd_embedding',
                          'type': 'embedding',
                          'params': {'metric': 'cosine'}},
        'runOnCreation': True}})

NB_CLUSTERS=8

print mldb.post('/v1/procedures', {
    "id" : "rcp_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from rcp_svd_embedding",
        "outputDataset" : "rcp_kmeans_clusters",
        "centroidsDataset" : "rcp_kmeans_centroids",
        "numClusters" : NB_CLUSTERS,
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>
<Response [201]>


Let's load up the recipe names so we can see what we clustered. First we'll read the file and then extract and clean the recipe names.

In [21]:
print mldb.put('/v1/datasets/rcp_names_raw', {
    'type': 'text.line',
    'params': {
        'dataFileUrl': 'https://raw.githubusercontent.com/jmcohen/taste/master/data/recipes.csv'
    }
})

print mldb.put('/v1/procedures/rcp_names_import', {
    'type': 'transform',
    'params': {
        'inputData': '''
            select jseval(
                   's = s.substr(s.indexOf(",") + 1);
                    s = s.replace(/&#34;/g, "\\"");
                    s = s.replace(/&#174;/g, "(R)");
                    return s;',
                    's', lineText) as name
            named implicit_cast(rowName()) - 1
            from rcp_names_raw
        ''',
        'outputDataset': 'rcp_names',
        'runOnCreation': True}})


<Response [201]>
<Response [201]>


In [22]:
def recipe_name(row_name):
    """ simple utility that returns the name of a recipe from its id (rowName() in the datasets) """
    return mldb.get('/v1/query', q="select name from rcp_names where rowName() = '{}'".format(row_name),
                   format='aos').json()[0]['name']

Let's look a the closest recipes to each cluster centroid to try to get a sense of what the clusters mean.

In [23]:
import pandas as pd
centroids = mldb.get('/v1/query', q="select * from rcp_kmeans_centroids order by implicit_cast(rowName())",
                      format='aos', rowNames="false").json()
rows = []
for c in centroids:
    neighbours = mldb.get('/v1/datasets/rcp_svd_embedding/routes/neighbours', numNeighbours=10, **c).json()
    rows.append([recipe_name(n[0]) for n in neighbours])
pd.DataFrame(rows)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Beer Burgers,Superb Sauteed Mushrooms,Traditional Christmas Cheese Ball,Noodles Alfredo,Mistakenly Zesty Pork Chops,Caprese Burger,All American Meatloaf,Autumn Pork Chops,Healthier Oven Roasted Potatoes,Inwood Hamburgers
1,Snickerdoodle Cake I,PHILLY Blackforest Stuffed Cupcakes,Chewy Red Raspberry Squares,Milky Way(R) Cupcake Icing,Hockey Pucks,Cinnamon Sugar Butter Cookies II,Jellybean Bark,Pink Ladies,Ribboned Fudge Cake,Cinnamon Coffee Frosting
2,Vegan Red Lentil Soup,"Spinach, Red Lentil, and Bean Curry",Moroccan-Style Stuffed Acorn Squash,Lentils And Spinach,Swiss Chard with Garbanzo Beans and Fresh Toma...,Mediterranean Chickpea Salad II,Vegan Split Pea Soup I,Fava Bean Breakfast Spread,Spinach Chickpea Curry,Mock Tuna Salad
3,Grilled Gyro Burgers,Cheese Grits,Three Cheese Macaroni with Tomatoes,Southwestern Cauliflower and Ham Soup,Chicken Creole,Country Scalloped Potatoes,Teriyaki Mushrooms,Southwestern Caesar Salad,Alison's Colcannon,Stuffed and Wrapped Chicken Breast
4,Incredible Potato Casserole,Baked Ham,Cheesy Fried Potatoes,The Best Chicken Salad Ever,Cheese Biscuits I,Angel's Chunky Chicken Salad,Taco Salad II,Jim's Macaroni Salad,Broccoli Casserole I,"Ham, Potato, and Cheese Soup"
5,Apple Coffee Cake With Brown Sugar Sauce,Cinnamon Sugar Biscotti,Cherry Pound Cake,Emily's Famous Chocolate Shortbread Cookies,Linda's Monster Cookies,Big Guy Strawberry Pie,Cream Cheese Coffee Cake II,Blueberry Coffee Cake I,Brown Sugar Cream Cheese Frosting,Farm Macaroons
6,Cup of Everything Cookies,Maple-Vanilla Syrup,Coffee Shake,We Be Jammin' Jamaican Banana Bread,Colonial Brown Bread,Gold Fever Chicken Wing Sauce,Orange Juice Cake,Chocolate Wontons,Irish Coffee,Easy Whipped Cream
7,Chap Chee Noodles,Gyros Burgers,Asian Barbequed Steak,Thit Bo Xao Dau,Key West Penne,African Curry,Killer Shrimp Soup,Korean Spicy Chicken and Potato (Tak Toritang),Baked Pork Spring Rolls,Greek Souzoukaklia


Not super informative.. Let's try to extract the most characteristic words used in the recipe names for each cluster.

We'll start by preprocessing the recipe names a bit : taking out a few punctuations and convert to lowercase.

In [24]:
print mldb.put('/v1/procedures/rcp_names_preproc', {
    'type': 'transform',
    'params': {
        'inputData': '''
            select jseval(
                   's = s.replace(/[(),"]/g, "");
                    s = s.replace(/'' /g, "");
                    s = s.replace(/ ''/g, "");
                    s = s.toLowerCase();
                    return s;',
                    's', name) as name
            from rcp_names
        ''',
        'outputDataset': 'rcp_names_clean',
        'runOnCreation': True}})

<Response [201]>


And then for a given cluster, we will count the words taken from the recipe names, after having passed them through a stemmer and filtered stop words. This is all done in one (big) query.

In [25]:
print mldb.put('/v1/functions/stem', {
    'type': 'stemmer',
    'params': {
        'language': 'english'}})

print mldb.put('/v1/functions/filter_stopwords', {'type': 'filter_stopwords'})

print mldb.put('/v1/procedures/sum_words_per_cluster', {
    'type': 'transform',
    'params': {
        'inputData': """
        SELECT
        sum(
            stem(
                filter_stopwords({
                    words: {
                        tokenize(name, {splitchars:' ()"'}) as *
                    }
                })
            )[words]
        ) as *
        NAMED cluster
        FROM merge(rcp_names_clean, rcp_kmeans_clusters)
        GROUP BY cluster
        """,
        'outputDataset': 'rcp_cluster_word_counts',
        'runOnCreation': True}})

<Response [201]>
<Response [201]>
<Response [201]>


Here is what the created dataset looks like:

In [29]:
mldb.query('select * from rcp_cluster_word_counts order by implicit_cast(rowName())')

Unnamed: 0_level_0,&,20,absolut,accident,adam,addict,ahead,aime,aioli,albondiga,...,won!texa,would'ha,xao,yakisoba,yia,yum,yung,z-zayt,zwiebelkuchen,zydeco
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2.0,1.0,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
1,,,1,,,1.0,,,,,...,,,,,,,,,,
2,1.0,,2,,,1.0,1.0,,,,...,,,,,,,,,,
3,,,2,,,,,,1.0,,...,,,,,,,,,,
4,2.0,,3,1.0,,,1.0,,,,...,,,,,,,,,,
5,,,3,,,1.0,1.0,,,,...,,,,,,,,,,
6,,,2,,1.0,,,,,,...,,,,,,,,,,
7,1.0,,4,,,,,,,2.0,...,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0


We can use this to create a TF-IDF score for each word in the cluster. Basically this score will give us an idea of the relative importance of a each word in a given cluster.

In [30]:
print mldb.put('/v1/procedures/train_tfidf', {
     'type': 'tfidf.train',
     'params': {
         'trainingData': 'select * from rcp_cluster_word_counts',
         'modelFileUrl': 'file://rcp_tfidf.idf',
         'runOnCreation': True}})

print mldb.put('/v1/functions/rcp_tfidf', {
     'type': 'tfidf',
     'params': {
         'modelFileUrl': 'file://rcp_tfidf.idf'}})

print mldb.put('/v1/procedures/apply_tfidf', {
     'type': 'transform',
     'params': {
         'inputData': 'select rcp_tfidf({input: {*}})[output] as * from rcp_cluster_word_counts',
         'outputDataset': 'rcp_cluster_word_scores',
         'runOnCreation': True}})


<Response [201]>
<Response [201]>
<Response [201]>


In the resulting dataset, the counts have been replaced by a score.

In [31]:
mldb.query("select * from rcp_cluster_word_scores order by implicit_cast(rowName())")

Unnamed: 0_level_0,&,20,absolut,accident,adam,addict,ahead,aime,aioli,albondiga,...,won!texa,would'ha,xao,yakisoba,yia,yum,yung,z-zayt,zwiebelkuchen,zydeco
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,12.942218,7.386471,11.769119,6.981315,6.981315,6.471109,6.471109,7.386471,6.981315,6.981315,...,,,,,,,,,,
1,,,5.88456,,,6.471109,,,,,...,,,,,,,,,,
2,6.471109,,11.769119,,,6.471109,6.471109,,,,...,,,,,,,,,,
3,,,11.769119,,,,,,6.981315,,...,,,,,,,,,,
4,12.942218,,17.653679,6.981315,,,6.471109,,,,...,,,,,,,,,,
5,,,17.653679,,,6.471109,6.471109,,,,...,,,,,,,,,,
6,,,11.769119,,6.981315,,,,,,...,,,,,,,,,,
7,6.471109,,23.538239,,,,,,,13.962631,...,7.386471,7.386471,7.386471,7.386471,14.772942,7.386471,7.386471,7.386471,7.386471,7.386471


If we transpose that dataset, we will be able to get the highest scored words for each cluster, and we can display them nicely in a word cloud.

In [36]:
print mldb.put('/v1/datasets/rcp_cluster_word_scores_t', {
    'type': 'transposed',
    'params': {
        'dataset': {'id':'rcp_cluster_word_scores'}}})

<Response [201]>


In [49]:
import json
from ipywidgets import interact 
from IPython.display import IFrame, display
html = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/d3.layout.cloud.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/wordcloud.js"></script>
<body> <script>drawCloud(%s)</script> </body>
"""

@interact 
def cluster_word_cloud(cluster=[0, NB_CLUSTERS]):
    num_words = 20
    cluster_words = mldb.get(
        '/v1/query',
        q="""
            SELECT rowName() as text
            FROM rcp_cluster_word_scores_t
            ORDER BY "{0}" DESC
            LIMIT {1}
          """.format(cluster, num_words),
        format='aos',
        rowNames=0
    ).json()
    for i,x in enumerate(cluster_words):
        x['size'] = num_words - i
    display( IFrame("data:text/html," + (html % json.dumps(cluster_words)).replace('"',"'"), 850, 350) )

None

Much better!

## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).