# Exploring Favourite Recipes

[AllRecipes](http://allrecipes.com/) is a recipe website, where people can mark certain recipes as 'favourites'. A student named Jeremy Cohen [scraped some of this data for an excellent machine learning project](http://www.jeremymcohen.net/posts/taste/) and we'll use his dataset to demo how to do some unsupervised machine learning with MLDB.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection()

The sequence of procedures below is based on the one explained in the [Mapping Reddit](/doc/nblink.html#_demos/Mapping Reddit) demo notebook.

In [3]:
mldb.v1.datasets("rcp_raw").put({
    "type": "text.csv.tabular",
    "params": {
        "headers": ["user_id", "recipe_id"],
        "dataFileUrl": "https://raw.githubusercontent.com/jmcohen/taste/master/data/favorites.csv"
    }
})

mldb.v1.procedures.post({
    "id": "rcp_import",
    "type": "transform",
    "params": {
        "inputDataset": "rcp_raw",
        "outputDataset": "recipes",
        "select": "pivot(recipe_id, 1) as *",
        "groupBy": "user_id", 
        "rowName": "user_id",
        "runOnCreation": True
    }
})

mldb.v1.procedures.post({
    "id": "rcp_svd",
    "type" : "svd.train",
    "params" : {
        "trainingDataset": "recipes",
        "columnOutputDataset" : "rcp_svd_embedding",
        "runOnCreation": True
    }
})

mldb.v1.procedures.post({
    "id" : "rcp_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingDataset" : "rcp_svd_embedding",
        "outputDataset" : "rcp_kmeans_clusters",
        "centroidsDataset" : "rcp_kmeans_centroids",
        "numClusters" : 20,
        "runOnCreation": True
    }
})

mldb.v1.procedures.post({
    "id": "rcp_tsne",
    "type" : "tsne.train",
    "params" : {
        "trainingDataset" : "rcp_svd_embedding",
        "rowOutputDataset" : "rcp_tsne_embedding",
        "runOnCreation": True
    }
})

mldb.v1.datasets("rcp_merged").put({
    "type": "merged",
    "params": {
        "datasets": [
            {"id": "rcp_svd_embedding"}, 
            {"id": "rcp_tsne_embedding"}, 
            {"id": "rcp_kmeans_clusters"}
        ]
    }
})

Let's load up the recipe names so we can see what we clustered.

In [4]:
import numpy as np
import pandas as pd
import urllib2 
names = []
for i, l in enumerate(urllib2.urlopen("https://raw.githubusercontent.com/jmcohen/taste/master/data/recipes.csv")):
    pieces = l.strip().split(",")
    names.append(",".join(pieces[1:]).replace("&#34;", '"').replace("'", "'"))

names = np.array(names)

def recipe_names(row_names):
    return names[ [int(x.split("|")[0]) for x in row_names] ]
 

Let's look a the closest recipes to each cluster centroid to try to get a sense of what the clusters mean.

In [5]:
centroids = mldb.query("select * from rcp_kmeans_centroids order by implicit_cast(rowName())")
import json
from IPython.display import display
rows = []
for x in json.loads(centroids.to_json(orient="records")):
    result = mldb.v1.datasets("rcp_svd_embedding").routes.neighbours.get(**x)
    rows.append(recipe_names(r[0] for r in result.json()))
pd.DataFrame(rows)
    

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,A Good Easy Garlic Chicken,Golden Potato Soup,Mexican Casserole,Tomato Chicken Parmesan,Baked Potato Soup V,Baked Potato Soup I,Fried Chicken with Creamy Gravy,Cheddar Baked Chicken,Stuffed Peppers,Oven Fried Chicken III
1,American Lasagna,Baked Potato Soup I,Kentucky Butter Cake,Bodacious Broccoli Salad,Bacon and Tomato Cups,BLT Dip,Chocolate Lovers' Favorite Cake,Day Before Mashed Potatoes,Honey Bun Cake I,Hot Pizza Dip
2,Chicken and Biscuit Casserole,Classic Goulash,Shrimp Fettuccine Alfredo,Fettuccine with Sweet Pepper-Cayenne Sauce,Cheesy Chicken Meatballs,Creamy Chicken on Linguine,Rice Balls a la Tim,Famous Chicken Francaise,Feta and Bacon Stuffed Chicken with Onion Mash...,"Best Ever Sausage with Peppers, Onions, and Beer!"
3,Cream Filled Cupcakes,Rick's Special Buttercream Frosting,Fluffy Peanut Butter Frosting,Dark Chocolate Cake I,Whipped Cream Cream Cheese Frosting,Extreme Chocolate Cake,Lemon Cake with Lemon Filling and Lemon Butter...,Creamy Chocolate Frosting,Chocolate Cookie Cheesecake,Fudge Puddles
4,Easy Spicy Roasted Potatoes,Swiss Chicken Casserole II,Garlic Butter,Potato Chips,Garlic Cheese Chicken Rollups,Cheddar Ranch Dip,Awesome Roast Beef,Fresh Tomato Salsa,Crispy Herb Baked Chicken,Charleston Breakfast Casserole
5,Super-Delicious Zuppa Toscana,Bacon Ranch Pasta Salad,Sesame Pasta Chicken Salad,Cheesy Ranch Potato Bake,Cabbage Fat-Burning Soup,Pepperoni Bread,Fresh Strawberry Upside Down Cake,Fudge Puddles,Hot Pizza Dip,Easy Cream Cheese Danish
6,Pesto Cheesy Chicken Rolls,Chicken Fajita Melts,Lasagna Alfredo,Prize Winning Baby Back Ribs,Southern Pulled Pork,The Best Chicken Fried Steak,Stuffed Chicken Valentino,Grill Master Chicken Wings,Amazing Crusted Chicken,Slow Cooker Italian Chicken Alfredo
7,French Baguettes,French Bread Rolls to Die For,Peppy's Pita Bread,Crispy and Creamy Doughnuts,French Bread,English Muffins,Burger or Hot Dog Buns,Simple Whole Wheat Bread,Best Bread Machine Bread,Cinnamon Rolls III
8,Shrimp Scampi Bake,Cajun Seafood Pasta,Grilled Shrimp Scampi,Basil Shrimp,Cioppino,Grilled Marinated Shrimp,Firecracker Grilled Alaska Salmon,Creamy Pesto Shrimp,Angel Hair Pasta with Shrimp and Basil,My Best Clam Chowder
9,Chicken Tikka Masala,Chicken Makhani (Indian Butter Chicken),Curried Coconut Chicken,Naan,Peanut Butter Noodles,Kung Pao Chicken,Vietnamese Fresh Spring Rolls,Mulligatawny Soup I,Asian Lettuce Wraps,Sean's Falafel and Cucumber Sauce


Not super informative.. Let's try to extract the most characteristic words used in the recipe names for each cluster.

In [6]:
df = mldb.query("select * from rcp_merged")
ordered_names = recipe_names(df.index.values)
documents = [" ".join(ordered_names[df.cluster.astype(int).values == i]) for i in range(len(centroids))]
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import rankdata
v = CountVectorizer(min_df=0.2, stop_words ="english")
td = v.fit_transform(documents).A
words = np.array(v.get_feature_names())

f = lambda x: rankdata(-x)
overall = td.sum(axis=0)
all_words = [list(words[(f(row) / f(overall-row)).argsort()]) for row in td]

In [7]:
import json
from ipywidgets import interact 
from IPython.display import IFrame, display
html = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/d3.layout.cloud.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/wordcloud.js"></script>
<body> <script>drawCloud(%s)</script> </body>
"""
@interact 
def cluster_word_cloud(cluster=[0,len(all_words)-1]):
    num_words = 20
    cluster_words = [dict(size=num_words-i, text=w) for i, w in enumerate(all_words[cluster][:num_words])]
    display( IFrame("data:text/html," + (html % json.dumps(cluster_words)).replace('"',"'"), 850, 350) )

None

Much better!

## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).