# Exploring Favourite Recipes

[AllRecipes](http://allrecipes.com/) is a recipe website, where people can mark certain recipes as 'favourites'. A student named Jeremy Cohen [scraped some of this data for an excellent machine learning project](http://www.jeremymcohen.net/posts/taste/) and we'll use his dataset to demo how to do some unsupervised machine learning with MLDB.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection()

The sequence of procedures below is based on the one explained in the [Mapping Reddit](/doc/nblink.html#_demos/Mapping Reddit) demo notebook.

In [2]:
print mldb.v1.datasets("rcp_raw").put({
    "type": "text.csv.tabular",
    "params": {
        "headers": ["user_id", "recipe_id"],
        "dataFileUrl": "https://raw.githubusercontent.com/jmcohen/taste/master/data/favorites.csv"
    }
})

print mldb.v1.procedures.post({
    "id": "rcp_import",
    "type": "transform",
    "params": {
        "inputDataset": "rcp_raw",
        "outputDataset": "recipes",
        "select": "pivot(recipe_id, 1) as *",
        "groupBy": "user_id", 
        "rowName": "user_id",
        "runOnCreation": True
    }
})

print mldb.v1.procedures.post({
    "id": "rcp_svd",
    "type" : "svd.train",
    "params" : {
        "trainingData": "select * from recipes",
        "columnOutputDataset" : "rcp_svd_embedding",
        "runOnCreation": True
    }
})

print mldb.v1.procedures.post({
    "id" : "rcp_kmeans",
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from rcp_svd_embedding",
        "outputDataset" : "rcp_kmeans_clusters",
        "centroidsDataset" : "rcp_kmeans_centroids",
        "numClusters" : 20,
        "runOnCreation": True
    }
})

print mldb.v1.procedures.post({
    "id": "rcp_tsne",
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from rcp_svd_embedding",
        "rowOutputDataset" : "rcp_tsne_embedding",
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>
<Response [201]>


Let's load up the recipe names so we can see what we clustered.

In [3]:
import numpy as np
import pandas as pd
import urllib2 
names = []
for i, l in enumerate(urllib2.urlopen("https://raw.githubusercontent.com/jmcohen/taste/master/data/recipes.csv")):
    pieces = l.strip().split(",")
    names.append(",".join(pieces[1:]).replace("&#34;", '"').replace("'", "'"))

names = np.array(names)

def recipe_names(row_names):
    return names[ [int(x.split("|")[0]) for x in row_names] ]
 

Let's look a the closest recipes to each cluster centroid to try to get a sense of what the clusters mean.

In [4]:
centroids = mldb.query("select * from rcp_kmeans_centroids order by implicit_cast(rowName())")
import json
rows = []
for x in json.loads(centroids.to_json(orient="records")):
    result = mldb.v1.datasets("rcp_svd_embedding").routes.neighbours.get(**x)
    rows.append(recipe_names(r[0] for r in result.json()))
pd.DataFrame(rows)
    

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Italian Dressing Mix,Potato Chips,Broccoli with Garlic Butter and Cashews,Garlic Butter,Egg Noodles,Roasted Garlic Cauliflower,Absolutely Fabulous Greek/House Dressing,Italian Peas,Fried Rice Restaurant Style,Ranch Dressing II
1,Magic Peanut Butter Middles,Chewy Peanut Butter Brownies,Fudge Puddles,No Bake Chocolate Oat Bars,Frosted Banana Bars,Fluffy Peanut Butter Frosting,Caramel Shortbread Squares,Oatmeal Peanut Butter Cookies III,Cookie Balls,Delicious Raspberry Oatmeal Cookie Bars
2,Addictive Sweet Potato Burritos,Vegetarian Chickpea Sandwich Filling,Black Bean and Salsa Soup,Spaghetti Squash I,The Best Vegetarian Chili in the World,Black Bean Vegetable Soup,California Grilled Veggie Sandwich,Lentil Soup,Sean's Falafel and Cucumber Sauce,Delicious Black Bean Burritos
3,Kentucky Butter Cake,Cream Cheese Sugar Cookies,Autumn Cheesecake,Irish Cream Bundt Cake,Cinnamon Rolls III,Chocolate Lovers' Favorite Cake,Honey Bun Cake I,Bake Sale Lemon Bars,Award Winning Peaches and Cream Pie,Golden Rum Cake
4,Chicken and Biscuit Casserole,Classic Goulash,Shrimp Fettuccine Alfredo,Fettuccine with Sweet Pepper-Cayenne Sauce,Creamy Chicken on Linguine,Cheesy Chicken Meatballs,Famous Chicken Francaise,Rice Balls a la Tim,Feta and Bacon Stuffed Chicken with Onion Mash...,"Best Ever Sausage with Peppers, Onions, and Beer!"
5,Delicious Raspberry Oatmeal Cookie Bars,Pumpkin Gingerbread,Apple Squares,Apple Strudel Muffins,Oatmeal Peanut Butter Cookies,Pumpkin Apple Streusel Muffins,Morning Glory Muffins I,Health Nut Blueberry Muffins,Low-Fat Blueberry Bran Muffins,Outrageous Chocolate Chip Cookies
6,Baked Slow Cooker Chicken,Slow Cooker Scalloped Potatoes with Ham,Slow Cooker Enchiladas,Slow Cooker Salisbury Steak,Slow Cooker Creamy Potato Soup,Slow Cooker Ham,Tangy Slow Cooker Pork Roast,Marie's Easy Slow Cooker Pot Roast,Cabbage Rolls II,Pork Chops for the Slow Cooker
7,Apple Crumb Pie,Apple Turnovers,Apple Squares,Crustless Cranberry Pie,Churros,Lemon Pie Bars,Ladyfingers,Chocolate Truffle Cookies,Triple Berry Crisp,French Breakfast Muffins
8,Swiss Chicken Casserole II,Garlic Cheese Chicken Rollups,Crispy Herb Baked Chicken,Awesome Roast Beef,Manicotti Italian Casserole,Brunch Enchiladas,Chicken Breasts with Lime Sauce,Slow Cooker London Broil,Chicken Scarpariello,Kona Chicken
9,Chocolate Covered Strawberries,Toasted Garlic Bread,Garlic Butter,Sauteed Apples,Baked French Fries I,Sarah's Applesauce,Great Garlic Bread,Oven Roasted Red Potatoes,Oven Roasted Potatoes,Strawberry Oatmeal Breakfast Smoothie


Not super informative.. Let's try to extract the most characteristic words used in the recipe names for each cluster.

In [5]:
df = mldb.query("select * from merge(rcp_svd_embedding, rcp_tsne_embedding, rcp_kmeans_clusters)")
ordered_names = recipe_names(df.index.values)
documents = [" ".join(ordered_names[df.cluster.astype(int).values == i]) for i in range(len(centroids))]
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import rankdata
v = CountVectorizer(min_df=0.2, stop_words ="english")
td = v.fit_transform(documents).A
words = np.array(v.get_feature_names())

f = lambda x: rankdata(-x)
overall = td.sum(axis=0)
all_words = [list(words[(f(row) / f(overall-row)).argsort()]) for row in td]

In [6]:
import json
from ipywidgets import interact 
from IPython.display import IFrame, display
html = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/d3.layout.cloud.js"></script>
<script src="http://opensource.datacratic.com/mldb-demo-resources/wordcloud.js"></script>
<body> <script>drawCloud(%s)</script> </body>
"""
@interact 
def cluster_word_cloud(cluster=[0,len(all_words)-1]):
    num_words = 20
    cluster_words = [dict(size=num_words-i, text=w) for i, w in enumerate(all_words[cluster][:num_words])]
    display( IFrame("data:text/html," + (html % json.dumps(cluster_words)).replace('"',"'"), 850, 350) )

None

Much better!

## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).