## Example with unstructured data, using *MongoDB*

This was removed from Week 1 to make concepts less overwhelming.

[Dataset](https://www.kaggle.com/c/whats-cooking) from _Yummly_.

In [1]:
import json
import pandas as pd
from smart_open import open
with open('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/cooking.json') as f:
    data = json.load(f)

print("There are ", len(data), "items that look like this:")
data[:1]

There are  39774 items that look like this:


[{'id': 10259,
  'cuisine': 'greek',
  'ingredients': ['romaine lettuce',
   'black olives',
   'grape tomatoes',
   'garlic',
   'pepper',
   'purple onion',
   'seasoning',
   'garbanzo beans',
   'feta cheese crumbles']}]

In [6]:
#@title Connect to a MongoDB database
#@markdown This cell connects to a remote MongoDB instance.
#@markdown We'll create our own databases later in the quarter,
#@markdown for now the code won't run for you.

!pip install dnspython pymongo
from urllib.parse import quote_plus
from pymongo import MongoClient
from getpass import getpass
user = "dbUser" #@param {type:"string"}
cluster_url = "cluster0.ga5s0.mongodb.net" #@param {type:"string"}
mongopw = getpass('Enter your MongoDB password for "{}":\n'.format(user))

client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(quote_plus(user), quote_plus(mongopw), cluster_url))

db = client.week1
# This deletes the collection if it already exists - i.e., if Dr. O
# already ran the below code, this lets us pretend we have a blank slate again!
out = db.drop_collection('cooking')

Enter your MongoDB password for "dbUser":
········


In [7]:
# 'collection' in MongoDB is like a 'table' in SQL
collection = db.cooking
result = collection.insert_many(data)

How many records are there in the collection 'cooking' of the database 'week1'?

In [8]:
collection.count_documents({})

39774

With the data in MongoDB, we can query the semi-structured data with similar flexibility to structured data.

For example, we can unfold the ingredients, so that there's a record for every ingredient of even data point.

In [11]:
pipeline = [
     {"$unwind": "$ingredients"},
     {"$limit": 5}
]
agg = collection.aggregate(pipeline)

# Print first five results
pd.DataFrame(agg)

Unnamed: 0,_id,id,cuisine,ingredients
0,61d49a3726a96584b3b4c42f,10259,greek,romaine lettuce
1,61d49a3726a96584b3b4c42f,10259,greek,black olives
2,61d49a3726a96584b3b4c42f,10259,greek,grape tomatoes
3,61d49a3726a96584b3b4c42f,10259,greek,garlic
4,61d49a3726a96584b3b4c42f,10259,greek,pepper


Why unwind? Now it's easier to group the data, to count which ingredients are most common.

In [12]:
pipeline = [
     {"$unwind": "$ingredients"},
     {"$group": {"_id": "$ingredients", "count": {"$sum": 1}}},
     {"$limit": 5}
]
agg = collection.aggregate(pipeline)
pd.DataFrame(agg)

Unnamed: 0,_id,count
0,dr. pepper,2
1,dried Thai chili,5
2,peach sorbet,1
3,bread dough,26
4,roasted white sesame seeds,6


Unwind > Group > Sort

In [13]:
pipeline = [
     {"$unwind": "$ingredients"},
     {"$group": {"_id": "$ingredients", "count": {"$sum": 1}}},
     {"$sort": {"count": -1}},
     {"$limit": 5}
]
agg = collection.aggregate(pipeline)
pd.DataFrame(agg)

Unnamed: 0,_id,count
0,salt,18049
1,onions,7972
2,olive oil,7972
3,water,7457
4,garlic,7380


Which cuisine+ingredient are most common in this dataset?

In [14]:
pipeline = [
     {"$unwind": "$ingredients"},
     {"$group": {"_id": {
         "ingredients": "$ingredients", "cuisine": "$cuisine"
     }, "count": {"$sum": 1}}},
     {"$sort": {"count": -1}},
     {"$limit": 5}
]
agg = collection.aggregate(pipeline)
pd.DataFrame(agg)[:5]

Unnamed: 0,_id,count
0,"{'ingredients': 'salt', 'cuisine': 'italian'}",3454
1,"{'ingredients': 'olive oil', 'cuisine': 'itali...",3111
2,"{'ingredients': 'salt', 'cuisine': 'mexican'}",2720
3,"{'ingredients': 'salt', 'cuisine': 'southern_us'}",2290
4,"{'ingredients': 'salt', 'cuisine': 'indian'}",1934
