# Week 8: Aggregations and MapReduce

# Today

- Small useful skills
- MapReduce
- Aggregations in MongoDB

# Announcements, Updates


## Review

# Toy Box of Useful Skills

## File Paths

`.` - Current folder

`..` - Parent folder

`folder1/folder2` - Directory named `folder2`, within a `folder1` directory, within the current directory

`../data` - Go up one directory, then into the 'data' folder

## `groupby(...).apply()`

In [1]:
import pandas as pd
example = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
example

Unnamed: 0,A,B,C
0,a,1,4
1,a,2,6
2,b,3,5


In [2]:
example.groupby('A').apply(lambda x: x.C.max() - x.B.min())

A
a    5
b    2
dtype: int64

## JSON Auto-indent in Jupyter / Colab

Add a line break:
    - after open braces
    - after commas
    - before and after close braces

In [3]:
{ "test": { 
    "key1": "value", 
    "key2": "value" }
}

{'test': {'key1': 'value', 'key2': 'value'}}

## Geographic Queries in MongoDB

In [1]:
from pymongo import MongoClient

# Loading my credentials from a file
with open('credentials.txt', mode='r') as f:
    user, mongopw, cluster_url = [l.strip() for l in f.readlines()]

client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(user, mongopw, cluster_url))
db = client.week10

Create a collection called 'fastfood', and tell Mongo that the 'geometry' key of each document is geographic:

In [2]:
db.fastfood.create_index([('geometry', '2dsphere')])

'geometry_2dsphere'

Import data of fast food restaurants. (Data not posted publicly - you can download from files section of Canvas)

*Thanks Jennings Anderson, PhD student in the EPIC lab, CU Boulder*

In [3]:
import json
with open('../data/fastfood.geojson', encoding='utf-8') as f:
    jsondata = json.load(f)
db.fastfood.insert_many(jsondata)

<pymongo.results.InsertManyResult at 0x1cdf5306d48>

In [None]:
# Delete all documents if needed
db.fastfood.delete_many({})

In [4]:
db.fastfood.estimated_document_count()

42073

Here's what the data looks like:

In [10]:
db.fastfood.find_one({})

{'_id': ObjectId('5ce1aaa15138fc0461a457f5'),
 'type': 'Feature',
 'properties': {'@id': 'relation/1059567',
  'amenity': 'fast_food',
  'building': 'yes',
  'cuisine': 'american',
  'name': 'Sonic Drive-In',
  'type': 'multipolygon'},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-96.1059697, 35.9987409],
    [-96.1059488, 35.9982857],
    [-96.1059014, 35.9982873],
    [-96.1059078, 35.9984123],
    [-96.1058381, 35.9984147],
    [-96.1058317, 35.9982896],
    [-96.1057768, 35.9982915],
    [-96.1057978, 35.9987468],
    [-96.1058574, 35.9987447],
    [-96.1058517, 35.9986359],
    [-96.1059198, 35.9986336],
    [-96.1059255, 35.9987424],
    [-96.1059697, 35.9987409]]]},
 'id': 'relation/1059567'}

To find restaurants near me, create a point according to the GeoJSON schema:

In [11]:
my_location =  { 'type' : "Point" ,
                 'coordinates' : [-104.961896, 39.676617] 
                }

In [13]:
query = {
    'geometry': {
        '$near': {
            '$geometry' : my_location
        }
    }
}
results = db.fastfood.find(query).limit(10)
list(results)

[{'_id': ObjectId('5ce1aaa15138fc0461a4b090'),
  'type': 'Feature',
  'properties': {'@id': 'node/631845194',
   'amenity': 'fast_food',
   'name': "Ben and Jerry's Ice Cream"},
  'geometry': {'type': 'Point', 'coordinates': [-104.9597847, 39.6785848]},
  'id': 'node/631845194'},
 {'_id': ObjectId('5ce1aaa15138fc0461a4b0ac'),
  'type': 'Feature',
  'properties': {'@id': 'node/638620276',
   'addr:housenumber': '2081',
   'addr:street': 'South University Boulevard',
   'amenity': 'fast_food',
   'name': "Mustard's Last Stand"},
  'geometry': {'type': 'Point', 'coordinates': [-104.9596249, 39.6789283]},
  'id': 'node/638620276'},
 {'_id': ObjectId('5ce1aaa15138fc0461a4b5a7'),
  'type': 'Feature',
  'properties': {'@id': 'node/976518442',
   'amenity': 'fast_food',
   'description': 'sandwich shop',
   'name': "Jimmy John's"},
  'geometry': {'type': 'Point', 'coordinates': [-104.959803, 39.6790585]},
  'id': 'node/976518442'},
 {'_id': ObjectId('5ce1aaa15138fc0461a4af5f'),
  'type': 'Feat

See Jennings Anderson's website for [examples of web tools built with Mongo and Geospatial data](https://www.townsendjennings.com/maps/)

![Screenshot of map tools](../images/jennings-maps.png)

## Geographic data in Pandas

There's a spin-off from Pandas called `GeoPandas`.

[Tutorial: Getting Started on Geospatial Analysis with Python, GeoJSON and GeoPandas](https://www.twilio.com/blog/2017/08/geospatial-analysis-python-geojson-geopandas.html)

# MapReduce

MapReduce is a framework from processing really large datasets, distributed across multiple threads, processes, or machines.

Two parts:

*Map*: Split the input into segments, to do something on it.

*Reduce*: Simplify the individually processed segments into one output.

(reduce is not simply 'combine' as we saw with SAC)

### Archetypal Example: Counting Words

![MapReduce Example](../images/mapreduce_example2.png)

*via http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf*

Some examples to consider:
    
- *Sorting*: How would we sort a *reeaaally* big list? e.g. Sort everything on Amazon by price.
- *Searching*: How do we determine how much a word shows up in 1 billion web pages?

MapReduce generally works by simplifying the problem to key-value pairs. 

MongoDB Example, via https://docs.mongodb.com/manual/tutorial/map-reduce-examples/: **Return the Total Price Per Customer**

Data structure:
```
{
     cust_id: "abc123",
     ord_date: new Date("Oct 04, 2012"),
     status: 'A',
     price: 25,
     items: [ { sku: "mmm", qty: 5, price: 2.5 },
              { sku: "nnn", qty: 5, price: 2.5 } ]
}
```

- Map step: return key-value pair of `cust_id: price`
- Reduce step: sum all the values for alike keys

**Calculate Average Quantity Per Item**

Data structure:
```
{
     cust_id: "abc123",
     ord_date: new Date("Oct 04, 2012"),
     status: 'A',
     price: 25,
     items: [ { sku: "mmm", qty: 5, price: 2.5 },
              { sku: "nnn", qty: 5, price: 2.5 } ]
}
```

- Map step: return key value pairs where the key is the 'sku' of each item, and the value is an object of `{count:1, quantity:  X}`
- Reduce step: Sum count and quantity for each key
- Finalize: Divide quantity/count for an everage

![](../images/mapreduce_example.jpg)

*https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm*

# Lab Exercises

Work on the first three questions of the lab.

Snowball help:
Once you get an answer look around and see if anybody is still trying to get there. Help them out!

# Data Aggregation Pipeline

*Aggregations* in MongoDB is a pipeline for combining data processing actions in MongoDB.

In [98]:
from IPython.display import HTML
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_tc6h037o&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_gxm0ypu7" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

Things you may want to do:

- **match**: Select a subset of data (as you can do with 'find')
- **sort**: Order data by the values of a certain key
- **group**: Group data based on a key - like 'groupby' in Pandas
- **limit**: Trim the number of documents in the dataset
- **unwind**: Deconstruct an array, so that there is a document for every value of the array
- **project**: Select specific fields (like with the second argument to 'find')

These are in fact the names of *stages* of the pipeline:

- **\$match**: Select a subset of data (as you can do with 'find')
- **\$sort**: Order data by the values of a certain key
- **\$group**: Group data based on a key - like 'groupby' in Pandas
- **\$limit**: Trim the number of documents in the dataset
- **\$unwind**: Deconstruct an array, so that there is a document for every value of the array
- **\$project**: Select specific fields (like with the second argument to 'find')

Let's consider these in detail, in practice.

In [6]:
db = client.week9
db.cooking.find_one({})

{'_id': ObjectId('5ec70b6b585a751f26094e32'),
 'cuisine': 'greek',
 'id': 10259,
 'ingredients': ['romaine lettuce',
  'black olives',
  'grape tomatoes',
  'garlic',
  'pepper',
  'purple onion',
  'seasoning',
  'garbanzo beans',
  'feta cheese crumbles']}

Basics of the aggregations pipeline:

`db.collectionName.aggregate(pipeline)`

where

```python
pipeline = [
    stage1,
    stage2,
    ...
    and_so_on
]
```

# $match

Same as `find`, but a good place to start

In [99]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_oxirky6a&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_1zl9hx01" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

In [10]:
pipeline = [
    {
        "$match": { "cuisine": "greek" }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': ['romaine lettuce',
   'black olives',
   'grape tomatoes',
   'garlic',
   'pepper',
   'purple onion',
   'seasoning',
   'garbanzo beans',
   'feta cheese crumbles']},
 {'_id': ObjectId('5ec70b98585a751f26094e8f'),
  'cuisine': 'greek',
  'id': 34471,
  'ingredients': ['ground pork',
   'finely chopped fresh parsley',
   'onions',
   'salt',
   'vinegar',
   'caul fat']}]

# $sort

Provide an object where the field names to sort by are the keys, and '-1' or '1' specify to sort in ascending or descending order.

Here, we sort by alphabetical order on 'cuisine' - we'll try something more useful shortly.

In [12]:
pipeline = [
    {
        "$sort": { "cuisine": -1 }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:1]

[{'_id': ObjectId('5ec70b98585a751f26094e45'),
  'cuisine': 'vietnamese',
  'id': 8152,
  'ingredients': ['soy sauce',
   'vegetable oil',
   'red bell pepper',
   'chicken broth',
   'yellow squash',
   'garlic chili sauce',
   'sliced green onions',
   'broccolini',
   'salt',
   'fresh lime juice',
   'cooked rice',
   'chicken breasts',
   'corn starch']}]

# $project

Select the columns that you want in the results, or exclude columns.

In [44]:
pipeline = [
    { "$match": { "cuisine": "greek" }  },
    {
        "$project": {"_id": 0, "ingredients": 0}
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'cuisine': 'greek', 'id': 10259}, {'cuisine': 'greek', 'id': 34471}]

In [21]:
pipeline = [
    { "$match": { "cuisine": "greek" } },
    { "$project": {"cuisine": 1}  }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'_id': ObjectId('5ec70b6b585a751f26094e32'), 'cuisine': 'greek'},
 {'_id': ObjectId('5ec70b98585a751f26094e8f'), 'cuisine': 'greek'}]

`$project` is usually for your benefit (it's more readable!), but that's not bad! If you're only focused on one or two pieces of information, it's easier to see that information with `$project`

## Lab Excercise

Work on the next question of the lab (writing an aggregation pipeline to answer the previous question).

# $limit

Same as `limit(n)`.

In [100]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_aukx8gcg&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_uh9di9z4" width="640" height="360" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

In [42]:
pipeline = [
    {
        "$match": { "cuisine": "greek" }
    },
    {
        "$limit": 1
    }
]

results = db.cooking.aggregate(pipeline)
len(list(results))

1

# $unwind

Expand each item in a list to it's own document.

Before:
    
```python
[{
  'cuisine': 'greek',
  'ingredients': ['romaine lettuce',
   'black olives',
   'feta cheese crumbles']
}]
```

After

```python
[
    {'cuisine': 'greek', 'ingredients': 'romaine lettuce'},
    {'cuisine': 'greek', 'ingredients': 'black olives'},
    {'cuisine': 'greek', 'ingredients': 'feta cheese crumbles'}
]
```

Step by step: what do the results below represent?

In [52]:
pipeline = [
    {"$match": {"cuisine": "greek" }},
    { "$limit": 1 },
    { "$unwind": '$ingredients' }
]
 
results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'romaine lettuce'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'black olives'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'grape tomatoes'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'garlic'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'pepper'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'purple onion'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'seasoning'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'garbanzo beans'},
 {'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'gre

**When you're referring to a field in a value (rather than a *key*), precede the name with '\$'**

'cuisine' is referred to in a key here:

```
{ "$match": { "cuisine": "greek" } }
```

'ingredients' is referred to in a value:

```
{ "$unwind": "$ingredients" }
```

*How does this differ?*

```
pipeline = [
    { "$match": { "cuisine": "greek" } },
    { "$limit": 1 },
    { "$unwind": '$ingredients' }
]
```
vs.

In [54]:
pipeline = [
    { "$match": { "cuisine": "greek" } },
    { "$unwind": '$ingredients' },
    { "$limit": 1 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': ObjectId('5ec70b6b585a751f26094e32'),
  'cuisine': 'greek',
  'id': 10259,
  'ingredients': 'romaine lettuce'}]

```python
[
    { "$match": { "cuisine": "greek" } },
    { "$unwind": '$ingredients' },
    { "$limit": 1 }
]
```

This is quick, because MongoDB is smart - it sees that there's a `$limit` after the `$unwind`, so internally it optimizes the search and doesn't unwind all 40k results.

### Exercises

Do the next question in the lab. (How much do the following categories show up?)

# $group

It's our split-apply-combine pattern in MongoDB.

In [101]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_dw7mhcxx&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_42xbiiba" width="640" height="360" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

In [55]:
pipeline = [
    { 
        "$group": {
            "_id": "$cuisine",
            "num_matching": { "$sum": 1 }
        }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': 'greek', 'num_matching': 1175},
 {'_id': 'british', 'num_matching': 804},
 {'_id': 'southern_us', 'num_matching': 4320},
 {'_id': 'cajun_creole', 'num_matching': 1546},
 {'_id': 'spanish', 'num_matching': 989},
 {'_id': 'mexican', 'num_matching': 6438},
 {'_id': 'italian', 'num_matching': 7838},
 {'_id': 'french', 'num_matching': 2646},
 {'_id': 'japanese', 'num_matching': 1423},
 {'_id': 'filipino', 'num_matching': 755},
 {'_id': 'irish', 'num_matching': 667},
 {'_id': 'moroccan', 'num_matching': 821},
 {'_id': 'thai', 'num_matching': 1539},
 {'_id': 'chinese', 'num_matching': 2673},
 {'_id': 'korean', 'num_matching': 830},
 {'_id': 'vietnamese', 'num_matching': 825},
 {'_id': 'russian', 'num_matching': 489},
 {'_id': 'jamaican', 'num_matching': 526},
 {'_id': 'brazilian', 'num_matching': 467},
 {'_id': 'indian', 'num_matching': 3003}]

Sort and limit:

In [58]:
pipeline = [
    { 
        "$group": {
            "_id": "$cuisine",
            "num_matching": { "$sum": 1 }
        } 
    },
    {
        "$sort": {"num_matching": -1 }
    },
    {
        "$limit": 3
    }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': 'italian', 'num_matching': 7838},
 {'_id': 'mexican', 'num_matching': 6438},
 {'_id': 'southern_us', 'num_matching': 4320}]

In [64]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": "$ingredients",
            "count": { "$sum": 1 }
        }
    },
    { "$sort": {"count": -1 }  },
    { "$limit": 10 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': 'salt', 'count': 18049},
 {'_id': 'onions', 'count': 7972},
 {'_id': 'olive oil', 'count': 7972},
 {'_id': 'water', 'count': 7457},
 {'_id': 'garlic', 'count': 7380},
 {'_id': 'sugar', 'count': 6434},
 {'_id': 'garlic cloves', 'count': 6237},
 {'_id': 'butter', 'count': 4848},
 {'_id': 'ground black pepper', 'count': 4785},
 {'_id': 'all-purpose flour', 'count': 4632}]

`$group` follows the following pattern:

```python
"$group": {
            "_id": GROUPING_CONDITIONS,
            FIELD: ACCUMULATOR
        }
```

You always need an id. It can be a string (to group by a single column), an object where the keys are new names and the values are the fields that your grouping by, or `None`, which groups the entire dataset into a single point.

The `_id` can be an object with multiple values.

In [69]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": {"cuisine": "$cuisine", "ingredients": "$ingredients"},
            "num_matching": { "$sum": 1 }
        }
    },
    { "$sort": {"num_matching": -1 }  },
    { "$limit": 4 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': {'cuisine': 'italian', 'ingredients': 'salt'}, 'num_matching': 3454},
 {'_id': {'cuisine': 'italian', 'ingredients': 'olive oil'},
  'num_matching': 3111},
 {'_id': {'cuisine': 'mexican', 'ingredients': 'salt'}, 'num_matching': 2720},
 {'_id': {'cuisine': 'southern_us', 'ingredients': 'salt'},
  'num_matching': 2290}]

here I named the id keys as the same thing as the incoming information:
    
```
"_id": {"cuisine": "$cuisine", "ingredients": "$ingredients"}
```

But they can be named whatever you like. e.g.

```
"_id": {"foo": "$cuisine", "bar": "$ingredients"}
```

Count total number of ingredients by grouping on `None`:

In [71]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": None,
            "num_matching": { "$sum": 1 }
        }
    },
    { "$sort": {"num_matching": -1 }  },
    { "$limit": 3 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': None, 'num_matching': 428275}]

## Lab Exercise

Next question - writing an aggregation pipeline that needs '$group' to count categories in the data.

In [102]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_vayll4b0&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_arvdpbxu" width="640" height="360" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

## Groupby operators

- `$sum`
  - Using `{ "$sum": 1 }` returns a count, but you can also sum a numeric set of values with `{ "$sum": "$keyName" }`
- `$avg`
- `$first`
- `$last`
- `$min`
- `$max`

How would you get the average number of ingredients per cuisine type?

- First step: how do you get a count of ingredients per each individual recipe?

In [210]:
!cd

C:\Users\Peter.Organisciak\Dropbox\teaching\lis4235-scripting\slides


In [167]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": {'cuisine':'$cuisine', 'id': '$id'},
            "num_ingredients": { "$sum": 1 }
        }
    },
    { 
        "$group": {
            "_id": {'cuisine':'$_id.cuisine'},
            "average_ingredients": { "$avg": "$num_ingredients" }
        }
    },
    {
        "$sort": { "average_ingredients": 1}
    }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': {'cuisine': 'irish'}, 'average_ingredients': 9.299850074962519},
 {'_id': {'cuisine': 'brazilian'}, 'average_ingredients': 9.5203426124197},
 {'_id': {'cuisine': 'southern_us'}, 'average_ingredients': 9.634953703703705},
 {'_id': {'cuisine': 'british'}, 'average_ingredients': 9.708955223880597},
 {'_id': {'cuisine': 'japanese'}, 'average_ingredients': 9.735066760365426},
 {'_id': {'cuisine': 'french'}, 'average_ingredients': 9.817838246409675},
 {'_id': {'cuisine': 'italian'}, 'average_ingredients': 9.909032916560347},
 {'_id': {'cuisine': 'filipino'}, 'average_ingredients': 10.0},
 {'_id': {'cuisine': 'greek'}, 'average_ingredients': 10.182127659574467},
 {'_id': {'cuisine': 'russian'}, 'average_ingredients': 10.224948875255624},
 {'_id': {'cuisine': 'spanish'}, 'average_ingredients': 10.42366026289181},
 {'_id': {'cuisine': 'mexican'}, 'average_ingredients': 10.87744641192917},
 {'_id': {'cuisine': 'korean'}, 'average_ingredients': 11.28433734939759},
 {'_id': {'cuisine': 'c