# Scripting Week 9: MongoDB and Semi-Structured Data

In [1]:
import pandas as pd

## Agenda

- Lab questions from last week
- JSON
- MongoDB
- Work time - lab and final projects
- Out at 5:30

## Announcements

## Library and Data Curation Skills

From *Bishop, B., Cowan, M., Collier, H., Mayernik, M., & Organisciak, P. (2023). Job Analyses of Earth Science Data Managers: A Survey Validation of Competencies to Inform Curricula in Research Data Management Education. Journal of Education for Library and Information Science, 64(2), 104–119. https://doi.org/10.3138/jelis-2021-0023*

Please indicate any software and tools you use in your position (check all that apply)

- Jupyter Notebook 23
- RStudio 22
- Any GIS 21
- NetCDF 17
- Other 14
- Scrum 8
- Apache Spark 4
- Apache Hadoop 3
Total Selections 112

---
Other: Eclipse (2), IDLE (2), Notepad ++ (2), GIT (2), Excel (2), Confluence (2), Kibana, Oracle, FreeFileSync, GeoMapApp, PostgreSQL, Panoply, ToolsUI, Linux shell, Google Docs, Zoom, RazorSQL, GitHub, JIRA, VisualFoxPro

Please indicate any programming languages and/or scripting that you use in your position (check all that apply)

- Markup Languages (e.g., HTML, XML)  34
- Python 26
- R 24
- Other 18
- JavaScript 13
- Total Selections 115

---
Other: LAMP, API, ETL, PERL, SQL (2), GRASS GIS, Visual FoxPro (2), C, C#, PHP (2), BASH shell (2), Java (4), MATLAB

*From Nickoal Eichmann-Kalwara: [Digital Scholarship Activities at CU Boulder:Preliminary Results of a Campus Survey (2019)](https://osf.io/vpdhc/)*

![](https://github.com/organisciak/Scripting-Course/blob/master/images/questions1.png?raw=true)

![](https://github.com/organisciak/Scripting-Course/blob/master/images/questions2.png?raw=true)

# Semi-structured data

Semi-structured data has a structure, but it is specific to the needs of the data.

Tabular data always has two-dimensions - rows and columns (aka 'fields' and 'records') - semi-structured data is less predictable.

### Common Semi-Structured Formats

JSON
- Hierarchical
- Organized around key/value pairs and lists of values

XML
- Hierarchical, enclosed
- Organized with tags, properties, and values

### XML, briefly

XML has tags and attributes, representing elements, and content

![Example XML](../images/xml1.png)

TAGS

- `<catalog>`
- `<book>`
- `<author>`

ELEMENTS
- Everything from `<book>` to `</book>`

ATTRIBUTES
- The ‘id’ of `<book>`

CONTENT
- “Gambardella, Matthew”
- “XML Developer’s Guide”

## JSON

JSON is made up a few data value types:

- string
- number (like a float in Python)
- object (named collection, like a dictionary in Python)
- array  (unnamed collection, like a list in Python)
- boolean
- null

Our 'containers' are objects and arrays. They can hold any of the other data types, *including other arrays and objects*.


### Object

Key value pairs, surrounded by curly braces.

- `:` separates key and value, `,` separates the item pairs.
- Keys have to be strings, values can be anything.

### Arrays

Nearly identical to lists in Python. Values, surrounded by square brackets.

- `,` separates items.
- Values can be any data type.

### Objects 

basic form:

```json
{
  "name": "Jill"
}
```

Adding key-value pairs:
    
```json
{
  "name": "Jill",
  "profession": "Juggler"
}
```

Values can be other types:
    
```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51
}
```

JSON is language-agnostic, so some things are different than in Python. e.g. *true / false* (lowercase) rather than Python's *True / False*:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true
}
```

The values can be arrays and objects - that's how things start to get deep:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"]
}
```

Nested Object:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": {
      "Tri-City Juggling Symposium 2010": "Technical Program, Silver",
      "Ballympics 2009": "Best Overall, Runner-up"
  }
}
```

List of objects, why not?

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": [
        {
            "event": "Tri-City Juggling Symposium",
             "year": 2010,
             "awards": ["Technical Program, Silver"]
        },
        {
            "event": "Ballympics",
             "year": 2009,
             "awards": ["Best Overall, Runner-up"]
        }
    ]
}
```

Whitespace doesn't matter:

```json
{"name": "Jill", "profession": "Juggler", "age": 51, "active": true, "tools": ["ball", "handkerchief", "flaming motorcycle"], "awards": [{"event": "Tri-City Juggling Symposium", "year": 2010, "awards": ["Technical Program, Silver"]}, {"event": "Ballympics", "year": 2009, "awards": ["Best Overall, Runner-up"]}]}
```

### Reading in Python

In [2]:
person = {"name": "Jill", "profession": "Juggler", "age": 51, "active": True,
          "tools": ["ball", "handkerchief", "flaming motorcycle"], 
          "awards": [
              {"event": "Tri-City Juggling Symposium", "year": 2010, "awards": ["Technical Program, Silver"]},
              {"event": "Ballympics", "year": 2009, "awards": ["Best Overall, Runner-up"]}
          ]}

Note that Python True/False is capitalized (though JSON as saved to file uses true/false)

In [3]:
person['name']

'Jill'

In [4]:
a = person['tools']
a[0]

'ball'

In [5]:
person['awards'][0]['event']

'Tri-City Juggling Symposium'

### Exercises

Develop a JSON structure for representing the following:

- a directory of current mayors, organized by cities and subdivided by state
    - Try with 3 cities from 2 states
- a collection of movies, each with basic metadata (e.g. title, director, year) and with viewer opinion information (ratings or reviews)
   - Try with *Black Panther* and *Avengers: Endgame*

# MongoDB

A semi-structured non-relational (No-SQL) database, which uses a JSON-like format for storing information.

### Terminology

<center>**Database Management System (MongoDB)**</center>
<center>*can have one or more*</center>

<center>**Databases**</center>

<center>*can have one or more*</center>
<center>**Collections**</center>

<center>*can have one or more*</center>
<center>**Documents**</center>

Compare to relational databases - How do the terms align?

*Documents* are JSON-like objects, *Collections* are lists of those objects.

```
[
    {
      "name": "Jill",
      "profession": "Juggler"
    },
    { "name": "Jack",
      "profession": "Unemployed"
    }
]
```

Connecting to a database called 'week10slides'. I'll put all my collections there this week.

In [47]:
from pymongo import MongoClient

# Loading my credentials from a file
with open('credentials.txt', mode='r') as f:
    user, mongopw, cluster_url = [l.strip() for l in f.readlines()]

client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(user, mongopw, cluster_url))
db = client.scripting

In [49]:
# Reset for future years
x = db.drop_collection('example')
x = db.drop_collection('example2')

New `example` collection:

In [50]:
collection = db.example

## Insert One

In [51]:
doc = {
  "name": "Jill",
  "age": 51,
  "profession": "Juggler"
}
collection.insert_one(doc) 

InsertOneResult(ObjectId('682b57d938c40268da87d94f'), acknowledged=True)

## Retrieving Documents

MongoDB does retrieval with `find` or `find_one`, using a JSON-query.

`find_one` gives a JSON object back:

In [52]:
collection.find_one(
    {"age": 51}
)

{'_id': ObjectId('682b57d938c40268da87d94f'),
 'name': 'Jill',
 'age': 51,
 'profession': 'Juggler'}

`find` returns an instance of an object that can be iterated over: 

In [53]:
results = collection.find({'name': 'Jill'})
results

<pymongo.synchronous.cursor.Cursor at 0x10f067d90>

In [54]:
results = collection.find({'name': 'Jill'})
for result in results:
    print(result)

{'_id': ObjectId('682b57d938c40268da87d94f'), 'name': 'Jill', 'age': 51, 'profession': 'Juggler'}


*Try iterating over the `results` variable again without doing another search. What happens?*

In [55]:
for result in results:
    print(result)

The `cursor` is a pointer to the database, so you're not holding all the data in memory - you're just pointing to it. Good for really large datasets!

Alternately, you can convert everything to a list, but don't do this if you have a lot of data!

In [56]:
results = collection.find({'name': 'Jill'})
list_of_results = list(results)
list_of_results

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'}]

## Insert Many

In [57]:
docs = [{ "name": "Jack", "age": 50, "profession": "Unemployed" },
        { "name": "Jun Ho", "age": 34, "profession": "Juggler" }
       ]
collection.insert_many(docs)

InsertManyResult([ObjectId('682b57e038c40268da87d950'), ObjectId('682b57e038c40268da87d951')], acknowledged=True)

In [58]:
results = collection.find({"profession": "Juggler"})
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('682b57e038c40268da87d951'),
  'name': 'Jun Ho',
  'age': 34,
  'profession': 'Juggler'}]

### Count

In [59]:
collection.estimated_document_count()

3

## Retrieving Everything

*What would our search be to get everything?*

In [60]:
results = collection.find({})
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('682b57e038c40268da87d950'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'},
 {'_id': ObjectId('682b57e038c40268da87d951'),
  'name': 'Jun Ho',
  'age': 34,
  'profession': 'Juggler'}]

## `Find` by Example

Exact match:

In [61]:
results = collection.find({
            "age": 50
        })
list(results)

[{'_id': ObjectId('682b57e038c40268da87d950'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

Multiple conditions:

In [62]:
results = collection.find({
            "age": 50,
            "profession": "Unemployed"
        })
list(results)

[{'_id': ObjectId('682b57e038c40268da87d950'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

Greater than, Less than, Greater than or equal to, Less than or equal to:

In [63]:
results = collection.find({
            "age": { "$gt": 50 }
        })
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'}]

In [64]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('682b57e038c40268da87d950'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

```json
{ "$gt": 50 }
```

The `$` tells mongo that this is a special function, not simply a value named `gt`

In [65]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('682b57e038c40268da87d950'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

And more:
    
- Combine expressions with `$or`, `$and`
- Match against multiple values with `$in`, `$all`
    
*What do these do? Can we figure out how to use them?*

### Specifying the return fields

The 2nd argument to `find` or `find_one` specifies which fields to include:

In [66]:
results = collection.find({}, { "name": 1 })
list(results)

[{'_id': ObjectId('682b57d938c40268da87d94f'), 'name': 'Jill'},
 {'_id': ObjectId('682b57e038c40268da87d950'), 'name': 'Jack'},
 {'_id': ObjectId('682b57e038c40268da87d951'), 'name': 'Jun Ho'}]

In [67]:
results = collection.find({}, { "age": 1, "name": 1 })
list(results)  

[{'_id': ObjectId('682b57d938c40268da87d94f'), 'name': 'Jill', 'age': 51},
 {'_id': ObjectId('682b57e038c40268da87d950'), 'name': 'Jack', 'age': 50},
 {'_id': ObjectId('682b57e038c40268da87d951'), 'name': 'Jun Ho', 'age': 34}]

Or by exclusion:

In [68]:
results = collection.find({}, { "_id": 0 })
list(results)

[{'name': 'Jill', 'age': 51, 'profession': 'Juggler'},
 {'name': 'Jack', 'age': 50, 'profession': 'Unemployed'},
 {'name': 'Jun Ho', 'age': 34, 'profession': 'Juggler'}]

*Why use Mongo rather than an SQL DB?*

# <center>Nested Fields and Arrays</center>

New collection:

In [69]:
collection = db.example2

In [70]:
doc = {
  "name": "Jill",
  "profile": {
      "age": 51,
      "hobbies": ["jogging", "juggling"]
  },
}
collection.insert_one(doc)

InsertOneResult(ObjectId('682b57ed38c40268da87d952'), acknowledged=True)

In [71]:
doc = {
  "name": "Jack",
  "profile": {
      "age": 50,
      "hobbies": ["internet commenting"]
  }
}
collection.insert_one(doc)

InsertOneResult(ObjectId('682b57ee38c40268da87d953'), acknowledged=True)

In [72]:
doc = {
  "name": "Ju Ho",
  "profile": {
      "age": 34,
      "hobbies": ["petting dogs", "juggling"]
  }
}
collection.insert_one(doc)

InsertOneResult(ObjectId('682b57ee38c40268da87d954'), acknowledged=True)

In [73]:
collection.estimated_document_count()

3

'dot' notation: separate nested field with a field. e.g. `profile.age`

In [74]:
results = collection.find( { "profile.age": 51} )
list(results)

[{'_id': ObjectId('682b57ed38c40268da87d952'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}]

In [75]:
results = collection.find( { "profile.hobbies": "juggling"} )
list(results)

[{'_id': ObjectId('682b57ed38c40268da87d952'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('682b57ee38c40268da87d954'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}}]

# Data Aggregation Pipeline

*Aggregations* in MongoDB is a pipeline for combining data processing actions in MongoDB.

Things you may want to do:

- **match**: Select a subset of data (as you can do with 'find')
- **sort**: Order data by the values of a certain key
- **group**: Group data based on a key - like 'groupby' in Pandas
- **limit**: Trim the number of documents in the dataset
- **unwind**: Deconstruct an array, so that there is a document for every value of the array
- **project**: Select specific fields (like with the second argument to 'find')

These are in fact the names of *stages* of the pipeline:

- **\$match**: Select a subset of data (as you can do with 'find')
- **\$sort**: Order data by the values of a certain key
- **\$group**: Group data based on a key - like 'groupby' in Pandas
- **\$limit**: Trim the number of documents in the dataset
- **\$unwind**: Deconstruct an array, so that there is a document for every value of the array
- **\$project**: Select specific fields (like with the second argument to 'find')

In [76]:
db = client.scripting
db.cooking.find_one({})

{'_id': ObjectId('682b56f789dffff004d7c272'),
 'id': 10259,
 'cuisine': 'greek',
 'ingredients': ['romaine lettuce',
  'black olives',
  'grape tomatoes',
  'garlic',
  'pepper',
  'purple onion',
  'seasoning',
  'garbanzo beans',
  'feta cheese crumbles']}

Basics of the aggregations pipeline:

`db.collectionName.aggregate(pipeline)`

where

```python
pipeline = [
    stage1,
    stage2,
    ...
    and_so_on
]
```

# $match

Same as `find`, but a good place to start

In [39]:
pipeline = [
    {
        "$match": { "cuisine": "greek" }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'_id': ObjectId('682b56f789dffff004d7c272'),
  'id': 10259,
  'cuisine': 'greek',
  'ingredients': ['romaine lettuce',
   'black olives',
   'grape tomatoes',
   'garlic',
   'pepper',
   'purple onion',
   'seasoning',
   'garbanzo beans',
   'feta cheese crumbles']},
 {'_id': ObjectId('682b56f789dffff004d7c2cf'),
  'id': 34471,
  'cuisine': 'greek',
  'ingredients': ['ground pork',
   'finely chopped fresh parsley',
   'onions',
   'salt',
   'vinegar',
   'caul fat']}]

# $sort

Provide an object where the field names to sort by are the keys, and '-1' or '1' specify to sort in ascending or descending order.

Here, we sort by alphabetical order on 'cuisine' - we'll try something more useful shortly.

In [40]:
pipeline = [
    {
        "$sort": { "cuisine": -1 }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:1]

[{'_id': ObjectId('682b56f789dffff004d7d071'),
  'id': 156,
  'cuisine': 'vietnamese',
  'ingredients': ['top round steak',
   'baking powder',
   'fish sauce',
   'water',
   'canola oil',
   'black peppercorns',
   'sugar',
   'frozen banana leaf',
   'fresh dill',
   'tapioca starch']}]

# $project

Select the columns that you want in the results, or exclude columns.

In [41]:
pipeline = [
    { "$match": { "cuisine": "greek" }  },
    {
        "$project": {"_id": 0, "ingredients": 0}
    }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'id': 10259, 'cuisine': 'greek'}, {'id': 34471, 'cuisine': 'greek'}]

In [42]:
pipeline = [
    { "$match": { "cuisine": "greek" } },
    { "$project": {"cuisine": 1}  }
]

results = db.cooking.aggregate(pipeline)
list(results)[:2]

[{'_id': ObjectId('682b56f789dffff004d7c272'), 'cuisine': 'greek'},
 {'_id': ObjectId('682b56f789dffff004d7c2cf'), 'cuisine': 'greek'}]

`$project` is usually for your benefit (it's more readable!), but that's not bad! If you're only focused on one or two pieces of informa
tion, it's easier to see that information with `$project`

# $limit

Same as `limit(n)`.

In [43]:
pipeline = [
    {
        "$match": { "cuisine": "greek" }
    },
    {
        "$limit": 1
    }
]

results = db.cooking.aggregate(pipeline)
len(list(results))

1

# $unwind

Expand each item in a list to it's own document.

Before:
    
```python
[{
  'cuisine': 'greek',
  'ingredients': ['romaine lettuce',
   'black olives',
   'feta cheese crumbles']
}]
```

After

```python
[
    {'cuisine': 'greek', 'ingredients': 'romaine lettuce'},
    {'cuisine': 'greek', 'ingredients': 'black olives'},
    {'cuisine': 'greek', 'ingredients': 'feta cheese crumbles'}
]
```

Step by step: what do the results below represent?

In [44]:
pipeline = [
    {"$match": {"cuisine": "greek" }},
    { "$limit": 2 },
    {"$project": {"_id":0, "id":0 }}
]
  
results = db.cooking.aggregate(pipeline)
list(results)

[{'cuisine': 'greek',
  'ingredients': ['romaine lettuce',
   'black olives',
   'grape tomatoes',
   'garlic',
   'pepper',
   'purple onion',
   'seasoning',
   'garbanzo beans',
   'feta cheese crumbles']},
 {'cuisine': 'greek',
  'ingredients': ['ground pork',
   'finely chopped fresh parsley',
   'onions',
   'salt',
   'vinegar',
   'caul fat']}]

**When you're referring to a field in a value (rather than a *key*), precede the name with '\$'**

'cuisine' is referred to in a key here:

```
{ "$match": { "cuisine": "greek" } }
```

'ingredients' is referred to in a value:

```
{ "$unwind": "$ingredients" }
```

# $group

It's our split-apply-combine pattern in MongoDB.

In [45]:
pipeline = [
    { 
        "$group": {
            "_id": "$cuisine",
            "num_matching": { "$sum": 1 }
        }
    }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': 'jamaican', 'num_matching': 526},
 {'_id': 'mexican', 'num_matching': 6438},
 {'_id': 'greek', 'num_matching': 1175},
 {'_id': 'vietnamese', 'num_matching': 825},
 {'_id': 'chinese', 'num_matching': 2673},
 {'_id': 'filipino', 'num_matching': 755},
 {'_id': 'cajun_creole', 'num_matching': 1546},
 {'_id': 'southern_us', 'num_matching': 4320},
 {'_id': 'indian', 'num_matching': 3003},
 {'_id': 'french', 'num_matching': 2646},
 {'_id': 'japanese', 'num_matching': 1423},
 {'_id': 'brazilian', 'num_matching': 467},
 {'_id': 'russian', 'num_matching': 489},
 {'_id': 'moroccan', 'num_matching': 821},
 {'_id': 'spanish', 'num_matching': 989},
 {'_id': 'italian', 'num_matching': 7838},
 {'_id': 'irish', 'num_matching': 667},
 {'_id': 'korean', 'num_matching': 830},
 {'_id': 'thai', 'num_matching': 1539},
 {'_id': 'british', 'num_matching': 804}]

In [39]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": "$ingredients",
            "count": { "$sum": 1 }
        }
    },
    { "$sort": {"count": -1 }  },
    { "$limit": 10 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': 'salt', 'count': 18049},
 {'_id': 'olive oil', 'count': 7972},
 {'_id': 'onions', 'count': 7972},
 {'_id': 'water', 'count': 7457},
 {'_id': 'garlic', 'count': 7380},
 {'_id': 'sugar', 'count': 6434},
 {'_id': 'garlic cloves', 'count': 6237},
 {'_id': 'butter', 'count': 4848},
 {'_id': 'ground black pepper', 'count': 4785},
 {'_id': 'all-purpose flour', 'count': 4632}]

`$group` follows the following pattern:

```python
"$group": {
            "_id": GROUPING_CONDITIONS,
            FIELD: ACCUMULATOR
        }
```

You always need an id. It can be a string (to group by a single column), an object where the keys are new names and the values are the fields that your grouping by, or `None`, which groups the entire dataset into a single point.

The `_id` can be an object with multiple values.

In [40]:
pipeline = [
    { "$unwind": "$ingredients" },
    { 
        "$group": {
            "_id": {"cuisine": "$cuisine", "ingredients": "$ingredients"},
            "num_matching": { "$sum": 1 }
        }
    },
    { "$sort": {"num_matching": -1 }  },
    { "$limit": 4 }
]

results = db.cooking.aggregate(pipeline)
list(results)

[{'_id': {'cuisine': 'italian', 'ingredients': 'salt'}, 'num_matching': 3454},
 {'_id': {'cuisine': 'italian', 'ingredients': 'olive oil'},
  'num_matching': 3111},
 {'_id': {'cuisine': 'mexican', 'ingredients': 'salt'}, 'num_matching': 2720},
 {'_id': {'cuisine': 'southern_us', 'ingredients': 'salt'},
  'num_matching': 2290}]

## Groupby operators

- `$sum`
  - Using `{ "$sum": 1 }` returns a count, but you can also sum a numeric set of values with `{ "$sum": "$keyName" }`
- `$avg`
- `$first`
- `$last`
- `$min`
- `$max`

# Bonus: MapReduce

MapReduce is a framework from processing really large datasets, distributed across multiple threads, processes, or machines.

Two parts:

*Map*: Split the input into segments, to do something on it.

*Reduce*: Simplify the individually processed segments into one output.

(reduce is not simply 'combine' as we saw with SAC)

### Archetypal Example: Counting Words

![MapReduce Example](https://github.com/organisciak/Scripting-Course/blob/master/images/mapreduce_example2.png?raw=true)

*via http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf*

Some examples to consider:
    
- *Sorting*: How would we sort a *reeaaally* big list? e.g. Sort everything on Amazon by price.
- *Searching*: How do we determine how much a word shows up in 1 billion web pages?

MapReduce generally works by simplifying the problem to key-value pairs. 

MongoDB Example, via https://docs.mongodb.com/manual/tutorial/map-reduce-examples/: **Return the Total Price Per Customer**

Data structure:
```
{
     cust_id: "abc123",
     ord_date: new Date("Oct 04, 2012"),
     status: 'A',
     price: 25,
     items: [ { sku: "mmm", qty: 5, price: 2.5 },
              { sku: "nnn", qty: 5, price: 2.5 } ]
}
```

- Map step: return key-value pair of `cust_id: price`
- Reduce step: sum all the values for alike keys

**Calculate Average Quantity Per Item**

Data structure:
```
{
     cust_id: "abc123",
     ord_date: new Date("Oct 04, 2012"),
     status: 'A',
     price: 25,
     items: [ { sku: "mmm", qty: 5, price: 2.5 },
              { sku: "nnn", qty: 5, price: 2.5 } ]
}
```

- Map step: return key value pairs where the key is the 'sku' of each item, and the value is an object of `{count:1, quantity:  X}`
- Reduce step: Sum count and quantity for each key
- Finalize: Divide quantity/count for an average