# MongoDB

MongoDB is a document database. It stores JSON objects.

- [Documentation](https://docs.mongodb.com)
- [Query selectors](https://docs.mongodb.com/manual/reference/operator/query/#query-selectors)

Note that MongoDB also provides a GUI interface via [MongDB Compasss](https://www.mongodb.com/products/compass) that might be useful when you are getting familiar with MongoDB. However, we will focus only on `pymongo`.

## Concepts

- What a document database is
- Why document databases
- Collections ~ tables
- Documents ~ rows
- Joins are possible but more common to embed nested objects
- [Basic data manipulation: CRUD](https://docs.mongodb.com/manual/crud/)
- Using `find`
- Simple summaries
- Using the `aggregate` method and setting up pipelines
- Geospatial queries
- Creating indexes to speed up queries

In [1]:
from pymongo import MongoClient, GEOSPHERE
from bson.objectid import ObjectId
from bson.son import SON

In [2]:
import requests
from bson import json_util

In [3]:
import collections
from pathlib import Path

In [4]:
import os

In [5]:
from pprint import pprint

### Background to NoSQL

CAP
- Consistentt
- Available
- Partitition tolerant

CAP thoerem - a database can be either consistent *or* availale

ACID
- Atomic
- Consistent
- Isolation
- Durable

BASE
- Basically available
- Soft state
- Eventualy consistent




## Prelude - JSON

Example of a data structure for a Person from [Wikipedia](https://en.wikipedia.org/wiki/JSON)

```json
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}
```

REST APIs deliver data in the form of JSON, so it is quite natural to store data in that form too if you are producing or consuming data from a REST API.

## Set up

This connects to the MongoDB daemon

In [6]:
client = MongoClient('mongodb:27017')

This specifies the database. It does not matter if it does not exist.

In [7]:
client.drop_database('starwars')

In [8]:
db = client.starwars

This specifies a `collection`

In [9]:
people = db.people

Check what collections are in the database. Note that the `people` collection is only created when the first value is inserted.

In [10]:
db.list_collection_names()

[]

## Get Data

In [11]:
base_url = 'http://swapi.dev/api/'

In [12]:
resp = requests.get(os.path.join(base_url, 'people/1'))
data = resp.json()

In [13]:
data

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'https://swapi.dev/api/planets/1/',
 'films': ['https://swapi.dev/api/films/1/',
  'https://swapi.dev/api/films/2/',
  'https://swapi.dev/api/films/3/',
  'https://swapi.dev/api/films/6/'],
 'species': [],
 'vehicles': ['https://swapi.dev/api/vehicles/14/',
  'https://swapi.dev/api/vehicles/30/'],
 'starships': ['https://swapi.dev/api/starships/12/',
  'https://swapi.dev/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'https://swapi.dev/api/people/1/'}

We will fetch details of the homeworld and starships as a nested document.

In [14]:
def get_nested(d):
    d['homeworld']  = requests.get(d['homeworld']).json()
    urls = d['starships']
    starships = [requests.get(url).json() for url in urls]
    d['starships']  = starships
    return d

We need to convert numbers from strings returned by the REST API

In [15]:
def convert_str(x):
    try:
        return int(x)
    except:
        return x

def to_num(data):
    for key in data:
        val = data[key]
        if isinstance(val, str):
            data[key] = convert_str(val)
        elif isinstance(val, dict):
            for k, v in val.items():
                if isinstance(v, str):
                    val[k] = convert_str(v)
        elif isinstance(val, list):
            for i, item in enumerate(val):
                if isinstance(item, str):
                    data[key][i] = convert_str(item)
                elif isinstance(item, dict):
                    for k, v in item.items():
                        if isinstance(v, str):
                            data[key][i][k] = convert_str(v)      
    return data

In [16]:
data = to_num(get_nested(data))

In [17]:
data

{'name': 'Luke Skywalker',
 'height': 172,
 'mass': 77,
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': {'name': 'Tatooine',
  'rotation_period': 23,
  'orbital_period': 304,
  'diameter': 10465,
  'climate': 'arid',
  'gravity': '1 standard',
  'terrain': 'desert',
  'surface_water': 1,
  'population': 200000,
  'residents': ['https://swapi.dev/api/people/1/',
   'https://swapi.dev/api/people/2/',
   'https://swapi.dev/api/people/4/',
   'https://swapi.dev/api/people/6/',
   'https://swapi.dev/api/people/7/',
   'https://swapi.dev/api/people/8/',
   'https://swapi.dev/api/people/9/',
   'https://swapi.dev/api/people/11/',
   'https://swapi.dev/api/people/43/',
   'https://swapi.dev/api/people/62/'],
  'films': ['https://swapi.dev/api/films/1/',
   'https://swapi.dev/api/films/3/',
   'https://swapi.dev/api/films/4/',
   'https://swapi.dev/api/films/5/',
   'https://swapi.dev/api/films/6/'],
  'created': '201

## Insertion

### Single inserts

In [18]:
result = people.insert_one(data)

In [19]:
db.list_collection_names()

['people']

### Bulk inserts

We load some previously retrieved values from file to avoid hitting the SWAPI server repeatedly.

In [20]:
import pickle

with open('data/sw.pickle', 'rb') as f:
    xs = pickle.load(f)
xs

[{'name': 'C-3PO',
  'height': 167,
  'mass': 75,
  'hair_color': 'n/a',
  'skin_color': 'gold',
  'eye_color': 'yellow',
  'birth_year': '112BBY',
  'gender': 'n/a',
  'homeworld': {'name': 'Tatooine',
   'rotation_period': 23,
   'orbital_period': 304,
   'diameter': 10465,
   'climate': 'arid',
   'gravity': '1 standard',
   'terrain': 'desert',
   'surface_water': 1,
   'population': 200000,
   'residents': ['http://swapi.dev/api/people/1/',
    'http://swapi.dev/api/people/2/',
    'http://swapi.dev/api/people/4/',
    'http://swapi.dev/api/people/6/',
    'http://swapi.dev/api/people/7/',
    'http://swapi.dev/api/people/8/',
    'http://swapi.dev/api/people/9/',
    'http://swapi.dev/api/people/11/',
    'http://swapi.dev/api/people/43/',
    'http://swapi.dev/api/people/62/'],
   'films': ['http://swapi.dev/api/films/1/',
    'http://swapi.dev/api/films/3/',
    'http://swapi.dev/api/films/4/',
    'http://swapi.dev/api/films/5/',
    'http://swapi.dev/api/films/6/'],
   'creat

In [21]:
result = people.insert_many(xs)

In [22]:
result.inserted_ids

[ObjectId('615237b605539c8b7e3ebf19'),
 ObjectId('615237b605539c8b7e3ebf1a'),
 ObjectId('615237b605539c8b7e3ebf1b'),
 ObjectId('615237b605539c8b7e3ebf1c'),
 ObjectId('615237b605539c8b7e3ebf1d'),
 ObjectId('615237b605539c8b7e3ebf1e'),
 ObjectId('615237b605539c8b7e3ebf1f'),
 ObjectId('615237b605539c8b7e3ebf20'),
 ObjectId('615237b605539c8b7e3ebf21')]

## Queries

In [23]:
people.find_one(
    # search criteria
    {'name': 'Luke Skywalker'}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
)

{'_id': ObjectId('615237147ab53903716d3176'),
 'name': 'Luke Skywalker',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue'}

In [24]:
for p in people.find(
    # search criteria
    {}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
):
    print(p)

{'_id': ObjectId('615237147ab53903716d3176'), 'name': 'Luke Skywalker', 'hair_color': 'blond', 'skin_color': 'fair', 'eye_color': 'blue'}
{'_id': ObjectId('615237187ab53903716d3177'), 'name': 'C-3PO', 'hair_color': 'n/a', 'skin_color': 'gold', 'eye_color': 'yellow'}
{'_id': ObjectId('615237187ab53903716d3178'), 'name': 'R2-D2', 'hair_color': 'n/a', 'skin_color': 'white, blue', 'eye_color': 'red'}
{'_id': ObjectId('615237187ab53903716d3179'), 'name': 'Darth Vader', 'hair_color': 'none', 'skin_color': 'white', 'eye_color': 'yellow'}
{'_id': ObjectId('615237187ab53903716d317a'), 'name': 'Leia Organa', 'hair_color': 'brown', 'skin_color': 'light', 'eye_color': 'brown'}
{'_id': ObjectId('615237187ab53903716d317b'), 'name': 'Owen Lars', 'hair_color': 'brown, grey', 'skin_color': 'light', 'eye_color': 'blue'}
{'_id': ObjectId('615237187ab53903716d317c'), 'name': 'Beru Whitesun lars', 'hair_color': 'brown', 'skin_color': 'light', 'eye_color': 'blue'}
{'_id': ObjectId('615237187ab53903716d317d'

### Using object ID

Note that ObjectID is NOT a string. You must convert a string to ObjectID before use.

From the official docs, the ObjectID consists of

- a 4-byte value representing the seconds since the [Unix epoch](https://en.wikipedia.org/wiki/Unix_time),
- a 5-byte random value, and
- a 3-byte counter, starting with a random value.

In particular, note that sorting by ObjectIDs generated across different machines will give an approximate time ordering.

In [25]:
result.inserted_ids[0]

ObjectId('615237b605539c8b7e3ebf19')

In [26]:
people.find_one(
    result.inserted_ids[0],
    {'name': True, 'hair_color': True, 'skin_color': True, 'eye_color': True}
)

### Bulk queries

The general `find` method returns a cursor, where each entry is a dictionary.

In [29]:
for person in people.find(
    {'gender': 'male'}
):
    print(person['name'])

Luke Skywalker
Darth Vader
Owen Lars
Biggs Darklighter
Obi-Wan Kenobi


You can also explicitly define the projection.

In [37]:
for x in people.find(
    {'gender': 'male'},             
    {
        '_id': False,
        'name': True,
        'gender': True
    }
): 
    pprint(x)

{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'male', 'name': 'Darth Vader'}
{'gender': 'male', 'name': 'Owen Lars'}
{'gender': 'male', 'name': 'Biggs Darklighter'}
{'gender': 'male', 'name': 'Obi-Wan Kenobi'}
{'gender': 'male', 'name': 'Luke Skywalker'}


#### Using regex search

In [33]:
for x in people.find(
    {
        'name': {'$regex': '^L'},
    },
    {
        'name': True, 
        'gender': True, 
        '_id': False
    }
):
    pprint(x)

{'gender': 'female', 'name': 'Leia Organa'}
{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'male', 'name': 'Luke Skywalker'}


The above example uses the JavaScript regular expression syntax. You can also use Python regular expressions with `ppymongo`.

In [38]:
import re

name_pat = re.compile(r'^l', re.IGNORECASE)

In [39]:
for x in people.find(
    {
        'name': name_pat,
    },
    {
        'name': True,
        'gender': True,
        '_id': False
    }
):
    pprint(x)

{'gender': 'female', 'name': 'Leia Organa'}
{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'male', 'name': 'Luke Skywalker'}


#### Using relational operators

In [40]:
for x in people.find(
    {
        'mass': {'$lt': 100},
    },
    {
        'name': True, 
        'mass': True, 
        '_id': False
    }
):
    pprint(x)

{'mass': 77, 'name': 'Luke Skywalker'}
{'mass': 75, 'name': 'C-3PO'}
{'mass': 32, 'name': 'R2-D2'}
{'mass': 49, 'name': 'Leia Organa'}
{'mass': 75, 'name': 'Beru Whitesun lars'}
{'mass': 32, 'name': 'R5-D4'}
{'mass': 84, 'name': 'Biggs Darklighter'}
{'mass': 77, 'name': 'Obi-Wan Kenobi'}
{'mass': 77, 'name': 'Luke Skywalker'}


In [41]:
mass_range = {'$lt': 100, '$gt': 50}

In [42]:
for x in people.find(
    {
        'mass': mass_range,
    },
    {
        'name': True, 
        'mass': True,
        '_id': False
    }
):
    pprint(x)

{'mass': 77, 'name': 'Luke Skywalker'}
{'mass': 75, 'name': 'C-3PO'}
{'mass': 75, 'name': 'Beru Whitesun lars'}
{'mass': 84, 'name': 'Biggs Darklighter'}
{'mass': 77, 'name': 'Obi-Wan Kenobi'}
{'mass': 77, 'name': 'Luke Skywalker'}


#### Nested search

Nowadays, many relational databases allow you to store data as JSON columns.  However, document databases allow the convenience of nested searches.

In [43]:
for x in people.find(
    {
        'homeworld.name': 'Tatooine',
    },
    {
        'name': True, 
        'species.name': True, 
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker', 'species': []}
{'name': 'C-3PO', 'species': []}
{'name': 'Darth Vader', 'species': []}
{'name': 'Owen Lars', 'species': []}
{'name': 'Beru Whitesun lars', 'species': []}
{'name': 'R5-D4', 'species': []}
{'name': 'Biggs Darklighter', 'species': []}
{'name': 'Luke Skywalker', 'species': []}


#### Matching multiple criteria

This is quite subtle. By default, when matching on multiple criteria, the search is across items. Here `Obi-Wan Kenobi` is returned because each of the 3 conditions is matched by one or more of his starships, even though none of his starships match all 3 criteria.

In [44]:
for x in people.find(
    {
        'starships.cost_in_credits': {'$lt': 250000},
        'starships.max_atmosphering_speed': {'$gt': 500},
        'starships.passengers': {'$gt': 0}
    },
    {
        'name': True, 
        'starship.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits': 149999,
                'max_atmosphering_speed': 1050,
                'passengers': 0},
               {'cost_in_credits': 240000,
                'max_atmosphering_speed': 850,
                'passengers': 20}]}
{'name': 'Obi-Wan Kenobi',
 'starships': [{'cost_in_credits': 180000,
                'max_atmosphering_speed': 1150,
                'passengers': 0},
               {'cost_in_credits': 125000000,
                'max_atmosphering_speed': 1050,
                'passengers': 48247},
               {'cost_in_credits': 'unknown',
                'max_atmosphering_speed': 1050,
                'passengers': 3},
               {'cost_in_credits': 320000,
                'max_atmosphering_speed': 1500,
                'passengers': 0},
               {'cost_in_credits': 168000,
                'max_atmosphering_speed': 1100,
                'passengers': 0}]}
{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits

In [45]:
for x in people.find(
    {'name': 'Obi-Wan Kenobi'},
    {
        'starships.name': True,
        'starships.cost_in_credits': True,
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        '_id': False
    }
):
    pprint(x)

{'starships': [{'cost_in_credits': 180000,
                'max_atmosphering_speed': 1150,
                'name': 'Jedi starfighter',
                'passengers': 0},
               {'cost_in_credits': 125000000,
                'max_atmosphering_speed': 1050,
                'name': 'Trade Federation cruiser',
                'passengers': 48247},
               {'cost_in_credits': 'unknown',
                'max_atmosphering_speed': 1050,
                'name': 'Naboo star skiff',
                'passengers': 3},
               {'cost_in_credits': 320000,
                'max_atmosphering_speed': 1500,
                'name': 'Jedi Interceptor',
                'passengers': 0},
               {'cost_in_credits': 168000,
                'max_atmosphering_speed': 1100,
                'name': 'Belbullab-22 starfighter',
                'passengers': 0}]}


#### Matching multiple criteria simultaneously

To find someone with a starship that matches all 3 conditions, we need to use the `elemMatch` operator.

In [46]:
for x in people.find(
    {
        'starships': {
            '$elemMatch': { 
                'cost_in_credits': {'$lt': 250000},
                'max_atmosphering_speed': {'$gt': 500},
                'passengers': {'$gt': 1}
            }
        }
    },
    {
        'name': True, 
        'starships.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits': 149999,
                'max_atmosphering_speed': 1050,
                'name': 'X-wing',
                'passengers': 0},
               {'cost_in_credits': 240000,
                'max_atmosphering_speed': 850,
                'name': 'Imperial shuttle',
                'passengers': 20}]}
{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits': 149999,
                'max_atmosphering_speed': 1050,
                'name': 'X-wing',
                'passengers': 0},
               {'cost_in_credits': 240000,
                'max_atmosphering_speed': 850,
                'name': 'Imperial shuttle',
                'passengers': 20}]}


## Aggregate Queries

In [47]:
people.count_documents({'homeworld.name': 'Tatooine'})

8

In [48]:
people.distinct('homeworld.name')

['Alderaan', 'Naboo', 'Stewjon', 'Tatooine']

### Using aggregate

The `aggregate` function runs a pipeline of commands, and uses the `$group` operator to summarize results. Within the aggregate method, you assemble a **pipeline** of operations that is executed atomically.

Filter and count

In [49]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'count': {'$sum': 1}}},
]

In [50]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Tatooine', 'count': 8}


Filter and find total mass

In [51]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [52]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Tatooine', 'total_mass': 676}


Total mass of all members of a planet

In [53]:
cmds = [
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [54]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Stewjon', 'total_mass': 77}
{'_id': 'Tatooine', 'total_mass': 676}
{'_id': 'Alderaan', 'total_mass': 49}
{'_id': 'Naboo', 'total_mass': 32}


Filter, project, group by, sorting.

In [55]:
cmds = [
     {
         '$match': {
             'mass': {
                 '$lt': 100
                     }
         },
     },
     {
         '$group': {
             '_id': '$homeworld.name',
             'total_mass': {'$sum': '$mass'},
             'avg_mass': {'$avg': '$mass'}
         },
     },
     {
        '$sort': { 
            'avg_mass': -1
        }
     }
]

In [56]:

for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Stewjon', 'avg_mass': 77.0, 'total_mass': 77}
{'_id': 'Tatooine', 'avg_mass': 70.0, 'total_mass': 420}
{'_id': 'Alderaan', 'avg_mass': 49.0, 'total_mass': 49}
{'_id': 'Naboo', 'avg_mass': 32.0, 'total_mass': 32}


#### SQL equivalent (approximate)

```sql
SELECT species.name, AVG(mass) AS avg_mass, SUM(mass) AS total_mass
WHERE mass < 100
FROM people
JOIN species
ON people.species_id = species.species_id
GROUP BY species.name
ORDER BY avg_mass
```

### Using MapReduce

With `MapReduce` you get the full power of JavaScript, but it is more complex and often less efficient. You should use `aggregate` in preference to `map_reduce` in most cases.

- In the map stage, you create a (key, value) pair
- In the reduce stage, you perform a reduction (e.g. sum) of the values associated with each key

#### Motivating Python example

In [57]:
from functools import reduce

In [58]:
eye_color = ['blue', 'blue', 'green', 'brown', 'grey', 'green', 'blue']

In [59]:
res = [(x, 1) for x in eye_color]
res

[('blue', 1),
 ('blue', 1),
 ('green', 1),
 ('brown', 1),
 ('grey', 1),
 ('green', 1),
 ('blue', 1)]

In [60]:
d = {}
for k, v in res:
    d[k] = d.get(k, 0) + v
d

{'blue': 3, 'green': 2, 'brown': 1, 'grey': 1}

#### Map-reduce example in Mongo

In [61]:
from bson.code import Code

Count the number by eye_color

In [62]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    var total = 0;
    for (var i = 0; i < values.length; i++) {
        total += values[i];
    }
    return total;
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result1'
)

In [63]:
for doc in result.find():
    pprint(doc)

{'_id': 'blue-gray', 'value': 1.0}
{'_id': 'red', 'value': 2.0}
{'_id': 'brown', 'value': 2.0}
{'_id': 'blue', 'value': 4.0}
{'_id': 'yellow', 'value': 2.0}


The output is also stored in the `result1` collection we specified.

In [64]:
list(db.result1.find())

[{'_id': 'blue-gray', 'value': 1.0},
 {'_id': 'red', 'value': 2.0},
 {'_id': 'brown', 'value': 2.0},
 {'_id': 'blue', 'value': 4.0},
 {'_id': 'yellow', 'value': 2.0}]

Using JavaScript Array functions to simplify code.

In [65]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result2'
)

In [66]:
for doc in result.find():
    pprint(doc)

{'_id': 'yellow', 'value': 2.0}
{'_id': 'red', 'value': 2.0}
{'_id': 'blue-gray', 'value': 1.0}
{'_id': 'brown', 'value': 2.0}
{'_id': 'blue', 'value': 4.0}


Find avergae mass by gender.

In [67]:
mapper = Code('''
function() {
    emit(this.gender, this.mass);
}
''')

reducer = Code('''
function (key, values) {
    return Array.avg(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [68]:
for doc in result.find():
    pprint(doc)

{'_id': 'n/a', 'value': 46.333333333333336}
{'_id': 'female', 'value': 62.0}
{'_id': 'male', 'value': 95.16666666666667}


Count number of members in each species

In [69]:
mapper = Code('''
function() {
    this.species.map(function(z) {
      emit(z.name, 1);
    })
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [70]:
for doc in result.find():
    pprint(doc)

{'_id': None, 'value': 3.0}


#### Using the `aggregate` method

See if you can convert the above MapReduce queries to `aggregate` method calls. An example is provided.

In [71]:
cmds = [
    {
         '$group': {
             '_id': '$eye_color',
             'count': {'$sum': 1},
         },
     },
     {
        '$sort': { 
            '_id': 1
        }
     }
]

In [72]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'blue', 'count': 4}
{'_id': 'blue-gray', 'count': 1}
{'_id': 'brown', 'count': 2}
{'_id': 'red', 'count': 2}
{'_id': 'yellow', 'count': 2}


## Geospatial queries

You specify queries using [GeoJSON Objects](https://docs.mongodb.com/manual/reference/geojson/)

- Point
- LineString
- Polygon
- MultiPoint
- MultiLineString
- MultiPolygon
- GeometryCollection

In [73]:
crime = db.crime

In [74]:
import json

In [75]:
path = 'data/crime-mapping.geojson'

with open(path) as f:
    datastore = json.load(f)

In [76]:
results = crime.insert_many(datastore['features'])

In [77]:
crime.find_one({})

{'_id': ObjectId('615238e477b88ecd24b8bd5a'),
 'geometry': {'type': 'Point', 'coordinates': [-78.78200313, 35.760212065]},
 'type': 'Feature',
 'properties': {'ucr': '2650',
  'domestic': 'N',
  'period': ['Everything', 'Last Year'],
  'street': 'KILDAIRE FARM RD',
  'radio': 'Everything,Last Year',
  'time_to': -62135553600,
  'crime_type': 'ALL OTHER - ESCAPE FROM CUSTODY OR RESIST ARREST',
  'district': 'D3',
  'phxrecordstatus': None,
  'lon': -78.78200313,
  'timeframe': ['Last Year'],
  'crimeday': 'THURSDAY',
  'phxstatus': None,
  'location_category': 'TOWN OWNED',
  'violentproperty': 'All Other',
  'residential_subdivision': 'SHOPPES OF KILDAIRE',
  'offensecategory': 'All Other Offenses',
  'chrgcnt': None,
  'time_from': -62135553600,
  'map_reference': 'P027',
  'date_to': '11/30/2017',
  'lat': 35.760212065,
  'phxcommunity': 'No',
  'crime_category': 'ALL OTHER',
  'activity_date': None,
  'beat_number': '112',
  'record': 3145,
  'incident_number': '17010528',
  'apartm

In [78]:
crime.find_one({},
              {
                  'geometry': 1,
                  '_id': 0,
              }
              )

{'geometry': {'type': 'Point', 'coordinates': [-78.78200313, 35.760212065]}}

In [79]:
crime.create_index([('geometry', GEOSPHERE)])

'geometry_2dsphere'

List 5 crimes near the location

In [80]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$near', {'$geometry' : loc})])
    },
    {
        '_id': 0,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
).limit(5):
    pprint(doc)

{'properties': {'crime_type': 'ALL OTHER - ESCAPE FROM CUSTODY OR RESIST '
                              'ARREST',
                'date_from': '2017-11-30'}}
{'properties': {'crime_type': 'COUNTERFEITING - USING',
                'date_from': '2017-09-25'}}
{'properties': {'crime_type': 'FRAUD - CREDIT CARD/ATM',
                'date_from': '2017-10-16'}}
{'properties': {'crime_type': 'LARCENY - FROM MOTOR VEHICLE',
                'date_from': '2018-06-12'}}
{'properties': {'crime_type': 'ALL OTHER - PROBATION/PAROLE VIOLATION',
                'date_from': '2017-11-16'}}


List crimes committed nearby (within 200 m)

In [81]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$geoNear', {'$geometry' : loc, '$minDistance': 1e-6, '$maxDistance': 200})]),
    },
    {
        '_id': 0,
        'geometry.coordinates': 1,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
):
    pprint(doc)

{'geometry': {'coordinates': [-78.78102423, 35.7607323]},
 'properties': {'crime_type': 'ASSAULT - SIMPLE - ALL OTHER',
                'date_from': '2018-02-14'}}
{'geometry': {'coordinates': [-78.78102423, 35.7607323]},
 'properties': {'crime_type': 'ASSAULT - SIMPLE - ALL OTHER',
                'date_from': '2018-02-14'}}
{'geometry': {'coordinates': [-78.78102423, 35.7607323]},
 'properties': {'crime_type': 'ASSAULT - SIMPLE - ALL OTHER',
                'date_from': '2018-02-14'}}
{'geometry': {'coordinates': [-78.78102423, 35.7607323]},
 'properties': {'crime_type': 'ASSAULT - SIMPLE - ALL OTHER',
                'date_from': '2018-02-14'}}
{'geometry': {'coordinates': [-78.78131931, 35.761138061]},
 'properties': {'crime_type': 'VANDALISM - GRAFFITI',
                'date_from': '2018-07-20'}}
{'geometry': {'coordinates': [-78.78131931, 35.761138061]},
 'properties': {'crime_type': 'VANDALISM - GRAFFITI',
                'date_from': '2018-07-20'}}
{'geometry': {'coordinates':

## Indexes

Just as with relational databases, you can add indexes to speed up search. Note that while reads become faster, writes become slower. There is always a trade-off.

In [82]:
people.find({}).explain

<bound method Cursor.explain of <pymongo.cursor.Cursor object at 0x7f6c94ab86d0>>

In [83]:
people.find({'name': 'Luke Skywalker'}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'starwars.people',
  'indexFilterSet': False,
  'parsedQuery': {'name': {'$eq': 'Luke Skywalker'}},
  'winningPlan': {'stage': 'FETCH',
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'name': 1},
    'indexName': 'name_1',
    'isMultiKey': False,
    'multiKeyPaths': {'name': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'name': ['["Luke Skywalker", "Luke Skywalker"]']}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 2,
  'executionTimeMillis': 0,
  'totalKeysExamined': 2,
  'totalDocsExamined': 2,
  'executionStages': {'stage': 'FETCH',
   'nReturned': 2,
   'executionTimeMillisEstimate': 0,
   'works': 3,
   'advanced': 2,
   'needTime': 0,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'docsExamined': 2,
   'alreadyHasObj': 0,
   'inputStage': {'stage':

In [84]:
people.create_index('name')

'name_1'

In [85]:
people.find({'name': 'Luke Skywalker'}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'starwars.people',
  'indexFilterSet': False,
  'parsedQuery': {'name': {'$eq': 'Luke Skywalker'}},
  'winningPlan': {'stage': 'FETCH',
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'name': 1},
    'indexName': 'name_1',
    'isMultiKey': False,
    'multiKeyPaths': {'name': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'name': ['["Luke Skywalker", "Luke Skywalker"]']}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 2,
  'executionTimeMillis': 0,
  'totalKeysExamined': 2,
  'totalDocsExamined': 2,
  'executionStages': {'stage': 'FETCH',
   'nReturned': 2,
   'executionTimeMillisEstimate': 0,
   'works': 3,
   'advanced': 2,
   'needTime': 0,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'docsExamined': 2,
   'alreadyHasObj': 0,
   'inputStage': {'stage':