# Introduction

## Using Jupyter

Jupyter is a front end to the [IPython](https://ipython.org) interactive shell, and offers IDE like features.  It is separated into two main types of cell: [Markdown](https://en.wikipedia.org/wiki/Markdown) cells (such as this one) which allows markdown or HTML code to be written, and code cells like the next one which can be run in real time.

To execute code in a cell, press `Crtl` + `Enter`, click on the `[ > ]Run` button in the main menu, or press `Shift` + `Enter` if you wish to execute the code and then move on to a new cell (creating it if it does not already exist).

## Setup

MongoDB is a NoSQL database, which has a core API in JavaScript, and a series of other APIs in different languages.  The one we are going to use is the Python API, [PyMongo](https://api.mongodb.com/python/current/).  MongoDB instance on your VM is already started by default.

PyMongo is a package that contains tools to work with MongoDB from Python. We have installed it in the VM provided to you. 

This imports the `MongoClient` class from the pymongo module, which we will use to deal with all our connections from.  We're connecting to our localhost, which is listening on port 27017. There are more options, the documentation for the formatting of the connection string is at https://docs.mongodb.com/manual/reference/connection-string/.

In [1]:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')

If you do not get any errors, you have confirmed PyMongo library has been successfully installed&configured in the VM. We will now check to see whether the connection is correct.  The following code calls a function which returns a list of all current databases.  If your Mongo instance is still empty, it should be something like `['admin', 'local']`.

In [2]:
client.list_database_names()

ServerSelectionTimeoutError: localhost:27017: [WinError 10061] 由于目标计算机积极拒绝，无法连接。

Next, we are going to create a database object `db`, which is a property of the `client` object.  MongoDB is schemaless, and so accessing a database like this will create the database if it does not already exist.

A database can be accessed by using "dot" notation (i.e., `client.dbname`), or dictionary notation (i.e., `client['dbname']`).  This also applies to making collections

Create a database called `test` in a variable called `db`.  Using that variable, create a collection called `test_collection` with a variable called `collection` as follows.  Run the code in the following cell (there should not be any output)

In [4]:
# Create database and collection objects for convenience
db = client.test
collection = db.test_collection

# Using MongoDB

## Inserting data

MongoDB data is stored as BSON (binary JSON), which is essentially JSON with some additional optimisations, so the way to insert data is as a JSON object.  For Python, you can use a `dict` or a `list` for this, and then call either `insert_one` or `insert_many` on the collection.  

In [5]:
# Create an object and insert into the `test_collection`
single_obj = {'name': 'Amber', 'star_sign': 'Capricorn', 'favourite_song': 'The Load-Out'}
collection.insert_one(single_obj)
single_obj_2 = {'name': 'Huw', 'star_sign': 'Libra', 'favourite_song': 'The masses against the classes'}
collection.insert_one(single_obj_2)
single_obj_3 = {'name': 'Robert', 'star_sign': 'Leo', 'favourite_song': 'Bad day'}
collection.insert_one(single_obj_3)

<pymongo.results.InsertOneResult at 0x7f4dc59281b8>

We will look at querying data in more detail below, but for now, to see whether the object got successfully inserted into the collection, run the code below.  This will always return the first instance which matches the query.  You will notice that even though we didn't specify `_id` one got added already.  This is a unique identifier for the document in the collection

In [6]:
collection.find_one() # returns a single document matching the query condition

{u'_id': ObjectId('5bef1c045ccae9018a4f55c5'),
 u'favourite_song': u'The Load-Out',
 u'name': u'Amber',
 u'star_sign': u'Capricorn'}

In [7]:
collection.find_one({'name':'Robert'})

{u'_id': ObjectId('5bef1c045ccae9018a4f55c7'),
 u'favourite_song': u'Bad day',
 u'name': u'Robert',
 u'star_sign': u'Leo'}

Remember, for MongoDB, you do not have to specify a schema or create a collection, it will be created automatically.  You don't need to keep to the same layout, but can have entirely different objects.  Consider the following: 

In [8]:
from datetime import datetime
obj1 = {'Meaning of life': 42}
obj2 = {'ABC': 'DEF', 'time': datetime.now()}
collection.insert_one(obj1)
collection.insert_one(obj2)

<pymongo.results.InsertOneResult at 0x7f4dc5928560>

We can also use the `insert_many`, which accepts a list of dicts.  In the cell below, create a list of dicts called `many_objects`, and call the `insert_many` function.  The code below that will iterate over all the documents in the database.

In [9]:
# YOUR CODE HERE
collection.insert_many([{"age":x} for x in range(20,30)])

#See what has been inserted into the collection
for doc in collection.find():
    print(doc)

{u'favourite_song': u'The Load-Out', u'_id': ObjectId('5bef1c045ccae9018a4f55c5'), u'name': u'Amber', u'star_sign': u'Capricorn'}
{u'favourite_song': u'The masses against the classes', u'_id': ObjectId('5bef1c045ccae9018a4f55c6'), u'name': u'Huw', u'star_sign': u'Libra'}
{u'favourite_song': u'Bad day', u'_id': ObjectId('5bef1c045ccae9018a4f55c7'), u'name': u'Robert', u'star_sign': u'Leo'}
{u'Meaning of life': 42, u'_id': ObjectId('5bef1c1b5ccae9018a4f55c8')}
{u'_id': ObjectId('5bef1c1b5ccae9018a4f55c9'), u'ABC': u'DEF', u'time': datetime.datetime(2018, 11, 16, 19, 35, 55, 819000)}
{u'age': 20, u'_id': ObjectId('5bef1c315ccae9018a4f55ca')}
{u'age': 21, u'_id': ObjectId('5bef1c315ccae9018a4f55cb')}
{u'age': 22, u'_id': ObjectId('5bef1c315ccae9018a4f55cc')}
{u'age': 23, u'_id': ObjectId('5bef1c315ccae9018a4f55cd')}
{u'age': 24, u'_id': ObjectId('5bef1c315ccae9018a4f55ce')}
{u'age': 25, u'_id': ObjectId('5bef1c315ccae9018a4f55cf')}
{u'age': 26, u'_id': ObjectId('5bef1c315ccae9018a4f55d0')}

## Importing and querying data

For this part of the exercise, we will use a sample dataset provided by Mongo  for a documentation tutorial.  The following cell runs the `mongoimport` command, which is a Unix command which comes with Mongo for importing data. We will need to run a bash command in the next cell first.  This uses the Jupyter \"magics\", and requires that the first line include `%%bash`. The code does the following:

- Download the JSON file from the url, and save as ./primer-dataset.json
- Import into the `test` database into the collection `restaurants` whilst dropping any collection which already exists from the file ./primer-dataset.json
- Deletes the file.

 Click on the following cell, and execute it:

In [10]:
%%bash
# Use wget to download the data
wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
# mongoimport is the Mongo command to import data.  
# It specifies the database, collection and format, and import file
# --drop means it's going to drop any collection with the same name which already exists
mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json
# Delete the JSON file we just downloaded
rm ./primer-dataset.json

--2018-11-16 19:36:29--  https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11874761 (11M) [text/plain]
Saving to: ‘primer-dataset.json’

     0K .......... .......... .......... .......... ..........  0% 1.28M 9s
    50K .......... .......... .......... .......... ..........  0% 2.02M 7s
   100K .......... .......... .......... .......... ..........  1% 2.54M 6s
   150K .......... .......... .......... .......... ..........  1% 5.42M 5s
   200K .......... .......... .......... .......... ..........  2% 2.06M 5s
   250K .......... .......... .......... .......... ..........  2% 1.18M 6s
   300K .......... .......... .......... .......... ..........  3%  412K 9s
   350K .......... .......... .......... .......... .

Change the variable `collection` to refer to the new collection `restaurants`, and inspect the general format of the data by adding the code below to find the first record of the collection:

In [None]:
# YOUR CODE HERE



We saw the [`collection.find()`](https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.find) function earlier to return all the documents we inserted into our `test` collection.  Without any arguments, `find()` will return a cursor of all the available documents from in the collection.  To refine queries, however, the search can be filtered by the addition of a first parameter.

The `filter` parameter is a dict, which searches for the documents where `key` = `value` where the dict is of the form `{key: value}`.  For example, to find all bakeries in the city, we would do the following query:   

**WARNING** Unlike the Mongo command line interface, if you try and print the output of a `find()` query, it will continue to output all results until it has finished.  This can cause the browser to crash, particularly if it is a particularly large query set.

Using `find().count()` is a useful way of checking how many a result will return, and `find_one()` to see the general structure of a result. If you use `find()`, make sure you either include the `limit` argument, or have a counter or other condition to break out of your printing loop!

In [None]:
collection.find({'cuisine': 'Bakery'}).limit(5)

Noted, count() is deprecated in the MongoDb drivers compatible with the 4.0features, as a result, we should use countDocuments() in the following exercises. (for more information, check https://docs.mongodb.com/manual/reference/method/db.collection.count/ and http://api.mongodb.com/python/current/changelog.html)

In [None]:
collection.count_documents({'cuisine': 'Bakery'})

A filter can have as many conditions as you like, and will assume that you are using an AND condition, unless you specify otherwise (as below).  In the cell below, write a query to return the number/count of all the establishments with a cuisine of `Hamburgers` in the borough of Manhattan.

In [None]:
# YOUR CODE HERE
# All establishments with: 
# * a cuisine of 'Hamburgers' 
# * in the borough of 'Manhattan



In [None]:
collection.distinct('cuisine')

### Sub-documents

A valid JSON style "document" can have another JSON document inside it.  To access these, we use the "dot" notation to access them.  For example, to get all the restaurants in a certain zipcode, you would run code as follows:

In [None]:
from pprint import pprint
cursor = collection.find({'address.zipcode': '10462'}, limit=5)
for c in cursor:
    pprint(c)

### Operators

MongoDB has a series of [operators](https://docs.mongodb.com/manual/reference/operator/query/) which allow us to do more sophisticated filters on our queries.  There are too many to go into individually, but we will look at a few important ones.  The specific syntax varies depending on the operator, so it isn't possible to give a general rule, but we will go over a few examples here.  Make sure you check the [documentation](https://docs.mongodb.com/manual/reference/operator/query/) for use on each one.

#### [\$or](https://docs.mongodb.com/manual/reference/operator/query/or/#op._S_or)

Performs a logical **OR** operation on all the key/value pairs in a list, as in the code below:

In [None]:
filter = {"$or": [{"cuisine": "Polynesian"}, {"cuisine": "Hawaiian"}]}
for f in collection.find(filter):
    pprint(f)

#### [`$regex`](https://docs.mongodb.com/manual/reference/operator/query/regex/#op._S_regex)

The `$regex` operator searches for a regular expression on a particular field.  Within the filter field, the named field (a key) takes a dict as a value.  

For example, to search for all restaurants which start with the word "Pretzel" in the title you can do the following:

In [None]:
filter = {"name": {"$regex": '^Pretzel'}}
collection.count_documents(filter)

There are other ways to use regular expressions in PyMongo, you can use the [`re`](https://docs.python.org/3/library/re.html) module in Python.  In Mongo itself, you can use the following syntax: The simplest is to enclose the regular expression inside `/` characters, as in `{"name": /^Pretzel/}`, but that doesn't work properly in PyMongo.

Using the `$regex` operator, find all restaurants which end in "Bar" in the borough of Brooklyn.

HINT: The regex character for the end of a string is `$`

In [None]:
# YOUR CODE HERE


#### [`$gt`](https://docs.mongodb.com/manual/reference/operator/query/gt/#op._S_gt)

The `$gt` operator is a comparison between two values where one is greater than the other.  

For example, consider this code which finds restaurants which have had a score of more than 15:

In [None]:
filter = {'grades.score': {'$gt': 12}}
collection.count_documents(filter)

Using one of the other comparison operators, find all restaurants which had a grade awarded on the 15 December 2012.  You'll need to create a [`datetime`](https://docs.python.org/2/library/datetime.html#datetime-objects) object in Python.

In [None]:
# YOUR CODE HERE


## Organising output

So far, we have seen the two of the arguments in the `find()` and related functions.  The `filter` which allows us to select the criteria for documents in the collection, and the `limit` to limit the amount of results.  You should read the documentation fully about the function in your own time, but for now, we will go over two other arguments which are for organising output: field selection, and sorting.

The field selection or `projection` argument is the argument after the \[optional\] filter, and is either:

* A list of fields to include (plus \_id)
* A dict of fields with True/False to include

For example, to display only the name of the restaurant:

In [None]:
filter = {'cuisine': 'Brazilian'}
fields = {'_id': False, 'name': True}
collection.find_one(filter, fields)

The sort argument is a dict object of field names as keys, and directions.  This can be done either as a named parameter when calling `find()`, or as a function in its own right [`sort()`](https://api.mongodb.com/python/current/api/pymongo/cursor.html#pymongo.cursor.Cursor.sort)

For example, to sort in alphabetical order, consider the following code:

In [None]:
import pymongo
# The ASCENDING and DESCENDING constants have values of 1 (ASCENDING) and -1 (DESCENDING)
sort = [('name', pymongo.ASCENDING)]
for d in collection.find(filter, projection=fields, sort=sort):
    pprint(d)

# MongoDB Aggregation Framework

The most common usage for the aggregation framework is to perform group operations such as sum, count or average.  The framework works as a pipeline, with a series of different stages where the data are transformed in each one.

At its simplest, this can be used to obtain output like min, max, count, avg on a collection as follows:

In [None]:
group = {
    '$group': {
        '_id': None, 
        'size': {'$sum': 1},
        'min': {'$min': '$restaurant_id'},
        'max': {'$max': '$restaurant_id'}
    }
}

cursor = collection.aggregate([group])
for c in cursor:
    print(c)

Note that it has an `$_id: None` key/value pair in it.  It is compulsory for a `$group` pipeline to have one, and it indicates what it is grouping by.  In this case, we haven't grouped it at all, however it can also be used for more complex output where documents are grouped according to a field.

### Aggregation example

Consider this example, of finding the breakdown of how many of each type of restaurant there is in the Bronx.  We would need to go through the following stages:

- Identify restaurants which are in the Bronx
- Group the restaurants by type to get the count
- Sort the results in a sensible way

The code to perform this query is below:

In [None]:
# Restrict the results to only establishments in the Bronx.  
# '$match' indicates the stage in the pipeline, and the dictionary is the same as using with find()
match = {
    "$match": {"borough": "Bronx"}
}

# $group indicates the stage in the pipeline
# _id is the field to perform the operation on (like SQL GROUP BY)
# count is the name of the field that the result will be in
# $sum is the counting operation, and the value 1 is how many to count each time
group = {
    '$group': {'_id': '$cuisine', 'count': {'$sum': 1}}
    
}
# $sort indicates the position in the pipeline
# count is the field to sort by, and -1 means to sort in descending order
sort = {
    '$sort': {'count': pymongo.DESCENDING}
}

cursor = collection.aggregate([match, group, sort])
for c in cursor:
    print(c)



This is a simple query, which shows some of the basic stages of the aggregation pipeline.  It can be improved as follows:

* We can change the name of the `_id` in the output back to `cuisine` using the `$project` stage
* We can change the order of the output to be sorted in alphabetical order as well
* We can limit the results to include results only with a count of 20 or more

Implement those stages in the cell below

In [None]:
# YOUR CODE HERE


pipeline = [match, group] #More code here...
cursor = collection.aggregate(pipeline)
for c in cursor:
    print(c)


In [None]:
# YOUR CODE HERE FOR sort

pipeline.append(sort)
cursor = collection.aggregate(pipeline)
for c in cursor:
    pprint(c)


In [None]:
# YOUR CODE HERE FOR `count` > 20

cursor = collection.aggregate(pipeline)
for c in cursor:
    pprint(c)


## Challenge

How would you work out the percentage of each type of cuisine out of all selected restaurants?

In [None]:
# YOUR CODE HERE...

# Where next?

* Dealing with array data
* Update, delete, and drop
* Setting up authentication
* Sharding databases