# Basic CRUD in MongoDB

This Notebook will take you through a few basic operations with a small MongoDB database, just to see how the basic CRUD (Create, Read, Update, Delete) operations work.

We're using the [PyMongo](http://api.mongodb.org/python/current/) module to allow Python to connect to MongoDB and run queries on it.

The Notebooks in the module should describe most of the features of PyMongo you need, but you may also wish to refer to the [MongoDB API documentation](http://api.mongodb.org/python/current/api/index.html). The PyMongo package provides a fairly thin wrapper on MongoDB, so with a bit of digging, you may also find the [MongoDB reference](http://docs.mongodb.org/manual/reference/) useful. The reference book *MongoDB: The Definitive Guide* provides additional context and background.

This Notebook only covers the basic of CRUD operations. You'll use more sophisticated queries in Parts 15 and 16.

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## Mongo databases and collections

In MongoDB, data is stored as a *collection* of JSON documents. Data for different applications can be stored in different collections. These collections are themselves stored in a database within the MongoDB server.

The variable `DB_NAME` contains the name of a database to use. For the remote VCE on `tm351.open.ac.uk`, a writable database has been created for you, using your OUCU as the database name.

If you are using a local VCE, you can call the database anything you like! We have set the variable `DB_NAME` to the value `test_db` for this notebook, but you can use a different name if you want.


We connect to a database with an index on the MongoDB server connection. If the database does not already exist, then this will create the database if you have appropriate permissions (which you will have on a local VCE, but not the remote VCE):

In [None]:
mongo_db=mongo_client[DB_NAME]

Having created the database (or connected to an existing database), we can set up a collection of documents. To demonstrate this functionality, we will set up a simple collection inside the database which we have created. The data we're going to be using in this notebook relates to cast members of the popular BBC *Doctor Who* program, which has been running for a long, long, time...

First, we will delete any existing instances of the doctor who collection (for example, in case you have partially run this notebook previously):

In [None]:
mongo_db.drop_collection('doctor_who_collection')

In [None]:
dw_collection = mongo_db['doctor_who_collection']

This command should have created a new collection called `doctor_who_collection` inside your database. The previous command `mongo_db.drop_collection('doctor_who_collection')` will have removed the collection `doctor_who_collection` if it already existed. Unlike SQL, if the collection is not present, MongoDB does not raise an error, but returns a structure which encodes the failure to drop the collection (in this case, in the key-value pair `'errmsg': 'ns not found'`).

Having created the new collection, we can check to see what sort of object each of these represent, noting that they appear to return `dict` like `document_class` objects:

In [None]:
# Check the database object

mongo_db

In [None]:
# Check the collection object

dw_collection

Note that database and collection creation in MongoDB is *lazy*: the database and collection aren't actually created in the DBMS until the first document is written. As a result, it is possible that if anything has gone wrong up to this point, you won't find out until you actually try interacting with the database.

## Create - Adding items to the database

`pymongo` doesn't return data directly using `pandas` DataFrame. Instead, the documents MongoDB stores are typically JSON datastructures.

These most naturally wrap onto Python `dict`s, but it's often quite straightforward to convert the returned data to a tabular, `pandas` DataFrame form.

This means we can separate our concerns somewhat:

- firstly, how we do we store, search for and retrieve data from the Mongo database (a MongoDB query question);
- secondly, how do we get responses from the database into a form we are familiar with and can happily work with (for example, `pandas` Dataframes; this is a Python datawrangling question);
- thirdly, how do we manipulate the data to generate charts from it, statistically analyse it, and so on (that is: how do we analyse the `pandas` represented data).

So let's load in `pandas`, along with any other packages that might be useful.

In [None]:
import pandas as pd

from datetime import datetime

The MongoDB database stores documents as JSON objects, but we pass in a Python `dict`. The  `pymongo` package will then handle the data type conversion for us automatically.

Note that keys in a document have to be strings, but the values can be almost anything (strings, numerics, datetime objects, lists, dicts, etc.).

PyMongo handles automatically most of the translation between simple Python data structures and the JSON structures that Mongo uses. This table summarises the main equivalences.

| Document DB term | JSON structure | Python structure |
|------------------|----------------|------------------|
| Document or sub-document | Object | dict  |
| List | Array | list |
| Key | String | string |
| String | String | string |
| Number | Number | int or float, depending |
| Date | Date | datetime.datetime object |
| Object IDs | BSON ObjectId | BSON ObjectId |

MongoDB uses BSON, a binary version of JSON, internally. You can generally ignore this, except when you want to create new ObjectIds for documents.

Let's insert a few simple documents into out test collection, just to get started. The command we need to use to insert a single JSON document into a collection takes the form: `pymongo.MongoClient[DBNAME][COLLECTIONNAME].insert_one(document)`. We have already set up the collection name with:
```python
dw_collection=pymongo.MongoClient[DB_NAME]['doctor_who_collection']
```
so to insert a single document we just use:

In [None]:
# Insert a single document
dw_collection.insert_one({'name': 'William', 'birthyear': 1908})

Now that we have added a document, we can use the `count_documents` method on the collection to see that the collection now contains one document:

In [None]:
dw_collection.count_documents({})

To add several documents at a time, use the `insert_many` method:

In [None]:
dw_collection.insert_many([{'name': 'Patrick', 'birthyear': 1920},
                           {'name': 'Jon', 'birthyear': 1919},
                           {'name': 'Tom', 'birthyear': 1934},
                           {'name': 'Peter', 'birthyear': 1951},
                           {'name': 'Colin', 'birthyear': 1943},
                           {'name': 'Sylvester', 'birthyear': 1943},
                           {'name': 'Paul', 'birthyear': 1959}])

As before, we can use `count_documents` to see how many documents we now have in the collection:

In [None]:
dw_collection.count_documents({})

### Inserting from a DataFrame

Suppose we have several more data items where the data is presented in a dataframe, perhaps because it has been generated as part of an analysis elsewhere in *pandas*. For example, we might have a string containing several space separated names copied from one row of a table, and a list of birth years from another:

In [None]:
modern_doctors_df=pd.DataFrame({'name': ['Christopher', 'David', 'Matt', 'Peter', 'Jodie'],
                                'birthyear':[1964, 1971, 1982, 1958, 1982]})
modern_doctors_df

How can we easily load these into the database as separate records?

From the dataframe, we can generate a list of dicts that we could then add to the database. The `.to_dict(orient='records')` method will generate such a list from a dataframe:

In [None]:
modern_doctors_df.to_dict(orient='records')

which is now in the correct format so that we can add using `.insert_many` again:

In [None]:
dw_collection.insert_many(modern_doctors_df.to_dict(orient='records'))

And we can now see that another 5 documents have been added to the collection:

In [None]:
dw_collection.count_documents({})

## Read: Retrieving items from the database

### Retrieving a single item from the database
The `.find_one()` method will return a single (arbitrary) document from a collection:

```
client.DBNAME.COLLECTIONNAME.find_one()
client.DBNAME.COLLECTIONNAME.find_one(SELECT)
```

Note that Mongo automatically adds an `_id` field to each document. (You can override this if you really want to, but we won't bother.)

In [None]:
dw_collection.find_one()

The `pymongo` library does the type conversion from the MongoDB Javascript object back to a Python `dict` for us:

In [None]:
type(dw_collection.find_one())

### Selection in MongoDB

If we give a dict of some key-value pairs, `find_one()` will return a document that matches them.

This is how MongoDB handles **selection**, which we have already seen for pandas DataFrames and tables in SQL. It allows us to specify only the documents we're interested in.

So if we want to find a document in the collection `dw_collection` which has a key `name` with value `Peter`, we would use the following:

In [None]:
dw_collection.find_one({'name': 'Peter'})

### Activity 1

Find a document in the collection `dw_collection` for someone born in 1943.

In [None]:
# Enter your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Query the database by selecting records that match a specific birth year: 

In [None]:
dw_collection.find_one({'birthyear': 1943})

#### End of Activity 1

----------------------------------------

### Retrieving multiple items from the database

`find()` will find all the documents that match the query, and returns a cursor that can be iterated over to retrieve the documents one by one.

```
client.DBNAME.COLLECTIONNAME.find()
client.DBNAME.COLLECTIONNAME.find(SELECT)
client.DBNAME.COLLECTIONNAME.find(SELECT, PROJECTION)
```

Again, the query acts to **select** the documents we want.

Let's find all the documents in the collection `dw_collection` which have a key `name` whose value is `Peter`:

In [None]:
dw_collection.find({'name': 'Peter'})

Rather than returning a document, the `.find` method returns a pyMongo `cursor` object.

This acts like an iterable in most cases, so we can print all the values by using a `for` loop:

In [None]:
cursor = dw_collection.find({'name': 'Peter'})
for p in cursor:
    print(p)

The selection has found both the documents where the value of `name` is `Peter`.

Because the cursor usually behaves as an iterable, we can also convert the results directly to a list:

In [None]:
list(dw_collection.find({'name': 'Peter'}))

We can iterate directly over the cursor, but this is a one-pass-only process. The cursor remembers where it is in the set of results and carries on from there. For instance, if we create a new cursor and print the items:

In [None]:
cursor = dw_collection.find({'name': 'Peter'})
for p in cursor:
    print(p)

we can't then reuse the cursor: all the results have already been processed:

In [None]:
for p in cursor:
    print(p)

In certain cases, this manual handling of cursors may not appear to be very useful: why not just return all the data from a query at the same time so we can work with it, in memory, in a *pandas* dataframe?

One reason is the question of resource: bandwidth, memory and and "compute power". If the dataset is very large, we are likely to run into problems if we try to download the entire set of returned documents all at once; we are also likely to hit problems storing them all in memory at the same time. MongoDB is designed to be able to rapidly store extremely large datasets, so for many of its use cases, a typical query would not fit in a standard desktop or laptop computer.

A third resource constraint applies to actually performing computations over the data. If we do not have a powerful computer, it could take a very long time to perform the whole computation.

When connecting to a database server on a remote server, it makes sense for us to try to perform as much computation as possible on that remote server via our database queries, rather than consuming:

- network capacity to download large amounts of data (which takes time);
- large amounts of local memory to hold the downloaded data, and
- large amounts of local computational effort to process that data.

To identify how many documents match the query, we can perform a count operation directly on the database. Run the same search limits as before, specifically the filter term `{'name': 'Peter'}`, the `.count_documents()` method returns a count of matching documents.

```
client.DBNAME.COLLECTIONNAME.count_documents()
client.DBNAME.COLLECTIONNAME.count_documents(SELECT)
```

For example:

In [None]:
dw_collection.count_documents({'name': 'Peter'})

To search for all the documents in a collection, pass an empty set of filter terms (`{}`), as we used at the beginning of the notebook:

In [None]:
dw_collection.count_documents({})

### Projection in MongoDB

An optional second argument to `find()` specifies the key-value pairs to return. If you give a list of keys, `find()` will return just those plus the `_id`. 

This is how PyMongo does **projection**, returning only some parts of the found documents.

In [None]:
list(dw_collection.find({'name': 'Peter'}, ['birthyear']))

Once again, you'll notice that Mongo returns the document `_id`. It always does that unless we specifically ask it not to. 

The 'list of keys' notation, `['birthyear']`, is a convenient shorthand for the full specification for which keys to return. 

Mongo actually expects an object that specifies which keys to include or exclude. If the keys in that projection specification have a value `True`, the key is included; if they have a value `False`, they're excluded. (See _MongoDB: The Definitive Guide_ for details).

The previous query is more comprehensively specified as:

In [None]:
list(dw_collection.find({'name': 'Peter'}, {'birthyear': True}))

or, to exclude the `_id`:

In [None]:
list(dw_collection.find({'name': 'Peter'}, {'birthyear': True, '_id': False}))

### Activity 2

According to the collection `dw_collection`, how many people were born in 1943? What are their names? Why might you want to return just  a count of the number of responses for a query?

Remember, the query returns a cursor object so you will need to find some way of retrieving the individual names from that object.

How many people in the collection `dw_collection` were born in 1943?


In [None]:
# Write your code in this cell

What are their names?


In [None]:
# Write your code in this cell

Why might you want to return just  a count of the number of responses for a query?

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To count how many documents relate to a particular query, use the `.count_documents()` collection method with the appropriate selection criteria:

In [None]:
# How many people were born in 1943?
dw_collection.count_documents({'birthyear': 1943})

To find just the names of the people born in the desired year, we need to *select* on the year and then return the just the names as a *projection* of the result.

Casting the returned cursor object to a list will display all the results.

In [None]:
# What are the names of people born in 1943?
list(dw_collection.find({'birthyear': 1943}, {'name': 1, '_id': 0}))

Assume, for a moment, that the cursor object returned from a query *does not* provide a simple way of finding the length of the object. In such a case, you would have to "consume" the whole of cursor in order to find its size, for example by casting it to a list and then finding the length of the list. This would be inefficient, and may also cause performance issues if your response contains millions of items. Looking up the number of documents in a response before *casting* it to a list might be a sensible precaution.

More generally, we might want to know how many items might be returned from a query so we can take steps to process that volume of results appropriately.

#### End of Activity 2

---------------------------------------------

### Limiting the number of returned objects

We can also limit the number of documents returned by `find()` by using the `limit` keyword argument as part of the selection criteria. This is very useful when exploring large datasets whilst we are developing queries.

The `limit` keyword is explicitly stated and is used in the following way:

`client.DBNAME.COLLECTIONNAME.find(SELECT, PROJECTION, limit=N)`

In [None]:
list(dw_collection.find({}, ['name', 'birthyear'], limit=3))

Note that the ordering of the returned documents is arbitrary, so from the example we couldn't dictate which three documents will be retrieved.

### Conditions in selection

In SQL, we can use the `WHERE` clause to select rows according to one or more conditions. In MongoDB, we use the dictionary format of the query to express the selection. For example, if we wanted to find all the documents in `dw_collection` which have a value of `birthyear` which is less than 1950, we would use the following: 

In [None]:
list(dw_collection.find({'birthyear': {'$lt': 1950}}, ['name', 'birthyear']))

The condition is expressed as the dict:
```python
{'$lt': 1950}
```

The set of available comparison operators is:

|condition | meaning|
|---|---|
|`'$eq'` | is equal to (=) |
|`'$gt'` | is greater than ($>$) |
|`'$gte'` | is greater than or equal to ($\geq$) |
|`'$in'` | is in (a list) ($\in$) |
|`'$lt'` | is less than ($<$) |
|`'$lte'` | is less than or equal to ($\leq$) |
|`'$ne'` | is not equal to ($\neq$) |
|`'$nin'` | is not in ($\not\in$) |

Note that in the expression, the operator must be placed in single quotes.

To combine these operators, MongoDB also provides a collection of logical connectives:

|connective | meaning|
|---|---|
|`'$and'` | logical and ($\wedge$) |
|`'$or'` | logical or ($\vee$) |
|`'$not'` | logical negation ($\neg$) |
|`'$nor'` | negation of or ($\in$) |

These should be applied to a list of expressions.

So for example, if we wanted to find the documents with a value of `birthyear` greater than 1940 *and* less than 1960, we would use:

In [None]:
list(dw_collection.find({'$and': [{'birthyear': {'$gt': 1940}}, {'birthyear': {'$lt':1960}}]}, ['name', 'birthyear']))

Finally, we might find it helpful to cast the output into a pandas DataFrame, rather than having a list of dicts. The constructor for a dataframe will automatically map the key-value pairs in each dict into appropriate columns in a DataFrame:

In [None]:
pd.DataFrame(dw_collection.find({'$and': [{'birthyear': {'$gt': 1940}}, {'birthyear': {'$lt':1960}}]}, ['name', 'birthyear']))

## Update: Changing the values in the database

Things have moved on a bit from the situation described in the original edition of _MongoDB: The Definitive Guide_ book. 

There are now several commands for changing a document. For example, `replace_one()` takes two arguments: a specification of the document to update (in the same way as `find()`) and a document it's replaced with. The entirety of the document is replaced with the one given. If multiple documents match the query, an arbitrary one is replaced.

```python
    client.DBNAME.COLLECTIONNAME.replace_one(SELECT, REPLACEMENT_DOCUMENT)
```

In most cases, you'll want `update_one()` or `update_many()`. These both take two arguments: a specification of the document(s) to update, and a description of the changes to make to those documents.

The changes are specified with the operator `$set`:

```python
    client.DBNAME.COLLECTIONNAME.update_one(SELECT, {'$set': UPDATED_DOCUMENT})
    client.DBNAME.COLLECTIONNAME.update_many(SELECT, {'$set': UPDATED_DOCUMENT})
```

The update methods let us find one or more documents that match the search criteria, and then change existing fields in the document, or add additional fields.

For example, suppose we wanted to add a surname to a particular record, for example by adding `Troughton` to the record for the actor Patrick Troughton.

We can view the document using `find_one`:

In [None]:
dw_collection.find_one({'name': 'Patrick'})

To update the document, we use `$set` to add the new key and value:

In [None]:
r = dw_collection.update_one({'name': 'Patrick'}, {'$set': {'surname': 'Troughton'}})
r

All of these operations return an `UpdateResult` object, which can be interrogated to find what effect the update had on the collection.

In [None]:
r.matched_count, r.modified_count

In this case, one document was found, one updated.

If we now look for the updated document, we see the change:

In [None]:
pd.DataFrame(dw_collection.find({'name': 'Patrick'}))

To update every document that matches the query, use the `update_many()` method.

The following update simply tags some documents with the arbitrary `tagged_you` key.

In [None]:
r = dw_collection.update_many({'name': 'Peter'}, {'$set': {'tagged_you': True}})
(r.matched_count, r.modified_count)

If we preview all the documents, you will see the additional tags on particular records:

In [None]:
list(dw_collection.find())

You can see that the two documents which matched the selection criterion were updated with the new key and value.

In the dataframe representation, the untagged records have a null indicator (`NaN`), as do the records where no surname is declared.

In [None]:
pd.DataFrame(dw_collection.find({}))

Note that this *view* contains data that is *not* explicitly in the original database. There is no `"tagged_you"` field in the original document associated with *Patrick Troughton*.

In other cases, there may be `NaN` value items that do form part of the database record, but we will be unable to distinguish those values from the artefacts added to our tabular representation just by looking at the dataframe.

#### Updating Many Documents With Distinct Values

Using the `.update_many()` approach is limited in that it will only apply the same change to each matched document.

If we want to update each document with a different value, for example by adding an "age" field to show how old in years they would be today, calculated (roughly!) from each person's birthyear, we have to specify each document in turn in the update.

One efficient way of accessing each document uniquely is to query all the records to obtain their indexed `_id` values, and then use the document's `_id` to reference each document, uniquely, in turn.

We could also find *all* the fields in the database — `dw_collection.find()` — or be more focused in what we return, requesting all the records, (that is, selecting everything and anything: `{}`) whilst at the same time limiting the projection of the fields that are returned to just the data elements we need: `['_id','birthyear']`.

In [None]:
import datetime

for p in dw_collection.find({},['_id','birthyear']):
    dw_collection.update_one({'_id': p['_id']}, {'$set': {'age': datetime.datetime.now().year - p['birthyear']}})
    
list(dw_collection.find())

And viewing as a DataFrame:

In [None]:
pd.DataFrame(dw_collection.find())

### Activity 3

Classify the people into two groups. Those born in 1945 or earlier should be labelled as `'age': 'older'`, while the others should be labelled as `'age': 'younger'`.

Store the results in a new DataFrame.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To update the records uniquely, we need to iterate over all the records, then make a decision as to how to update each one.

To display the results as a *pandas* DataFrame, we pass the MongoDB results cursor DataFrame constructor:

In [None]:
for p in dw_collection.find():
    if p['birthyear'] <= 1945:
        dw_collection.update_one({'_id': p['_id']}, {'$set': {'age': 'older'}})
    else:
        dw_collection.update_one({'_id': p['_id']}, {'$set': {'age': 'younger'}})

pd.DataFrame(dw_collection.find())

#### End of Activity 3

-------------------------------------------

## Delete: Removing items from the database

### Removing Components of Stored Documents

To delete part of a record, use the `$unset` modifier ([docs](https://docs.mongodb.com/manual/reference/operator/update/unset/)):

```
client.DBNAME.COLLECTIONNAME.update_one(SELECT, {'$unset': {KEY:VALUE}})
client.DBNAME.COLLECTIONNAME.update_many(SELECT, {'$unset': {KEY:VALUE}})
```

For example, in the following case, we can remove the arbitrary `tagged_you` elements:

In [None]:
r=dw_collection.update_many({'name': 'Peter'}, {'$unset': {'tagged_you': ''}})
(r.matched_count, r.modified_count)

In [None]:
pd.DataFrame(dw_collection.find())

Note that it does not matter *what* `VALUE` we pass in the `dict` to the `$unset` operator (we can pass in any value, it is simply ignored), but getting the correct `KEY` value(s) is critical.  

To leave things as we started, we will drop the collection:

In [None]:
mongo_db.drop_collection('doctor_who_collection')

If you are working on a local VCE, you can also drop the database you created (if you are working on the remote VCE, you do not have permission to drop your database):

In [None]:
# Will not work on the remote VCE
mongo_client.drop_database(DB_NAME)

## What Next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.2 Working With Embedded Documents`.