# Working With Embedded Documents

Values in documents can be themselves documents. For instance, we can encapsulate each person's name in a nested or embedded sub-document.

In this notebook, we will explore how MongDB can be used to store documents inside other documents, as well as how we might view the results in a notebook using *pandas* dataframes.

In [None]:
# Standard imports

import pandas as pd

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## Documents within documents

In this notebook, we will look at how to handle embedded documents. So far we have seen some json, but now let's use it a little more seriously here to give some embedded documents.

As before, we'll set up a collection of the actors who have played Dr Who, but this time, rather than using a string to store their name, we will use a new document. This will also be represented using a python `dict`.

Again we'll create the database client:

In [None]:
mongo_db=mongo_client[DB_NAME]

and as before, create a collection named `doctor_who_collection`, dropping any existing versions which might still be in the database:

In [None]:
mongo_db.drop_collection('doctor_who_collection')

# Create a new version of the collection:
dw_collection=mongo_db['doctor_who_collection']

We can use `insert_many` to add the new documents to the database:

In [None]:
dw_collection.insert_many([{'name':{'forename': 'William', 'surname': 'Hartnell'}, 'birthyear': 1908},
     {'name':{'forename': 'Patrick', 'surname': 'Troughton'}, 'birthyear': 1920},
     {'name':{'forename': 'Jon', 'surname': 'Pertwee'}, 'birthyear': 1919},
     {'name':{'forename': 'Tom', 'surname': 'Baker'}, 'birthyear': 1934},
     {'name':{'forename': 'Peter', 'surname': 'Davison'}, 'birthyear': 1951},
     {'name':{'forename': 'Colin', 'surname': 'Baker'}, 'birthyear': 1943},
     {'name':{'forename': 'Sylvester', 'surname': 'McCoy'}, 'birthyear': 1943},
     {'name':{'forename': 'Paul', 'surname': 'McGann'}, 'birthyear': 1959},
     {'name':{'forename': 'Christopher', 'surname': 'Eccleston'}, 'birthyear': 1964},
     {'name':{'forename': 'David', 'surname': 'Tennant'}, 'birthyear': 1971},
     {'name':{'forename': 'Matt', 'surname': 'Smith'}, 'birthyear': 1982},
     {'name':{'forename': 'Peter', 'surname': 'Capaldi'}, 'birthyear': 1958},
     {'name':{'forename': 'Jodie', 'surname': 'Whittaker'}, 'birthyear': 1982}])

print(f"Collection contains {dw_collection.count_documents({})} documents")

Let's see how that changes a single record:

In [None]:
dw_collection.find_one()

We now have a document that contains a document. And, indeed, all the documents contain a document in the `name` key.

Previously, we used the fact that `find()` returns a list of dictionaries to cast the output of `find` into a DataFrame. However, if we naively create a list from the MongoDB results  in this case, we see the subdocuments recorded as Python dictionaries inside a column:

In [None]:
pd.DataFrame(dw_collection.find({}))

Helpfully, pandas gives us a useful function to handle these embedded documents more easily. If the results contain nested documents, these will appear as Python dictionaries within columns that match the top level items in the original set of results.

A more robust way to create a dataframe from the results cursor is to use the `pandas.json_normalize` function:

In [None]:
dw_normalised_df=pd.json_normalize(list(dw_collection.find()))
dw_normalised_df

We can now use a dotted notation to project over any columns in the dataframe that we want:

In [None]:
dw_normalised_df['name.surname']

## Querying subdocuments in MongoDB

Moving away from dataframes, if we want to make a query based on the contents of a subdocument, the selection term should match the whole subdocument. For example, the following query gets a match:

In [None]:
dw_collection.find_one({'name':{'forename': 'William', 'surname': 'Hartnell'}})

But if we just try to a simple, naive search against *part* of that subdocument specficied as a `dict`, we don't get any matches:

In [None]:
dw_collection.find_one({'name':{'forename': 'William'}})

That is, the match fails because we needed to match the whole of the subdocument `{'forename': 'William', 'surname': 'Hartnell'}`

However, we can use a dot notation to construct a path to an element in a subdocument. In this case, we *do* get the match:

In [None]:
dw_collection.find_one({'name.forename': 'William'})

Let's add another field to some of the records, in particular a list of notable stories for each person,using the dot notation to identify path defined keys that search deep into a sub-document. We will add the stories as a list:

In [None]:
r = dw_collection.update_one({'name.forename': 'William', 'name.surname': 'Hartnell'},
        {'$set': {'episodes': ['An Unearthly Child', 'The Daleks', 'The Tenth Planet']}})
(r.matched_count, r.modified_count)

In [None]:
dw_collection.find_one({'name.forename': 'William'})

There's lots more information on this in *MongoDB: The Definitive Guide*, the [MongoDB documentation](http://docs.mongodb.org/manual/reference/), and the [PyMongo documentation](http://api.mongodb.org/python/current/api/index.html).

Note that this use of dotted notation means that keys in a MongoDB database cannot contain full stops. Trying to insert a document with a key containing a full stop will raise an error:

In [None]:
dw_collection.insert_one({'illegal.key':'test'})

## Clean up

As before, we will drop this test collection to leave the database as we found it at the start of the notebook:

In [None]:
mongo_db.drop_collection(dw_collection)

and if you are working on a local VCE, you can also drop the database you created (if you are working on the remote VCE, you do not have permission to drop your database):

In [None]:
# Will not work on the remote VCE
mongo_client.drop_database(DB_NAME)

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.3 Importing Data into MongoDB`.