# Basic CRUD

This Notebook will take you through a few basic operations with a dummy database, just to see how the basic CRUD (Create, Read, Update, Delete) operations work.

We're using the [PyMongo](http://api.mongodb.org/python/current/) module to allow Python to connect to MongoDB. The Notebooks in the module will describe most of the features of PyMongo you need, but you should refer to the [API documentation](http://api.mongodb.org/python/current/api/index.html) as necessary to understand the detail and nuance of PyMongo. PyMongo is also a fairly thin wrapper on MongoDB, so you may need to refer to the [MongoDB reference](http://docs.mongodb.org/manual/reference/) for some of the details and *MongoDB: The Definitive Guide* for context and background.

This Notebook only covers the basic of CRUD operations. You'll use more sophisticated queries in Parts 15 and 16.

In [1]:
# Import the required libraries

import pymongo
from datetime import datetime

import pandas as pd

In [2]:
# Open a connection to the Mongo server
client = pymongo.MongoClient('mongodb://localhost:27351/')

In [3]:
try:
    client.drop_database(dw_db)
except NameError:
    print("DB doesn't exist yet.")

DB doesn't exist yet.


In [4]:
# Create a doctorwho database and a test_collection within it.
dw_db = client.doctorwho
tc = dw_db.test_collection

Note that database and collection creation in MongoDB is *lazy*: the database and collection aren't actually created in the DBMS until the first document is written.

## Data structures and conversion

PyMongo handles automatically most of the translation between Python data structures and the JSON structures that Mongo uses. This table summarises the main equivalences.

| Document DB term | JSON structure | Python structure |
|------------------|----------------|------------------|
| Document or sub-document | Object | dict  |
| List | Array | list |
| Key | String | string |
| String | String | string |
| Number | Number | int or float, depending |
| Date | Date | datetime.datetime object |
| Object IDs | BSON ObjectId | BSON ObjectId |

MongoDB uses BSON, a binary version of JSON, internally. You can generally ignore this, except when you want to create new ObjectIds for documents.

## Create
Let's insert a few simple documents, just to get started.

Note that keys in a document have to be strings, but the values can be almost anything.

In [8]:
# Insert a single document
tc.insert_one({'name': 'William', 'birthyear': 1908})

# Insert a few (zip takes some lists and returns a list of tuples)
for n, b, c in zip('Patrick Jon Tom Peter Colin Sylvester Paul Christopher David Matt Peter'.split(),
                [1920, 1919, 1934, 1951, 1943, 1943, 1959, 1964, 1971, 1982, 1958], [1,2,3,4,5,6,7,8,9,10,11]):
    tc.insert_one({'name': n, 'birthyear': b, 'c': c})

## Read
`find_one()` will return a single (arbitrary) document. Note that Mongo automatically adds an `_id` field to each document. You can override this if you really want to, but we won't bother.

In [9]:
tc.find_one()

{'_id': ObjectId('5aa8e3190fd01f166dd2099d'),
 'birthyear': 1908,
 'name': 'William'}

The `pymongo` library does the type conversion for us:

In [14]:
type(tc.find_one())

dict

If we give a dict of some key-value pairs, `find_one()` will return a document that matches them.

This is **selecting**, choosing only the documents we're interested in.

In [15]:
tc.find_one({'name': 'Peter'})

{'_id': ObjectId('5aa8e3190fd01f166dd209a1'),
 'birthyear': 1951,
 'c': 4,
 'name': 'Peter'}

### Activity 1
Find a document for someone born in 1943.

The solution is in the [`14.1solutions`](14.1solutions.ipynb) Notebook.

In [13]:
tc.find_one({'birthyear': 1943})

{'_id': ObjectId('5aa8e3190fd01f166dd209a2'),
 'birthyear': 1943,
 'c': 5,
 'name': 'Colin'}

In [9]:
# Insert your solution here

In [10]:
tc.find_one({'birthyear':1943})

{'_id': ObjectId('5a987afd0fd01f47d3057533'),
 'birthyear': 1943,
 'name': 'Colin'}

`find()` will find all the documents that match the query, and returns a cursor that can be iterated over to retrieve the documents one by one. Again, the query acts to **select** the documents we want.

In [17]:
tc.find({'name': 'Peter'})

<pymongo.cursor.Cursor at 0x7f3959075c18>

This returns a PyMongo `cursor` object, which acts like an iterable in most cases. 

If we convert the `cursor` to a `list`, we get a list of `dict`s that are the found documents. Just look at a few, to save waiting for Jupyter to display the result.)

In [16]:
list(tc.find({'name': 'Peter'}))

[{'_id': ObjectId('5aa8e3190fd01f166dd209a1'),
  'birthyear': 1951,
  'c': 4,
  'name': 'Peter'},
 {'_id': ObjectId('5aa8e3190fd01f166dd209a8'),
  'birthyear': 1958,
  'c': 11,
  'name': 'Peter'},
 {'_id': ObjectId('5aa8e3390fd01f166dd209ad'),
  'birthyear': 1951,
  'c': 4,
  'name': 'Peter'},
 {'_id': ObjectId('5aa8e3390fd01f166dd209b4'),
  'birthyear': 1958,
  'c': 11,
  'name': 'Peter'}]

We can iterate directly over the cursor, but this is a one-pass-only process. The cursor remembers where it is in the set of results and carries on from there. For instance, if we ask for some items and print them:

In [18]:
cursor = tc.find({'name': 'Peter'})
for p in cursor:
    print(p)

{'birthyear': 1951, 'name': 'Peter', '_id': ObjectId('5aa8e3190fd01f166dd209a1'), 'c': 4}
{'birthyear': 1958, 'name': 'Peter', '_id': ObjectId('5aa8e3190fd01f166dd209a8'), 'c': 11}
{'birthyear': 1951, 'name': 'Peter', '_id': ObjectId('5aa8e3390fd01f166dd209ad'), 'c': 4}
{'birthyear': 1958, 'name': 'Peter', '_id': ObjectId('5aa8e3390fd01f166dd209b4'), 'c': 11}


…and then try to print them again, we get nothing: all the results have already been processed.

In [15]:
for p in cursor:
    print(p)

This manual handling of cursors isn't that useful, as we'll generally be consuming the entire set of returned documents all at once. But if the dataset is very large, you'll need to repeatedly consume and process part of the results, summarising as you go. Parts 15 and 16 will show you other ways of handling very large datasets in Mongo.

The cursor can tell us how many documents will match the query.

In [16]:
tc.find({'name': 'Peter'}).count()

2

In [17]:
tc.find().count()

12

An optional second argument to `find()` specifies the key-value pairs to return. If you give a list of keys, `find()` will return just those plus the `_id`. 

This is how PyMongo does **projection**, returning only some parts of the found documents.

In [18]:
for p in tc.find({'name': 'Peter'}, ['birthyear']):
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d3057532'), 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057539'), 'birthyear': 1958}


You'll notice that Mongo returns the document `_id`. It always does that unless we specifically ask it not to. 

The 'list of keys' notation is a shorthand for the full specification for which keys to return. Mongo expects a dict of keys to include or exclude. If the keys in that projection specification have a value `1`, the key is included; if they have a value `0`, they're excluded. You can only include or exclude keys in one projection (i.e. all 1s or all 0s), with the exception of the `_id` key. (See _MongoDB: The Definitive Guide_ for details).

For instance, the previous query could be specified as:

In [19]:
for p in tc.find({'name': 'Peter'}, {'birthyear': 1}):
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d3057532'), 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057539'), 'birthyear': 1958}


or, to exclude the `_id`:

In [20]:
for p in tc.find({'name': 'Peter'}, {'birthyear': 1, '_id': 0}):
    print(p)

{'birthyear': 1951}
{'birthyear': 1958}


### Activity 2
How many people were born in 1943? What are their names?


The solution is in the [`14.1solutions`](14.1solutions.ipynb) Notebook.

In [20]:
tc.find({'birthyear':1943},{'name':1, '_id':0}).count()

4

In [23]:
tc.find({'birthyear':1943},{'name':1, '_id':0}).count()

2

In [24]:
for p in tc.find({'birthyear':1943},{'name':1, '_id':0}):
    print(p)

{'name': 'Colin'}
{'name': 'Sylvester'}


In [None]:
# Insert your solution here

We can also limit the number of documents returned by `find()` by using the `limit` keyword argument. This is very useful when exploring large datasets and we're developing queries. 

The ordering of documents is arbitrary, so in the example below we can't dictate which three documents will be retrieved.

In [28]:
for p in tc.find({'birthyear': {'$lt': 1950}}, ['name', 'birthyear']):
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'birthyear': 1919}
{'_id': ObjectId('5a987afd0fd01f47d3057531'), 'name': 'Tom', 'birthyear': 1934}
{'_id': ObjectId('5a987afd0fd01f47d3057533'), 'name': 'Colin', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057534'), 'name': 'Sylvester', 'birthyear': 1943}


In [26]:
for p in tc.find({'birthyear': {'$lt': 1950}}, ['name', 'birthyear'], limit=3):
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'birthyear': 1919}


### Reading into DataFrames
We can create a _pandas_ DataFrame from a `list` of `dicts`, where each key in the `dict`s becomes a column in the DataFrame. This means we can convert the results of the `find()` directly into a _pandas_ DataFrame. (Note the `list` lurking in the middle.)

In [30]:
pd.DataFrame(list(tc.find({})))

Unnamed: 0,_id,birthyear,name
0,5a987afd0fd01f47d305752e,1908,William
1,5a987afd0fd01f47d305752f,1920,Patrick
2,5a987afd0fd01f47d3057530,1919,Jon
3,5a987afd0fd01f47d3057531,1934,Tom
4,5a987afd0fd01f47d3057532,1951,Peter
5,5a987afd0fd01f47d3057533,1943,Colin
6,5a987afd0fd01f47d3057534,1943,Sylvester
7,5a987afd0fd01f47d3057535,1959,Paul
8,5a987afd0fd01f47d3057536,1964,Christopher
9,5a987afd0fd01f47d3057537,1971,David


In [32]:
pd.DataFrame(list(tc.find({},{'_id':0})))

Unnamed: 0,birthyear,name
0,1908,William
1,1920,Patrick
2,1919,Jon
3,1934,Tom
4,1951,Peter
5,1943,Colin
6,1943,Sylvester
7,1959,Paul
8,1964,Christopher
9,1971,David


## Update

Things have moved on a bit from the _MongoDB: The Definitive Guide_ book. 

There are now several commands for changing a document. `replace_one()` takes two arguments: a specification of the document to update (in the same way as `find()`) and a document it's replaced with. The entirety of the document is replaced with the one given. If multiple documents match the query, an arbitrary one is replaced.

In most cases, you'll want `update_one()` or `update_many()`. These both take two arguments: a specification of the document(s) to update, and a description of the changes to make to those documents. The changes are specified by `$set`, `$push`, and similar operations.

All of these operations return an `UpdateResult`, which can be interrogated to find what effect the update had on the collection.

Let's say we want to add a surname to one of the records:

In [33]:
patrick = tc.find_one({'name': 'Patrick'})
print(patrick)

{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'birthyear': 1920}


In [34]:
r = tc.update_one({'name': 'Patrick'}, {'$set': {'surname': 'Troughton'}})
r.matched_count, r.modified_count

(1, 1)

(One document found, one updated.)

If we now look for the updated document, we see the change:

In [35]:
for p in tc.find({'name': 'Patrick'}):
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'surname': 'Troughton', 'birthyear': 1920}


To update every document that matches the query, use the `update_many()` operation. (This update just tags some documents with the `multi_updated` key.)

In [36]:
r = tc.update_many({'name': 'Peter'}, {'$set': {'multi_updated': True}})
r.matched_count, r.modified_count

(2, 2)

In [37]:
for p in tc.find():
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'surname': 'Troughton', 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'birthyear': 1919}
{'_id': ObjectId('5a987afd0fd01f47d3057531'), 'name': 'Tom', 'birthyear': 1934}
{'multi_updated': True, '_id': ObjectId('5a987afd0fd01f47d3057532'), 'name': 'Peter', 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057533'), 'name': 'Colin', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057534'), 'name': 'Sylvester', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057535'), 'name': 'Paul', 'birthyear': 1959}
{'_id': ObjectId('5a987afd0fd01f47d3057536'), 'name': 'Christopher', 'birthyear': 1964}
{'_id': ObjectId('5a987afd0fd01f47d3057537'), 'name': 'David', 'birthyear': 1971}
{'_id': ObjectId('5a987afd0fd01f47d3057538'), 'name': 'Matt', 'birthyear': 1982}
{'multi_updated': True, '_id': ObjectId('5a

You can see that the two Peters were updated. 

We can remove the additional key with the `$unset` modifier (the value we're updating it to is ignored):

In [39]:
tc.update_many({'name': 'Peter'}, {'$unset': {'multi_updated': ''}})
for p in tc.find():
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'surname': 'Troughton', 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'birthyear': 1919}
{'_id': ObjectId('5a987afd0fd01f47d3057531'), 'name': 'Tom', 'birthyear': 1934}
{'_id': ObjectId('5a987afd0fd01f47d3057532'), 'name': 'Peter', 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057533'), 'name': 'Colin', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057534'), 'name': 'Sylvester', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057535'), 'name': 'Paul', 'birthyear': 1959}
{'_id': ObjectId('5a987afd0fd01f47d3057536'), 'name': 'Christopher', 'birthyear': 1964}
{'_id': ObjectId('5a987afd0fd01f47d3057537'), 'name': 'David', 'birthyear': 1971}
{'_id': ObjectId('5a987afd0fd01f47d3057538'), 'name': 'Matt', 'birthyear': 1982}
{'_id': ObjectId('5a987afd0fd01f47d3057539'), 'name': 'Peter', 'bi

The 'many' approach can only give the same value to each matching document. If we want to give a different value to each document, we have to specify each document in turn in the update. This is efficient if we use the document's `_id`, as that's indexed:

In [40]:
import datetime
for p in tc.find():
    tc.update_one({'_id': p['_id']}, {'$set': {'age': datetime.datetime.now().year - p['birthyear']}})
for p in tc.find():
    print(p)

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'age': 110, 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'surname': 'Troughton', 'age': 98, 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'age': 99, 'birthyear': 1919}
{'_id': ObjectId('5a987afd0fd01f47d3057531'), 'name': 'Tom', 'age': 84, 'birthyear': 1934}
{'_id': ObjectId('5a987afd0fd01f47d3057532'), 'name': 'Peter', 'age': 67, 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057533'), 'name': 'Colin', 'age': 75, 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057534'), 'name': 'Sylvester', 'age': 75, 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057535'), 'name': 'Paul', 'age': 59, 'birthyear': 1959}
{'_id': ObjectId('5a987afd0fd01f47d3057536'), 'name': 'Christopher', 'age': 54, 'birthyear': 1964}
{'_id': ObjectId('5a987afd0fd01f47d3057537'), 'name': 'David', 'age': 47, 'birthyear': 1971}
{'_id': ObjectId('5a987afd0fd01f47d3

### Activity 3
Classify the people into two groups. Those born in 1945 or earlier should be labelled as `'age': 'old'`, while the others should be labelled as `'age': 'young'`.

Store the results in a new DataFrame.

The solution is in the [`14.1solutions`](14.1solutions.ipynb) Notebook.

In [38]:
# Insert your solution here

In [45]:
tc.update_many({'birthyear': {'$lt':1945}}, {'$set':{'age':'old'}})
tc.update_many({'birthyear': {'$gt':1945}}, {'$set':{'age':'young'}})
for p in tc.find():
    print(p)
    

{'_id': ObjectId('5a987afd0fd01f47d305752e'), 'name': 'William', 'age': 'old', 'birthyear': 1908}
{'_id': ObjectId('5a987afd0fd01f47d305752f'), 'name': 'Patrick', 'surname': 'Troughton', 'age': 'old', 'birthyear': 1920}
{'_id': ObjectId('5a987afd0fd01f47d3057530'), 'name': 'Jon', 'age': 'old', 'birthyear': 1919}
{'_id': ObjectId('5a987afd0fd01f47d3057531'), 'name': 'Tom', 'age': 'old', 'birthyear': 1934}
{'_id': ObjectId('5a987afd0fd01f47d3057532'), 'name': 'Peter', 'age': 'young', 'birthyear': 1951}
{'_id': ObjectId('5a987afd0fd01f47d3057533'), 'name': 'Colin', 'age': 'old', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057534'), 'name': 'Sylvester', 'age': 'old', 'birthyear': 1943}
{'_id': ObjectId('5a987afd0fd01f47d3057535'), 'name': 'Paul', 'age': 'young', 'birthyear': 1959}
{'_id': ObjectId('5a987afd0fd01f47d3057536'), 'name': 'Christopher', 'age': 'young', 'birthyear': 1964}
{'_id': ObjectId('5a987afd0fd01f47d3057537'), 'name': 'David', 'age': 'young', 'birthyear': 1971}

## Embedded documents

Values in documents can be themselves documents. For instance, we can encapsulate each person's name in a sub-document.

In [21]:
tc.drop()
# Insert a few
for f, s, b in zip('William Patrick Jon Tom Peter Colin Sylvester Paul Christopher David Matt Peter'.split(),
                   'Hartnell Troughton Pertwee Baker Davison Baker McCoy McGann Eccleston Tennant Smith Capaldi'.split(),
                [1908, 1920, 1919, 1934, 1951, 1943, 1943, 1959, 1964, 1971, 1982, 1958]):
    tc.insert_one({'name': {'forename': f, 'surname': s}, 'birthyear': b})
for p in tc.find():
    print(p)

{'birthyear': 1908, 'name': {'forename': 'William', 'surname': 'Hartnell'}, '_id': ObjectId('5aa8e54e0fd01f166dd209b5')}
{'birthyear': 1920, 'name': {'forename': 'Patrick', 'surname': 'Troughton'}, '_id': ObjectId('5aa8e54e0fd01f166dd209b6')}
{'birthyear': 1919, 'name': {'forename': 'Jon', 'surname': 'Pertwee'}, '_id': ObjectId('5aa8e54e0fd01f166dd209b7')}
{'birthyear': 1934, 'name': {'forename': 'Tom', 'surname': 'Baker'}, '_id': ObjectId('5aa8e54e0fd01f166dd209b8')}
{'birthyear': 1951, 'name': {'forename': 'Peter', 'surname': 'Davison'}, '_id': ObjectId('5aa8e54e0fd01f166dd209b9')}
{'birthyear': 1943, 'name': {'forename': 'Colin', 'surname': 'Baker'}, '_id': ObjectId('5aa8e54e0fd01f166dd209ba')}
{'birthyear': 1943, 'name': {'forename': 'Sylvester', 'surname': 'McCoy'}, '_id': ObjectId('5aa8e54e0fd01f166dd209bb')}
{'birthyear': 1959, 'name': {'forename': 'Paul', 'surname': 'McGann'}, '_id': ObjectId('5aa8e54e0fd01f166dd209bc')}
{'birthyear': 1964, 'name': {'forename': 'Christopher', '

We can also include a list of notable stories for each person. Note the use of the dot notation to identify keys in a sub-document.

In [None]:
r = tc.update_one({'name.forename': 'William', 'name.surname': 'Hartnell'},
        {'$set': {'episodes': ['An Unearthly Child', 'The Daleks', 'The Tenth Planet']}})
r.matched_count, r.modified_count

In [None]:
tc.find_one({'name.forename': 'William'})

There's lots more information on this in *MongoDB: The Definitive Guide*, the [MongoDB documentation](http://docs.mongodb.org/manual/reference/), and the [PyMongo documentation](http://api.mongodb.org/python/current/api/index.html).

# Importing data
Now we've seen the basics of how Mongo works, let's import and process a larger dataset.

The rest of this Notebook shows you how to import data from CSV files into MongoDB. In this example, we're using the data from the [Ultimate Doctor Who](http://www.ultimatedoctorwho.com/) site, though with some modifications to remove duplicate column names in the file.

First, let's take a look at the data in the CSV file.

In [51]:
!head data/Ultimate_Doctor_Who_resave.csv

Story ID,Year,Season,Title,No. of parts,Pt. 1 air date,Pt. 1 viewers (in millons),Pt. 2 air date,Pt.2 viewers,Pt. 3 air date,Pt.3 viewers,Pt. 4 air date,Pt.4 viewers,Pt.5 air date,Pt. 5 viewers,Pt.6 air date,Pt.6 viewers,Pt. 7 air date,Pt.7 viewers,pt. 8 air date,pt. 8 viewers,pt. 9 air date,pt. 9 viewers,pt. 10 air date,pt. 10 viewers,pt. 11 air date,pt. 11 viewers,pt. 12 air date,pt. 12 viewers,Type of Broadcast,Doctor Number,Doctor,Guest Doctor(s),Companion 1,Companion 2,Companion 3,Companion 4,Companion 5,Companion 6,Companion 7,Companion 8,Appearance of UNIT,Recurring Villains,Firsts
1,1963,1,An Unearthly Child,4,11/26/63,4.4,11/30/63,5.9,12/7/63,6.9,12/14/63,6.4,,,,,,,,,,,,,,,,,Serial,1,William Hartnell,,Susan Foreman,Barbara Wright,Ian Chesterton,,,,,,,,
2,1964,1,The Daleks,7,12/21/63,6.9,12/28/13,6.4,1/4/64,8.9,1/11/64,9.9,1/18/64,9.9,1/25/64,10.4,2/1/64,10.4,,,,,,,,,,,Serial,1,William Hartnell,,Susan Foreman,Barbara Wright,Ian Chesterton,,,,,,,Daleks,
3,1964,1,The Edge of D

The command to import files into Mongo is `mongoimport`. It imports a file into a specified collection in the specified database. It takes a number of parameters, but these are the most useful to you:

* `drop` drops the collection if it exists already
* `db` and `collection` specify where the imported data should go
* `headerline` indicates that the first line in the file contains the column names, which will be used as keys for the created documents
* `ignoreBlanks` means that keys with empty values will not be created in the imported documents
* `file` tells `mongoimport` where the data resides.

In [54]:
!/usr/bin/mongoimport --port 27351 --drop --db doctorwho --collection episodes \
    --type csv --headerline --ignoreBlanks \
    --file data/Ultimate_Doctor_Who_resave.csv

2018-03-01T22:41:58.493+0000	connected to: localhost:27351
2018-03-01T22:41:58.493+0000	dropping: doctorwho.episodes
2018-03-01T22:41:58.511+0000	imported 244 documents


changed the original location to !/usr/bin/mongoimport as it was pointing at the wrong place

In [55]:
# Open the imported database and collection.
episodes = dw_db.episodes

In [56]:
episodes.find().count()

244

In [57]:
episodes.find_one()

{'Companion 1': 'Susan Foreman',
 'Companion 2': 'Barbara Wright',
 'Companion 3': 'Ian Chesterton',
 'Doctor': 'William Hartnell',
 'Doctor Number': 1,
 'No': {' of parts': 4},
 'Pt': {' 1 air date': '11/26/63',
  ' 1 viewers (in millons)': 4.4,
  ' 2 air date': '11/30/63',
  ' 3 air date': '12/7/63',
  ' 4 air date': '12/14/63',
  '2 viewers': 5.9,
  '3 viewers': 6.9,
  '4 viewers': 6.4},
 'Season': 1,
 'Story ID': 1,
 'Title': 'An Unearthly Child',
 'Type of Broadcast': 'Serial',
 'Year': 1963,
 '_id': ObjectId('5a9881b6126840fb6995808c')}

Note that `mongoimport` treats dots in the column names as names of keys within sub-documents, so the column name 'No. of parts' becomes a sub-document within a 'No' key.

## Cleaning
As with most imported data, this dataset needs some cleaning to massage it into shape. For instance, we might want to collect the various companions into one list in the document, while deleting the individual fields.

`$push` adds an item to a list (and creates it if it doesn't exist). `$unset` removes a key from a document. This next cell will remove the separate `Companion` key-values and push them into a list of `Companions`.

In [58]:
for e in episodes.find():
    for key in list(e.keys()):
        if key.startswith('Companion '):
            episodes.update_one({'_id': e['_id']}, {'$push': {'Companions': e[key]},
                                                '$unset': {key: 1}})

In [59]:
episodes.find_one()

{'Companions': ['Ian Chesterton', 'Susan Foreman', 'Barbara Wright'],
 'Doctor': 'William Hartnell',
 'Doctor Number': 1,
 'No': {' of parts': 4},
 'Pt': {' 1 air date': '11/26/63',
  ' 1 viewers (in millons)': 4.4,
  ' 2 air date': '11/30/63',
  ' 3 air date': '12/7/63',
  ' 4 air date': '12/14/63',
  '2 viewers': 5.9,
  '3 viewers': 6.9,
  '4 viewers': 6.4},
 'Season': 1,
 'Story ID': 1,
 'Title': 'An Unearthly Child',
 'Type of Broadcast': 'Serial',
 'Year': 1963,
 '_id': ObjectId('5a9881b6126840fb6995808c')}

### Activity 4
* Note that this activity is optional, and is more of a programming exercise than really teaching you much about MongoDB. You'll miss nothing by just looking at the solution.

Create a list of sub-documents, one for each part. Each part sub-document should contain the part number, air date, and number of viewers. For example:

`'Parts': [{'Number': 1, 'Air date': datetime.datetime(1963, 11, 26, 0, 0), 'Viewers': 4.4},
    {'Number': 2, 'Air date': datetime.datetime(1963, 11, 30, 0, 0), 'Viewers': 5.9},
    {'Number': 3, 'Air date': datetime.datetime(1963, 12, 7, 0, 0), 'Viewers': 6.9},
    {'Number': 4, 'Air date': datetime.datetime(1963, 12, 14, 0, 0), 'Viewers': 6.4}]`

Note that parts are sometimes identifeid by `Pt`, sometimes by `pt`, and that everything's case sensitive.

Note that POSIX date conversions assume that '11/26/63' means 11 November 2063, so you'll need to fiddle with the years. The magic incantation is:

`d = datetime.strptime('11/26/63', '%m/%d/%y')
d = d.replace(year=(d.year - 100))`

(hoping that the year isn't 2400 or 2000).

Finally, note that if you update documents while iterating over many of them, Mongo may decide to return the updated document to you later in the same iteration (i.e. you may end up processing the same document more than once). Either check whether a document has been updated before you start processing it again, or use a _snapshot query_ which doesn't exhibit this behaviour. The format for snapshot queries is to include `modifiers={"$snapshot": True}` as a keyword parameter to `find()`.

The solution is in the [`14.1solutions`](14.1solutions.ipynb) Notebook.

In [None]:
# Try your code here

## Clean up
Drop this test database

In [60]:
client.drop_database(dw_db)

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.2 Introduction to accidents`.