# Importing data

Now we've seen the basics of how Mongo works, let's import and process a larger dataset.

In this notebook, you will look at some of the issues in importing data from CSV files into MongoDB. We'll use the data from the [Ultimate Doctor Who](http://www.ultimatedoctorwho.com/) site, though with some modifications to remove duplicate column names in the file.

In [None]:
# Standard imports

import pandas as pd

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

You should now be connected to the MongoDB database.

## Importing data from a csv file

First, let's take a look at the data in the CSV file. The csv file is in the file `Ultimate_Doctor_Who_resave.csv` in the `data` directory.

In [None]:
dr_who_df=pd.read_csv('data/Ultimate_Doctor_Who_resave.csv')

dr_who_df

As before, we will start by creating the database client:

In [None]:
mongo_db=mongo_client[DB_NAME]

And create a collection called `dr_who_collection`:

In [None]:
mongo_db.drop_collection('dr_who_collection')
dw_collection=mongo_db['dr_who_collection']

Now, what happens when we try to insert the data in the dataframe into the MongoDB collection?

In [None]:
dw_collection.insert_many(dr_who_df.to_dict(orient='records'))

So we see that the first problem is that several of the column names contain full stops, which, as we saw in notebook `14.2 Working With Embedded Documents` is not permitted in MongoDB.

One solution to this might simply be to replace any full stops in the column headings with a safer character such as an underscore:

In [None]:
dr_who_safe_df=dr_who_df.rename(lambda s:s.replace('.', '_'), axis='columns')

dr_who_safe_df.head()

In many circumstances, this may be the most sensible approach, but in this case there are some better techniques we can use.

Looking at the column headings which contain a full stop, we can see that these are all parts of sequences (eg. the different parts of a given series).

So rather than deal with the full stops at this point, let's begin by inserting only that information which isn't part of a numbered list:

In [None]:
dr_who_df[['Story ID',
 'Year',
 'Season',
 'Title',
 'Type of Broadcast',
 'Doctor Number',
 'Doctor',
 'Guest Doctor(s)',
 'Appearance of UNIT',
 'Recurring Villains',
 'Firsts']]

As we have seen in notebook `14.1 Basic CRUD`: we can insert this data into the MongoDB database using the `.to_dict(orient='records')` method:

In [None]:
dw_collection.insert_many(dr_who_df[['Story ID',
 'Year',
 'Season',
 'Title',
 'Type of Broadcast',
 'Doctor Number',
 'Doctor',
 'Guest Doctor(s)',
 'Appearance of UNIT',
 'Recurring Villains',
 'Firsts']].to_dict(orient='records'))

In [None]:
dw_collection.count_documents({})

In [None]:
dw_collection.find_one()

We notice at this point that some fields contain a NULL value (shown here as `nan`). We should remove these values: in a MongoDB database, the NULL should be reflected simply by the lack of the relevant key, rather than an explicit NULL value. In this case, we can make a comparison with the python expression `float('nan')`:

In [None]:
pd.DataFrame(dw_collection.find({'Guest Doctor(s)': float('nan')}))

To remove the NULL values from the `Guest Doctor(s)` column, we use `$unset`, as described in notebook `14.1 Basic CRUD`:

In [None]:
dw_collection.update_many({'Guest Doctor(s)': float('nan')}, {'$unset': {'Guest Doctor(s)':''}})

If we now look at an early story, we find that the document no longer contains the  `Guest Doctor(s)` key. This is how NULL is represented in MongoDB:

In [None]:
dw_collection.find_one({'Story ID': 1.0})

To see whether a document actually contains a given key, use the condition `{'$exists': True}`. So to find a document which does contain a key and value for  `Guest Doctor(s)`, use:

In [None]:
dw_collection.find_one({'Guest Doctor(s)': {'$exists': True}})

And to find a document which does not contain the key, just use the condition `{'$exists': False}`:

In [None]:
dw_collection.find_one({'Guest Doctor(s)': {'$exists': False}})

### Activity 1

For the rest of the documents in `dw_collection`, remove any NULL values.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

As all documents were created with the same set of keys, we can use the remaining keys in any arbitrary document to find the remaining keys to remove NULL values from:

In [None]:
dw_collection.find_one({}).keys()

We can use this list to clear the NULL values from the rest of the database:

In [None]:
for key in dw_collection.find_one({}).keys():
    dw_collection.update_many({key: float('nan')}, {'$unset': {key:''}})

If we now look at the first document again, we should see that the NULL entries have been removed:

In [None]:
dw_collection.find_one({'Story ID': 1.0})

#### End of Activity 1

---------------------------------------------

### Arrays in MongoDB

Some of the fields in the database would be better represented as an ordered collection than by a key-value paired json subdocument. For this data, we can use a list, or an array.

If we look at the column names in the original csv file, we see that there are a number of columns for the doctor's companions in any given story:

In [None]:
dr_who_df.keys()

The columns `Companion 1`, `Companion 2` and so on allow for up to 8 companions to be included for each story, but require several named columns in a csv file, many of which will be completely populated with NULL values. A more natural way of representing the companions in each episode would be with an array.

To include the companions in each story in an array, we can write a function which takes a row of a DataFrame (which is a series), and use `'$set'` to add it to the relevant document in the database:

In [None]:
def find_companions(row_ss):
    '''Takes a row from the dr_who_df dataframe
    and return a list of all companions'''
    out=[]
    for c in row_ss.index:
        if c[:9]=='Companion' and pd.notnull(row_ss[c]):
            out.append(row_ss[c])
    return out

For example, the first row of the dataframe is:

In [None]:
dr_who_df.iloc[0]

and the list of companions is then:

In [None]:
find_companions(dr_who_df.iloc[0])

To add the companion data to the database collection, apply the function to all rows in the DataFrame, and use `'$set'` to add to the database:

In [None]:
for idx in dr_who_df.index:
    row=dr_who_df.loc[idx]
    dw_collection.update_one({'Story ID':row['Story ID']}, 
                             {'$set': {'Companions':find_companions(row)}})

In [None]:
dw_collection.find_one()

We can do a similar process with the information about each of the parts. Rather than have keys named `'Pt. 1 air date'`, `'Pt.3 viewers'` and the like, we can create an array of subdocuments for each story. Then each subdocument can contain the air date and viewer numbers for the particular episode. Note that this also means that all the column names containing full stops will have been handled.

A (rather verbose) function for extracting the relevant episode information might be as follows:

In [None]:
def get_part_info(row_ss):
    '''Takes a row from the dr_who_df dataframe
    and return a list of the part information'''
    parts_ls=[]
    
    if not(pd.isnull(row_ss['Pt. 1 air date'])):
        parts_ls.append({'number':1,
                         'air_date':row_ss['Pt. 1 air date'],
                         'viewers':row_ss['Pt. 1 viewers (in millons)']})
        
    if not(pd.isnull(row_ss['Pt. 2 air date'])):
        parts_ls.append({'number':2, 
                         'air_date':row_ss['Pt. 2 air date'],
                         'viewers':row_ss['Pt.2 viewers']})
        
    if not(pd.isnull(row_ss['Pt. 3 air date'])):
        parts_ls.append({'number':3,
                         'air_date':row_ss['Pt. 3 air date'],
                         'viewers':row_ss['Pt.3 viewers']})
    
    if not(pd.isnull(row_ss['Pt. 4 air date'])):
        parts_ls.append({'number':4,
                         'air_date':row_ss['Pt. 4 air date'],
                         'viewers':row_ss['Pt.4 viewers']})
    
    if not(pd.isnull(row_ss['Pt.5 air date'])):
        parts_ls.append({'number':5,
                         'air_date':row_ss['Pt.5 air date'],
                         'viewers':row_ss['Pt. 5 viewers']})
    
    if not(pd.isnull(row_ss['Pt.6 air date'])):
        parts_ls.append({'number':6,
                         'air_date':row_ss['Pt.6 air date'],
                         'viewers':row_ss['Pt.6 viewers']})
    
    if not(pd.isnull(row_ss['Pt. 7 air date'])):
        parts_ls.append({'number':7,
                         'air_date':row_ss['Pt. 7 air date'],
                         'viewers':row_ss['Pt.7 viewers']})
    
    if not(pd.isnull(row_ss['pt. 8 air date'])):
        parts_ls.append({'number':8, 
                         'air_date':row_ss['pt. 8 air date'],
                         'viewers':row_ss['pt. 8 viewers']})
    
    if not(pd.isnull(row_ss['pt. 9 air date'])):
        parts_ls.append({'number':9,
                         'air_date':row_ss['pt. 9 air date'],
                         'viewers':row_ss['pt. 9 viewers']})
    
    if not(pd.isnull(row_ss['pt. 10 air date'])):
        parts_ls.append({'number':10,
                         'air_date':row_ss['pt. 10 air date'],
                         'viewers':row_ss['pt. 10 viewers']})
    
    if not(pd.isnull(row_ss['pt. 11 air date'])):
        parts_ls.append({'number':11,
                         'air_date':row_ss['pt. 11 air date'],
                         'viewers':row_ss['pt. 11 viewers']})
    
    if not(pd.isnull(row_ss['pt. 12 air date'])):
        parts_ls.append({'number':12,
                         'air_date':row_ss['pt. 12 air date'],
                         'viewers':row_ss['pt. 12 viewers']})
    
    return parts_ls
        

**Note:** *If you are confident programming in python, you might be thinking that there are far more efficient ways to work through the cases, for example by matching and iterating across the column names. This is certainly true, and you should feel free to adapt the code according to your coding level and style.*

If we call the `get_part_info` function on the first row of the DataFrame, we receive a list of four parts, with the air date and viewer numbers (in millions) for each of those parts.

In [None]:
get_part_info(dr_who_df.iloc[0])

And as with the companion arrays, we can add these to the database:

In [None]:
for idx in dr_who_df.index:
    row=dr_who_df.loc[idx]
    dw_collection.update_one({'Story ID':row['Story ID']}, 
                             {'$set': {'Parts':get_part_info(row)}})

We can now see that the parts are represented by an array of subdocuments:

In [None]:
dw_collection.find_one()

So if we want to find the details of the episode which was aired on 25th April, 1964, we use the dotted notation, as described in notebook `14.2 Working With Embedded Documents`.

In [None]:
dw_collection.find_one({'Story ID':5, 'Parts.air_date':'4/25/64'})

## Discussion

We have made a reasonable first attempt at converting the initial csv file into an appropriate structure for a MongoDB database. Where a value does not exist for a particular document, we have removed the key rather than have a NULL entry. Similarly, where the csv file had many columns to represent fields with multiple values, we have represented the information with an array.

Something to be wary of when using MongoDB with pandas, is that the DataFrame will still create columns for missing data. If we fun a `.find({})` query on the collection now, and cast into a DataFrame, the resulting DataFrame has columns for `Firsts`, `Guest Doctor(s)` and so on, even though these keys do not appear in the majority of documents:

In [None]:
pd.DataFrame(dw_collection.find({})).head()

So it is important to remember that it may possible to misinterpret how the DataFrame represents the results of the query.

Cleaning data is often an iterative process. If you have a known document schema against which you can check your data, that provides a good benchmark against which to check your data. But in many cases, you will have to use your own judgement on what data type any data field is best represented by.

You are also likely to find that data cleaning is an iterative process. The more you look at your dataset, the more dirty you may realise that it is.

For example, for this dataset, there are some spelling mistakes in the entries (for example, "Patrick Traughton" is listed as a guest doctor, rather than "Patrick Troughton"), we have not converted the (string) entries into more appropriate types, such as floats or datetimes, there are (despite our earlier efforts) still NULL entries in the set, and so on.

*If you are interested, you could try to further clean the data if there are other ways of storing it that you would like to explore.*

*Remember, the view of the data presented by the dataframe may differ from the data that is actual stored in the database documents.*

## Clean up

Finally, we can drop this test collection:

In [None]:
mongo_client[DB_NAME].drop_collection('dr_who_collection')

and if you are working on a local VCE, you can also drop the database you created (if you are working on the remote VCE, you do not have permission to drop your database):

In [None]:
# Will not work on the remote VCE
mongo_client.drop_database(DB_NAME)

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.4 Introduction to the accidents database`.