# Introduction to the accidents dataset
In this Notebook, you'll take your first look at the accidents dataset. To keep things manageable, we'll only be looking at the accidents for 2012.

You can read more about the dataset in the VLE course materials (Part 14 section 4).

The dataset provided in the database is derived from the [road safety data](https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data) accident datasets published by the Department for Transport under the Open Government Licence.

This dataset contains details of every recorded road traffic accident in Britain in 2012. The data is anonymised, but the date, time, and location of the accidents is recorded, along with the number and severity of casualties.

The aim of the activities described in this notebook is to reinforce some of the ways of querying data with MongoDB and recap some ways of using *pandas* to analyse the data.

As ever, we need to load in some essential libraries.

In [None]:
# Standard imports
import pandas as pd

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## Exploring the Database

We are going to make use of the `accidents` throughout the MongoDB practical activities, so we'll start by having a quick poke around to familiarise ourselves with it and see what sorts of data it contains. This also provides a good opportunity to review how to make a various sorts of query over a MongoDB collection.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

*__A note on speed limits__*

*The speed limit data in this dataset shows the speed limit of the road at the location of the accident. It says nothing about the speed of any particular vehicle, so you can't use this data to infer anything about whether speeding causes more accidents.*

*However, it may reasonable to assume that vehicles will often be going faster in a 60mph zone than in a 30mph zone. (Or is it? I once had a 5mph accident in stop start traffic on a national speed limit dual carriageway...)*

### Retrieving a single document
`find_one()` is the basic method for returning a single document from a collection. With no arguments, it just returns the first document it finds (chosen arbitrarily by Mongo) as a Python `dict`.

In [None]:
accidents_collection.find_one()

That's quite a bit of data...

Even though we are just selecting an arbitrary accident report, it does give us a feel for what some of the data looks like.

We can also inspect the top level document keys (which me might think of as a document equivalent of tabular column headings) to see what sorts of field are available to us:

In [None]:
accidents_collection.find_one().keys()

Note that some of the fields may themselves contain documents (which is to say, `dict`s).

When running query, we may also want to get a feel for what sorts of values are provided for each key value.

Tools do exist for trying to generate data schemas from the contents of databases, but they can be unreliable. A better reference is usually the data dictionary documentation if such a thing is published with the dataset.

### Querying On a Particular Field
Let's focus our query a bit more by pulling out an accident that happened in a 70 mph zone.

We do that by providing an element that must match a corresponding document element in the database.

Remember, in general, the pattern for running queries is:

```python
pymongo.MongoClient[DBNAME][COLLECTIONNAME].find_one(SELECT, PROJECTION)
```

Alternatively, to find a multiple elements, we use `.find()` rather than `.find_one()`, in which case a `limit` argument is also available:

```python
pymongo.MongoClient.[DBNAME][COLLECTIONNAME].find(SELECT, PROJECTION, limit=N)
```

Let's select a document item based on particular speed limit criterion (feel free to try other queries of your own):

In [None]:
accidents_collection.find_one({'Speed_limit': 70})

### Querying over Multiple SELECT items - Logical AND

If we give more than one key-value pair in the query document, the returned document must match all of them all: a logical AND.

For instance, to find an accident in a 30mph zone that involved two vehicles and one casualty, we specify that information in the query document:

In [None]:
accidents_collection.find_one({'Speed_limit': 30, 'Number_of_Vehicles': 2, 'Number_of_Casualties': 1})

Recall that we can limit the key-value pairs returned by specifying the second argument to `find_one()`. 

The following query combines *selection* (the speed limit, one casualty, and two vehicles) and *projection* (only retrieving some parts of the document).

In [None]:
accidents_collection.find_one({'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2},
                              ['Accident_Index', 'Speed_limit', 'Number_of_Casualties', 'Number_of_Vehicles'])

### Counting and Finding Multiple Documents

If we want to count how many documents match a particular query, we use the `.count_documents()` collection method:

```python
pymongo.MongoClient[DBNAME][COLLECTIONNAME].count_documents()
pymongo.MongoClient[DBNAME][COLLECTIONNAME].count_documents(SELECT)
```

If we want to find more than one document at a time, we use the imaginatively named `.find()` function.

The `count_documents()` and `.find()` functions both take similarly structured SELECT arguments.

For example, let's run a count over the number of accidents in a 30mph speed area with a single casualty and that involved two cars, returning the 

In [None]:
accidents_collection.count_documents({'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2})

With this dataset, the `limit` query keyword is extremely useful when exploring, as it stops us being overwhelmed by data. Let's create a small DataFrame to pick out a few attributes of a few accidents of this type.

Recall that the output of `find()` is an iterator of `dict`s. If we convert the iterator to a `list`, we can create a DataFrame directly.

In [None]:
pd.DataFrame(accidents_collection.find({'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2},
                                       ['Accident_Index', 'Accident_Severity', 'Road_Type','Weather_Conditions'],
                                       limit=10))

The `SELECT` conditions may use a range of comparison tests other than simple equality tests. For example, if a particular field is associated with values that support inequality tests (such as numerics, or strings), we can test using inequalities:

- less than: `{'$lt': VALUE}`
- less than or equal to: `{'$lte': VALUE}`
- greater than: `{'$gt': VALUE}`
- greater than or equal to: `{'$gt': VALUE}`

SELECT expressions that make use of inequalities take the form:

```
{FIELD: {INEQUALITY: VALUE}}
```

For example:

```
accidents.count_documents({'Speed_limit': {'$gte': 50}})
```

For more examples, and a comprehensive list of available comparison operators, see the [query selectors - comparison operators](https://docs.mongodb.com/manual/reference/operator/query-comparison/#query-selectors-comparison) section of the MongoDB documentation.

Other operators are also available, such as logical operators ([docs](https://docs.mongodb.com/manual/reference/operator/query/#logical)). For example:

```
{ '$and': [ { <expression1> }, { <expression2> } , ... , { <expressionN> } ] }
{ '$or': [ { <expression1> }, { <expression2> }, ... , { <expressionN> } ] }
{ FIELD: { '$not': { <operator-expression> } } }
```


Existence and logical typing tests ([docs](https://docs.mongodb.com/manual/reference/operator/query/#element)) are also provided:

```
{ FIELD: { '$exists': True }
{ FIELD: { '$exists': False }
{ FIELD: { '$type': TYPE } }
```


You may also find it useful to revise the 'Query criteria' section in _MongoDB: The Definitive Guide_.

*Although the following activity is quite a short one, I would encourage you to spend a few minutes exploring your own queries using various combinations of the above operators just to get a feel for them.*

### Activity 1
How many accidents were there where the speed limits was less than 30mph?

For accidents where the speed limit was less than 30mph, create a DataFrame that holds the accident index, number of vehicles, and number of casualties for each accident.

In [None]:
# Enter your code in this cell


#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

A query of the following form will return documents where the speed limit was less than 30mph.

In [None]:
accidents_collection.count_documents({'Speed_limit': {'$lt': 30}})

We can add a projection clause to to limit the fields returned in the response, and by casting the cursor to a list of `dict`s, generate a *pandas* DataFrame from it:

In [None]:
pd.DataFrame(accidents_collection.find({'Speed_limit': {'$lt': 30}}, 
                                 ['Accident_Index', 'Number_of_Vehicles',
                                  'Number_of_Casualties'])).head()

#### End of Activity 1

----------------------------------------------------------------

## Expanding the codes
A lot of the information in this dataset is recorded as numeric codes rather than human-readable labels. The `labels` collection contains the labels for each code, derived from a data dictionary supplied with the original dataset.

The `labels` collection contains the human-readable labels for all the codes in in the accident descriptions.

In [None]:
labels = accidents_db['labels']

labels

The label classes are described by the `label` field.

We can find the distinct values associated with field by calling the `.distinct()` method on the collection and specifying the field we want to return the distinct values for: 

In [None]:
labels.distinct('label')

Associated with each label are a range of codes. For exammple, here are the codes for decoding the `Road_Type` labels:

In [None]:
labels.find_one({'label': 'Road_Type'})

We can review all the codes associated with a particular label by indexing into the result dictionary:

In [None]:
labels.find_one({'label': 'Road_Type'})['codes']

We can then further index into this dictionary to ask for the label of a particular type of road, given its code value:

In [None]:
labels.find_one({'label': 'Road_Type'})['codes']['6']

### Expanded Description Labels

Some keys in the documents are themselves quite cryptic, although human-readable expansions of them are once again provided in the `labels` collection.

In [None]:
# What does 'FdHGVR4' mean?
labels.find_one({'label': 'FdHGVR4'})

In [None]:
expanded_labels = pd.DataFrame(labels.find({'expanded': {"$exists": True}},
                                                  {'label':1, 'expanded':1, '_id':0}))
expanded_labels

Creating a dictionary for these lookups is trivial:

In [None]:
expanded_labels.set_index('label').to_dict()['expanded']

## Annotating MongoDB Query Results with Human Readable Labels

Armed with information that decodes the codes back into human readable terms, we should now be able to repeat the query we made earlier over two car accidents in a 30mph speed area with a single casualty, decoding the original `Accident_Severity`, `Road_Type` and `Weather_Conditions` coded values back to human readable labels.

Recall, we use a SELECT expression of the form:

```
{'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2}
```

and a projection of the form:

```
['Accident_Index', 'Accident_Severity', 'Road_Type','Weather_Conditions']
```

With our lookup dictionaries available we can now generate results with human readable labels by annotating items retrieved from queries onto the MongoDB `accidents` collection.

First, let's create a simple dataframe with some sample results:

In [None]:
slow_accidents = accidents_collection.find({'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2}, 
                                ['Accident_Index', 'Accident_Severity', 'Road_Type', 'Weather_Conditions'],
                                limit=10)

slow_accidents_df = pd.DataFrame(slow_accidents)
slow_accidents_df

The *pandas* `.map()` dataframe method allows us to map values from a `dict` against the values within a column.

This means we can map a dictionary of values retrieved from the `labels` collection onto the `accidents` results, although we do need to note that we ensure we are matching types as well values: the *Road_Type* values in the `accidents` results are typed as integers, whereas the keys retrieved from the `labels` collection are strings:

In [None]:
slow_accidents_df.dtypes

To map the dictionary values against a dataframe column, we use a construction of the form:

```python
df[COLUMN].map(DICT)
```

The `DICT` keys should match the values in the dataframe `COLUMN`. A *Series* will be returned containing the values from the dictionary associated with the keys referenced from the `COLUMN`.

In [None]:
# Make sure we type the keys correctly
slow_accidents_df['Road_Type_'] = slow_accidents_df['Road_Type'].astype(str).map(labels.find_one({'label': 'Road_Type'})['codes'])
slow_accidents_df

More generally:

In [None]:
label = 'Road_Type'

slow_accidents_df[label].astype(str).map(labels.find_one({'label': label})['codes'])

### Activity 2

Create a DataFrame containing ten sample accident results and displaying the labels of the accident severity, road type, and weather conditions. For example, your code should display printed rows of the form:
 
|Accident_Index|Accident_Severity|Road_Type|Weather_Conditions|
|----|-----|-----|-----|
201201BS70001 |Slight| Single carriageway| Fine no high winds|

Start by creating a DataFrame of the accidents, and then use the `labels` to create a human readable form.


In [None]:
# Enter your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

First, generate the sample accidents dataframe and make sure the columns are presented in the desired order:

In [None]:
sample_accidents = accidents_collection.find({'Speed_limit': 30, 'Number_of_Casualties': 1, 'Number_of_Vehicles': 2}, 
                                {'Accident_Index':1, 'Accident_Severity':1,
                                 'Road_Type':1, 'Weather_Conditions':1, '_id':0},
                                limit=10)

sample_accidents_df = pd.DataFrame(sample_accidents)

#Order the columns as required
sample_accidents_df = sample_accidents_df[["Accident_Index", "Accident_Severity",
                                           "Road_Type", "Weather_Conditions"]]

sample_accidents_df

To display the decoded values, we need to access the `label_of` dictionary with the code scheme and the code value. The following function will replace the column with mapped values, or create a new derived column:

In [None]:
def map_labels(df, label, replace=True):
    """Map labels onto a column."""
    labeled = label if replace else label+'_'
    df[labeled] = df[label].astype(str).map(labels.find_one({'label': label})['codes'])

Now we need to map the accident severity, road type, and weather conditions labels:

In [None]:
map_labels(sample_accidents_df, "Accident_Severity")
map_labels(sample_accidents_df, "Road_Type")
map_labels(sample_accidents_df, "Weather_Conditions")

sample_accidents_df

#### End of Activity 2

--------------------------------------------

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.5 Investigating the accident data`.