# Investigating the accident data

In the previous notebook you saw how the `accident` database contains two collections: `accidents` and `labels`. The `labels` collection supports a range of labels that decode various code values that are used in the `accidents` collection.

In this part of the notebook, we will use *exploratory data visualisation* techniques to explore the accidents database using a combination of data queries on the MongoDB accidents database, and analysis methods applied to *pandas* DataFrames.

Let's start by loading in our required packages:

In [None]:
# Standard imports
import pandas as pd

import seaborn as sns

We also need to set up a connection to the MondoDB in the usual way:

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

We'll be plotting some charts, so increase the default plot size to make things easier to read:

In [None]:
# Set a larger plot size than the default
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Severities at one speed
We'll start our investigation of the data with something easily visualised: *what are the relative proportions of accident severities in accidents occurring in 30mph zones?*

Note that the basic `find()` just returns documents: it can't do any aggregation and doesn't have anything like SQL's `GROUP BY` clause.

*In the Week 15 activities, we'll look at Mongo's aggregation pipelines, where we can get MongoDB to do this kind of group based, aggregation processing for us. But for now, let's stick to generating summary statistics from the raw data by using pandas' aggregation methods.*

*__A note on speed limits__*

*The speed limit data in this dataset shows the speed limit of the road at the location of the accident. It says nothing about the speed of any particular vehicle, so you can't use this data to infer anything about whether speeding causes more accidents.*

*However, it may reasonable to assume that vehicles will often be going faster in a 60mph zone than in a 30mph zone. (Or is it? I once had a 5mph accident in stop start traffic on a national speed limit dual carriageway...)*

For our data investigation, we'll grab the data corresponding to accidents in 30mph zones.

*Note that this may take some time to run.*

In [None]:
# Build a DataFrame, one row for each accident
severities_30_df = pd.DataFrame(accidents_collection.find({'Speed_limit': 30}, ['Accident_Severity']))
severities_30_df.head(3)

Then we'll calculate the summary numbers, counts of how many accidents were associated with each accident severity in a dataset of accidents at 30 mph:

In [None]:
# Count the number of each severity
severities_30_ss = severities_30_df['Accident_Severity'].value_counts()
severities_30_ss

The index values of the dataframe correspond to the coded accident severity values.

Now let's plot the value counts as a bar chart so we can visually compare them:

In [None]:
severities_30_ss.plot(kind='bar');

The bar chart uses the row index values for the x-axis labels. It orders the bars according to the order they are presented in the dataframe. To display the chart with the bars ordered according to the severity (that is, the x-axis values), we can sort the dataframe by index value.

In [None]:
severities_30_ss.sort_index(inplace=True)
severities_30_ss

We can now plot the chart with values mapped against the severity. Recent updates to the plotting library may uniquely colour each bar, so in passing let's just make sure that all the columns are plotted in the same colour.

If we end the plotting statement with a `;`, we can suppress the display of the matplotlib object type.

In [None]:
severities_30_ss.sort_index().plot(kind='bar', color='royalblue');

The chart is still quite cryptic: it's not obvious how to read the numeric coded values along the horizontal x-axis. 

We can query the `labels` collection to find the appropriate human-readable labels:

In [None]:
labels=accidents_db['labels']

In [None]:
label = 'Accident_Severity'

severity_labels = labels.find_one({'label': label})['codes']
severity_labels

The chart uses the row index values for the x-axis labels, so we need to update the coded severity values that form the current index to the decoded labels.

We do this by mapping the dictionary containing the appropriate labels to the dataframe index, ensuring that we have correctly types the keys along the way:

In [None]:
severities_30_ss.index = severities_30_ss.index.astype(str).map(severity_labels)

If we now plot the chart, we should see the correct labels:

In [None]:
severities_30_ss.plot(kind='bar', color='royalblue');

### Activity 1
What are the numbers of accidents at each severity in 60mph zones?

Create a DataFrame containing the numbers of accidents at each severity in 60mph zones, and then show your results as a bar chart with the bars given meaningful labels.


In [None]:
# Enter your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Select the accidents at 60mph, cast them to a dataframe and then calculate the value counts for each severity, ordering the dataframe by increasing coded severity value:

In [None]:
# Build a DataFrame, one row for each accident
severities_60_df = pd.DataFrame(accidents_collection.find({'Speed_limit': 60}, ['Accident_Severity']))

# Count the number of each severity
severities_60_ss = severities_60_df['Accident_Severity'].value_counts()
severities_60_ss.sort_index(inplace=True)
severities_60_ss

Relabel the dataframe index and plot the chart.

In [None]:
severities_60_ss.index = severities_60_ss.index.astype(str).map(severity_labels)
severities_60_ss.plot(kind='bar', color='royalblue');

#### End of Activity 1

--------------------------------------

## Severities across speeds
It's a bit tedious doing one speed at a time. Let's summarise all the data and put it in a *pandas* DataFrame so we can see it all together.

You may or may not remember the *pandas* `crosstab()` method that we review in the notebook `04.1 Crosstabs and pivot tables.ipynb`. As a refresher, the `crosstab()` method provides a convenient way of counting the occurrences of one column value or index value with respect to another.

We call the `.crosstab()` method with two required arguments, `index` and `columns`, or `x` and `y`: `df.crosstab(x,y)`.

The `x` value identifies which column in the original dataframe whose unique values we want to use as row index values; and the `y` value specifies which column's unique values we want to map onto column values. At the intersection of each row and column in the crosstab dataframe, there is the count of the number of times that row value and that column value occur in the original table's rows.

As ever, the explanation is often complicated, the visual reality often making things clearer.

Let's start by building a DataFrame with one row for each accident.

In [None]:
# Build a DataFrame, one row for each accident
severity_by_speed_df = pd.DataFrame(accidents_collection.find({}, ['Speed_limit', 'Accident_Severity']))
severity_by_speed_df.head()

We can then count the number of rows at each speed/severity combination with a _pandas_ `crosstab`.

*Remember, the `.crosstab()` method takes two column names, using the unique values from the first as the index values in a reshaped dataframe, and the unique values from the second as the new dataframe's column headings, the cell values giving a count of how many times the (index_value, column_value) pairs occurred in the original dataframe.*

In [None]:
# Count the number of each severity
severity_by_speed_df = pd.crosstab(severity_by_speed_df['Speed_limit'], 
                                   severity_by_speed_df['Accident_Severity'])
severity_by_speed_df

We can make the DataFrame more informative by relabelling the columns away from severity code values to severity labels.

If we cast the list of column names to a *pandas.Series*, suitably typed, we can make use of the `.map()` method:

In [None]:
# Relabel the columns
severity_by_speed_df.columns = pd.Series(severity_by_speed_df.columns).astype(str).map(severity_labels)
severity_by_speed_df

Plotting the chart from the data frame by default creates a dodged bar chart, with colummns grouped by speed limit, which is to say, the DataFrame's index values and each color coded bar representing the values in a particular column.

In [None]:
severity_by_speed_df.plot(kind='bar',
                          title='Accident severity by speed')

As an aside, it's sometimes useful to move the graph legend off to the side. Use the following reformulation of the plotting command to do so, grabbing a matplotlib axis object from the dataframe generated chart and then operating on the axis to set the location of the legend.

In [None]:
ax = severity_by_speed_df.plot(kind='bar',
                               title='Accident severity by speed')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

As you can see, there are a lot more accidents in 30mph zones than at any other speed limit. We'll have to account for that when we do some more detailed analysis later.

### Activity 2

What are the relative proportions of accident severities by junction type (roundabout, crossroads, etc)? Display your results in a meaningful way.

Note that this is a more in-depth activity than the previous example, so I suggest breaking it down into several steps.

- What are the different types of junction?
  - you can search the `labels` collection with a regular expression filter to find label categories the refer to junctions:
    - regular expression search filter: `{"label": {"$regex" : ".*unction.*"}}`
  - Identify which label corresponds to the junction type by reviewing the decoded values associated with each junction related label
* What different severities happen at each junction type?

We have broken down the task into several subtasks.

First, identify the junction related labels.

In [None]:
# Enter your code in this cell

Next, preview the `labels` values associated with junction related labels to decide which junction related label we want.


In [None]:
# Enter your code in this cell

Create a crosstab across the appropriate junction label and accident severity.


In [None]:
# Enter your code in this cell

Display results table with appropriate index values and column headings.


In [None]:
# Enter your code in this cell

Finally, display results chart.

In [None]:
# Enter your code in this cell

*Comment on your results in this cell.*

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

First, identify the junction types by finding sets of labels that refer to junctions:

In [None]:
list(labels.find({"label":{"$regex" : ".*unction.*"}}, {'label':1, '_id':0}))

So it seems there are three sorts of thing that relate to junctions, but which one do we want? Let's look at the decoded values associated with each one:

In [None]:
list(labels.find({"label":{"$regex" : ".*unction.*"}}, {'label':1, 'codes':1, '_id':0}))

`Junction_Detail` seems to be the one we want.

We can now build a crosstab table that summarises the counts for each `Junction_detail`/`Accident_severity` combination:

In [None]:
# Build a DataFrame, one row for each accident
severity_by_junction_df = pd.DataFrame(accidents_collection.find({}, ['Junction_Detail', 'Accident_Severity']))

# Count the number of each severity
severity_by_junction_df = pd.crosstab(severity_by_junction_df['Junction_Detail'], 
                                      severity_by_junction_df['Accident_Severity'])
severity_by_junction_df

The index values and column headings are coded values. It would be be easier to read the table if we mapped the associated human readable labels to them:

In [None]:
label = 'Junction_Detail'

junction_labels = labels.find_one({'label': label})['codes']

# Relabel the index to the junction types
severity_by_junction_df.index = severity_by_junction_df.index.astype(str).map(junction_labels)
severity_by_junction_df.index.name = 'Junction type'

# Relabel the columns
severity_by_junction_df.columns = pd.Series(severity_by_junction_df.columns).astype(str).map(severity_labels)
severity_by_junction_df.columns.name = 'Accident severity'
severity_by_junction_df

Now we can plot a bar chart directly from the dataframe to visually compare the results: 

In [None]:
severity_by_junction_df.plot(kind='bar',
                             title='Accident severity by junction type')

Again, it's difficult to judge if the proportions of accident severities are different for different junction types. We'll look at this later.

#### End of Activity 2

-----------------------------------------

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

This completes the required notebook activities for this week.

Two optional notebooks are also provided that demonstrate how you might apply statistical measures to your investigations using well supported statistical analysis Python packages.