# Grouping and summarising operations in aggregation pipelines

In the previous notebook, you were introduced to the simple aggregation pipelines, focussing on how pipeline stages could be created to filter records through selection, projection and limit operators, as well as "unwinding" lists into separate documents, one per array (list) element.

In this notebook, you will see how we can perform grouping and summarising operations within a pipeline.

This can be particularly useful because it keeps the summarising operations close to the data. Rather than have to return a large number of records and hold them in memory as a *pandas* dataframe, and then create a summary table from  the dataframe, we can just retrieve summary table directly from the pipeline.

Let's start in the normal way by loading in some required packages:

In [None]:
# Standard imports
import pandas as pd

import seaborn as sns

from pandas.api.types import CategoricalDtype

Open a connection to the MongoDB server, then open the accidents database and set up references to the `accidents` and `labels` collections:

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

We will also specify the `labels` collection:

In [None]:
labels=accidents_db['labels']

We'll be plotting some charts, so increase the default plot size to make things easier to read:

In [None]:
# Set a larger plot size than the default
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Grouping in aggregation pipelines

A `$group` operator can be applied to the records in a pipeline to perform summary aggregation operations over the group.

For example, the following  pipeline:

- finds all the accidents at 30mph or above;
- groups the accidents by speed limit and finds the number of accidents at each speed limit within the group.

```python
pipeline = [{'$match': {'Speed_limit': {'$gte': 30}}},
            {'$group': {'_id': '$Speed_limit',
                        'num_accidents': {'$sum': 1}}}]
```

In this case, the first step of the pipeline retrieves records of accidents that occurred at 30 mile per hour or greater and the second step groups the results as follows:

- group the accidents using an `_id` value set as the value being grouped on (which is to say, the value of the speed limit, `{'_id': '$Speed_limit'}`);
- having grouped the documents, a so-called "accumulator expression" is applied to each document in the group. In this case, we obtain the number of accidents associated with that group (`num_accidents`) and add 1 (one) to it (`{'$sum': 1}`) for each document.

Other accumulator operators include `$min` and `$max` (for a full list, see the [MongoDB aggregation pipeline group/accumulator operator docs](https://docs.mongodb.com/manual/reference/operator/aggregation/#accumulators-group)).

It may look quirky, but this is the idiom for counting members in a group.

In [None]:
# Pull out all the accidents at 30mph or above
pipeline = [{'$match': {'Speed_limit': {'$gte': 30}}},
            # Group by speed
            {'$group': {'_id': '$Speed_limit',
                        'num_accidents': {'$sum': 1}}}]

# Show totals for each speed.
results = list(accidents_collection.aggregate(pipeline))
results

We can now put the results in a _pandas_ DataFrame.

In [None]:
results_df = pd.DataFrame(results).set_index('_id')

# Rename the _id index (representing the group key)
# to something more meaningful
results_df.index.name = 'speed_limit'
results_df

*For many of the aggregation pipeline activities below, build up the pipeline in stages. The `$limit` operator is your friend here: it will allow you to see what the pipeline produces without being overwhelmed by thousands of items.*

### Activity 1
Find all the accidents at 30mph or above, group them by speed limit and accident severity, and find the number of accidents at each speed limit/severity combination.

Hint: If you give multiple keys for a single `$group` operation, it will return one group for each combination of those keys.

In [None]:
# Insert your solution here.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To create the pipeline, we need to match speed limits greater than or equal to 30 mph and then group the results by speed and severity. We can then count the number of items in each group by adding 1 for each item.

In [None]:
# Pull out all the accidents at 30mph or above, group by speed and severity, 
#   and show totals at each speed/severity combination.
_pipeline1 = [{'$match': {'Speed_limit': {'$gte': 30}}},
              {'$group': {'_id': {'Speed_limit': '$Speed_limit', 
                                'Accident_Severity': '$Accident_Severity'},
                        'num_accidents': {'$sum': 1}}}]

_results1 = list(accidents_collection.aggregate(_pipeline1))

display(_results1[:3])


####  End of Activity 1

-----------------------------------

### Creating a DataFrame from the Multi-Attribute Group

Using `json_normalize` to cast the result to a *pandas* dataframe flattens the multiple index terms to columns with `_id.` prefix column names:

In [None]:
#Use the sample results from the activity answer
severity_speed_df = _results1
# Alternatively, use your own results

severity_speed_df = pd.json_normalize(severity_speed_df)

severity_speed_df.head()

We can replace the column names with something a little tidier by using a `dict` comprehension to create a lookup from `_id.` prefixed columns names to clean column names:

In [None]:
# For column names starting with _id,
# create a lookup from that column name to a cleaned column name
column_renames = {c:c.replace('_id.', '') for c in severity_speed_df.columns if c.startswith('_id.')}

severity_speed_df.rename(columns=column_renames, inplace=True)
severity_speed_df.head()

We can also relabel the accident severity with meaningful labels in the normal way, taking the opportunity to cast them as ordered categorical items as we do so:

In [None]:
# Get accident severity labels
accident_severity_labels = labels.find_one({'label': 'Accident_Severity'})['codes']


# Set accident severity labels as strings
severity_speed_df['Accident_Severity'] = severity_speed_df['Accident_Severity'].astype(str).map(accident_severity_labels)

# Then ma the accident severity labels to an ordered category type
severity_speed_df['Accident_Severity'] = \
    severity_speed_df['Accident_Severity'].astype(CategoricalDtype(['Slight','Serious', 'Fatal'],
                                                            ordered=True))

severity_speed_df.head()

### Visualising Long Format Data With `seaborn.barplot()`

With the data in this long form, we can plot it directly using the `seaborn.barplot()` function, using the speed limit as the grouping value on the *x*-axis, the number of acccidents as the bar height on the *y*-axis and the accident severity for the colour (*hue*) grouping:

In [None]:
ax = sns.barplot(x="Speed_limit", y="num_accidents",
                 hue="Accident_Severity", data=severity_speed_df)

### Activity 2
Using an aggegation pipeline, group the accidents by number of vehicles and number of casualties. From each group, find the number of accidents for each combination of vehicle and casualty number.

Visualise the data using a *seaborn* scatterplot chart, `sns.scatterplot()`. Use the number of vehicles for the *x*-axis, the number of casualties on the *y*-axis, and the number of accidents for the *size*.

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Run the pipeline to generate the data and unroll the cursor response into a list of results:

In [None]:
_pipeline2 = [{'$group': {'_id': {'Number_of_Casualties': '$Number_of_Casualties', 
                                  'Number_of_Vehicles': '$Number_of_Vehicles'},
                          'num_accidents': {'$sum': 1}}}]

_results2 = list(accidents_collection.aggregate(_pipeline2))
_results2[:3]

Create a dataframe from the results using whatever method you prefer. Here, I use the `json_normalise` route and then rename the columns:

In [None]:
_results2_long_df = pd.json_normalize(_results2)
_results2_long_df.columns = [c.replace('_id.', '') for c in _results2_long_df.columns]
_results2_long_df.head()

Visualise the data using *seaborn*:

In [None]:
from matplotlib.ticker import MaxNLocator

ax = sns.scatterplot(x="Number_of_Vehicles", y="Number_of_Casualties",
                size=_results2_long_df["num_accidents"].to_list(),
                data=_results2_long_df)


# Only use integer tick labels
ax.axes.get_xaxis().set_major_locator(MaxNLocator(integer=True)) 
ax.axes.get_yaxis().set_major_locator(MaxNLocator(integer=True)) ;

(Note that we have converted the `size` argument to a list because of a known bug in [matplotlib and seaborn](https://stackoverflow.com/questions/63443583/seaborn-valueerror-zero-size-array-to-reduction-operation-minimum-which-has-no).

####  End of Activity 2

-----------------------------------

### Activity 3 (optional)

Using an aggregation pipeline, group accidents by severity and junction type (`Junction_Detail`) and find the number of accidents for each combination of junction and severity.

Cast any `NaN` results to 0.

Store the results in a long DataFrame, casting any `NaN` results to 0 and using meaningful severity and junction detail text labels (e.g. Fatal, Serious; Roundabout, Crossroads) in place of numerical code values.

Visualise the resulting dataframe as an appropriately ordered grouped bar chart.

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Start by retrieving the data into a list of dictionaries:

In [None]:
_pipeline4 = [{'$match': {'Speed_limit': {'$gte': 30}}},
              {'$group': {'_id': {'Junction_Detail': '$Junction_Detail', 
                                  'Accident_Severity': '$Accident_Severity'},
                          'num_accidents': {'$sum': 1}}}]

_results4 = list(accidents_collection.aggregate(_pipeline4))
_results4[:3]

Convert the list to a dataframe:

In [None]:
_results4_long_df = pd.json_normalize(_results4)
_results4_long_df.columns = [c.replace('_id.', '') for c in _results4_long_df.columns]
_results4_long_df.head(3)

Map the numerical codes to human readable labels, as appropriate:

In [None]:
junction_detail_labels = labels.find_one({'label': 'Junction_Detail'})['codes']

_results4_long_df['Accident_Severity'] = \
    _results4_long_df['Accident_Severity'].astype(str).map(accident_severity_labels)

# Use the categorical datatype
_results4_long_df['Accident_Severity'] = \
    _results4_long_df['Accident_Severity'].astype(CategoricalDtype(['Slight','Serious', 'Fatal'],
                                                            ordered=True))

_results4_long_df['Junction_Detail'] = \
    _results4_long_df['Junction_Detail'].astype(str).map(junction_detail_labels)

_results4_long_df.head(3)

Visualise the result, using the `hue_order` to appropriately order the accident severities:

In [None]:
ax = sns.barplot(x="num_accidents", y="Junction_Detail",
                 hue="Accident_Severity",
                 hue_order=['Slight', 'Serious', 'Fatal'],
                 data=_results4_long_df)

####  End of Activity 3

-----------------------------------

## Optional *pandas* data wrangling practice examples

*Optional activities to provide practice on using pandas datashaping `.pivot()` operations with pipeline results data. The example solutions may also be useful as worked examples, so just remember they're available here in case you want to check back on them in future!*

### Activity 4 (optional)

Reshaping and plotting the `severity_speed_df` data using a *pandas* `.pivot()`.

*Click the arrow in the margin or run this cell to reveal this optional practice activity.*

Convert the long format `severity_speed_df` dataframe containing number of accidents for each combination of speed limit and severity, into a wide format dataframe with columns corresponding to accident severity and an index, in decreasing order, of speed limits.

Visualise the result as a bar chart.

*Hint: use a pandas `pivot` to reshape the data.*

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Pivot the long form dataframe into a wide form using accident severities as the column values:

In [None]:
_results3_wide_df = severity_speed_df.pivot(index='Speed_limit',
                                            columns='Accident_Severity',
                                            values='num_accidents')
_results3_wide_df.head(3)

Sort the index of the dataframe in descending speed order:

In [None]:
_results3_wide_df.sort_index(ascending=False, inplace=True)
_results3_wide_df.head(3)

We can visualise the wide format dataframe directly using a *pandas* `bar` chart plotting method. The function groups items by row, assigning each column to its own color-designated bar, with groups indexed by the row index value:

In [None]:
ax = _results3_wide_df.plot(kind='bar')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));

####  End of Activity 4

-----------------------------------

### Activity 5 (optional)

More reshaping and plotting the `severity_speed_df` data using a *pandas* `.pivot()`.

*Click the arrow in the margin or run this cell to reveal this optional practice activity.*

Using your dataframe from Activity 2, where you grouped the accidents by number of vehicles and number of casualties, reshape the dataframe to a wide format using the number of vehicles for the columns and number of casualties as the index. 

Replace any `NaN` fields with 0 values.

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Reshape the table from long to wide form using a pivot:

In [None]:
_results4_wide_df = _results2_long_df.pivot(index='Number_of_Casualties',
                                              columns='Number_of_Vehicles',
                                              values='num_accidents')
_results4_wide_df.head()

Replace the `NA` values with `0` values:

In [None]:
_results4_wide_df.fillna(0, inplace=True)
_results4_wide_df.head()

####  End of Activity 5

-----------------------------------

## Projections  on pipeline `$group` results

As well as limiting the return of particular fields in a document, the `$project` operator also allows us to project elements returned from the `$group` phase in two additional ways:

- firstly, onto newly named variables
- secondly, onto calculated values derived from the group members.

Let's start by defining a grouping operation:

In [None]:
group = {'$group': {'_id': {'Speed_limit': '$Speed_limit'}}}

pipeline = [group]

list(accidents_collection.aggregate(pipeline))

### Renaming group `_id` fields

We might then project the `_id.Speed_limit` value returned from a grouping operation onto the simpler `Speed_limit` value, which makes for a cleaner output.

In [None]:
project = {'$project': {'Speed_limit': '$_id.Speed_limit'}}

pipeline = [group, project]

list(accidents_collection.aggregate(pipeline))

In an expanded view, the pipeline looks like this:

```python
pipeline = [{'$group': {'_id': {'Speed_limit': '$Speed_limit'}}}, 
            {'$project': {'Speed_limit': '$_id.Speed_limit'}}]
```

We can also suppress the display of the `_id` document by setting it's return value to the "false" `0` value in the projection step:

In [None]:
pipeline = [group,
            {'$project': {'Speed_limit': '$_id.Speed_limit',
                          '_id':0}}]

list(accidents_collection.aggregate(pipeline))

## Sorting Items In the Pipeline
We can sort the output by adding a sort step to the pipeline, passing as true (`1`) the field(s) we want to sort on:

In [None]:
sort = {'$sort':{'Speed_limit': 1}}

Run this as part of a pipeline:

In [None]:
# Create the pipeline
pipeline = [group, project, sort]

# Run the pipeline
list(accidents_collection.aggregate(pipeline))

Note that you can use multiple projections within the same pipeline. For example, if we know all we are interested in is the `Speed_limit` field, we can start the pipeline with a projection onto just that element:

In [None]:
pipeline = [{'$project': {'Speed_limit':1}},
            {'$group': {'_id': {'Speed_limit': '$Speed_limit'}}},
            {'$project': {'Speed_limit': '$_id.Speed_limit',
                          '_id':0}}]
list(accidents_collection.aggregate(pipeline))

## Performing Calculations as Part of the Projection

As well as filtering and renaming elements, the `$project` operator also provides a means by which we can perform calculations over document values using a range of aggregation operators, including arithmetic expression operators, conditional and comparison operators and date, text and string operators.

For example, aggregation operators, which take the form `{<operator>: [ <argument1>, <argument2> ... ]}` or `{ <operator>: <argument1>}`, include, but are not limited to:

- `$add`, `$subtract`, `$multiply`, `$divide`
- `$eq`, `$gt`, `$gte`, `$lt`, `$lte`, `$ne`
- `$dateFromString`, `$dayOfMonth`, `$dayOfWeek`, `$dayOfYear`
- `$sin`, `$cos`, `$tan`, `$asin` etc.
- `$toInt`, `$toString` etc


*For a full list, see the [MongoDB aggegation pipeline operator docs](https://docs.mongodb.com/manual/reference/operator/aggregation/).*

For example, we could find the speed limit in km per hour rather than miles per hour by multiplying the speed limit in mph by 1.61.

*Note that the renamed `$Speed_limit` document is not available within the `$project` scope.*

In [None]:
pipeline = [group,
            {'$project': {'Speed_limit': '$_id.Speed_limit',
                         '_id':0,
                          'Speed_limit_kph': {'$multiply': ['$_id.Speed_limit', 1.61]}}},
            sort]

list(accidents_collection.aggregate(pipeline))[:2]

Or in an expanded form:

In [None]:
pipeline = [{'$group': {'_id': {'Speed_limit': '$Speed_limit'}}},
            {'$project': {'Speed_limit': '$_id.Speed_limit',
                         '_id': 0,
                          'Speed_limit_kph': {'$multiply': ['$_id.Speed_limit', 1.61]}}},
            {'$sort':{'Speed_limit': 1}}]

list(accidents_collection.aggregate(pipeline))[:2]

### Activity 6
Use an aggegation pipeline to find the "average" number of vehicles and casualties per accident at each speed limit. Replace the `_id` of each speed limit group with the plain `Speed_limit`.

Store the results in a DataFrame with the averages as the columns and speed limits as the index, sorted by increasing speed limit.

Visualise the data using an appropriately defined bar chart.

Hint: Use `$group` operator with the `$sum` accumulator expression to find the total vehicles and casualties, then use `$project` to find the averages and rename the ID.

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Create a group for each speed limit, then `project` over the `$Number_of_Casualties` and `$Number_of_Vehicles` to find the total number of each.

In [None]:
_group = {'$group': {'_id': {'Speed_limit': '$Speed_limit'},
                     'total_casualties': {'$sum': '$Number_of_Casualties'},
                     'total_vehicles': {'$sum': '$Number_of_Vehicles'},
                     'num_accidents': {'$sum': 1}}}

_pipeline5a = [_group]

list(accidents_collection.aggregate(_pipeline5a))[:2]

Find the average number of casualties per accident by dividing the total number of casualties by the total number of accidents, and likewise for the average number of vehicles per accident.

Also use the projection to rename the `$_id.Speed_limit` document and remove the `$_id` records.

In [None]:
_project = {'$project': {'average_casualties': {'$divide': ['$total_casualties', '$num_accidents']},
                         'average_vehicles': {'$divide': ['$total_vehicles', '$num_accidents']},
                         'Speed_limit': '$_id.Speed_limit',
                         '_id': 0}}

_pipeline5b = [_group, _project]

list(accidents_collection.aggregate(_pipeline5b))[:2]

A final element of the pipeline sorts the resulting documents by `$Speed_limit`.

In [None]:
_sort = {'$sort': {'Speed_limit': 1}}

_pipeline5c = [_group, _project, _sort]

_results5 = list(accidents_collection.aggregate(_pipeline5c))
_results5[:2]

Put the results in a DataFrame. The speed limit index value will allow us to group bars by this dimension in our bar chart.

In [None]:
_results5_df = pd.json_normalize(_results5)

# Set the index
_results5_df.set_index('Speed_limit', inplace=True)
_results5_df

Finally, use a simple *pandas* plot to visualise the results using a bar chart:

In [None]:
ax = _results5_df.plot(kind='bar')

# Put the legend outside the chart so it doesn't occlude the bars
ax.legend(bbox_to_anchor=(1, 0.5));

####  End of Activity 6

-----------------------------------

### Activity 7 (optional)

Use an aggegation pipleine to find number of casualties for each combination of casualty severity and casualty age band. 

Store the results in a DataFrame with the severities as the columns and age bands as the index. The columns and index should contain the text labels (e.g. 21-25, 46-55; Fatal, Slight), not the codes.

Hint: Use `$unwind` to examine each `casualty` sub-document in turn.

In [None]:
# Insert your solution here.

# You may find it convenient to construct your solution over several code cells.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Start of by defining a pipeline that unwinds the `$Casualties` lists and then group the result by casualty age band (`Age_Band_of_Casualty`) and severity (`Casualty_Severity`), using the `$sum` accumulator expression to count the number of records in each group.

Sort the results by the casualty age band for sensible charting.

In [None]:
_pipeline6 = [{'$unwind': '$Casualties'},
              {'$group': {'_id': {'Age_Band_of_Casualty': '$Casualties.Age_Band_of_Casualty',
                                  'Casualty_Severity': '$Casualties.Casualty_Severity'},
                          'num_accidents': {'$sum': 1}}},
              {'$sort': {'_id.Age_Band_of_Casualty': 1}}]

_results6 = list(accidents_collection.aggregate(_pipeline6))
_results6[:3]

Make sure we have the human readable labels to hand.

We might also consider creating an ordered categorial type for the age bands:

In [None]:
casualty_severity_labels = labels.find_one({'label': 'Casualty_Severity'})['codes']

age_band_casualty_labels = labels.find_one({'label': 'Age_Band_of_Casualty'})['codes']

# Created ordered category type for acge bands
ordered_age_bands = [age_band_casualty_labels[k] for k in sorted(age_band_casualty_labels.keys(), key=int)]
age_band_category = CategoricalDtype(ordered_age_bands, ordered=True)

Put the result into a dataframe, rename the columns, and map the codes to human readable labels:

In [None]:
_results6_long_df = pd.json_normalize(_results6)

#Rename the columns
_results6_long_df.columns = _results6_long_df.columns = [c.replace('_id.', '') for c in _results6_long_df.columns]


#Map the codes to human readable labels and thence to ordered categorical labels
_results6_long_df['Age_Band_of_Casualty'] = \
    _results6_long_df['Age_Band_of_Casualty'].astype(str).map(age_band_casualty_labels).astype(age_band_category)

_results6_long_df['Casualty_Severity'] = \
    _results6_long_df['Casualty_Severity'].astype(str).map(casualty_severity_labels)

_results6_long_df.head(3)

Finally, create a bar chart of the results:

In [None]:
ax = sns.barplot(x="num_accidents", y="Age_Band_of_Casualty",
                 hue="Casualty_Severity",
                 data=_results6_long_df)

ax.legend(bbox_to_anchor=(1, 0.5));

####  End of Activity 7

-----------------------------------

## Binning Data

In the `accidents` data, the age band of driver and casualty fields represent "binned" ranges. The `Number_of_Vehicles` and `Number_of_Casualties` fields give actual counts, but as we have seen, in some situations it may also be useful to bin these into different groups.

For example, we might categorise accidents based on the number of vehicles involved in the accident, defining the categories: single vehicle crashes, two car accidents, multiple vehicle incidents (up to and including five vehicles, for example) and "huge pile ups" of six vehicles or more.

*In a formal report, you should probably be more circumspect in the naming of the different categories!*

We can allocate documents to different bins using the aggregation pipeline `$bucket` operator.

This requires items to be grouped in a particular way, and then allocated to different bounded groups based on a itemised boundary list. The list is parsed to create pairwise bounded ranges where the lower bound values are inclusive and the upper bound values are exclusive:

For example, the boundary list `[0, 1, 2, 3, 6, 999]` identifies bins:

- `[1, 2)`: 1 vehicle (1 inclusive to 2 exclusive)
- `[2, 3)`: 2 vehicles (2 inclusive to 3 exclusive)
- `[3, 6)`: 3-5 vehicles (3 inclusive to 6 exclusive)
- `[6, 999)`: 6 vehicles and up (6 inclusive to 999 exclusive, which we set to greater than the maximum number of vehicles, guessing at an upper bound here, although we could set one explicitly based on the actual maximum vehicle count.)

In [None]:
bucket = {'$bucket':
          {'groupBy': "$Number_of_Vehicles",
           'boundaries': [0, 1, 2, 3, 6, 999],
           'output':
               { "count": { '$sum': 1 } }
          }
         }

list(accidents_collection.aggregate([bucket]))

We can check these with some counts over equivalent queries:

In [None]:
single_vehicle_accidents = accidents_collection.count_documents({"Number_of_Vehicles":1})
two_vehicle_accidents = accidents_collection.count_documents({"Number_of_Vehicles":2})
six_and_up_vehicle_accidents = accidents_collection.count_documents({"Number_of_Vehicles": {"$gte":6}})

print(f'''There were {single_vehicle_accidents} single vehicle accidents, \
{two_vehicle_accidents} two vehicle accidents and \
{six_and_up_vehicle_accidents} accidents involving six or more vehicles.
''')

*Do the numbers match?*

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `15.5 Introducing the Roads collection`.