# Introducing the `roads` collection

Thus far, you have explored the `accidents` collection and of road traffic accident data, and cross-referenced items with in to the `labels` metadata collection, which provides human readable labels for `accidents` data numerical codes. This notebook introduces a new dataset, the `roads` collection.

The `roads` collection contains road related meta-data and daily road traffic flow for a wide range of roads in the UK road network. The values represent *the number of vehicles that travel past (in both directions) the location on an average day of the year* and are recorded *for every junction-to-junction link on the motorway and 'A' road network, and for some minor roads in Great Britain.

*The data was originally published by the Department for Transport (DfT) under an Open Government License and then loaded by the TM351 module team into MongoDB. For more details, see [roadtraffic.dft.gov.uk](https://roadtraffic.dft.gov.uk/).*

You are encouraged to explore the dataset by writing your own queries if there are particular questions about the traffic flows that you are curious about.

As ever, let's load in some required packages:

In [None]:
# Standard imports
import pandas as pd

# Seaborn for charts...
import seaborn as sns

# folium for maps...
import folium

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

We will also specify the `labels` collection:

In [None]:
labels=accidents_db['labels']

We'll be plotting some charts, so increase the default plot size to make things easier to read:

In [None]:
# Set a larger plot size than the default
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Looking at the structure of a `roads` collection document

So far, we have seen the `accident` and `label` collections in the accident database. In this notebook, we will look at the `roads` collection. Let's set up the collection:

In [None]:
roads_collection=accidents_db['roads']

To get an idea of what sort of data might be contained in the `roads` collection, let's the most basic question of all — *what's in a single 'road' document?* — from a document picked at random and which we might assume to be representative of documents across the collection as a whole:

In [None]:
roads_collection.find_one()

As a flattened dataframe, the document looks like this:

In [None]:
pd.json_normalize( [roads_collection.find_one()] )

The record describes a road link — a stretch of road between two junctions — with average daily flows for different vehicle types that passed along that link during a particular sample period.

Road links have two ends (`A-Junction` and `B-Junction`) which are either junctions or region boundaries, and are located in a particular local authority area (`ONS LA Name`).

The `Fd...` keys are the number of vehicles of a particular class of vehicle that passed this point during a particular sample period (in the forward direction, but there's no 'reverse' direction specified).

The `LenNet` measure gives the distance (in km) of the road link; `RCat` describes the road category using one of a set of rad category codes, and the `Road` field gives the road number.

What do the codes mean? The decoded human readable label references already form part of the `labels` collection, so we can cross-reference those just as we looked up `accidents` code labels previously.

Specifically, we can create a lookup for human readable labels associated with the `RCat` (road category) codes:

In [None]:
road_category_labels = labels.find_one({'label': 'RCat'})['codes']
road_category_labels

The `roads` collection also includes some keys that are described using expanded human readable labels, such as labels for decoding the field names for elements starting `Fd`, such as `FdLGV` and `FdHGVA6a`.

One way to access these is to create a lookup dictionary of expanded codes created from a dataframe generated from a query on the `labels` collection documents containing `expanded` label elements.

For example, create the dataframe:

In [None]:
expanded_labels = pd.json_normalize(labels.find({'expanded': {"$exists": True}},
                                                  {'label':1, 'expanded':1, '_id':0}))
expanded_labels.head()

Create a lookup dictionary, keyed by the `label` values:

In [None]:
expanded_name = expanded_labels.set_index('label').to_dict()['expanded']
expanded_name

We can then expand a human readable form of a key code directly:

In [None]:
expanded_name['FdAll_MV'], expanded_name['FdHGVA6']

The following function will map encoded column names onto human readable labels:

In [None]:
def readable_column_names(columns):
    """Map column codes onto human readable labels."""
    column_names = []
    for k in columns:
        label = expanded_name[k] if k in expanded_name else k
        column_names.append(label)
    # As a one-line using a list comprehension
    #column_names = [expanded_name[k] if k in expanded_name else k for k in df.columns]
    return column_names

We can then annotate a dataframe created from documents pulled from the `roads` collection as follows:

In [None]:
df = pd.json_normalize(list(roads_collection.find(limit=4)))
df.columns = readable_column_names(df.columns)
df

## Exploring the `roads` collection data

As well as looking at the `roads` data on a map view, we can also analyse the data in various numerical or statistical ways. In this section we'll review some of the ways we might explore any, arbitrary, numerical dataset, although still framing our questions very much in the context of the data we have to hand.

For example, we might review the distribution of road segment lengths, the profile of roads within a particular district or area, or the distributions of traffic flows across different parts of the road network.

Let's start by having a look at some of the numbers associated with the traffic flow data contained in the `roads` collection. We'll start by analysing the data as represented in a *pandas* dataframe.

To obtain the data we need in order to create the dataframe, we'll use the aggregation pipeline technique you met in a previous notebook.

We'll begin by considering how many examples there are of each type of road link, and what the average length of each of them is.

The `RCat` field gives the road category, which we'll count instances of. The `LenNet` field gives the length of the road link, from which we can find the average length.

In the grouping step, we can make use of the `$avg` averaging accumulator operator to find the group averages.

As well as the grouping operator, we'll tidy up the attribute names returned from the pipeline using an ultimate `$project` step.

We can run the basic pipeline as follows:

In [None]:
pipeline = [{'$group': {'_id': '$RCat',
                        'length': {'$avg': '$LenNet'},
                        'count': {'$sum': 1}}},
            {'$project': {'RCat': '$_id',
                          '_id': 0, 'length': 1, 'count': '$count'}}]

results = list(roads_collection.aggregate(pipeline))
results

Note that in the `$project` statement, we can retain a field in the projection without renaming it by declaring it as `'name': '$name'` or `'name': 1`.

We can also cast the result to a *pandas* dataframe and map the codes in the normal way:

In [None]:
road_lengths = pd.json_normalize(list(roads_collection.aggregate(pipeline)))

# Map code values
road_lengths['RCat'] = road_lengths['RCat'].map(road_category_labels)

# Map column labels
road_lengths.columns = readable_column_names(road_lengths.columns)

road_lengths

We can plot this data using a simple scatterplot to show the average road segment length for a particular road category against the number of segments of that category.

We can also add an annotation label to each point by applying a labeling function applied to each row of the dataset.

In [None]:
ax = sns.scatterplot(x="length", y="count", data=road_lengths)

def add_chart_label(row, ax, x, y, label):
    """Add a label to a point at a particular location."""
    ax.text(row[x], row[y], row[label], horizontalalignment='left')

road_lengths.apply(add_chart_label, x='length', y='count',
                   label='Road category', ax=ax, axis=1);

Unsurprisingly, rural road links are longer than urban road links. There are more "principal" than "trunk" road links, probably because "trunk" roads are designated major routes.

But looking at the labels, what might the "principal motorways" relate to?

The `.distinct(field, query)` Mongo collection method ([docs](https://docs.mongodb.com/manual/reference/method/db.collection.distinct/)) allows us to review the unique (that is, *distinct*) values for a particular `field` retrieved from a specfied `query`.

You might recall had a road category `PM`:

In [None]:
road_category_labels

*In passing, you might notice that `rural` roads are identified by a a letter `R` in the second character position in the code, `urban` roads by the letter `U` in the same position, and motorways by the letter `M`. The first letter appears to identify the road class (principle, trunk, U, B or C.*

So let's use that to satisfy our curiosity and see just what might be considered as a "principal motorway":

In [None]:
roads_collection.distinct('Road', {'RCat': 'PM'})

Hmm... All others are presumably *trunk motorways*.

### Distribution of road link lengths

Reconsidering the road link lengths, one thing the average lengths shown so far don't tell us about the distribution of lengths of different road links, so let's explore that.

#### Generating summary statistics

If we pull back the lengths for every road link and category into a *pandas* dataframe, we can easily run some summary statistics over the data using the *pandas* dataframe `.describe()` method. Let's just grab the data into a simple dataframe directly from a simple query:

In [None]:
road_lengths_df = pd.DataFrame(roads_collection.find({}, {'CP':1, 'RCat':1, 'LenNet':1, '_id':0}))
road_lengths_df.head(3)

And then review the summary statistics of the data contained in the dataframe:

In [None]:
road_lengths_df.describe()

We can also generate similar statistics using an aggregation pipeline which avoids having to download a large number of results into memory within in potentially very long *pandas* dataframe:

In [None]:
group = {'$group': {'_id': None,
                    'length': {'$avg': '$LenNet'},
                    'count': {'$sum': 1},
                    'std': {'$stdDevPop': '$LenNet'},
                    'min': {'$min': '$LenNet'}
                   }}

project = {'$project': {'length': '$length',
                        'count': '$count',
                        'std': '$std',
                        'min': '$min',
                        '_id':0}}

pipeline = [group, project]

list(roads_collection.aggregate(pipeline))

From the long *pandas* dataframe containing the road segment length for each count point, we can use the  `.hist()` method to plot a histogram of the road length values:

In [None]:
road_lengths_df['LenNet'].hist();

The *seaborn* `.distplot()` provides an alternative plot, optionally overlaying a continuous *kernel density estimate (kde)* model of the distribution on top of a histogram.

*Pass `kde=True` into the plot to display an overplotted continuous model.*

In [None]:
 sns.distplot(road_lengths_df['LenNet'],
              # Specify the bin break points for binning the data
              bins=[0, 6, 12, 16, 22, 35, 100],
              kde=True);

#### Binning data using a `$bucket` pipeline stage

We can also use an aggregation pipeline to "bin" or "bucket" counts into different road length ranges using an aggregation pipeline `$bucket` stage.

For example, we might bin the road lengths into buckets of road lengths `[0, 6), [6, 12), [12, 16), [16, 22), [22, 35), [35, 100)`,  where the lower bound values are inclusive and the upper bound values are exclusive: 

In [None]:
bucket = {'$bucket':
          {'groupBy': "$LenNet",
           'boundaries': [0, 6, 12, 16, 22, 35, 100],
           'output':
               { "count": { '$sum': 1 } }
          }
         }

list(roads_collection.aggregate([bucket]))

We can then cast this to a dataframe and plot a bar chart from the result:

*It's not quite a histogram, but it gets the idea of the distribution across, just as long as you remember that the bars represent different size bins...!)*

*Also note that the axes could be labeled a little more clearly. As it stands, this chart is probably okay as a sketch for helping you get a quick overview of the data, but it's not really appropriate as a publication ready chart.*

In [None]:
df = pd.json_normalize(list(roads_collection.aggregate([bucket])))
sns.barplot(x='_id', y='count', color='royalblue', data=df);

#### Interpreting the data

From the summary statistics and the visualisations, it seems the the majority of road links are very short, with a few that are longer.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to notebook `15.6 Working with roads location data`.