# Working with `roads` location data

You have already seen in a previous notebook how we can plot markers onto an interactive `folium` map from a set of MongoDB results generated from location based queries onto the `accidents` collection.

In this notebook, you will see how to run location based queries as part of a pipeline as well as plotting them onto a map in a similar way to before.

Begin by loading in some essential packages:

In [None]:
# Standard imports
import pandas as pd

# Seaborn for charts...
import seaborn as sns

# folium for maps...
import folium

And connect to the database:

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

We will also specify the `labels` collection:

In [None]:
labels=accidents_db['labels']

and the `roads` collection:

In [None]:
roads_collection=accidents_db['roads']

We'll be plotting some charts, so increase the default plot size to make things easier to read:

In [None]:
# Set a larger plot size than the default
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Plotting some `roads` census location points

To start with, let's just plot some road segments on the map to see where they are. We'll reuse some of the map-making procedures from a previous notebook, adding a few more parameters to function definitions, where appropriate, to make them more general.

The `add_marker()` function will add a solid, circular marker of specified `color` and `radius` to a provided *folium* map (`m`) at the specified location. The `latitude` and `longitude` parameters can be used to specify column headings associated with the co-ordinates. An optional tooltip is passed as a formatted string over the dataframe row. For example, `tooltip="Road number: {Road}"`.

The `add_marker()` function is typically called by `apply`ing it it to each row of a dataframe.

In [None]:
def add_marker(row, m, color='red', radius=50, tooltip=None,
               latitude='Lat', longitude='Lon'):
    """Add a marker to a folium mark."""
    # Optionally add a tooltip
    if tooltip and isinstance(tooltip, str):
        tooltip = tooltip.format(**row)
    else:
        tooltip = None
    folium.Circle(location=[row[latitude], row[longitude]], tooltip=tooltip,
                  color=color, radius=radius, fill=True, fill_opacity=1.0).add_to(m)

The first thing to note about the collection itself is that not every road segment has a location as given by `loc`, latitude and longitude, or Northing and Easting co-ordinates. We may have to bear that in mind when doing geographic analysis of the roads dataset:

In [None]:
locations_exist = roads_collection.count_documents({'loc': {'$exists': True}})
locations_dont_exist = roads_collection.count_documents({'loc': {'$exists': False}})

print(f'''
{locations_exist} documents have location data, \
{locations_dont_exist} documents do not have location data.
''')

Let's start by getting hold of a sample of several hundred road locations using an aggregation pipeline.

We can use a `$project` step to grab the latitude and longitude values explicitly out of the `loc.coordinates` element:

In [None]:
# Select documents with locations
select = {"$match": {'loc': {'$exists': True}}}

# Limit the number of documents we want to retrieve
limit = {'$limit':  500}

# Define a projection on the returned results
roads_project = {"$project": {"CP":1, "Road":1, "_id":0,
                              'Lon': {'$arrayElemAt': ['$loc.coordinates', 0]},
                              'Lat': {'$arrayElemAt': ['$loc.coordinates', 1]},
                              'loc.coordinates':1
                             }}

# Create the pipeline
pipeline = [select, limit, roads_project]

# Run the pipeline
sampled_locations = pd.json_normalize(list(roads_collection.aggregate(pipeline)))

# Set the count point as the index
sampled_locations.set_index('CP', inplace=True)

sampled_locations.head()

We can plot these on a map by applying the `add_marker` function to each row of the datframe:

In [None]:
AVERAGE_LOCATION = sampled_locations[['Lat', 'Lon']].median()
    
m = folium.Map(location=AVERAGE_LOCATION,
               width=500, height=800, zoom_start=6)

sampled_locations.apply(add_marker, m=m, color='blue', axis=1)
m

When I ran the query, it clearly showed that the road data covers Great Britain (England, Scotland and Wales), but nothing in Northern Ireland.

## Using a Pipeline to Find Nearby Road Census Locations

Suppose that we wanted to use a pipeline to lookup road census locations, for example, around Milton Keynes. 

Recall that we can look up the co-ordinates of a specific location using a geocoding service:

In [None]:
import geopy

geocoder = geopy.Nominatim(user_agent="tm351-geocoding")

mk_geo = geocoder.geocode("The Open University, Walton Hall, Milton Keynes, UK")

mk_geo.longitude, mk_geo.latitude
# (-0.7092748093945007, 52.02453775)

Create a simpler reference for those co-ordinates:

In [None]:
ou_lonlat = [mk_geo.longitude, mk_geo.latitude]

Having got a target location in hand, let's now try to search around it.

Amongst its various geo-query tools, MongoDB provides a way for us to search for the nearest location using a `$geoNear` aggregation operation ([docs](https://docs.mongodb.com/manual/reference/operator/aggregation/geoNear/)) as the first step in a pipeline operation.

The operator returns documents in order of nearest to farthest from a specified point, measured in meters:

```python
'$geoNear': {
             'near': <GeoJSON_point in the form { type: "Point", coordinates: [ 0, 50 ] }>,
             'distanceField': fieldname, # specifies a new field containing the distance found
             'spherical' :True
} 
```

An optional `maxDistance` parameter is the maximum distance, in meters, that the discovered documents can be.

An optional `query` parameter allows you to pass in a query that limits the documents that are used as the basis for discovering nearby locations.

The following example finds the accident from the `roads` collection that is closest to the specified location.

In [None]:
near_query = {'$geoNear': {'near': {'type': "Point",
                                    'coordinates': [ mk_geo.longitude, mk_geo.latitude ]},
                           'distanceField': 'distance',
                           'maxDistance': 5000,
                           'spherical': True}}

# Add the distance field
roads_project["$project"]['distance'] = 1

# Create the pipeline
pipeline = [near_query, limit, roads_project]

# Run the pipeline
mk_5km_census = pd.json_normalize(list(roads_collection.aggregate(pipeline)))
mk_5km_census.head(3)

Let's plot these road census locations on a map.

We'll also grab the road number to the projection and add a tooltip to display the road number if we hover our cursor over a map marker.

In [None]:
# Define a template for the tooltip
# In this case, just display the road number
tooltip = '{Road}'

AVERAGE_LOCATION = mk_5km_census[['Lat', 'Lon']].median()
    
m = folium.Map(location=AVERAGE_LOCATION,
               width=600, height=600, zoom_start=10)

mk_5km_census.apply(add_marker, m=m, color='blue',
                    tooltip=tooltip, axis=1)

m

Let's also add some accidents to the mix:

In [None]:
# Define a projection on the returned results
accidents_project = {"$project": {"Accident_Index":1, "distance":1, "_id":0,
                                  "coords": "$loc.coordinates",
                                  'Lon': {'$arrayElemAt': ['$loc.coordinates', 0]},
                                  'Lat': {'$arrayElemAt': ['$loc.coordinates', 1]}}}


# Create the pipeline
pipeline = [near_query, limit, accidents_project]


# Run the pipeline
mk_5km_accidents = pd.json_normalize(list(accidents_collection.aggregate(pipeline)))
mk_5km_accidents.head(3)

We can overplot these on the same map as the road census locations:

In [None]:
mk_5km_accidents.apply(add_marker, m=m, color='red',
                       tooltip='{Accident_Index}', axis=1)

m

This suggests that not all the roads have traffic monitoring locations, or traffic flow data, associated with them. 

## Looking up road census points for multiple accidents 

In [None]:
def get_nearby_locations(collection, coords, projection=None, maxdist=500):
    """Get locations near a particular location from a specified collection."""
    _nearby_query = {'$geoNear': {'near': {'type': "Point",
                                             'coordinates': coords},
                                    'distanceField': 'distance',
                                    'spherical': True,
                                    'maxDistance': 2000}}
    
    _pipeline =  [_nearby_query] if projection is None else [_nearby_query, projection]
    return pd.json_normalize(list(collection.aggregate(_pipeline)))


near_the_OU_accidents = get_nearby_locations(roads_collection, ou_lonlat, projection=accidents_project)
near_the_OU_accidents

In [None]:
def row_find_nearby_locations(row, collection, coords_col, projection=None, maxdist=500):
    """Look up nearby locations in a collection from a document."""
    return get_nearby_locations(collection, row[coords_col],
                                projection=projection, maxdist=maxdist)


combined_roads_locations = near_the_OU_accidents.apply(row_find_nearby_locations, collection=roads_collection,
                                                          coords_col='coords',
                                                          projection=roads_project, axis=1)


unique_roads_locations = pd.concat(combined_roads_locations.to_list()).drop_duplicates(subset=['CP'])
unique_roads_locations

In [None]:
unique_roads_locations.apply(add_marker, m=m, color='blue', axis=1)

m

## What next?

This completes the notebooks for part 15.