# Searching within a geographical area

In previous notebooks, you have seen how we can search for accident records based on the presence of a local highway authority code. Searching within predefined administrative geographies using such codes, or other geography codes such as local authority district codes `Local_Authority_(District)`) is often very convenient, but sometimes we need to be more specific and search around a specific location, as described by its co-ordinates. 

The field of geographical information systems (GIS) is a large one, with many techniques for searching in a location sensitive way. You will perhaps get the first inkling of what is possible within such systems from the simple approaches to searching in a location based way that are introduced in this notebook.

First, we need to include some essential Python packages for connecting to the MongoDB and reshaping the data we might retrieve from it. We're also going to work with time as well as location data...

In [None]:
# Standard imports
import pandas as pd

import seaborn as sns

import folium

Open a connection to the MongoDB database and define some references to the required collections:

## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

We will also specify the `labels` collection:

In [None]:
labels=accidents_db['labels']

We'll be plotting some charts, so increase the default plot size to make things easier to read:

In [None]:
# Set a larger plot size than the default
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Creating Spatial Geo-Data Indexes in MongoDB

As with many database management systems, MongoDB provides a certain level of support for spatial datasets.

To create this collection, we will use the database `DB_NAME`, as described in notebook `14.1 Basic CRUD`.

A new collection can be created with a spatial index declared on a particular field in the following way:

In [None]:
from pymongo import GEO2D

# Set up a test collection called geodemo
geo_collection = mongo_client[DB_NAME]['geodemo']

# Create a collection within geodemo with a spatial 
# index relating to a specified field
geo_collection['locations'].create_index([("location", GEO2D)])

Add a couple of test records:

In [None]:
geo_collection['locations'].insert_one({'location': [ -0.7589607, 52.0429797],
                             'name': 'Milton Keynes'})

geo_collection['locations'].insert_one({'location':  [ -0.1276474, 51.5073219],
                             'name': 'London'})

We can now perform a simple query to look up records that are "near" to a specified location.

For example:

In [None]:
geo_collection['locations'].find_one({"location": {"$near": [-0.75, 52]}})

Drop the collection to leave things a bit tidier:

In [None]:
mongo_client[DB_NAME].drop_collection('geodemo.locations')

## Searching By Geographical Location

In situations where we are provided with datasets containing geographical area codes, it is trivial to search for records associated with a particular area code. But what if we want to search within a certain distance of a specified location, or within a particular geographical area defined by a shapefile or boundaryfile?

*Location based searches are performed against spatially indexed collections. You can check what indexes exist over a collection by running a command of the form `COLLECTION.index_information()`, such as `roads.index_information()`.*

Let's zoom in a bit on Milton Keynes, the home of the Open University. Recall that we can search for records contained within a specified closed polygon, or within a distance of a specified location.

You have already seen how we can run a simple `{"location": {"$near": [LON, LAT]}}` query, but other types of query ar available too.

Bounded area queries take the form:

```python
{'loc': {'$geoWithin': {'$geometry': geojson_shape}}}`
```
And distance based queries take the form:

```python
{'loc': {'$nearSphere':
           {'$geometry': 
             {'type': 'Point', 
              'coordinates': [lon, lat]},
              '$maxDistance': 2000}}}  # distance in meters
```

We can also search within a rectangular bounded area using a construction of the form:

```python
{'$geoWithin': {
        '$box': [ [bottom_left_lon, bottom_left_lat],
                  [upper_right_lon, upper_right_lat] ]
        }
}
```

*Additional geographical search phrases are allowed. See the [`$geoWithin` documentation](https://docs.mongodb.com/manual/reference/operator/query/geoWithin/) for details.*

Let's search within a specified distance of a central location.

You can find the location of a particular address using a *geocoder*. The [`geopy`](https://geopy.readthedocs.io) Python package, which is installed in the TM351 VCE, provides access to a range of geocoding services, including [Nominatim](https://geopy.readthedocs.io/en/stable/#nominatim), the OpenStreetMap geocoding service.

Let's focus out search on Milton Keynes (MK), and in particular on the location of the Open University campus at Walton Hall. 

In [None]:
import geopy
geocoder = geopy.Nominatim(user_agent="tm351-geocoding")

mk_geo = geocoder.geocode("The Open University, Walton Hall, Milton Keynes, UK")

mk_geo

We can access the latitude and longitude co-ordinates directly from the response object:

In [None]:
mk_geo.longitude, mk_geo.latitude
# If there are problems with the geocoder, use: (-0.7092748093945007, 52.02453775)

To plot all the accidents that occurred within 10km of the OU campus, we need to find those accidents. The following MongoDB query searches for records within a specified distance in meters (`$maxDistance`) of a specified location.

Note that the MongoDB query expects the geographical co-ordinates to be provided as `[longitude, latitude]` whereas the `folium` map, for example, expects map centering co-ordinates in the form `[latitude, longitude]`. (Which is to say - always read the docs to make sure you present the co-ordinates in the required order...)

In [None]:
query = {'loc': 
          {'$nearSphere':
           {'$geometry': 
            {'type': 'Point', 
             'coordinates': [mk_geo.longitude, mk_geo.latitude]},
            '$maxDistance': 10000}}}

mk_accidents = pd.json_normalize(list(accidents_collection.find(query,
                                                        ['loc.coordinates',
                                                         'Accident_Index'])))

# Set an appropriate index to stash the columns we want to preserve
mk_accidents.set_index(['Accident_Index'], inplace=True)

# Split the latitude and longitude co-ordinates into separate columns
mk_accidents = mk_accidents['loc.coordinates'].apply(pd.Series)
mk_accidents.columns = ['Lon', 'Lat']

# Reset the index to unpack the stashed values
mk_accidents.reset_index(inplace=True)

mk_accidents.head()

To plot the accidents on the map, we will use a variation of the `add_marker` function that we defined in notebooks `02.2.3  Data file formats - other` and `5.2 Getting started with maps - folium`:

In [None]:
def add_marker(row, fmap):
    """Add a marker to a specific map."""
    
    lat = row['Lat']
    lon = row['Lon']
    
    folium.Circle(location=[lat, lon], radius=10,
                  color='red',
                  fill_opacity=0.8).add_to(fmap)

Now plot the accidents:

In [None]:
AVERAGE_LOCATION = mk_accidents[['Lat', 'Lon']].median()

m = folium.Map(AVERAGE_LOCATION, zoom_start=10)

mk_accidents.apply(add_marker, fmap=m, axis=1)

m

### Activity 1

Plot the accidents within a 15km of your home or workplace.

Do the accidents occur where you would expect?

*Before generating the map, use your personal local knowledge to record in this cell two or three locations where you might expect accident hotspots to be.*

In [None]:
# Insert your solution here.

# You may find it useful to use several code cells to structure your answer.

*From the locations plotted on the map, are the accidents concentrated where you expected? Did anything about the distribution of the accidents surprise you? (Remember, the dataset relates to accidents several years ago.)*

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Let's see what accidents there were within 10km of the Newport in the center of the Isle of Wight. (I wonder in advance if there are any accidents from the car ferry that are located in the middle of the Solent!)

We can look up some central co-ordinates using the Nominatim geocoder:

In [None]:
newport_iw = geocoder.geocode("Newport, Isle of Wight, UK")
newport_iw

Lookup the accidents:

In [None]:
iw_region = {'loc':
             {'$nearSphere':
              {'$geometry':
               {'type': 'Point',
                'coordinates': [newport_iw.longitude, newport_iw.latitude]},
                '$maxDistance': 10000}}}

iw_accidents = pd.json_normalize(list(accidents_collection.find(iw_region,
                                                         ['loc.coordinates',
                                                          'Accident_Index'])))

# Set an appropriate index to stash the columns we want to preserve
iw_accidents.set_index(['Accident_Index'], inplace=True)

# Split the latitude and longitude co-ordinates into separate columns
iw_accidents = iw_accidents['loc.coordinates'].apply(pd.Series)
iw_accidents.columns = ['Lon', 'Lat']

# Reset the index to unpack the stashed values
iw_accidents.reset_index(inplace=True)

iw_accidents.head()

And plot them:

In [None]:
m = folium.Map([newport_iw.latitude, newport_iw.longitude], zoom_start=11)

# And plot the accidents
iw_accidents.apply(add_marker, fmap=m, axis=1)

m

####  End of Activity 1

-----------------------------------

## Searching within an arbitrarily defined area

We can also search for points within a bounded area defined as a closed polygon made up from a list of latitude-longitude pairs, with the first co-ordinate pair matching the last co-ordinate pair to ensure the shape is closed.

The following query defines a crude bounding rectangle around Central Milton Keynes. We'll use it to filter the set of accidents more closely.

In [None]:
mk_area = [[-0.78, 52.08],
           [-0.7, 52.08],
           [-0.7, 52.02],
           [-0.78, 52.02],
           [-0.78, 52.08]]

The shape can also be used as part of a minimal `geojson` datastructure to represent the area as a geographical object.

*`geojson` is a simple JSON flavoured datastructure for describing lines and shapes for geographical use.*

The geojson object is recognised by both MongoDB for making area based queries, which means we can search for locations withing the area; and by `folium`, which means we can display it on a map.

In [None]:
milton_keynes = {'type': 'Polygon',
                 'coordinates': [mk_area]}

As an alternative to using the [`$nearSphere`](https://docs.mongodb.com/manual/reference/operator/query/nearSphere/) query, we can search within the area using the following MongoDB constrcution using the [`$geoWithin`](https://docs.mongodb.com/manual/reference/operator/query/geoWithin/) search filter:

In [None]:
mk_region_query = {'loc': {'$geoWithin': {'$geometry': milton_keynes}}}

mk_region_accidents = pd.json_normalize(list(accidents_collection.find(mk_region_query,
                                                         ['loc.coordinates',
                                                          'Accident_Index'])))

# Set an appropriate index to stash the columns we want to preserve
mk_region_accidents.set_index(['Accident_Index'], inplace=True)

# Split the latitude and longitude co-ordinates into separate columns
mk_region_accidents = mk_region_accidents['loc.coordinates'].apply(pd.Series)
mk_region_accidents.columns = ['Lon', 'Lat']

# Reset the index to unpack the stashed values
mk_region_accidents.reset_index(inplace=True)

mk_region_accidents.head()

To center the map, we can define the center of the map by inspection of the bound box co-ordinates:

In [None]:
mk_centre_lat = (52.02 + 52.08) / 2
mk_centre_lon = (-0.78 + -0.7) / 2
mk_centre = [mk_centre_lat, mk_centre_lon]
mk_centre

Let's plot a bounding box showing the extent of the search area and then use the shape bounded query to find and plot markers associated with accidents located within it:

In [None]:
m = folium.Map(location=mk_centre, width=500, height=800, zoom_start=12)

# Plot the bounding box
folium.GeoJson(milton_keynes).add_to(m)

# And plot the accidents
mk_region_accidents.apply(add_marker, fmap=m, axis=1)

m

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `15.3 Introducing aggregation pipelines`.