## Udacity Data Engineering Capstone-Project
UK Police has a great dataset that provides open data about police, crimes and outcomes in England, Wales and 
Northern Ireland and is available on [data.police.uk](data.police.uk). We can also filter the data based on regions, cities, neighborhoods, 
coordinates, officer names, crime category etc. inside the United Kingdom. 

The purpose of this notebook is to explore the data about the crimes from the API offered by UK Police.
After reading the [documentation](https://data.police.uk/docs/) of the API first list of forces are explored

### Importing Libries

In [1]:
import pandas as pd
import numpy as np
import requests
from pyspark.sql import SparkSession

### Forces Related
The forces related responses from the API will be explored namely:
1. Forces
2. Specific forces
3. Senior Officers

#### 1. Forces
The call to this api returns a list of all the police forces available via the API except the British Transport Police, which is excluded from the list returned. Unique force identifiers obtained here are used in requests for force-specific data via other methods.

In [2]:
forces = requests.get('https://data.police.uk/api/forces').json()

In [3]:
forces[0]

{'id': 'avon-and-somerset', 'name': 'Avon and Somerset Constabulary'}

As it can be seen that the request to the list of forces only returns `id` and `name` of the forces.

#### 2. Specific forces
The call to this api returns the description about a specific force.
After reading the documentation it can be seen that to get the information on the specific force we need the `id` of force.

In [4]:
specific_force = requests.get('https://data.police.uk/api/forces/leicestershire').json()

In [5]:
specific_force

{'description': None,
 'url': 'http://www.leics.police.uk/',
 'engagement_methods': [{'url': 'http://www.facebook.com/leicspolice',
   'type': 'facebook',
   'description': None,
   'title': 'facebook'},
  {'url': 'http://www.twitter.com/leicspolice',
   'type': 'twitter',
   'description': None,
   'title': 'twitter'},
  {'url': 'http://www.youtube.com/leicspolice',
   'type': 'youtube',
   'description': None,
   'title': 'youtube'},
  {'url': 'https://www.leics.police.uk/news/leicestershire/news/GetNewsRss/',
   'type': 'rss',
   'description': None,
   'title': 'rss'},
  {'url': '', 'type': 'telephone', 'description': None, 'title': 'telephone'}],
 'telephone': '101',
 'id': 'leicestershire',
 'name': 'Leicestershire Police'}

As it can be seen from the above results and as well as documentation this request/record contains the following fields:

* **description**: Description of the force
* **url**: Force website URL
* **engagement-methods**: Ways to keep informed.
    * **url**: Method website URL
    * **description**: Method description
    * **title**: Method title
* **telephone**: Force telephone number
* **id**: Unique Force identifier
* **name**: Force name

Field `engagement_methods` is a list of nested fields. During ETL process it'd be unnested to get only `url` and `title` from the field.

#### 3. Senior Officers
The call to this api returns the senior officers of a specific force.
After reading the documentation it can be seen that to get the information on the specific force we need the `id` of force.

In [6]:
senior_officer = requests.get('https://data.police.uk/api/forces/leicestershire/people').json()

In [7]:
pd.DataFrame(senior_officer)

Unnamed: 0,bio,contact_details,name,rank
0,<p>Simon Cole QPM took up his position as Chie...,{'twitter': 'http://www.twitter.com/CCLeicsPol...,Simon Cole,Chief Constable
1,<p>Rob has served with Leicestershire Police f...,{'twitter': 'http://www.twitter.com/DCCLeicsPo...,Rob Nixon,Deputy Chief Constable
2,<p>Julia Debenham joined Leicestershire Police...,{'twitter': 'http://www.twitter.com/AccLeicsPo...,Julia Debenham,Assistant Chief Constable
3,<p>David Sandall has served with Leicestershir...,{},David Sandall,Assistant Chief Constable


In [8]:
pd.DataFrame(list(pd.DataFrame(senior_officer).contact_details))

Unnamed: 0,twitter
0,http://www.twitter.com/CCLeicsPolice
1,http://www.twitter.com/DCCLeicsPolice
2,http://www.twitter.com/AccLeicsPolice
3,


Now let's look at all the fields from senior officers. As it can be seen from the above results that the schema is different from the documentation.

In [9]:
senior_officers = []
for force in forces:
    try:
        senior_officer = requests.get(f"https://data.police.uk/api/forces/{force['id']}/people").json()
        senior_officers = senior_officers + senior_officer
    except Exception as e:
        print(e)

In [10]:
pd.DataFrame(senior_officers).head()

Unnamed: 0,bio,contact_details,name,rank
0,,{},Darren Martland,Chief Constable
1,,{},Jo Farrell,Chief Constable
2,,{},Dave Orford,Deputy Chief Constable
3,,{},Gary Ridley,Assistant Chief Officer
4,,{},Dale Checksfield,Special Chief Officer


In [11]:
pd.DataFrame(list(pd.DataFrame(senior_officers).contact_details))

Unnamed: 0,twitter,telephone,website,email
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
...,...,...,...,...
127,,,,
128,,,,
129,,,,
130,,,,


As it can be seen from the above results that senior_officers contain the following fields:
* **bio**: Senior officer biography (if available)
* **contact_details** : Contact details for the senior officer
    * **twitter**: Twitter profile URL
    * **telephone**: Telephone number
    * **website**: Website address
    * **email**: Email address
* **name**: Name of the person
* **rank**: Force rank

Field `contact_details` is a nested field of dictionary. During ETL process it'd be unnested to get only `twitter`,`telephone`,`website` and `email` from the field.

### Neighbourhood related
The forces related responses from the API will be explored namely:
1. Neighborhood
2. Specific neighborhood
3. Neighborhood boundary

#### 1. Neighborhoods
The call to this API returns the list of neighborhoods for a force. After reading the documentation it can be seen that to get the information on the specific force we need the `id` of force.

In [12]:
neighborhoods = requests.get("https://data.police.uk/api/leicestershire/neighbourhoods").json()

In [13]:
pd.DataFrame(neighborhoods).head()

Unnamed: 0,id,name
0,NC04,City Centre
1,NC66,Cultural Quarter
2,NC67,Riverside
3,NC68,Clarendon Park
4,NE09,Belgrave South


As it can be seen from the above results and as well as documentation this request/record contains the following fields:

* **id**: Police force specific team identifier.
* **name**: Name for the neighbourhood

#### 2. Specific neighbourhood
The call to this API returns the description about  neighborhoods for a force. After reading the documentation it can be seen that to get the information on the specific neighborhood we need the `id` of neighborhood as well as `id` of the force.

In [14]:
specific_neighborhood = requests.get("https://data.police.uk/api/leicestershire/NC04").json()

In [15]:
specific_neighborhoods = []
for neighborhood in neighborhoods[:10]:
    try:
        specific_neighborhood = requests.get(f"https://data.police.uk/api/leicestershire/{neighborhood['id']}").json()
        specific_neighborhoods.append(specific_neighborhood)
    except:
        continue

In [16]:
pd.DataFrame(specific_neighborhoods).head()

Unnamed: 0,url_force,contact_details,name,links,centre,locations,description,id,population
0,http://www.leics.police.uk/local-policing/city...,{'twitter': 'http://www.twitter.com/centrallei...,City Centre,"[{'url': 'http://www.leicester.gov.uk/', 'desc...","{'latitude': '52.6389', 'longitude': '-1.13619'}","[{'name': 'Mansfield House (Leicester)', 'long...",<p>The Castle neighbourhood is a diverse cover...,NC04,0
1,http://www.leics.police.uk/local-policing/cult...,{'twitter': 'http://twitter.com/centralleicsNP...,Cultural Quarter,[],"{'latitude': '52.6337', 'longitude': '-1.12435'}","[{'name': 'Mansfield House (Leicester)', 'long...",,NC66,0
2,http://www.leics.police.uk/local-policing/rive...,{'twitter': 'http://twitter.com/centralleicsNP...,Riverside,[],"{'latitude': '52.6274', 'longitude': '-1.13566'}","[{'name': 'Mansfield House (Leicester)', 'long...",,NC67,0
3,http://www.leics.police.uk/local-policing/clar...,{'twitter': 'http://www.twitter.com/lpclarendo...,Clarendon Park,[],"{'latitude': '52.6202', 'longitude': '-1.12315'}","[{'name': 'Mansfield House (Leicester)', 'long...",,NC68,0
4,http://www.leics.police.uk/local-policing/belg...,{'twitter': 'http://www.twitter.com/LPBelgrave...,Belgrave South,"[{'url': 'http://www.leicester.gov.uk/', 'desc...","{'latitude': '52.6474', 'longitude': '-1.11783'}","[{'name': 'Keyham Lane (Leicester)', 'longitud...",<p>The Belgrave South neighbourhood boasts the...,NE09,0


As it can be seen from the above result that there are four nested fields. `contact_details`, `links`,`centre` and `locations`. So let's explore them to get a better undestanding of each field.

In [17]:
df= pd.DataFrame(specific_neighborhoods)

##### contact_details field

In [18]:
pd.DataFrame(list(df.contact_details)).head()

Unnamed: 0,twitter,facebook,telephone,email
0,http://www.twitter.com/centralleicsNPA,https://www.facebook.com/CentralLeicsNPA/,101,centralleicester.npa@leicestershire.pnn.police.uk
1,http://twitter.com/centralleicsNPA,http://www.facebook.com/CentralLeicsNPA,101,centralleicester.npa@leicestershire.pnn.police.uk
2,http://twitter.com/centralleicsNPA,http://www.facebook.com/CentralLeicsNPA,101,centralleicester.npa@leicestershire.pnn.police.uk
3,http://www.twitter.com/lpclarendonpark,https://www.facebook.com/LPClarendonPark/,101,centralleicester.npa@leicestershire.pnn.police.uk
4,http://www.twitter.com/LPBelgrave,https://www.facebook.com/LPBelgraveRushey/,101,eastleicester.npa@leicestershire.pnn.police.uk


##### links field

In [19]:
pd.DataFrame(list(df.links.explode().dropna())).head()

Unnamed: 0,url,description,title
0,http://www.leicester.gov.uk/,,Leicester City Council
1,http://www.leicester.gov.uk/,,Leicester City Council
2,http://leicspolice.wordpress.com/category/lpu-...,,Keyham Lane LPU Blog
3,http://www.leicester.gov.uk/,,Leicester City Council
4,http://leicspolice.wordpress.com/category/lpu-...,,Keyham Lane LPU Blog


##### centre field

In [20]:
pd.DataFrame(list(df.centre)).head()

Unnamed: 0,latitude,longitude
0,52.6389,-1.13619
1,52.6337,-1.12435
2,52.6274,-1.13566
3,52.6202,-1.12315
4,52.6474,-1.11783


##### locations field

In [21]:
pd.DataFrame(list(df.locations.explode().dropna())).head()

Unnamed: 0,name,longitude,postcode,address,latitude,type,description
0,Mansfield House (Leicester),,LE1 3GG,"74 Belgrave Gate\n, Leicester",,station,
1,Mansfield House (Leicester),,LE1 3GG,"74 Belgrave Gate\n, Leicester",,station,
2,Mansfield House (Leicester),,LE1 3GG,"74 Belgrave Gate\n, Leicester",,station,
3,Mansfield House (Leicester),,LE1 3GG,"74 Belgrave Gate\n, Leicester",,station,
4,Keyham Lane (Leicester),,LE5 1FY,"Colin Grundy Drive\n, Off Keyham Lane\n, Leice...",,station,


As it can be seen from the above results and documentations that specific neighborhoods contain the following fields:

* **url_force**: URL for the neighbourhood on the Force's website
* **contact_details**: Ways to get in touch with the neighbourhood officers
    * **twitter**: Twitter profile URL
    * **facebook**: Facebook profile URL 
    * **telephone**: Telephone number 
    * **email**: Email address
* **name**: Name of the neighbourhood
* **welcome_message**: An introduction message for the neighbourhood
* **links**
    * **url**: URL
    * **description**: Description of the link (if available)
    * **title**: Title of the link
* **centre**: Centre point locator for the neighbourhood.
    * **latitude**: Centre point latitude
    * **longitude**: Centre point longitude
* **locations**: Any associated locations with the neighbourhood, e.g. police stations
    * **name**: Name (if available)
    * **longitude**: Location longitude
    * **latitude**: Location latitude
    * **postcode**: Postcode of the location
    * **address**: Location address
    * **type**: Type of location, e.g. 'station' (police station)
    * **description**: Description of the location
* **population**: Population of the neighbourhood
* **id**: Police force specific team identifier. This identifier is not unique and may also be used by a different force.
* **description**: Description (if available)

Fields `contact_details`, `links`, `centre` and `locations` are nested fields. During ETL process they'll be unnested.

#### 3. Neighborhood Boundary
The call to this API returns the boundaries of a neighborhood. After reading the documentation it can be seen that to get the information about the boundaries of the neighborhood we need the `id` of neighborhood as well as `id` of the force.

In [22]:
neighborhood_boundary = requests.get('https://data.police.uk/api/leicestershire/NC04/boundary').json()

In [23]:
neighborhood_boundary[0]

{'latitude': '52.6394052587', 'longitude': '-1.1458618876'}

As it can be seen from the above results and as well as documentation this request/record contains the following fields:

* **latitude**
* **longitude**

### Crime Related
The crime related responses from the API will be explored namely:
1. Street level crimes
2. Outcomes for a specific crime

#### 1. Street level crimes

The call to this API returns the crimes at street-level; either within a 1 mile radius of a single point, or within a custom area. After reading the documentation it can be seen that date, the point latitude and longitude are required for the API call.

In [24]:
crimes = requests.get('https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-03').json()

In [25]:
pd.DataFrame(crimes).head()

Unnamed: 0,category,location_type,location,context,outcome_status,persistent_id,id,location_subtype,month
0,anti-social-behaviour,Force,"{'latitude': '52.623514', 'street': {'id': 882...",,,,56865079,,2017-03
1,anti-social-behaviour,Force,"{'latitude': '52.625201', 'street': {'id': 882...",,,,56862275,,2017-03
2,anti-social-behaviour,Force,"{'latitude': '52.631090', 'street': {'id': 883...",,,,56866879,,2017-03
3,anti-social-behaviour,Force,"{'latitude': '52.631131', 'street': {'id': 883...",,,,56866876,,2017-03
4,anti-social-behaviour,Force,"{'latitude': '52.629264', 'street': {'id': 883...",,,,56866874,,2017-03


As it can be seen from the above result as well as from the documentation that this request/record contains the following fields:
* **category**: Category of the crime
* **persistent_id**: 64-character unique identifier for that crime.
* **month**: Month of the crime
* **location**: Approximate location of the incident
    * **latitude**: Latitude of the location
    * **longitude**: Longitude of the location
    * **street**: The approximate street the crime occurred
        * **id**: Unique identifier for the street
        * **name**: Name of the location
* **context**: Extra information about the crime
* **id**: ID of the crime. This ID only relates to the API, it is NOT a police identifier
* **location_type**: The type of the location. Either Force or BTP: Force indicates a normal police force location; BTP indicates a British Transport Police location. BTP locations fall within normal police force boundaries.
* **location_subtype**: For BTP locations, the type of location at which this crime was recorded
* **outcome_status**: The category and date of the latest recorded outcome for the crime
    * **category**: Category of the outcome
    * **date**: Date of the outcome

Fields `location` and `outcome_status` are nested. During ETL they will be un-nested. 

#### 2. Outcomes for a specific crime
The call to this API returns the outcomes (case history) for the specified crime. 
After reading the documentation it can be seen that to get the information on the outcome for a specific crime `Crime ID (persistent id)` which is returned by the street level crime is needed.

In [26]:
outcomes = requests.get('https://data.police.uk/api/outcomes-for-crime/590d68b69228a9ff95b675bb4af591b38de561aa03129dc09a03ef34f537588c').json()

In [27]:
outcomes

{'outcomes': [{'category': {'code': 'under-investigation',
    'name': 'Under investigation'},
   'date': '2017-05',
   'person_id': None},
  {'category': {'code': 'formal-action-not-in-public-interest',
    'name': 'Formal action is not in the public interest'},
   'date': '2017-06',
   'person_id': None},
  {'category': {'code': 'unable-to-prosecute',
    'name': 'Unable to prosecute suspect'},
   'date': '2017-11',
   'person_id': None}],
 'crime': {'category': 'violent-crime',
  'location_type': 'Force',
  'location': {'latitude': '52.639814',
   'street': {'id': 883235, 'name': 'On or near Sanvey Gate'},
   'longitude': '-1.139118'},
  'context': '',
  'persistent_id': '590d68b69228a9ff95b675bb4af591b38de561aa03129dc09a03ef34f537588c',
  'id': 56880258,
  'location_subtype': '',
  'month': '2017-05'}}

As it can be seen from the above result as well as from the documentation that this request/record contains the following fields:
* **outcomes**: A list of categories and dates of each outcome
    * **category**: Category of the outcome
        * **code**: Internal code
        * **name**: Human-readable name
    * **date**: Date of the outcome
    * **person_id**: An identifier for the suspect/offender, where available.
* **crime**: Crime information
    * **category**: Category of the crime
    * **location_type**: The type of the location. Either Force or BTP: Force indicates a normal police force location; BTP indicates a British Transport Police location. BTP locations fall within normal police force boundaries.
    * **persistent_id**: 64-character unique identifier for that crime.
    * **month**: Month of the crime
    * **location**: Approximate location of the incident
        * **latitude**: Latitude of the location
        * **longitude**: Longitude of the location
        * **street**: The approximate street the crime occurred
            * **id**: Unique identifier for the street
            * **name**: Name of the location
    * **context**: Extra information about the crime
    * **id**: ID of the crime. This ID only relates to the API, it is NOT a police identifier
   

Fields `outcomes` and `crime` are nested and they'll be un-nested in the ETL phase.