## Introductory Tutorial on Visualizing Reported Pot Holes in Toronto

In this tutorial, I am covering an introductory tutorial on data mining and data exploration. I am a Data Scientist currently working in Canada's Payments industry. The amount of work needed to gather data for analytics and research always varies on the task at hand. For example, if you are looking at payment processes involving user transactions and you are the company providing the payment service, you may not need to mine your raw data from an external source. If you are working on your own projects or writing up tutorials like I am, it is the easiest to start off with publicly available data. 

For this tutorial, I will be using [open data](https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e2634d40-0dbf-4d91-12fa-83f039307e93) provided by 311 Toronto. This API contains information about reported pot holes by residents in the city of Toronto. This winter we have had some deep freeze followed by warm days which are known to give rise to new pot holes. I am interested to see what areas of the city has been reported to have a high concentration of pot holes. I am also interested to see the workload this puts on the city in terms of the number of days to investigate reported pot holes. Once the reports are investigated, it is also interesting to investigate how long the expected time of repair would be. To complete our objectives, I will take you through a step-by-step guide on data mining, data cleaning, and visualization.

### Things you need:
1. Python (I am using Python 3.6.6)
2. pip (in order to install necessary packages)

### 1. Install and Import Modules that We need. 
- requests package is used to make API requests from 311 Toronto. We are using "GET" method. If you are new to REST and gathering data by making requests to APIs, I suggest you take a read through [this](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Introduction).
- json package is used take the response sent back to use from our request to 311 Toronto API and put it into a JSON format. JSON stands for JavaScript Object Notation and with this package we can parse and manipulate json objects.
- Python's datatime library is very useful and not just for data science projects. Dates appear often and require extensive manipulation. In this tutorial, we define date ranges for the API request that we are making. Convering string dates to datetime objects require the use of this package. 
- Pandas has been the most important tool I have used in majority of my data science projects and it is one of the most useful libraries in Python to get familiar with if you are going to continue on this path of data science / data analytics. 
- mplleaflet is the visualization library we are going to use to display our data. It uses matplotlib and leaflet to display longitudes and latitudes on a map object. For more information, take look [here](https://github.com/jwass/mplleaflet). 

#### Installing Modules and their corresponding versions

With pip install the following or use the requirements.txt posted in the github for this project.

    - pandas (I am using version 0.24.1)
    - matplotlib ( I am using version 3.0.2)
    - mplleaflet  

If you are using Linux, you can run the commands below from my jupyter notebook with this tutorial. 

In [None]:
!pip install pandas
!pip install matplotlib
!pip install mplleaflet

#### Import Modules

In [1]:
import requests 
import json
import datetime
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
%matplotlib inline
import mplleaflet

### 2. Deciding on a date range

We know that pot holes are problematic during the season of freezing and thawing around late winter and spring. Knowing this, it would be interesting to look at data from this winter because we have had some alternating cold days and warm days. We have defined our date parameters below but feel free to grab this notebook from my Github and change the dates around for more insight.

In [2]:
# date range parameters
start_date = "2018-10-01 00:00:00"
end_date = "2019-02-16 00:00:00"

### 3.Finness those Dates to Datetime
We have defined our variables as strings. Now, we are going to utilize the datetime module and convert the strings into datetime objects. The other thing to note is that according to the 311 Toronto API Readme, dates are accepted only when they are compliant to w3 isoformat. I had some issues achieving this with the datetime.isoformat() and utilized a little hack to add the 'Z' at the end. If you have a better suggestion, please let me know! 

In [3]:
def time(input):
    if type(input) != datetime.datetime:
        dt = datetime.datetime.strptime(input, '%Y-%m-%d %H:%M:%S')
        d = dt.isoformat() + 'Z'
        return d
    elif type(input) == datetime.datetime:
        d = input.isoformat() + 'Z'
    else: 
        raise Exception('Please pass properly formatted datetime parameter. E.g."2018-10-01 10:01:30"')

### 3. Understanding the imposed API Limit (1000 records)

The 311 Toronto API has a limit of 1000 records in its response object. Having a size limit or a rate limit when working with APIs is more common than you may think. This is a way of ensuring that the servers are not overloaded trying to fulfill a bunch of requests and can provide a good quality of service. We have requested only 3 months of data for our date range above. I have checked to see the average number of recorded pot holes on average per month is usually around 1k and matches the imposed limit. We are going to take our date range and partition it into 30 day periods. This way we can make synchronous requests for each of the 30 day chunks. 

In [4]:
def data_partitions(start, end):
    '''dealing with the 1k api limit'''
    if type(start) != datetime.datetime:
        start_date = datetime.datetime.strptime(start, '%Y-%m-%d %H:%M:%S')
    if type(end) != datetime.datetime:
        end_date = datetime.datetime.strptime(end, '%Y-%m-%d %H:%M:%S')
    
    days_total = end_date - start_date
    print(days_total)
    if days_total.days > 30:
        new_end_dates = [start_date]
        rounds = days_total.days // 30
        for i in range(rounds):
            new_end_dates.append(new_end_dates[-1] + datetime.timedelta(days=29))
            new_end_dates.append(new_end_dates[-1] + datetime.timedelta(days=1))
        if new_end_dates[-1] != end_date:
            new_end_dates.append(end_date)
        else:
            end_date = end_date + datetime.timedelta(hours=23)
            new_end_dates.append(end_date)
        return new_end_dates
    else:
        return [start_date, end_date]

In [5]:
partitions = data_partitions(start_date, end_date)
partitions

138 days, 0:00:00


[datetime.datetime(2018, 10, 1, 0, 0),
 datetime.datetime(2018, 10, 30, 0, 0),
 datetime.datetime(2018, 10, 31, 0, 0),
 datetime.datetime(2018, 11, 29, 0, 0),
 datetime.datetime(2018, 11, 30, 0, 0),
 datetime.datetime(2018, 12, 29, 0, 0),
 datetime.datetime(2018, 12, 30, 0, 0),
 datetime.datetime(2019, 1, 28, 0, 0),
 datetime.datetime(2019, 1, 29, 0, 0),
 datetime.datetime(2019, 2, 16, 0, 0)]

Using the function above, we are going to take the partitions and get a bunch of start and end ranges. Since we do not want overlapping days pulling the same data, we made sure to add construct our partitions this way. Meaning, we want the first chunk of days to be from 2018-01-01 to 2018-01-30. The next chunk we want to **make sure** starts from 2018-01-31 instead of 2018-01-30. Using this odd / even relationship of the list above we will construct our ranges.

In [6]:
# take every even numbers
start_ranges = partitions[::2]
# take every odd numbers
end_ranges = partitions[1::2]

### 4. Fetch the Actual Data

It is time to make the actual api request to 311 Toronto. In our base url we have some parameters like the service_code=CSROWR-12 and this specifies we only want data for reported pot holes.

In [7]:
data_clob = []
def data(start_range, end_range):
    sd = start_range.isoformat() + 'Z'
    ed = end_range.isoformat() + 'Z'
    base_url = "https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&"
    url = base_url+'start_date'+'='+sd+'&'+'end_date'+'='+ed
    print(url)
    data = requests.get(url).json()
    data_clob.append(data)
    return data    

In [8]:
i = 0
while i < len(start_ranges):
    data(start_ranges[i], end_ranges[i])
    i += 1

https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&start_date=2018-10-01T00:00:00Z&end_date=2018-10-30T00:00:00Z
https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&start_date=2018-10-31T00:00:00Z&end_date=2018-11-29T00:00:00Z
https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&start_date=2018-11-30T00:00:00Z&end_date=2018-12-29T00:00:00Z
https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&start_date=2018-12-30T00:00:00Z&end_date=2019-01-28T00:00:00Z
https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&start_date=2019-01-29T00:00:00Z&end_date=2019-02-16T00:00:00Z


### 5. Clean the Data

We know we have a giant list of responses from the get requests we made earlier. Let's take a quick look at what this looks like.

In [9]:
# uncomment below to print the result of the first pull
#data_clob[0]

We only pulled the first data_clob item and we see that it is a nested JSON containing records of service requests on reported pot holes. We know that the values we are interested in is the list object that is paired to the key "service_requests" as shown above. We are going to iterate through every data clob object and pull this list out.

In [10]:
# data clob is a nested dictionary always starting with the key 'service_requests' -- clean and get only the values for this.
data_set = []
for result in data_clob:
    data = result['service_requests']
    data_set.append(data)

# combine partitioned lists into a single list object
data_set = sum(data_set, [])

### 6. Construct Base DataFrame

This is where the magic of Pandas come into play. Pandas can read your data from a bunch of formats like csv, dictionary, lists and put it into a data frame for you. I frequently use pandas with this handy tool called sqlalchemy to connect to databases and pandas has a way to also read sql. 

In [11]:
df_raw = pd.DataFrame.from_dict(data_set)
# head() shows you first five rows but you can see more by running an int parameter like df_raw.head(10)
df_raw.head()

Unnamed: 0,address,address_id,agency_responsible,description,expected_datetime,lat,long,media_url,requested_datetime,service_code,service_name,service_notice,service_request_id,status,status_notes,updated_datetime,zipcode
0,"1430 Sheppard Ave W, North York, Sheppard Publ...",10133016.0,311 Toronto,,2020-04-29T17:03:00-04:00,43.743649,-79.49185,,2018-10-29T17:03:00-04:00,CSROWR-12,Road - Pot hole,,101005572079,closed,Completed - The request has been concluded.,2018-10-30T13:03:00-04:00,
1,"Rathburn Rd / Melbert Rd, Etobicoke",13467170.0,311 Toronto,,2018-11-02T15:04:00-04:00,43.650021,-79.58274,,2018-10-29T15:04:00-04:00,CSROWR-12,Road - Pot hole,,101005571876,closed,Completed - The request has been concluded.,2018-11-07T14:00:00-05:00,
2,"Winterton Dr / Lloyd Manor Rd, Etobicoke",13463541.0,311 Toronto,,2018-11-02T15:04:00-04:00,43.671929,-79.555053,,2018-10-29T15:04:00-04:00,CSROWR-12,Road - Pot hole,,101005571889,closed,Completed - The request has been concluded.,2018-11-07T14:01:00-05:00,
3,GARDINER EXPRESS / Lake Shore Blvd E / Booth A...,13466078.0,311 Toronto,,,43.653533,-79.340792,,2018-10-29T14:04:00-04:00,CSROWR-12,Road - Pot hole,,101005571749,closed,"Cancelled - The request may be a duplicate, wo...",,
4,"GARDINER EXPRESS / Lake Shore Blvd W, former T...",13468963.0,311 Toronto,,2019-01-27T14:01:00-05:00,43.633417,-79.435976,,2018-10-29T14:01:00-04:00,CSROWR-12,Road - Pot hole,,101005571645,closed,Completed - The request has been concluded.,2018-10-30T11:04:00-04:00,


### 7. Data Post-Processing
From the readme doc posted by 311 Toronto we know the following:
- agency_responsible always set to 311 Toronto
- service_notice: not returned
- zipcode: not returned 

Based on this information, we will clean up the dataframe by dropping the corresponding columns.

In [12]:
df_delta_days = df_raw.drop(['agency_responsible', 'service_notice', 'zipcode'], axis=1)

### 8. Actual Calculations

##### Calculate Difference in Days between Updated Case Date and Expected Date

In [13]:
df_delta_days['requested_datetime'] = pd.to_datetime(df_raw.requested_datetime, utc=True)
df_delta_days['updated_datetime'] = pd.to_datetime(df_raw.updated_datetime, utc=True)
df_delta_days['expected_datetime'] = pd.to_datetime(df_raw.expected_datetime, utc=True)

### 9. How long does the city take to respond and investigate?
Since the days from the request_date and the updated_datetime indicate the investigation period, this would be an interesting parameter to also calculate. We will call this investigation_days.

In [14]:
df_delta_days['investigation_days'] = df_delta_days.updated_datetime - df_delta_days.requested_datetime

In [15]:
df_delta_days['repair_days'] = df_delta_days.expected_datetime.values.astype('datetime64[D]') - df_delta_days.updated_datetime.values.astype('datetime64[D]')

#### Further cleaning - drop nulls

In [16]:
df_delta_days = df_delta_days.dropna()

In [17]:
df_delta_days.head()

Unnamed: 0,address,address_id,description,expected_datetime,lat,long,media_url,requested_datetime,service_code,service_name,service_request_id,status,status_notes,updated_datetime,investigation_days,repair_days
93,"5910 Finch Ave E, Scarborough, Ward: Scarborou...",6947314.0,On n. curb 1m e. of e. pedestrian crossing.,2022-10-24 15:04:00+00:00,43.817674,-79.224815,http://seeclickfix.com/files/issue_images/0117...,2018-10-24 15:04:00+00:00,CSROWR-12,Road - Pot hole,101005564805,closed,Completed - The request has been concluded.,2018-10-26 11:05:00+00:00,1 days 20:01:00,1459 days
136,"1058 Dufferin St, former Toronto, Ward: Davenp...",10180527.0,Directly in front of the porch of 1058 Dufferi...,2019-01-20 23:02:00+00:00,43.661142,-79.436122,http://seeclickfix.com/files/issue_images/0117...,2018-10-22 22:02:00+00:00,CSROWR-12,Road - Pot hole,101005562198,closed,Completed - The request has been concluded.,2018-10-26 12:03:00+00:00,3 days 14:01:00,86 days
151,"155 McNicoll Ave, North York, Ward: Willowdale...",30033326.0,Patchwork not finished at this spot even thoug...,2018-10-25 18:00:00+00:00,43.798982,-79.358084,http://seeclickfix.com/files/issue_images/0117...,2018-10-21 18:00:00+00:00,CSROWR-12,Road - Pot hole,101005560224,closed,Completed - The request has been concluded.,2018-11-14 11:03:00+00:00,23 days 17:03:00,-20 days
229,"4801 Dufferin St, North York, G. Ross Lord Par...",11714766.0,40m s. of Dolomite on e. curb bordering sewer ...,2019-01-15 17:00:00+00:00,43.778172,-79.468741,http://seeclickfix.com/files/issue_images/0117...,2018-10-17 16:00:00+00:00,CSROWR-12,Road - Pot hole,101005554783,closed,Completed - The request has been concluded.,2018-10-23 19:04:00+00:00,6 days 03:04:00,84 days
230,"4239 Dufferin St, North York, Ward: York Centr...",5336716.0,30m n. of Overbook near e. curb.,2019-01-15 17:00:00+00:00,43.762137,-79.46504,http://seeclickfix.com/files/issue_images/0117...,2018-10-17 16:00:00+00:00,CSROWR-12,Road - Pot hole,101005554751,closed,Completed - The request has been concluded.,2018-10-23 19:04:00+00:00,6 days 03:04:00,84 days


In [18]:
# number of total records
df_delta_days.shape[0]

151

Some things to notice is that we see records that have over 1400 days for expected repair. This might be because it is an auto-populated that gets filled under certain conditions and then re-updated at a later date. We can't be sure since it is not our data but it is something to keep in mind. 

### 10.Visualization Investigation

We want to know what the average days of investigation and repairs are. Based on this, we want to create a threshold. Any record that took less than or equal to the average time to investigate we are going to assume were fast investigations. We are going to follow a similar logic for the repair days.

In [19]:
# Load longitude, latitude data
# slow investigations
mean_days = (df_delta_days['investigation_days'].values).mean()
slow_long = df_delta_days[df_delta_days.investigation_days.values > mean_days].long
slow_lat = df_delta_days[df_delta_days.investigation_days > mean_days].lat

#quick investigations
fast_long = df_delta_days[df_delta_days.investigation_days.values <= mean_days].long
fast_lat = df_delta_days[df_delta_days['investigation_days'] <= mean_days].lat

# slow repairs
mean_repairs = (df_delta_days['repair_days'].values).mean()
slow_long_r = df_delta_days[df_delta_days.repair_days.values > mean_repairs].long
slow_lat_r = df_delta_days[df_delta_days.repair_days.values > mean_repairs].lat

#quick repairs
fast_long_r = df_delta_days[df_delta_days.repair_days.values <= mean_repairs].long
fast_lat_r = df_delta_days[df_delta_days.repair_days.values <= mean_repairs].lat

Plot our findings above on a map.

In [20]:
# plot on plotlib
fig,ax=plt.subplots(figsize=(10,10))
plt.plot(slow_long_r, slow_lat_r, 'rs') # slow repair
plt.plot(fast_long_r, fast_lat_r, 'gs') # fast repair
plt.plot(slow_long, slow_lat, 'b.') # slow investigation
plt.plot(fast_long, fast_lat, 'k.') # fast investigation
# display mplleaflet within notebook
#mplleaflet.display()

mplleaflet.show(tiles='cartodb_positron', path='pot_holes.html')

### Final Visualization
We see slow repairs for the areas near downtown core. We see that the city is fast at investigating reports in the NW side of the city. 

The scope of this tutorial was to cover the steps to gathering and preparing data ready for analysis. We did not delve into the analysis portion much in this tutorial but unexplored columns in our Pandas raw dataframe are now cleaned and available for your to explore on your own. Some interesting tips would be to look at submitted photos of the pot holes in the reports (under medua_url column). Image analysis is a fascinating area of data science with many open source projects in the area to get involved with. 