## Introductory Tutorial on Visualizing Reported Pot Holes in Toronto

In this tutorial, I am covering data mining and data exploration. *Avoid similiar starts* I am a Data Scientist currently working in Canada's Payments industry. The amount of work needed to gather data for analytics and research always varies with the task at hand. For example, if you are looking at payment processes involving user transactions and you are the company providing the payment service, you may not need to mine your raw data from an external source. If you are working on your own projects or writing up tutorials like I am, it is easiest to start off with publicly available data.

I will be using [open data](https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e2634d40-0dbf-4d91-12fa-83f039307e93) provided by 311 Toronto. This API contains information about reported pot holes by residents of Toronto. This winter we have had some deep freezes followed by warm days which give rise to new pot holes. *Avoid Similiar Start* I am interested to see what areas of the city has been reported to have a high concentration of pot holes. *Avoid Similiar Starts* I am also interested to see the workload this puts on the city in terms of the number of days required to investigate reported pot holes. Once the reports are investigated, it is also interesting to investigate how long the expected time of repair would be. To complete our objectives, I will take you through a step-by-step guide on data mining, cleaning, and visualization.

### Things you need:
1. Python (I am using Python 3.6.6)
2. pip (in order to install necessary packages)

### 1. Install and Import Modules that We need. 
- requests package is used to make API requests to 311 Toronto's server. We are using the "GET" method. If you are new to REST and gathering data by making requests to APIs, I suggest you take a read through [this](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Introduction).
- Python's datatime library is very useful and not just for data science projects. Dates appear often and require extensive manipulation. In this tutorial, we define date ranges for the API request that we are making. Convering string dates to datetime objects require the use of this package. 
- Pandas has been the most important tool I have used in majority of my data science projects and it is one of the most useful libraries in Python to get familiar with if you are going to continue on this path of data science / data analytics.
- mplleaflet is the visualization library we are going to use to display our data. It uses matplotlib and leaflet to display longitudes and latitudes on a map object. For more information, take look [here](https://github.com/jwass/mplleaflet). 

#### Installing Modules and their corresponding versions
*Use the requirments.txt*
With pip install the following or use the requirements.txt posted in the github for this project.

    - pandas (I am using version 0.24.1)
    - matplotlib ( I am using version 3.0.2)
    - mplleaflet  

If you are using Linux, you can run the commands below from my jupyter notebook with this tutorial. 

In [None]:
!pip install pandas
!pip install matplotlib
!pip install mplleaflet

#### Import Modules

In [None]:
import requests 
import datetime
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
%matplotlib inline
import mplleaflet

### 2. Deciding on a date range

We know that pot holes are problematic during the season of freezing and thawing around late winter and spring. Knowing this, it would be interesting to look at data from this winter because we have had some alternating cold days and warm days. We have defined our date parameters below but feel free to grab this notebook from my Github and change the dates around for more insight.

In [None]:
# date range parameters
start_date = "2018-11-01 00:00:00"
end_date = "2019-03-04 00:00:00"

### 3. Understanding the imposed API Limit (1000 records)

The 311 Toronto API has a limit of 1000 records in its response object. Having a size limit or a rate limit when working with APIs is more common than you may think. This is a way of ensuring that the servers are not overloaded trying to fulfill a bunch of requests and can provide a good quality of service. We have requested only 3 months of data for our date range above. I have checked to see the average number of recorded pot holes per month is usually around 1k and matches the imposed limit. We are going to take our date range and partition it into 30 day periods. This way we can make requests for each of the 30 day chunks.

In [None]:
def data_partitions(start, end):
    '''dealing with the 1k api limit'''
    if type(start) != datetime.datetime:
        start_date = datetime.datetime.strptime(start, '%Y-%m-%d %H:%M:%S')
    if type(end) != datetime.datetime:
        end_date = datetime.datetime.strptime(end, '%Y-%m-%d %H:%M:%S')
    
    days_total = end_date - start_date
    print(days_total)
    if days_total.days > 30:
        new_end_dates = [start_date]
        rounds = days_total.days // 30
        for i in range(rounds):
            new_end_dates.append(new_end_dates[-1] + datetime.timedelta(days=29))
            new_end_dates.append(new_end_dates[-1] + datetime.timedelta(days=1))
        if new_end_dates[-1] != end_date:
            new_end_dates.append(end_date)
        else:
            end_date = end_date + datetime.timedelta(hours=23)
            new_end_dates.append(end_date)
        return new_end_dates
    else:
        return [start_date, end_date]

In [None]:
partitions = data_partitions(start_date, end_date)
partitions

*sync with above*
Using the function above, we are going to take the partitions and get a bunch of start and end ranges. Since we do not want overlapping days pulling the same data, we make sure to construct our partitions this way. Meaning, we want the first chunk of days to be from 2018-01-01 to 2018-01-30. The next chunk we want to **make sure** starts from 2018-01-31 instead of 2018-01-30. Using this odd / even relationship of the list above we will construct our ranges.

In [None]:
# take every even numbers
start_ranges = partitions[::2]
# take every odd numbers
end_ranges = partitions[1::2]

### 4. Fetch the Actual Data

It is time to make the actual api request to 311 Toronto. In our base url we have some parameters like the service_code=CSROWR-12 and this specifies we only want data for reported pot holes.

In [None]:
def data(start_range, end_range):
    sd = start_range.isoformat() + 'Z'
    ed = end_range.isoformat() + 'Z'
    base_url = "https://secure.toronto.ca/webwizard/ws/requests.json?jurisdiction_id=toronto.ca&service_code=CSROWR-12&"
    url = base_url+'start_date'+'='+sd+'&'+'end_date'+'='+ed
    print(url)
    return requests.get(url).json()    

In [None]:
data_clob = []
i = 0
while i < len(start_ranges):
    data_clob.append(data(start_ranges[i], end_ranges[i]))
    i += 1

### 5. Clean the Data

We know we have a giant list of responses from the get requests we made earlier. Let's take a quick look at what this looks like.

In [None]:
# uncomment below to print the result of the first pull
#data_clob[0]

We only pulled the first data_clob item and we see that it is a nested JSON containing records of service requests on reported pot holes. We know that the value we are interested in is the list object that is paired to the key "service_requests" as shown above. We are going to iterate through every data clob object and pull this list out.

In [None]:
# data clob is a nested dictionary always starting with the key 'service_requests' -- clean and get only the values for this.
data_set = []
for result in data_clob:
    val = result['service_requests']
    data_set.append(val)

# combine partitioned lists into a single list object
data_set = sum(data_set, [])

### 6. Construct Base DataFrame

This is where the magic of Pandas come into play. Pandas can read your data from a bunch of formats like csv, dictionary, lists and put it into a data frame for you.

In [None]:
df_raw = pd.DataFrame.from_dict(data_set)
# head() shows you first five rows but you can see more by running an int parameter like df_raw.head(10)
df_raw.head()

### 7. Data Post-Processing
From the readme doc posted by 311 Toronto we know the following:
- agency_responsible always set to 311 Toronto
- service_notice: not returned
- zipcode: not returned 

Based on this information, we will clean up the dataframe by dropping the corresponding columns.
*Explain axis below?*

In [None]:
df_delta_days = df_raw.drop(['agency_responsible', 'service_notice', 'zipcode'], axis=1)

### 8. Actual Calculations

##### Calculate Difference in Days between Updated Case Date and Expected Date

*Whats going on below?* *Why utc = True?*

In [None]:
df_delta_days['requested_datetime'] = pd.to_datetime(df_raw.requested_datetime, utc=True)
df_delta_days['updated_datetime'] = pd.to_datetime(df_raw.updated_datetime, utc=True)
df_delta_days['expected_datetime'] = pd.to_datetime(df_raw.expected_datetime, utc=True)

### 9. How long does the city take to respond and investigate?
Since the days from the request_date and the updated_datetime indicate the investigation period, this would be an interesting parameter to calculate. We will call this investigation_days. *We are adding new columns!*

In [None]:
df_delta_days['investigation_days'] = df_delta_days.updated_datetime - df_delta_days.requested_datetime

In [None]:
df_delta_days['repair_days'] = df_delta_days.expected_datetime.values.astype('datetime64[D]') - df_delta_days.updated_datetime.values.astype('datetime64[D]')

#### Further cleaning - drop nulls

In [None]:
df_delta_days = df_delta_days.dropna()

In [None]:
df_delta_days.head()

In [None]:
# number of total records
# Why shape[0]?
df_delta_days.shape[0]

Some things to notice is that we see records that have over 1400 days for expected repair. This might be because it is an auto-populated field that gets filled under certain conditions and then re-updated at a later date. We can't be sure since it is not our data but it is something to keep in mind. 

### 10.Visualization Investigation

We want to know what the average length of investigation and repair in days. Based on this, we want to create a threshold. Any record that took less than or equal to the average time to investigate we are going to assume are fast investigations. We are going to follow a similar logic for the repair days.

In [None]:
# Load longitude, latitude data
# slow investigations
mean_days = (df_delta_days['investigation_days'].values).mean()
slow_long = df_delta_days[df_delta_days.investigation_days.values > mean_days].long
slow_lat = df_delta_days[df_delta_days.investigation_days > mean_days].lat

#quick investigations
fast_long = df_delta_days[df_delta_days.investigation_days.values <= mean_days].long
fast_lat = df_delta_days[df_delta_days['investigation_days'] <= mean_days].lat

# slow repairs
mean_repairs = (df_delta_days['repair_days'].values).mean()
slow_long_r = df_delta_days[df_delta_days.repair_days.values > mean_repairs].long
slow_lat_r = df_delta_days[df_delta_days.repair_days.values > mean_repairs].lat

#quick repairs
fast_long_r = df_delta_days[df_delta_days.repair_days.values <= mean_repairs].long
fast_lat_r = df_delta_days[df_delta_days.repair_days.values <= mean_repairs].lat

Plot our findings above on a map.

In [None]:
# plot on plotlib
fig,ax=plt.subplots(figsize=(10,10))
plt.plot(slow_long_r, slow_lat_r, 'rs') # slow repair
plt.plot(fast_long_r, fast_lat_r, 'gs') # fast repair
plt.plot(slow_long, slow_lat, 'b.') # slow investigation
plt.plot(fast_long, fast_lat, 'k.') # fast investigation
# display mplleaflet within notebook
#mplleaflet.display()

mplleaflet.show(path='pot_holes.html')

### Final Visualization
We see slow repairs for the areas near downtown core. We see that the city is fast at investigating reports in the NW side of the city. 

The scope of this tutorial was to cover the steps to gathering and preparing data ready for analysis. We did not delve into the analysis portion much in this tutorial but unexplored columns in our Pandas raw dataframe are now cleaned and available for your to explore on your own. Some interesting tips would be to look at submitted photos of the pot holes in the reports (under medua_url column). Image analysis is a fascinating area of data science with many open source projects in the area to get involved with. 