<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

<table align="left">
    <tr><td>
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a></td><td>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</td>
    </tr>
    <tr><td>Jacques Roy, Byte Size Data Science</td><td> </td></tr>
    </table>

# Time Series Exploration
In this notebook, we look at timeseries to get a feel for what they are.

In [None]:
# youtube video related to this notebook
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/fa7rHy7YmWU?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)

In [None]:
# Libraries needed in the notebook
# import urllib3, requests, json
import pandas as pd
import numpy as np
import datetime as dt
import dateutil.parser

# import io

# pd.set_option('display.max_colwidth', -1)

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
%matplotlib inline


## Getting the data
In this section, we get data from the city of Chicago.

Once we have the data, we can save it in a file in the project so we don't have to download it again.
This is recommended since this data will be used in future notebooks.

We'll get the data dynamically using the socrata API. See also: <a href="https://github.com/jacquesroy/byte-size-data-science/blob/master/Notebooks/W005-FindingData.ipynb" target="_blank">W005-FindingData.ipyng</a>

The dataset attributes include:

|  |  |  |  |
| :--- | :--- | :--- | :--- 
| `alignment` | `beat_of_occurrence` | *crash_date* | `crash_date_est_i`
| *crash_day_of_week* | *crash_hour* | *crash_month* | `crash_record_id`
|`crash_type` | `damage` | `date_police_notified` | `device_condition`
| `dooring_i` | `first_crash_type` | `hit_and_run_i` | `injuries_fatal`
| `injuries_incapacitating` | `injuries_no_indication` | | 
| `injuries_non_incapacitating` | `injuries_reported_not_evident` | | |
| `injuries_total` | `injuries_unknown` | `intersection_related_i` | `lane_cnt`
| **`latitude`** | `lighting_condition` | `location` | **`longitude`**
| `most_severe_injury` | `num_units` | `photos_taken_i` | 
| `posted_speed_limit` | `prim_contributory_cause` | `private_property_i` | 
| `rd_no` | `report_type` | `road_defect` | `roadway_surface_cond`
| `sec_contributory_cause` | `statements_taken_i` | `street_direction` | 
| **`street_name`** | **`street_no`** | `traffic_control_device` | `trafficway_type`
| `weather_condition` | `work_zone_i` | `work_zone_type` | `workers_present_i` 

For our purpose, we limit ourselves to a small subset of these attributes.

In [None]:
# Library used to read datasets
# https://github.com/xmunoz/sodapy
!pip install sodapy 2>&1 >pipsodapy.txt

from sodapy import Socrata

In [None]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)

### Limit the data we read
We'll limit ourselves to 2019 and 2020.

In [None]:
from datetime import date

# If we wanted to do today:
# We are using a fix date for future comparisons
two_years = (date(2019,1,1)).strftime('%Y-%m-%d')
where = "crash_date >= '{}'and crash_date < '{}'".format(two_years, date(2021,1,1))
select = "crash_date,crash_month,crash_day_of_week,crash_hour,injuries_fatal,injuries_total"

In [None]:
# https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if
select = "crash_date,crash_month,crash_day_of_week,crash_hour,injuries_fatal,injuries_total"

crashes_df = pd.DataFrame(client.get("85ca-t3if", select=select,where=where, limit=10000))
offset = 10000
result = client.get("85ca-t3if", select=select,where=where, offset=offset, limit=10000)
while (len(result) > 0) :
    crashes_df = crashes_df.append(pd.DataFrame(result), sort=True)
    offset += 10000
    result = client.get("85ca-t3if", select=select,where=where, offset=offset, limit=10000)

print("Number of records: {}, number of columns: {}".format(crashes_df.shape[0], crashes_df.shape[1]))

### Convert to proper data types
The data returned through this interface is only character strings. We need to convert them to the proper types.

The `crash_date` includes hour, minute, and second. We only need the date part.

Note that `crash_day_of week` starts with 1 for Sunday.

In [None]:
# injuries_fatal and injuries_total are converted to floats becasue of issues with NaN/Null values
crashes_df2 = crashes_df.astype({'crash_date': 'datetime64[ns]' , 'crash_month': int, 'crash_day_of_week': int,
                                 'crash_hour': int,'injuries_fatal': float, 'injuries_total': float})
crashes_df2['crash_date'] = crashes_df2['crash_date'].dt.floor('d') # Keep only the date part

crashes_df2['injuries_fatal'] = crashes_df2['injuries_fatal'].fillna(0)
crashes_df2['injuries_total'] = crashes_df2['injuries_total'].fillna(0)
crashes_df2['injuries_fatal'] = crashes_df2['injuries_fatal'].astype(int)
crashes_df2['injuries_total'] = crashes_df2['injuries_total'].astype(int)

crashes_df2.dtypes

## Write Data to Cloud Storage
You need to import your cloud storage credentials.

The easiest way is to open the file section on the right and use `Insert to code`
to add the credentials to an empty cell.<br/>
Make sure the name of the variable is `credentials`
and not `credentials_1`.

In [None]:
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials = {
    'IAM_SERVICE_ID': 'iam-ServiceId-9d918923-e68f-46df-b2ab-3c23479e0bee',
    'IBM_API_KEY_ID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.cloud.ibm.com/oidc/token',
    'BUCKET': 'bscs2project-donotdelete-pr-vafl0cosn5bcq1',
    'FILE': 'titanic.csv'
}

In [None]:
import ibm_boto3
from ibm_botocore.client import Config

cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

In [None]:
# Save crashes
crashes_df2.to_csv('crashes.csv',index=False)
# Save the file to COS
cos.upload_file('crashes.csv', Bucket=credentials['BUCKET'],Key='crashes.csv')
!rm crashes.csv

### Adding the file to the project
At this point, the file is added to your cloud storage. You need to also add it to your project.

When you are in the project, open the files tab, select `crashed.csv` and add it to the project.

## Read the Data from the Cloud Object Storage
Instead of through Socrata if the file was created.

Select an empty cell then open the file tab, use `insert to code` and select `pandas DataFrame`

In [None]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_f4a37e7b459c4faa8b0ceeeb172a28a7 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='CvdDbJdbPIpTlcNRUrp6',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_f4a37e7b459c4faa8b0ceeeb172a28a7.get_object(Bucket='bscs2project-donotdelete-pr-vafl0cosn5bcq1',Key='crashes.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

crashes_df2 = pd.read_csv(body)
crashes_df2['crash_date'] = pd.to_datetime(crashes_df2['crash_date'])
print('Number of records: {}'.format(crashes_df2.shape[0]))
crashes_df2.head()

## Grouping the data - Daily
We need to group the data per day (for now) and differentiate between fatalities, injuries, and only property damage.

For future analysis, we could also add the number of injuries and number of fatalities per day.

In [None]:
# Add columns to reflect the type of accident
crashes_df2['fatalities'] = (crashes_df2.injuries_fatal > 0).astype(int)
crashes_df2['injuries'] = ((crashes_df2.injuries_total > 0 ) & (crashes_df2.injuries_fatal == 0)).astype(int)
crashes_df2['damage'] = ((crashes_df2.injuries_total == 0) & (crashes_df2.injuries_fatal == 0)).astype(int)
crashes_df2.head()

In [None]:
# Aggregate accidents 
crashes_df3 = crashes_df2[['crash_date','crash_day_of_week','crash_month','fatalities','injuries','damage']].\
                        groupby(by='crash_date').agg({'crash_day_of_week': 'min','crash_month': 'min',\
                                                      'fatalities': 'sum','injuries': 'sum','damage': 'sum' } )
# Add a total column to have the total number of accidents per day
crashes_df3['total'] = crashes_df3[['fatalities','injuries','damage']].sum(axis=1)
crashes_df3.head()

## Plot the daily accidents

In [None]:
# offset = int((9 - crashes_df3.iloc[0].crash_day_of_week) % 7) # Offset to start on a Monday
offset = 0

plt.figure(figsize=(18,6))
crashes_df3['total'][offset:].plot(kind='line',style='.-',grid=True)
plt.title('Chicago Daily accident totals from {} to {}'.format(crashes_df3.index[offset].date(),crashes_df3.index[crashes_df3.shape[0] - 1].date()) )
plt.show()

### First observation
From plotting the timeseries, we can see that something significant happened in March 2020.

This coincide with the beginning of pandemic lockdown.

We also see that from April to July 2020, the daily accidents went back up almost to the pre-pandemic level. 
We can see three separate sections to the time series:
- beginning to March 2020
- March 2020 to July 2020
- July 2020 to the end

Still, from July 2020, we can see that we are still in an uncertain period.


## Working on 2019
For now, we'll look at the data from 2019.

In [None]:
df2019 = crashes_df3[(crashes_df3.index >= '2019-01-01') & (crashes_df3.index < '2020-01-01')].copy(deep=True)

### Add linear regression for the total number of accidents 
We can then see if there is a upward or downward trend.

In [None]:
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()  # create object for the class
X = pd.Series(range(df2019.shape[0])).values.reshape(df2019.shape[0],1)
Y = df2019['total'].values.reshape(df2019.shape[0],1)

linear_regressor.fit(X, Y)  # perform linear regression
Y_pred = linear_regressor.predict(X)  # make predictions
df2019['regression'] = np.reshape(Y_pred, Y_pred.shape[0])

In [None]:
# offset = int((9 - df2019.iloc[0].crash_day_of_week) % 7) # If we wanted to start on a Monday
offset = 0

plt.figure(figsize=(18,6))
df2019['total'][offset:].plot(kind='line',style='.-',grid=True)
df2019['regression'][offset:].plot(kind='line', color='red',grid=True) # linewidth=3

plt.title('Daily accident totals from {} to {}'.format(df2019.index[offset].date(),df2019.index[df2019.shape[0] - 1].date()) )
plt.show()

In [None]:
# See if the regression is flat
print(
    "Regression at the beginning: {:7.3f} , end: {:7.3f}, difference: {:4.3f}".format(
        df2019.regression[0],
        df2019.regression[df2019.shape[0] - 1],
        df2019.regression[df2019.shape[0] - 1] - df2019.regression[0])
)

## Distribution
We need to adjust the totals based on the trend (regression slope)

In [None]:
plt.figure(figsize=(18,6))
(df2019.total + (df2019.regression - df2019.regression[0])).hist(bins=20)

### Observation:
- This looks like a normal distribution

## Visualize outliers
For example, we can see that January 12 and November 12 are obviously outliers. Let's be more formal.

#### <font color='red'>Warning!</font>

We can do that because the data appears to be stable throughout the year.<br/>
This is not possible in all cases. For example, if we had both years (2019,2020), that would not work.

We'll see more on that later.

In [None]:
standard_dev = df2019.total.std().item()
median = df2019.total.median().item()

In [None]:
# offset = int((9 - df2019.iloc[0].crash_day_of_week) % 7)
offset = 0
df2019_cnt = df2019.shape[0] - offset

median_ser = pd.Series([median] * df2019_cnt)
median_ser.index = df2019.index[offset:]

plt.figure(figsize=(18,6))
df2019['total'][offset:].plot(kind='line',style='.-',grid=True)
df2019['regression'][offset:].plot(kind='line', color='red',grid=True,legend=False,label="regr") # linewidth=3
median_ser.plot(kind='line', color='cyan',grid=True,legend=False,label="median") # linewidth=3

(median_ser + standard_dev).plot(kind='line', color='olive',grid=True,legend=False,label="+1-std") # linewidth=3
(median_ser - standard_dev).plot(kind='line', color='olive',grid=True,legend=False, label="") # linewidth=3

(median_ser + (2 * standard_dev)).plot(kind='line', color='green',grid=True,legend=False,label="2-std") # linewidth=3
(median_ser - (2 * standard_dev)).plot(kind='line', color='green',grid=True,legend=False, label="") # linewidth=3

(median_ser + (3 * standard_dev)).plot(kind='line', color='orange',grid=True,legend=False,label="3-std") # linewidth=3
(median_ser - (3 * standard_dev)).plot(kind='line', color='orange',grid=True,legend=False, label="") # linewidth=3

plt.title('Daily accident totals from {} to {}'.format(df2019.index[0].date(),df2019.index[df2019.shape[0] - 1].date()) )
plt.show()

### Observations:
In a standard distribution (which we seem to have), we see:
- Within one standard deviation   : 68.2% of values
- Within two standard deviations  : 95.4% of values
- Within three standard deviations: 99.6% of values

We have a few point passed three standard deviations and several more passed two.<br/>
Of course, some of them are in the fewer accidents category so we're happy with that.<br/>
We may want to find out why these values are outliers.

## Are some days of the week worse than others?

In [None]:
days=['Sunday', 'Monday','Tuesday','Wednesday','Thursday', 'Friday','Saturday']
avg_df = df2019[['crash_day_of_week','total']].groupby('crash_day_of_week').mean()
avg_df.index = days
print(avg_df)
avg_df.plot(kind='bar', legend=False)

## Is the worst always Friday?
TODO: Need to set the dates to the same week date.

In [None]:
# Set all the dates to the same day of the week (Tuesday here)
df2019_2 =df2019.copy(deep=True)
df2019_2.index = df2019.index - pd.to_timedelta((df2019['crash_day_of_week'] - 3), unit='D')

In [None]:
plt.figure(figsize=(18,10))
df2019_2[df2019_2.crash_day_of_week == 1]['total'].plot(kind='line',grid=True,legend=True,label="Sunday")
df2019_2[df2019_2.crash_day_of_week == 2]['total'].plot(kind='line',grid=True,legend=True,label="Monday")
df2019_2[df2019_2.crash_day_of_week == 3]['total'].plot(kind='line',grid=True,legend=True,label="Tuesday")
df2019_2[df2019_2.crash_day_of_week == 4]['total'].plot(kind='line',grid=True,legend=True,label="Wednesday")
df2019_2[df2019_2.crash_day_of_week == 5]['total'].plot(kind='line',grid=True,legend=True,label="Thursday")
df2019_2[df2019_2.crash_day_of_week == 6]['total'].plot(kind='line',grid=True,legend=True,label="Friday")
df2019_2[df2019_2.crash_day_of_week == 7]['total'].plot(kind='line',grid=True,legend=True,label="Saturday")

plt.title('Daily accident totals from {} to {} by day of the week'.format(df2019.index[0].date(),df2019.index[df2019.shape[0] - 1].date()) )
plt.show()

## Stationary time series?
Can we consider 2019 a stationary time series?

A lot of time series analysis assume a stationary one.

To check this, we use the augmented Dickey-Fuller test.

See: https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test

We are using a confidence level of 95% or better.

In [None]:
import statsmodels.graphics.tsaplots as sgt
import statsmodels.tsa.stattools as sts
from statsmodels.tsa.seasonal import seasonal_decompose

In [None]:
# Function for color display
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))
    
def is_stationary(adf, name) :
    if (adf[1] < 0.5) :
        if (adf[0] < adf[4]['1%']) :
            print('The {} time series is stationary within the 1% margin'.format(name))
        elif (adf[0] < adf[4]['5%']) :
            print('The {} time series is stationary within the 5% margin'.format(name))
        else :
            printmd("The {} time series is <span style='color:{}'>**NOT**</span> stationary".format(name,'red'))
    else :
        printmd("The {} time series is <span style='color:{}'>**NOT**</span> stationary".format(name,'red'))
    return

In [None]:
adf = sts.adfuller(df2019.total)
print('adf: {}\npvalue: {}\nusedlag: {}\nnubs: {}'.format(adf[0],adf[1],adf[2],adf[3]))
print('critical values: {}\nicbest: {}'.format(adf[4],adf[5]))
is_stationary(adf, 'total')

### Info:
The `pvalue` is below 0.5 and `adf` is smaller than the 5% value. 