<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#What-Are-Time-Series-Data?" data-toc-modified-id="What-Are-Time-Series-Data?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What Are Time Series Data?</a></span><ul class="toc-item"><li><span><a href="#Some-Examples" data-toc-modified-id="Some-Examples-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Some Examples</a></span></li><li><span><a href="#Uses-for-Time-Series" data-toc-modified-id="Uses-for-Time-Series-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Uses for Time Series</a></span></li><li><span><a href="#Example-Data" data-toc-modified-id="Example-Data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Example Data</a></span></li></ul></li><li><span><a href="#Datetime-Objects" data-toc-modified-id="Datetime-Objects-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Datetime Objects</a></span><ul class="toc-item"><li><span><a href="#Setting-Datetime-Objects-as-the-Index" data-toc-modified-id="Setting-Datetime-Objects-as-the-Index-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Setting Datetime Objects as the Index</a></span></li><li><span><a href="#Investigating-Time-Series-with-Datetime-Objects" data-toc-modified-id="Investigating-Time-Series-with-Datetime-Objects-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Investigating Time Series with Datetime Objects</a></span></li></ul></li><li><span><a href="#Resampling-Techniques" data-toc-modified-id="Resampling-Techniques-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Resampling Techniques</a></span><ul class="toc-item"><li><span><a href="#Aside:-Deeper-Exploration" data-toc-modified-id="Aside:-Deeper-Exploration-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Aside: Deeper Exploration</a></span></li></ul></li><li><span><a href="#Visualizing-Time-Series" data-toc-modified-id="Visualizing-Time-Series-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Visualizing Time Series</a></span><ul class="toc-item"><li><span><a href="#Showing-Changes-Over-Time" data-toc-modified-id="Showing-Changes-Over-Time-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Showing Changes Over Time</a></span><ul class="toc-item"><li><span><a href="#Line-Plot" data-toc-modified-id="Line-Plot-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Line Plot</a></span></li><li><span><a href="#Dot-Plot" data-toc-modified-id="Dot-Plot-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Dot Plot</a></span></li><li><span><a href="#Grouping-Plots" data-toc-modified-id="Grouping-Plots-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Grouping Plots</a></span><ul class="toc-item"><li><span><a href="#All-Annual-Separated" data-toc-modified-id="All-Annual-Separated-5.1.3.1"><span class="toc-item-num">5.1.3.1&nbsp;&nbsp;</span>All Annual Separated</a></span></li><li><span><a href="#All-Annual-Together" data-toc-modified-id="All-Annual-Together-5.1.3.2"><span class="toc-item-num">5.1.3.2&nbsp;&nbsp;</span>All Annual Together</a></span></li></ul></li></ul></li><li><span><a href="#Showing-Distributions" data-toc-modified-id="Showing-Distributions-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Showing Distributions</a></span><ul class="toc-item"><li><span><a href="#Histogram" data-toc-modified-id="Histogram-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Histogram</a></span></li><li><span><a href="#Density" data-toc-modified-id="Density-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Density</a></span></li><li><span><a href="#Box-Plot" data-toc-modified-id="Box-Plot-5.2.3"><span class="toc-item-num">5.2.3&nbsp;&nbsp;</span>Box Plot</a></span></li><li><span><a href="#Heat-Maps" data-toc-modified-id="Heat-Maps-5.2.4"><span class="toc-item-num">5.2.4&nbsp;&nbsp;</span>Heat Maps</a></span><ul class="toc-item"><li><span><a href="#Example-of-how-heat-maps-are-useful" data-toc-modified-id="Example-of-how-heat-maps-are-useful-5.2.4.1"><span class="toc-item-num">5.2.4.1&nbsp;&nbsp;</span>Example of how heat maps are useful</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#EXTRAS" data-toc-modified-id="EXTRAS-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>EXTRAS</a></span><ul class="toc-item"><li><span><a href="#EDA" data-toc-modified-id="EDA-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>EDA</a></span></li></ul></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 1000)

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

#  Objectives

- Understand the use case for time series data
- Manipulate datetime objects
- Understand different resampling techniques
- Implement different visualization techniques for time series data

# What Are Time Series Data?

> We can saw data is a **time series** when the temporal information is a key focus of the data.

Data in a time series can stem from historical data or data that is dependent on past values.

## Some Examples

- Stock prices
- Atmospheric changes over the course of decades
- Audio samples
- Heart rate data

## Uses for Time Series

- Understand some underlying process
- Forecasting (what we'll mostly focus on)
- Imputation (filling missing "past" data)
- Anomaly detection

## Example Data

In [None]:
# Define a function that will help us load and
# clean up a dataset.

def load_trend(trend_name='football', country_code='us'):
    df = pd.read_csv('data/google-trends_'
                     + trend_name + '_'
                     + country_code
                     + '.csv').iloc[1:, :]
    df.columns = ['counts']
    df['counts'] = df['counts'].str.replace('<1', '0').astype(int)
    return df

In [None]:
df = load_trend(**{'trend_name': 'data-science', 'country_code': 'us'})
df.head()

Now we can do this with multiple time series data!

In [None]:
trends = [
    {'trend_name': 'data-science', 'country_code': 'us'},
    {'trend_name': 'football', 'country_code': 'us'},
    {'trend_name': 'football', 'country_code': 'uk'},
    {'trend_name': 'coronavirus', 'country_code': 'us'},
    {'trend_name': 'trump', 'country_code': 'us'},
    {'trend_name': 'taxes', 'country_code': 'us'},
    {'trend_name': 'avengers', 'country_code': 'us'}
]

In [None]:
trend_dfs = [load_trend(**trend) for trend in trends]

# Datetime Objects

Datetime objects make our time series modeling lives easier.  They will allow us to perform essential data prep tasks with a few lines of code.  

We need our time series **index** to be datetime objects, since our models will rely on being able to identify the previous chronological value.

There is a `datetime` [library](https://docs.python.org/2/library/datetime.html), and inside `pandas` there is a datetime module as well as a to_datetime() function.

For time series modeling, the first step often is to make sure that the index is a datetime object.

## Setting Datetime Objects as the Index

There are a few ways to **reindex** our series to datetime. 

We can use `pandas.to_datetime()` method:

In [None]:
ts_no_datetime = pd.read_csv('data/Gun_Crimes_Heat_Map.csv')

In [None]:
ts_no_datetime.head()

In [None]:
ts_no_datetime.index

In [None]:
ts = ts_no_datetime.set_index(pd.to_datetime(ts_no_datetime['Date']), drop=True)

> Alternatively, we can parse the dates directly on import

In [None]:
ts = pd.read_csv('data/Gun_Crimes_Heat_Map.csv', index_col='Date', parse_dates=True)

In [None]:
print(f"Now our index is a {type(ts.index)}")

In [None]:
ts.head()

## Investigating Time Series with Datetime Objects

Datetime objects include aspects of the date as attributes, like month and year:

In [None]:
ts.index[0]

In [None]:
ts.index[0].month

In [None]:
ts.index[0].year

We can also use the date to directly slice the DataFrame

In [None]:
# Only data after 2021
ts['2021':]

In [None]:
# Only data from this time period
ts['2020-02-01 00:00':'2020-02-01 01:00']

We can easily see now whether offenses happen, for example, during business hours.

In [None]:
fig, ax = plt.subplots()

ts['hour'] = ts.index
ts['hour'] = ts.hour.apply(lambda x: x.hour)
ts['business_hours'] = ts.hour.apply(lambda x: 9 <= x <= 17)

bh_ratio = ts.business_hours.value_counts()[1]/len(ts)

x = ts.business_hours.value_counts().index
y = ts.business_hours.value_counts()
sns.barplot(x=x, y=y)

ax.set_title(f'{bh_ratio: 0.2%} of Offenses\n Happen Btwn 9 and 5');

# Resampling Techniques

> **Resampling** allows us to convert the time series into a particular frequency

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

With a Datetime index, we also have new abilities, such as **resampling**.

To create our timeseries, we will count the number of gun offenses reported per day.

In [None]:
ts.resample('D')

There are many possible units for resampling, each with its own alias:

<table style="display: inline-block">
    <caption style="text-align: center"><strong>TIME SERIES OFFSET ALIASES</strong></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>B</td><td>business day frequency</td></tr>
<tr><td>C</td><td>custom business day frequency (experimental)</td></tr>
<tr><td>D</td><td>calendar day frequency</td></tr>
<tr><td>W</td><td>weekly frequency</td></tr>
<tr><td>M</td><td>month end frequency</td></tr>
<tr><td>SM</td><td>semi-month end frequency (15th and end of month)</td></tr>
<tr><td>BM</td><td>business month end frequency</td></tr>
<tr><td>CBM</td><td>custom business month end frequency</td></tr>
<tr><td>MS</td><td>month start frequency</td></tr>
<tr><td>SMS</td><td>semi-month start frequency (1st and 15th)</td></tr>
<tr><td>BMS</td><td>business month start frequency</td></tr>
<tr><td>CBMS</td><td>custom business month start frequency</td></tr>
<tr><td>Q</td><td>quarter end frequency</td></tr>
<tr><td></td><td><font color=white>intentionally left blank</font></td></tr></table>

<table style="display: inline-block; margin-left: 40px">
<caption style="text-align: center"></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>BQ</td><td>business quarter endfrequency</td></tr>
<tr><td>QS</td><td>quarter start frequency</td></tr>
<tr><td>BQS</td><td>business quarter start frequency</td></tr>
<tr><td>A</td><td>year end frequency</td></tr>
<tr><td>BA</td><td>business year end frequency</td></tr>
<tr><td>AS</td><td>year start frequency</td></tr>
<tr><td>BAS</td><td>business year start frequency</td></tr>
<tr><td>BH</td><td>business hour frequency</td></tr>
<tr><td>H</td><td>hourly frequency</td></tr>
<tr><td>T, min</td><td>minutely frequency</td></tr>
<tr><td>S</td><td>secondly frequency</td></tr>
<tr><td>L, ms</td><td>milliseconds</td></tr>
<tr><td>U, us</td><td>microseconds</td></tr>
<tr><td>N</td><td>nanoseconds</td></tr></table>

When resampling, we have to provide a rule to resample by, and an **aggregate function**.

**To upsample** is to increase the frequency of the data of interest.  
**To downsample** is to decrease the frequency of the data of interest.

For our purposes, we will downsample, and  count the number of occurences per day.

In [None]:
ts.resample('D').count()

Our time series will consist of a series of counts of gun reports per day.

In [None]:
# ID is unimportant. We could choose any column, since the counts are the same.
ts = ts.resample('D').count()['ID']

In [None]:
ts

Let's visualize our timeseries with a plot.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ts.index, ts.values)
ax.set_title('Gun Crimes per day in Chicago')
ax.set_ylabel('Reported Gun Crimes');

There seems to be some abnormal activity happening towards the end of our series.

**[sun-times](https://chicago.suntimes.com/crime/2020/6/8/21281998/chicago-deadliest-day-violence-murder-history-police-crime)**

## Aside: Deeper Exploration

In [None]:
ts.sort_values(ascending=False)[:10]

Let's treat the span of days from 5-31 to 6-03 as outliers. 

There are several ways to do this, but let's first remove the outliers, and populate an an empty array with the original date range. That will introduce us to the `pandas.date_range()` method.

In [None]:
daily_count = ts[ts < 90]
ts_dr = pd.date_range(daily_count.index[0], daily_count.index[-1])
ts_daily = np.empty(shape=len(ts_dr))
ts_daily = pd.Series(ts_daily)
ts_daily = ts_daily.reindex(ts_dr)
ts = ts_daily.fillna(daily_count)

In [None]:
ts

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ts.plot(ax=ax)
ax.set_title('Gun Crimes in Chicago with Deadliest Days Removed');

Let's zoom in on that week again:

In [None]:
fig, ax = plt.subplots()
ax.plot(ts[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax.tick_params(rotation=45)
ax.set_title('We have some gaps now');

The datetime object allows us several options of how to fill those gaps:

In [None]:
# .ffill()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5))
ax1.plot(ts.ffill()[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax1.tick_params(rotation=45)
ax1.set_title('Forward Fill')

ax2.plot(ts[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax2.tick_params(rotation=45)
ax2.set_title('Original');

In [None]:
# .bfill()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5))
ax1.plot(ts.bfill()[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax1.tick_params(rotation=45)
ax1.set_title('Back Fill')

ax2.plot(ts[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax2.tick_params(rotation=45)
ax2.set_title('Original');

In [None]:
# .interpolate()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5))
ax1.plot(ts.interpolate()[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax1.tick_params(rotation=45)
ax1.set_title('Interpolation')

ax2.plot(ts[(ts.index > '2020-05-20') 
                 & (ts.index < '2020-06-07')]
       )
ax2.tick_params(rotation=45)
ax2.set_title('Original');

Let's proceed with the interpolated data.

In [None]:
ts = ts.interpolate()
ts.isna().sum()

Now that we've cleaned up a few data points, let's downsample to the week level.  

In [None]:
ts_weekly = ts.resample('W').mean()

In [None]:
ts_weekly.plot();

# Visualizing Time Series

There can be a lot information to be found in time series! Visualizations can help us tease out this information to something we can more easily observe.

## Showing Changes Over Time

Can identify patterns and trends with visualizations

In [None]:
# New York Stock Exchange average monthly returns [1961-1966] from curriculum
nyse = pd.read_csv("data/NYSE_monthly.csv")
col_name= 'Month'
nyse[col_name] = pd.to_datetime(nyse[col_name])
nyse.set_index(col_name, inplace=True)

In [None]:
display(nyse.head(10))
display(nyse.info())

### Line Plot

In [None]:
nyse.plot(figsize = (16,6))
plt.show()

### Dot Plot

In [None]:
nyse.plot(figsize = (16,6), style="*")
plt.show()

> Note the difference between this and the line plot.
>
> When might you want a dot vs a line plot?

### Grouping Plots

What if we wanted to look at year-to-year (e.g., temperature throughout many years)

There are a couple options to choose from.

#### All Annual Separated

In [None]:
year_groups == nyse.groupby()

In [None]:
# Annual Frequency
year_groups = nyse.groupby(pd.Grouper(freq ='A'))

#Create a new DataFrame and store yearly values in columns 
nyse_annual = pd.DataFrame()

for yr, group in year_groups:
    nyse_annual[yr.year] = group.values.ravel()
    
# Plot the yearly groups as subplots
nyse_annual.plot(figsize = (13,8), subplots=True, legend=True)
plt.show()

#### All Annual Together

In [None]:
# Plot overlapping yearly groups 
nyse_annual.plot(figsize = (15,5), subplots=False, legend=True)
plt.show()

## Showing Distributions

Sometimes the distribution of the values are important.

What are some reasons?

- Checking for normality (for stat testing)
- First check on raw & transformed data

### Histogram

In [None]:
nyse.hist(figsize = (10,6))
plt.show()

In [None]:
# Bin it to make it more obvious if normal
nyse.hist(figsize = (10,6), bins = 7)
plt.show()

### Density

In [None]:
nyse.plot(kind='kde', figsize = (15,10))
plt.show()

### Box Plot

- Shows distribution over time
- Can help show outliers
- Seasonal trends

In [None]:
# Generate a box and whiskers plot for temp_annual dataframe
nyse_annual.boxplot(figsize = (12,7))
plt.show()

### Heat Maps

Use color to show patterns throughout a time period for data

#### Example of how heat maps are useful

In [None]:
df_temp = pd.read_csv(
    'data/min_temp.csv',             # Data to read
    index_col=0,                # Use the first column as index ('Date')
    parse_dates=True,           # Have Pandas parse the dates
    infer_datetime_format=True, # Make Pandas try to parse dates automatically
    dayfirst=True               # Impoprtant to know format is DD/MM
)

In [None]:
display(df_temp.head())
display(df_temp.info())

In [None]:
# Create a new DataFrame and store yearly values in columns for temperature
temp_annual = pd.DataFrame()

for yr, group in df_temp.groupby(pd.Grouper(freq ='A')):
    temp_annual[yr.year] = group.values.ravel()

##### Plotting each line plot in a subplot

Let's use our strategy in plotting multiple line plots to see if we can see a pattern:

In [None]:
# Plot the yearly groups as subplots
temp_annual.plot(figsize = (16,8), subplots=True, legend=True)
plt.show()

You likely will have a hard time seeing exactly the temperature shift is throughout the year (if it even exists!)

We can try plotting all the lines together to see if a pattern is more obvious in our visual.

##### Plotting all line plots in one plot

In [None]:
# Plot overlapping yearly groups 
temp_annual.plot(figsize = (15,5), subplots=False, legend=True)
plt.show()

That's great we can see that the temperature decreases in the middle of the data! But now we sacrificed being able to observe any pattern for an individual year. 

This is where using a heat map can help visualize patterns throughout the year for temperature! And of course, the heat map can be used for more than just temperature related data.

##### And finally, using a heat map to visualize a pattern

In [None]:
# Year and month 
year_matrix = temp_annual.T
plt.matshow(year_matrix, interpolation=None, aspect='auto', cmap=plt.cm.Spectral_r)
plt.show()

☝🏼 Look at that beautiful visual pattern! Makes me want to weep with joy for all the information density available to us!

# Level Up

## EDA

Let's import some data on **gun violence in Chicago**.

[source](https://data.cityofchicago.org/Public-Safety/Gun-Crimes-Heat-Map/iinq-m3rg)

In [None]:
ts = pd.read_csv('data/Gun_Crimes_Heat_Map.csv')

In [None]:
ts.head()

Let's look at some summary stats:

In [None]:
print(f"There are {ts.shape[0]} records in our timeseries")

In [None]:
# Definitely some messy input of our Desciption data
ts['Description'].value_counts()

In [None]:
height = ts['Description'].value_counts()[:10]
offense_names = ts['Description'].value_counts()[:10].index

fig, ax = plt.subplots()
sns.barplot(height, offense_names, color='r', ax=ax)
ax.set_title('Mostly Handgun offenses');

In [None]:
# Mostly non-domestic offenses

fig, ax = plt.subplots()
sns.barplot( ts['Domestic'].value_counts().index, 
             ts['Domestic'].value_counts(),  
             palette=[ 'r', 'b'], ax=ax
           )

ax.set_title("Overwhelmingly Non-Domestic Offenses");

In [None]:
# Mostly non-domestic offenses

arrest_rate = ts['Arrest'].value_counts()[1]/len(ts)

fig, ax = plt.subplots()

sns.barplot( ts['Arrest'].value_counts().index, 
             ts['Arrest'].value_counts(), 
             palette=['r', 'g'], ax=ax
           )

ax.set_title(f'{arrest_rate: 0.2%} of Total Cases\n Result in Arrest');

In [None]:
fig, ax = plt.subplots()
sns.barplot( ts['Year'].value_counts().index, 
             ts['Year'].value_counts(),  
             color= 'r', ax=ax
           )

ax.set_title("Offenses By Year");

While this does show some interesting information that will be relevant to our time series analysis, we are going to get more granular.