# Sect 37: Intro to Time Series

- online-ds-ft-070620
- 10/15/20

## Learning Objectives:

- Learn how to load in timeseries data into pandas
- Learn how to plot timeseries in pandas
- Learn how to resample at different time frequencies
- Learn about types of time series trends and how to remove them.
- Learn about seasonal decomposition

- Prepare a time series dataset to use for modeling next class

## Questions?

# Intro to Time Series

## References

- [Pandas Timeseries Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- ['Timeseries Offset Aliases'](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
- [Anchored Offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets)


- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

## Working with Time Series

In [None]:
# !pip install -U fsds 
from fsds.imports import *

import warnings
warnings.filterwarnings('ignore')

## Setting figures to timeseries-friendly
mpl.rcParams['figure.figsize'] = (12,6)

import os,sys

In [None]:
sns.__version__

### Creating a Time Series from a DataFrame

In [None]:
df = pd.read_csv('baltimore_crime_2020.csv',low_memory=False,
                usecols=range(12))
#df = fs.datasets.load_ts_baltimore_crime_full(read_csv_kwds={
#     'low_memory':False,'usecols':list(range(12)) })
df.head()

## Preparing Data for Time Series Visualization

- Index must be a `datetimeindex`

In [None]:
df.index

> #### Queston: We need to make a datetime index from the CrimeDate and CrimeTime. How might we do that?

In [None]:
# Make datetime variable from two columns


In [None]:
## Set dataframe index to be time series


In [None]:
## Preview the dataframe to confirm index

In [None]:
## Save a copy of the original dataframe (IF IT DOESN"T ALREADY EXIST)
### 


In [None]:
# ## Identify columns to drop/keep
keep_cols = ['Description','Longitude','Latitude','District','Neighborhood']

##  Remake df from df_org using keep_cols


In [None]:
## Inspect the value_counts for the different types of crimes

# display with an inline-barplot inside your df


In [None]:
## Get list of crimes to iterate through


In [None]:
## Lets get just Shootings in a new df


In [None]:
## Make a new SHOOTING column that is an integer


In [None]:
## make a dict of all crime types' DataFrames 

## For each crime type
    
    ## Get the group df
    
    ## Create a new column for that crime as we did for SHOOTINGS above
    
    ## Save the group_df into the CRIMES dict
    
## Display the keys

# Visualizing Time Series

In [None]:
## Check out SHOOTING key


In [None]:
### save the shooting column as ts


In [None]:
## Plot shooting


> #### Q: What went wrong? What are we looking at?

## Resampling Time Series

In [None]:
## Resample to daily data


### Time series Frequencies


#### Pandas Frequency Aliases

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases


|Alias	| Description|
| --- | --- |
|B |	business day frequency|
|C |	custom business day frequency|
|D |	calendar day frequency|
|W |	weekly frequency|
|M |	month end frequency|
|SM |	semi-month end frequency (15th and end of month)|
|BM |	business month end frequency|
|CBM |	custom business month end frequency|
|MS |	month start frequency|
|SMS |	semi-month start frequency (1st and 15th)|
|BMS |	business month start frequency|
|CBMS |	custom business month start frequency|
|Q |	quarter end frequency|
|BQ |	business quarter end frequency|
|QS |	quarter start frequency|
|BQS |	business quarter start frequency|
|A, Y |	year end frequency|
|BA, BY |	business year end frequency|
|AS, YS |	year start frequency|
|BAS, BYS |	business year start frequency|
|BH | business hour frequency|
|H | hourly frequency|
|T |  min	minutely frequency|
|S | secondly frequency|
|L |  ms	milliseconds|
|U |  us	microseconds|
|N | nanoseconds|

#### Compare Resampled ts

In [None]:
## Plot the same ts as different frequencies
## Specify freq codes
freq_codes = ['D','3D','W','M','Q','A']

## select ts from CRIMES
ts = CRIMES['SHOOTING']['SHOOTING']

## For each freq code, resample and plot


In [None]:
## Repeat the above loop,but plot it all on one figure


## Visualize all CRIMES as "D" Freq

> **Loop through CRIMES and resample and plot all crimes**

In [None]:
## Plot all crimes the same way (daily sum)


## Slicing With Time Series

- Make sure you're index is sorted first'
- Feed in 2 dates as strings for slicing.
- Always use .loc when slicing dates

In [None]:
## Slice 2014:


### Using Dictionaries for TIme Series preprocessing

In [None]:
## Save all crimes from 2014 on with freq=D in new TS dict

## For each crime
    ## Resample and slice and save ts



In [None]:
## Check shooting


### Visualize all ts with the differnet requency codes

In [None]:
## Plot the same ts as different frequencies
freq_codes = ['D','3D','W','M','Q']

for freq in freq_codes:
    fig, ax = plt.subplots()

    for crime,ts in TS.items():
        ts.loc['2015':'2019'].resample(freq).sum().plot(title=f"Freq Code = {freq}",ax=ax)
        
    ax.legend(bbox_to_anchor=(1.05,1),loc='upper left')

### Save Final TS and `ts_df`

In [None]:
## SAVE FINAL CHOICES FOR YEAR AND FREQUENCY TO TS 
TS = {}

## Fill in each crimes's processed time series 
for crime,ts in CRIMES.items():
    
    ## Slice out years and resample and sum 
    TS[crime] = ts.sort_index().loc['2014':][crime].resample('D').sum()
    
## Make TS into a df
ts_df = pd.concat(TS,axis=1)
ts_df.head()

In [None]:
## Check For Null Values


In [None]:
## Show rows with null values


> More columns would benefit from ffill than bfill so we are going to ffill and then dropna to remove the few days at the beginning of 2014 without data

In [None]:
## FFill null values

## chekc for nulls


In [None]:
## Drop remaining nulls


In [None]:
## Final null check


In [None]:
## Save df to csv for time series modeling next class
# ts_df.to_csv('baltimore_crime_2020_ts_070620ft.csv')

# Time Series Trends

### Types of Trends

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-removing-trends-online-ds-ft-100719/master/images/new_trendseasonal.png" width=80%>

### Stationarity

<div style="text-align:center;font-size:2em">Mean</div>
    
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_mean_nonstationary.png" width=70%>
<br><br>
<div style="text-align:center;font-size:3em">Variance</div>
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_cov_nonstationary.png" width=70%>
</div>

In [None]:
## Lab Function
def stationarity_check(TS,plot=True,col=None):
    """From: https://learn.co/tracks/data-science-career-v2/module-4-a-complete-data-science-project-using-multiple-regression/working-with-time-series-data/time-series-decomposition
    """
    
    # Import adfuller
    from statsmodels.tsa.stattools import adfuller

    if col is not None:
        # Perform the Dickey Fuller Test
        dftest = adfuller(TS[col]) # change the passengers column as required 
    else:
        dftest=adfuller(TS)
 
    if plot:
        # Calculate rolling statistics
        rolmean = TS.rolling(window = 8, center = False).mean()
        rolstd = TS.rolling(window = 8, center = False).std()

        #Plot rolling statistics:
        fig = plt.figure(figsize=(12,6))
        orig = plt.plot(TS, color='blue',label='Original')
        mean = plt.plot(rolmean, color='red', label='Rolling Mean')
        std = plt.plot(rolstd, color='black', label = 'Rolling Std')
        plt.legend(loc='best')
        plt.title('Rolling Mean & Standard Deviation')
#     plt.show(block=False)
    
    # Print Dickey-Fuller test results
    print ('Results of Dickey-Fuller Test:')

    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
        
    dfoutput['sig'] = dfoutput['p-value']<.05
    display (dfoutput.round(3))
    
    return dfoutput
#     return dftest

In [None]:
ts = TS['SHOOTING']
ts
# ts = ts.resample('D').sum().asfreq('D')

In [None]:
ts.plot()

In [None]:
res = stationarity_check(ts);

# Removing Trends (cont'd next class)

- .diff()
- subtract rolling mean
- seasonal decomposition

## Seasonal Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(ts)#.plot();
decomp.plot();

In [None]:
decomp.resid

In [None]:
decomp.seasonal

# APPENDIX

In [None]:
ax = ts_df.loc['2017':,'BURGLARY'].plot()
ax.legend(bbox_to_anchor=(1.05,1),loc='upper left')
ax.set(ylabel='# of Crimes',xlabel='Year by Day')

## Time Series Calculations


### Rolling Statistics

In [None]:
## Plot the same ts as different frequencies
# freq_codes = ['D','3D','W','M','Q']
windows = [3,5,7,30,90]
for window in windows:
    
    fig,ax = plt.subplots()
    ts_df.rolling(window).mean().plot(title= f"Window = {window}",ax=ax)
#     ts.rolling(window).mean().plot(title=title)
        
    ax.legend(bbox_to_anchor=(1.05,1),loc='upper left')

In [None]:
ts = TS['HOMICIDE']
ts

In [None]:
ts = ts.resample('W').sum()
ts

In [None]:
ts_mean = ts.rolling(window=2).sum()
ax = ts.plot(label='Time Series')
ts_mean.plot(label='Rolling 7-Day Average')
ax.legend()

In [None]:
## fill in null values
# ts.plot()
ts.ewm(halflife=2).mean()#sum()#.plot()

In [None]:
## Use adfull to test for stationarity
ts.plot()

## Rolling Windows 

In [None]:
ts_df.rolling(1).mean()

In [None]:
# ts_df.rolling(7).mean().plot()

## Using Datetime objects and apply statemements

In [None]:
df_ = CRIMES['HOMICIDE'].reset_index().dropna()
display(df_.head())
df_.dtypes

In [None]:
df_['index'][0]

In [None]:
df_['index'].dt.year#.strftime('%y')

In [None]:
df_.isna().sum()

In [None]:
df_['index'].dt.year

In [None]:
df_['index'].map(lambda x: x.strftime('%Y'))

In [None]:
test = df_.loc[0,'index']
test

In [None]:
test.strftime('%y')

### Date Str Formatting




Formatting follows the Python datetime <strong><a href='http://strftime.org/'>strftime</a></strong> codes.<br>
The following examples are based on <tt>datetime.datetime(2001, 2, 3, 16, 5, 6)</tt>:
<br><br>

<table style="display: inline-block">  
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%Y</td><td>Year with century as a decimal number.</td><td>2001</td></tr>
<tr><td>%y</td><td>Year without century as a zero-padded decimal number.</td><td>01</td></tr>
<tr><td>%m</td><td>Month as a zero-padded decimal number.</td><td>02</td></tr>
<tr><td>%B</td><td>Month as locale’s full name.</td><td>February</td></tr>
<tr><td>%b</td><td>Month as locale’s abbreviated name.</td><td>Feb</td></tr>
<tr><td>%d</td><td>Day of the month as a zero-padded decimal number.</td><td>03</td></tr>  
<tr><td>%A</td><td>Weekday as locale’s full name.</td><td>Saturday</td></tr>
<tr><td>%a</td><td>Weekday as locale’s abbreviated name.</td><td>Sat</td></tr>
<tr><td>%H</td><td>Hour (24-hour clock) as a zero-padded decimal number.</td><td>16</td></tr>
<tr><td>%I</td><td>Hour (12-hour clock) as a zero-padded decimal number.</td><td>04</td></tr>
<tr><td>%p</td><td>Locale’s equivalent of either AM or PM.</td><td>PM</td></tr>
<tr><td>%M</td><td>Minute as a zero-padded decimal number.</td><td>05</td></tr>
<tr><td>%S</td><td>Second as a zero-padded decimal number.</td><td>06</td></tr>
</table>
<table style="display: inline-block">
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%#m</td><td>Month as a decimal number. (Windows)</td><td>2</td></tr>
<tr><td>%-m</td><td>Month as a decimal number. (Mac/Linux)</td><td>2</td></tr>
<tr><td>%#x</td><td>Long date</td><td>Saturday, February 03, 2001</td></tr>
<tr><td>%#c</td><td>Long date and time</td><td>Saturday, February 03, 2001 16:05:06</td></tr>
</table>  
    

In [None]:
test = ts.index[0]
print(test)
test


In [None]:
print(test.strftime("%m-%d-%Y"))
print(test.strftime("%T"))

In [None]:
TS['SHOOTING']

## A: Groupby Indexing

In [None]:
ts.groupby(pd.Grouper(freq='M')).sum().plot(subplots=True)