# Sect 37: Intro to Time Series

- online-ds-pt-041320
- 10/21/20

## Learning Objectives:

- Learn how to load in timeseries data into pandas
- Learn how to plot timeseries in pandas
- Learn how to resample at different time frequencies
- Learn about types of time series trends and how to remove them.
- Learn about seasonal decomposition

- Prepare a time series dataset to use for modeling next class

## Questions?

## Announcements

> For section 38: Time Series Models, you **NEED to do one of the appendix labs in order to fully learn time series modeling**.
- Appendix > More Time Series > SARIMA Lab

# Intro to Time Series

## References

- [Pandas Timeseries Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- ['Timeseries Offset Aliases'](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
- [Anchored Offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets)


- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

## Working with Time Series

In [None]:
# !pip install -U fsds 
from fsds.imports import *

import warnings
warnings.filterwarnings('ignore')

## Setting figures to timeseries-friendly
mpl.rcParams['figure.figsize'] = (12,6)

import os,sys

In [None]:
sns.__version__

### Creating a Time Series from a DataFrame

In [None]:
df = pd.read_csv('baltimore_crime_2020.csv',low_memory=False,
                usecols=range(12))

df

In [None]:
## Check info
df.info()

## Preparing Data for Time Series Visualization

- Index must be a `datetimeindex`

In [None]:
## Check the index


> #### Queston: We need to make a datetime index from the CrimeDate and CrimeTime. How might we do that?

In [None]:
# Make datetime variable from two columns


In [None]:
## Set dataframe index to be time series


In [None]:
## Save a copy of the original dataframe as df_orig (IF IT DOESN"T ALREADY EXIST)


In [None]:
## Remaking df using only some of the columns
keep_cols = ['Description','Longitude','Latitude','District','Neighborhood']
df = df_orig[keep_cols].copy()
df

In [None]:
## Inspect the value_counts for the different types of crimes

# display with an inline-barplot inside your df


In [None]:
## Lets get just Shootings in a new df


In [None]:
## Make a new SHOOTING column that is an integer


In [None]:
## Get list of crimes to iterate through


In [None]:
## make a dict of all crime types' DataFrames 

## For each crime type

    ## Get the group df

    
    ## Create a new column for that crime as we did for SHOOTINGS above

    
    ## Save the group_df into the CRIMES dict (and slice out crime column)
    
## Display the keys


In [None]:
## Check out the value for LARCENY


# Visualizing Time Series

In [None]:
## Save SHOOTING as ts


In [None]:
## Plot shooting


> #### Q: What went wrong? What are we looking at?

## Resampling Time Series

In [None]:
## Resample to daily data


### Time series Frequencies


#### Pandas Frequency Aliases

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases


|Alias	| Description|
| --- | --- |
|B |	business day frequency|
|C |	custom business day frequency|
|D |	calendar day frequency|
|W |	weekly frequency|
|M |	month end frequency|
|SM |	semi-month end frequency (15th and end of month)|
|BM |	business month end frequency|
|CBM |	custom business month end frequency|
|MS |	month start frequency|
|SMS |	semi-month start frequency (1st and 15th)|
|BMS |	business month start frequency|
|CBMS |	custom business month start frequency|
|Q |	quarter end frequency|
|BQ |	business quarter end frequency|
|QS |	quarter start frequency|
|BQS |	business quarter start frequency|
|A, Y |	year end frequency|
|BA, BY |	business year end frequency|
|AS, YS |	year start frequency|
|BAS, BYS |	business year start frequency|
|BH | business hour frequency|
|H | hourly frequency|
|T |  min	minutely frequency|
|S | secondly frequency|
|L |  ms	milliseconds|
|U |  us	microseconds|
|N | nanoseconds|

#### Compare Resampled ts

In [None]:
## Plot the same ts as different frequencies
## Specify freq codes
freq_codes = []

## For each freq code, resample and plot
    
    ## Create a figure
    
    ## Resample ts and plot


In [None]:
## Repeat the above loop,but plot it all on one figure



## Move the legend to OUTSIDE the plot


## Visualize all CRIMES as "D" Freq

> **Loop through CRIMES and resample and plot all crimes**

In [None]:
## Plot all crimes the same way


> What do we notice about the different time series? 

## Slicing With Time Series

- Make sure you're index is sorted first'
- Feed in 2 dates as strings for slicing.
- Always use .loc when slicing dates

In [None]:
## Slice 2014:
ts = CRIMES['SHOOTING'].loc['2014':].resample('D').sum()
ts


## Using Dictionaries for Time Series preprocessing

In [None]:
## Save all crimes from 2014 on with freq=D in new TS dict
TS ={}
## For each crime
    
    ## Resample and slice and save ts

## Display the key
TS.keys()

In [None]:
## Check shooting


### Visualize all ts with the differnet requency codes

In [None]:
## Plot the same ts as different frequencies
freq_codes = ['D','3D','W','M','Q']

for freq in freq_codes:
    
    fig, ax = plt.subplots()

    for crime,ts in TS.items():
        ts.loc['2015':'2019'].resample(freq).sum().plot(title=f"Freq Code = {freq}",ax=ax)
        
    ax.legend(bbox_to_anchor=(1.05,1),loc='upper left')

### Save Final TS and `ts_df`

In [None]:
## Concatenate the TS dict to a single DataFrame and display .head()


In [None]:
## Check For Null Values


## Save List of columns with null values


In [None]:
## Show rows with null values


In [None]:
## save index of null rows to compare before and after


In [None]:
## Slice out the rows that contained null values in the null columns


> More columns would benefit from ffill than bfill so we are going to ffill and then dropna to remove the few days at the beginning of 2014 without data

In [None]:
## FFill null values

## check null rows/cols


In [None]:
## Bfill remaining nulls

## check null rows/cols


In [None]:
## Final null check


In [None]:
## Save df to csv for time series modeling next class


# Time Series Trends

## Types of Trends

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-removing-trends-online-ds-ft-100719/master/images/new_trendseasonal.png" width=80%>

## Stationarity

<div style="text-align:center;font-size:2em">Mean</div>
    
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_mean_nonstationary.png" width=70%>
<br><br>
<div style="text-align:center;font-size:3em">Variance</div>
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_cov_nonstationary.png" width=70%>
</div>

In [None]:
## Lab Function
from statsmodels.tsa.stattools import adfuller

def stationarity_check(TS,plot=True,col=None):
    """From: https://learn.co/tracks/data-science-career-v2/module-4-a-complete-data-science-project-using-multiple-regression/working-with-time-series-data/time-series-decomposition
    """
    
    # Import adfuller
    from statsmodels.tsa.stattools import adfuller

    if col is not None:
        # Perform the Dickey Fuller Test
        dftest = adfuller(TS[col]) # change the passengers column as required 
    else:
        dftest=adfuller(TS)
 
    if plot:
        # Calculate rolling statistics
        rolmean = TS.rolling(window = 8, center = False).mean()
        rolstd = TS.rolling(window = 8, center = False).std()

        #Plot rolling statistics:
        fig = plt.figure(figsize=(12,6))
        orig = plt.plot(TS, color='blue',label='Original')
        mean = plt.plot(rolmean, color='red', label='Rolling Mean')
        std = plt.plot(rolstd, color='black', label = 'Rolling Std')
        plt.legend(loc='best')
        plt.title('Rolling Mean & Standard Deviation')
#     plt.show(block=False)
    
    # Print Dickey-Fuller test results
    print ('Results of Dickey-Fuller Test:')

    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
        
    dfoutput['sig'] = dfoutput['p-value']<.05
    display (dfoutput.round(3))
    
    return dfoutput

In [None]:
## Save LARCENY for 2017-End as our ts

## Run a stationarity check


> Since our AD Fuller Test p-value is > .05 we conclude that we have non-stationary data. So now what??

# Removing Trends

#### Trend Removal Methods
- Log-Transformation (`np.log`)
- Differencing (`.diff()`)
- Subtract Rolling Mean (`ts-ts.rolling().mean()`)
- Subtract Exponentially-Weighted Mean (`ts-ts.ewm().mean()`)
- Seasonal Decomposition (`from statsmodels.tsa.seasonal import seasonal_decompose`
)

In [None]:
## Simpler Version of ADfullter func
def adfuller_test_df(ts):
    """Returns the AD Fuller Test Results and p-values for the null hypothesis
    that there the data is non-stationary (that there is a unit root in the data)"""
    df_res = adfuller(ts)
    names = ['Test Statistic','p-value','#Lags Used','# of Observations Used']
    res  = dict(zip(names,df_res[:4]))
    res['p<.05'] = res['p-value']<.05
    res['Stationary?'] = res['p<.05']
    
    return pd.DataFrame(res,index=['AD Fuller Results'])

In [None]:
## Plot ts and show adfuller results


In [None]:
## Log Transform


In [None]:
## Differencing 


In [None]:
## Subtract Rolling mean


In [None]:
## Subtract Exponentially Weight Mean Rolling mean



## Seasonal Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
## Save decomposed time series and plot


In [None]:
## Get ADFuller Results for seasonal component


In [None]:
## Get ADFuller Results for trend component


In [None]:
## Get ADFuller Results for resid component


# APPENDIX

### Date Str Formatting




Formatting follows the Python datetime <strong><a href='http://strftime.org/'>strftime</a></strong> codes.<br>
The following examples are based on <tt>datetime.datetime(2001, 2, 3, 16, 5, 6)</tt>:
<br><br>

<table style="display: inline-block">  
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%Y</td><td>Year with century as a decimal number.</td><td>2001</td></tr>
<tr><td>%y</td><td>Year without century as a zero-padded decimal number.</td><td>01</td></tr>
<tr><td>%m</td><td>Month as a zero-padded decimal number.</td><td>02</td></tr>
<tr><td>%B</td><td>Month as locale’s full name.</td><td>February</td></tr>
<tr><td>%b</td><td>Month as locale’s abbreviated name.</td><td>Feb</td></tr>
<tr><td>%d</td><td>Day of the month as a zero-padded decimal number.</td><td>03</td></tr>  
<tr><td>%A</td><td>Weekday as locale’s full name.</td><td>Saturday</td></tr>
<tr><td>%a</td><td>Weekday as locale’s abbreviated name.</td><td>Sat</td></tr>
<tr><td>%H</td><td>Hour (24-hour clock) as a zero-padded decimal number.</td><td>16</td></tr>
<tr><td>%I</td><td>Hour (12-hour clock) as a zero-padded decimal number.</td><td>04</td></tr>
<tr><td>%p</td><td>Locale’s equivalent of either AM or PM.</td><td>PM</td></tr>
<tr><td>%M</td><td>Minute as a zero-padded decimal number.</td><td>05</td></tr>
<tr><td>%S</td><td>Second as a zero-padded decimal number.</td><td>06</td></tr>
</table>
<table style="display: inline-block">
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%#m</td><td>Month as a decimal number. (Windows)</td><td>2</td></tr>
<tr><td>%-m</td><td>Month as a decimal number. (Mac/Linux)</td><td>2</td></tr>
<tr><td>%#x</td><td>Long date</td><td>Saturday, February 03, 2001</td></tr>
<tr><td>%#c</td><td>Long date and time</td><td>Saturday, February 03, 2001 16:05:06</td></tr>
</table>  
    