# Sect 31-Pt2 & 32: Intro to Time Series

- From 12/19/19  study group

## Learning Objectives:
- `pd.grouper`

- Learn about types of time series trends and how to remove them.
- Learn about seasonal decomposition`statsmodels.tsa.seasonal.seasonal_decompose`

- Learn about PACF, ACF
- Introduce ARIMA and SARIMA models.



 ## Questions to Revisit
 - Can you interpolate between missing datapoints?
     - `pd.Series.interpolate` 
     - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
 

## References

- [Pandas Timeseries Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- ['Timeseries Offset Aliases'](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
- [Anchored Offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets)


- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

**REFERENCE CONTENTS:**
- Date StrFormatting
    - Used for:
        - Recognizing Date Formats (`pd.to_datetime`)
        - `dt_obj.strftime()`
        
- Pandas Frequency Aliases
    - Used for:
        - `df.resample()`
        - `df.asfreq()`
        - ...
        

### Date Str Formatting




Formatting follows the Python datetime <strong><a href='http://strftime.org/'>strftime</a></strong> codes.<br>
The following examples are based on <tt>datetime.datetime(2001, 2, 3, 16, 5, 6)</tt>:
<br><br>

<table style="display: inline-block">  
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%Y</td><td>Year with century as a decimal number.</td><td>2001</td></tr>
<tr><td>%y</td><td>Year without century as a zero-padded decimal number.</td><td>01</td></tr>
<tr><td>%m</td><td>Month as a zero-padded decimal number.</td><td>02</td></tr>
<tr><td>%B</td><td>Month as locale’s full name.</td><td>February</td></tr>
<tr><td>%b</td><td>Month as locale’s abbreviated name.</td><td>Feb</td></tr>
<tr><td>%d</td><td>Day of the month as a zero-padded decimal number.</td><td>03</td></tr>  
<tr><td>%A</td><td>Weekday as locale’s full name.</td><td>Saturday</td></tr>
<tr><td>%a</td><td>Weekday as locale’s abbreviated name.</td><td>Sat</td></tr>
<tr><td>%H</td><td>Hour (24-hour clock) as a zero-padded decimal number.</td><td>16</td></tr>
<tr><td>%I</td><td>Hour (12-hour clock) as a zero-padded decimal number.</td><td>04</td></tr>
<tr><td>%p</td><td>Locale’s equivalent of either AM or PM.</td><td>PM</td></tr>
<tr><td>%M</td><td>Minute as a zero-padded decimal number.</td><td>05</td></tr>
<tr><td>%S</td><td>Second as a zero-padded decimal number.</td><td>06</td></tr>
</table>
<table style="display: inline-block">
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%#m</td><td>Month as a decimal number. (Windows)</td><td>2</td></tr>
<tr><td>%-m</td><td>Month as a decimal number. (Mac/Linux)</td><td>2</td></tr>
<tr><td>%#x</td><td>Long date</td><td>Saturday, February 03, 2001</td></tr>
<tr><td>%#c</td><td>Long date and time</td><td>Saturday, February 03, 2001 16:05:06</td></tr>
</table>  
    

### Pandas Frequency Aliases


https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases

|Alias	| Description|
| --- | --- |
|B |	business day frequency|
|C |	custom business day frequency|
|D |	calendar day frequency|
|W |	weekly frequency|
|M |	month end frequency|
|SM |	semi-month end frequency (15th and end of month)|
|BM |	business month end frequency|
|CBM |	custom business month end frequency|
|MS |	month start frequency|
|SMS |	semi-month start frequency (1st and 15th)|
|BMS |	business month start frequency|
|CBMS |	custom business month start frequency|
|Q |	quarter end frequency|
|BQ |	business quarter end frequency|
|QS |	quarter start frequency|
|BQS |	business quarter start frequency|
|A, Y |	year end frequency|
|BA, BY |	business year end frequency|
|AS, YS |	year start frequency|
|BAS, BYS |	business year start frequency|
|BH | business hour frequency|
|H | hourly frequency|
|T |  min	minutely frequency|
|S | secondly frequency|
|L |  ms	milliseconds|
|U |  us	microseconds|
|N | nanoseconds|

# Intro to Time Series

In [None]:
!pip install -U fsds_100719
# !pip install -U scikit-learn

from fsds_100719.imports import *

# Baltimore Crime

## 2020 Data

In [1]:
from fsds.imports import *


Bad key "text.kerning_factor" on line 4 in
//anaconda3/envs/learn-env/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


fsds v0.2.8 loaded.  Read the docs: https://fs-ds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


In [2]:
import os
file = '../../datasets/baltimore_crime_2020.csv'
df = pd.read_csv(file,low_memory=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df = pd.read_csv('/Users/jamesirving/Downloads/BPD_Part_1_Victim_Based_Crime_Data.csv')
df['datetime']  = pd.to_datetime(df['CrimeDate']+ ' - ' + df['CrimeTime'])
df

In [None]:
def stationarity_check(TS,plot=True,col=None):
    """From: https://learn.co/tracks/data-science-career-v2/module-4-a-complete-data-science-project-using-multiple-regression/working-with-time-series-data/time-series-decomposition
    """
    
    # Import adfuller
    from statsmodels.tsa.stattools import adfuller

    if col is not None:
        # Perform the Dickey Fuller Test
        dftest = adfuller(TS[col]) # change the passengers column as required 
    else:
        dftest=adfuller(TS)
 
    if plot:
        # Calculate rolling statistics
        rolmean = TS.rolling(window = 8, center = False).mean()
        rolstd = TS.rolling(window = 8, center = False).std()

        #Plot rolling statistics:
        fig = plt.figure(figsize=(12,6))
        orig = plt.plot(TS, color='blue',label='Original')
        mean = plt.plot(rolmean, color='red', label='Rolling Mean')
        std = plt.plot(rolstd, color='black', label = 'Rolling Std')
        plt.legend(loc='best')
        plt.title('Rolling Mean & Standard Deviation')
#     plt.show(block=False)
    
    # Print Dickey-Fuller test results
    print ('Results of Dickey-Fuller Test:')

    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
        
    dfoutput['sig'] = dfoutput['p-value']<.05
    print (dfoutput)
    
    return dfoutput

In [None]:
import os
os.listdir('../../datasets/')

In [None]:
df.to_csv('../../datasets/baltimore_crime_2020.csv',index=False)

In [None]:
df.index = pd.to_datetime(df['CrimeDate']+ ' - ' + df['CrimeTime'])
df

In [None]:
# df.set_index('datetime',inplace=True)
# df.index

In [None]:
df['Description'].value_counts()

In [None]:
df.head()

In [None]:
## Identify columns to drop/keep
drop_cols = ['CrimeDate','CrimeTime','CrimeCode','Location', 
             'Premise','Post','Neighborhood','Location 1',
             'vri_name1','Total Incidents','Weapon','Inside/Outside']

id_cols = ['Description','Weapon','Longitude','Latitude']

In [None]:
## 
df.drop(columns=drop_cols,inplace=True)
df

In [None]:
## make a dict of all crime types 
CRIMES = {}
for crime in df['Description'].unique():
    group  = df.groupby('Description').get_group(crime)
    group[crime] = (group['Description'] == crime).astype(int)
    CRIMES[crime] = group

In [None]:
CRIMES.keys()

In [None]:
CRIMES['SHOOTING']

In [None]:
df['SHOOTING'] = (df['Description'] == 'SHOOTING').astype(int)
df


## Baltimore Crime - 2019 Data

In [None]:
baltimore_crime ="https://raw.githubusercontent.com/jirvingphd/fsds_100719/master/fsds_100719/data/BPD_Part_1_Victim_Based_Crime_Data.csv"
df = pd.read_csv(baltimore_crime,low_memory=False,
                 parse_dates=["CrimeDate","CrimeTime"])#,
#                 index_col='CrimeDate')
display(df.head())
mpl.rcParams['figure.figsize']= (12,8)
df.index

In [None]:
df['datetime'] =df['CrimeDate'] #'CrimeTime'].copy()
df.set_index('datetime',inplace=True)
df.sort_index(inplace=True)
df

In [None]:
df.sort_index(inplace=True)
df = df.loc['2014':].copy()
df.index

### Which crimes were the most common?

In [None]:
ax=df["Description"].value_counts(ascending=True).plot(kind='barh')
ax.set_xlabel('Number of Crimes')

In [None]:
ax=df["Description"].value_counts(ascending=True, normalize=True).plot(kind='barh')
# ax.xaxis.set_major_formatter(mpl.ticker.FormatStrFormatter("%d.2\%"))
ax.set_xlabel('Portion of Crimes')

In [None]:
df.head()

### Making df_crimes

In [None]:
pd.set_option('display.max_columns',0)

In [None]:
df_crimes = pd.get_dummies(df,columns=['Description'])
df_crimes

___

In [None]:
crime_cols = [col for col in df_crimes.columns if 'Description_' in col]
crime_cols

In [None]:
new_names = [x.replace('Description_','') for x in crime_cols]
new_names

In [None]:
df_crimes.columns

In [None]:
rename_dict = dict(zip(crime_cols,new_names))
rename_dict

In [None]:
df_crimes.rename(rename_dict,axis=1,inplace=True)
df_crimes

In [None]:
df_crimes

In [None]:
# df_crimes['datetime'] = df_crimes['CrimeDate'].copy()
# df_crimes.set_index('datetime',inplace=True)
# df_crimes

### Visualize then Get Counts

In [None]:
keep_cols = ['CrimeDate','CrimeTime']
keep_cols.extend(new_names)
keep_cols

In [None]:
df_crimes = df_crimes[keep_cols].copy()
df_crimes

In [None]:
df_crimes.groupby('CrimeDate')[new_names].sum().plot()

### Using `pd.Grouper`

In [None]:
df_crimes.groupby(pd.Grouper(freq='D')).sum().plot()

In [None]:
df_counts = df_crimes.groupby(pd.Grouper(freq='D')).sum()
df_counts.plot()

In [None]:
df_counts['SHOOTING'].plot()

In [None]:
for freq_code in ['D','W','M']:
    ax= df_crimes.groupby(pd.Grouper(freq=freq_code)).sum().plot()
    ax.set_title(f"Freq Code={freq_code}")
#     ax.legend(None)
    

### Working with df_counts


In [None]:
df_counts.to_csv('datasets/baltimore_crime_counts_2014-2019.csv')
df_counts = pd.read_csv('datasets/baltimore_crime_counts_2014-2019.csv',parse_dates=True, index_col='datetime')
df_counts

# Removing Trends 
- .diff()
- np.log
- subtract rolling mean
- seasonal decomposition

In [None]:
col = 'COMMON ASSAULT'
ts = df_counts[col].copy()
ts.loc['2014':'2016'].plot(figsize=(12,4))#style='.b')

## Seasonal Decomposition

In [None]:
ts=ts.loc['2014':'2016']

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(ts)#,model='mul')
decomp.plot();

In [None]:
trend = decomp.trend
residuals = decomp.resid
seasonal = decomp.seasonal

In [None]:
ts.plot()
residuals.plot(label='Residuals')
plt.legend()

In [None]:
fs.ihelp(stationarity_check,False)

In [None]:
from statsmodels.tsa.stattools import adfuller
adfuller()

In [None]:
stationarity_check(ts);

In [None]:
ts.interpolate(method='time',inplace=True)
decomp = seasonal_decompose(ts)


In [None]:
stationarity_check(ts)

In [None]:
stationarity_check(decomp.resid.dropna());

- .diff()
- np.log
- subtract rolling mean
- seasonal decomposition

In [None]:
ts.plot()

In [None]:

# plt.plot(np.log(ts))

# ACF & PACF

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
mpl.rcParams['figure.figsize'] = (12,4)
plot_acf(ts);
plt.xlabel("Number Lag")
# fig=ax.get_figure()

# fig.set_size_inches(12,4)
# fig


In [None]:

ts.plot()

#  ARIMA/SARIMA
- SEE `sect_32_time_series_models.ipynb`

# HOUSING DATA FROM LAST CLASS

In [None]:
df = fs.datasets.load_mod1_proj()
df.head()

In [None]:
df.info()

In [None]:
date = pd.to_datetime(df['date'])
date

In [None]:
df['d_date'] = pd.to_datetime(df['date'])
display(df.head())
df.dtypes

In [None]:
df.set_index('d_date',inplace=True)
df.index

In [None]:
mpl.rcParams['figure.figsize'] = (12,6)

In [None]:
df['price'].plot()

### Slicing With Time Series

- Make sure you're index is sorted first'
- Feed in 2 dates as strings for slicing.
- Always use .loc when slicing dates

In [None]:
df.sort_index(inplace=True)

In [None]:
df.loc['2014-05-01':'2015-05-01','price'].plot().autoscale(axis='x',tight=True)

## Time series Frequencies


In [None]:
df.index

In [None]:
df.index

In [None]:
ts = df['price']

In [None]:
freq_codes = ['D','3D','W','M', 'Q']
for freq in freq_codes:
    plt.figure()
    title=f"Freq Code: {freq}"
    plt.legend()
    ts.resample(freq).mean().plot(title=title)
    
ax = ts.resample('M').mean().plot(kind='bar')

In [None]:

ts.resample('D').mean().plot()

## Using Datetime objects and apply statemements

In [None]:
display(df.head())
df.dtypes

In [None]:
t = df.index.to_series()[0]
display(t)
print(t)

In [None]:
# help(t.strptime)

In [None]:
print(t.strftime("%m-%d-%Y"))
print(t.strftime("%T"))

In [None]:
df.index

In [None]:
df['month'] = df.index.to_series().apply(lambda x: x.month)
df.head()


In [None]:
## Let's make a month column to groupby
df['month_int'] = df.index.to_series().apply(lambda x: x.month) #x
df['month_name'] =df.index.to_series().apply(lambda x:x.strftime("%B"))

for col in ['month_int','month_name']:
    display(df[col].value_counts(normalize=True))

In [None]:
df.index.to_series().apply(lambda x: x.month)

In [None]:
help(ax.xaxis.set_ticklabels)

In [None]:
fig = ax.get_figure()

In [None]:
ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(),**{'rotation':45,
                                                    'ha':'right'}) 
# ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
fig