In [1]:
# notebook dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Time Series Module
    date: Tuesday, August 9th 2022

----

<u>``General Topics``</u>

* temporal 
* periodic 
* resampling
* stationary
* non-stationary

<u>``Terms``</u>

* auto-correlation  
* correlation

<u>``Research``</u>

* Scaling "X" variable, which can be a representation of time

<u>``Types of Models``</u>

* Persistence Model ("Naive")
* Simple Average (pretty "clunky")
* Moving Average (and "Exponential Moving Average")

<u>``Technique for Exploration``</u>

* Decomposition
* Holt-Winters Method (who is Holt-Winters?)

<u>``Addtional Predictive Model Techniques``</u>

* ARIMA
* "Prophet" (created by META inc.)

<u>``Useful Resources``</u>


* freesourcecamp.com


### Data Acquisistion

* HTTP (*HyperText Transfer Protocol*): Plain text transportation
* Request
* Response
* HTML (*HyperText Markup Language*): Document structure for a webpage
* JSON (*JavaScript Object Notation*): Data interchange format based on JavaScript (structure is very similiar to Python dictionaries)
* API (*Application Programming Interface*): How things are interacted with programatically
* REST (*Representational State Transfer*): A set of rules for application urls

| HTTP Method | Endpoint         | Description                |
| ---         | ---              | ---                        |
| GET         | /{resource}/{id} | Read details of a resource |
| GET         | /{resource}      | A listing of resources     |
| POST        | /{resource}      | Create a new resource      |
| PATCH       | /{resource}/{id} | Update a resource          |
| DELETE      | /{resource}/{id} | Delete a resource          |




**``Http status codes:``**

* 200s: everythings good
* 300s: redirecting
* 400s: you did something wrong
* 500s: something is wrong with the server

#### ``Quote Generator``

Refers to...

----
#### **``date: Wednesday, August 10th 2022``**

<u>``useful Pandas functions & methods:``</u>

- to_datetime()

    <u>to extract parts of the date:</u>
    - can use "dt." day/month/year/day_name/
    - must come after series/column (i.e., df["date"].dt.year)

``let's look at an example:``

example: 2020-03-13 12-PM is what our date looks like

<u>It is made up of several parts:</u>

* a 4 digit year: `%Y`
* followed by a hyphen `-`
* a two digit month: `%m`
* followed by a hyphen `-`
* a two digit day: `%d`
* a space ` `
* a 12-hour clock number: `%I`
* a hyphen `-`
* an AM/PM `%p`

With this info we can now build our format string:

**2020-03-13 12-PM**


**``Steps for creating a df of weekday names:``**

* df.index.day_name()
* df['weekday_name'] = df.index.day_name()

**``Can then GroupBy:``**

* df.groupby('weekday_name').mean()
* df.groupby('weekday_name').mean().close.plot() (consider a plot?)
* df.groupby('weekday_name').mean().volume.plot() (consider a plot?)

**``Getting a subset of the time data:``**

* df.loc['2019-11-19 12:00:00'] (can pass the date format into pandas loc method)
* df['2018':'2019'] (years between)
* df['2018-11'] (specific year)
* df['2018-01-01': '2018-06-30'] (from specific time periods)


**``Changing the period of the dataset/df:``**

by_month = df.asfreq('M') -- (using pandas asfreq() method)
by_month.close.plot() (consider plotting?)


**``Getting first day of the month:``**

by_month_first_day = df.asfreq('MS')

*(consider a plot?)*
* by_month.close.plot(label='Last Day of Month Frequency')
* by_month_first_day.close.plot(label='First Day of Month Frequency')

**<u>useful links:</u>**

[Pandas asfreq Offset Aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)

[Pandas asfreq Anchored Offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets)


**<u>"Upsampling": Increasing Frequency</u>**

* ``by_half_hour = df.asfreq('30T') -- which refers to half hour intervals``

----
#### ``How about Lagging or Leading the data?``

* `.shift`: move the data backwards and forwards by a given amount
* `.diff`: find the difference with the previous observation (or a specified further back observation)

<u>Examples:</u>

daily_df['shift_by_one'] = daily_df.close.shift(1)

daily_df.close.shift(-1)
daily_df['diff(1)'] = daily_df.close.diff(1)
daily_df['other_diff'] = daily_df.close - daily_df['shift_by_one']

----
#### Example Questions using .shift()

**Examples:**

* What is the yearly rate of return or return on investment (ROI)?
* What is the year-over-year percentage difference?
* Maybe month-over-month?
* what about day-over-day?

``Incoming math warning!``

![Math gif](https://media4.giphy.com/media/DHqth0hVQoIzS/giphy.gif?cid=ecf05e47ciwxjpj7j3mkqv6dxplynda0k44lru3atznbajs1&rid=giphy.gif&ct=g)


**``How do we calculate ROI?``**

`yearly_rate_return` = (`where_we_are_today` - `where_we_were_a_year_ago`  ) / `where_we_were_a_year_ago`




----
#### **``date: Thursday, August 11th 2022``**
    focus: data preparation

<u>**Module Takeaways**</u>

1. Convert date-data to proper DateTime type
2. Move dates to index in Pandas DataFrame

``Let's work with some data``

In [4]:
# let's import a dataframe

df = pd.read_csv("/Users/mijailmariano/codeup-data-science/time-series-exercises/merged_sales.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,item,sale_amount,sale_date,store,item_brand,item_name,item_price,item_upc12,item_upc14,store_address,store_city,store_state,store_zipcode
0,0,1,13.0,"Tue, 01 Jan 2013 00:00:00 GMT",1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
1,1,1,11.0,"Wed, 02 Jan 2013 00:00:00 GMT",1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
2,2,1,14.0,"Thu, 03 Jan 2013 00:00:00 GMT",1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
3,3,1,13.0,"Fri, 04 Jan 2013 00:00:00 GMT",1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
4,4,1,10.0,"Sat, 05 Jan 2013 00:00:00 GMT",1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253


In [5]:
# first, inspect the data:

df.shape

(913000, 14)

In [6]:
# where "sale_date" is an object type

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 913000 entries, 0 to 912999
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Unnamed: 0     913000 non-null  int64  
 1   item           913000 non-null  int64  
 2   sale_amount    913000 non-null  float64
 3   sale_date      913000 non-null  object 
 4   store          913000 non-null  int64  
 5   item_brand     913000 non-null  object 
 6   item_name      913000 non-null  object 
 7   item_price     913000 non-null  float64
 8   item_upc12     913000 non-null  int64  
 9   item_upc14     913000 non-null  int64  
 10  store_address  913000 non-null  object 
 11  store_city     913000 non-null  object 
 12  store_state    913000 non-null  object 
 13  store_zipcode  913000 non-null  int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 97.5+ MB


In [8]:
# let's inspect the date

df["sale_date"].head()

0    Tue, 01 Jan 2013 00:00:00 GMT
1    Wed, 02 Jan 2013 00:00:00 GMT
2    Thu, 03 Jan 2013 00:00:00 GMT
3    Fri, 04 Jan 2013 00:00:00 GMT
4    Sat, 05 Jan 2013 00:00:00 GMT
Name: sale_date, dtype: object

In [10]:
# let's use pandas' to_datetime function to convert the column type 

pd.to_datetime(df["sale_date"], infer_datetime_format = True)

0        2013-01-01
1        2013-01-02
2        2013-01-03
3        2013-01-04
4        2013-01-05
            ...    
912995   2017-12-27
912996   2017-12-28
912997   2017-12-29
912998   2017-12-30
912999   2017-12-31
Name: sale_date, Length: 913000, dtype: datetime64[ns]

In [11]:
# let's convert the date column

df["sale_date"] = pd.to_datetime(df["sale_date"], infer_datetime_format = True)
df.info() # checks out!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 913000 entries, 0 to 912999
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Unnamed: 0     913000 non-null  int64         
 1   item           913000 non-null  int64         
 2   sale_amount    913000 non-null  float64       
 3   sale_date      913000 non-null  datetime64[ns]
 4   store          913000 non-null  int64         
 5   item_brand     913000 non-null  object        
 6   item_name      913000 non-null  object        
 7   item_price     913000 non-null  float64       
 8   item_upc12     913000 non-null  int64         
 9   item_upc14     913000 non-null  int64         
 10  store_address  913000 non-null  object        
 11  store_city     913000 non-null  object        
 12  store_state    913000 non-null  object        
 13  store_zipcode  913000 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(6), object(5

In [17]:
# let's place the date column as the index

df.set_index("sale_date").reset_index()

Unnamed: 0.1,sale_date,Unnamed: 0,item,sale_amount,store,item_brand,item_name,item_price,item_upc12,item_upc14,store_address,store_city,store_state,store_zipcode
0,2013-01-01,0,1,13.0,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
1,2013-01-02,1,1,11.0,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
2,2013-01-03,2,1,14.0,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
3,2013-01-04,3,1,13.0,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
4,2013-01-05,4,1,10.0,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
912995,2017-12-27,912995,50,63.0,10,Choice,Choice Organic Teas Black Tea Classic Black - ...,5.20,47445919221,47445919221,8503 NW Military Hwy,San Antonio,TX,78231
912996,2017-12-28,912996,50,59.0,10,Choice,Choice Organic Teas Black Tea Classic Black - ...,5.20,47445919221,47445919221,8503 NW Military Hwy,San Antonio,TX,78231
912997,2017-12-29,912997,50,74.0,10,Choice,Choice Organic Teas Black Tea Classic Black - ...,5.20,47445919221,47445919221,8503 NW Military Hwy,San Antonio,TX,78231
912998,2017-12-30,912998,50,62.0,10,Choice,Choice Organic Teas Black Tea Classic Black - ...,5.20,47445919221,47445919221,8503 NW Military Hwy,San Antonio,TX,78231


----
#### **``Time Series Split & SKLearn Module``**

``Seasonality``
A repeated cycle in the data. Occurs at fixed frequency. In our weather data there is yearly and daily seasonality.

``Trend``
Long term upwards or downwards movement

``Cycle``
Some arbitrary chunk of time, usually longer than a season, or consists of multiple seasons

``Data Splitting``
- Ideally all splits contain a season
- Human-based: Using domain knowledge, a cutoff is selected. (ex: use the last year as Test)
- Percentage-based: A cutoff is selected arbitrarily (ex: use the last 20% of observations as Test)
- Cross-validation-based: Break data into slices and use successive slices as train and test repeatedly

    **`sklearn.model_selection.TimeSeriesSplit`**

``Percentage Based``

-----
**``Topic: Time Series Modeling``**

date: Monday, August 15th 2022


### ``Forecast`` 

Forecasting is another word for predicting time series data. 

1. Last Observed Value: The future will look like the now
2. Simple Average: The future will look, on average, like history. 
3. Moving Average: The future will look, on average, like recent history. 
4. Holt's Linear Trend
5. Previous Cycle


#### ``Last observed value (baseline)``

The simplest method for forecasting is to predict all future values to be the last observed value.  

<u>**``Make Predictions``**</u>

**<u>Sales Total Example</u>**

*take the last item of sales total and assign to variable*

last_sales = train['sales_total'][-1:][0]

**<u>Quantity Example</u>**

*take the last item of quantity and assign to variable*

last_quantity = train['quantity'][-1:][0]

----

#### ``Simple Average (baseline)``

- ok to use if there is no trend in your data*


Take the simple average of historical values and use that value to predict future values.   

This is a good option for an initial baseline. Every future datapoint (those in 'test') will be assigned the same value, and that value will be the overall mean of the values in train. 


----

#### ``Moving Average (baseline)``

In this example, we will use a 30-day moving average to forecast. In other words, the average over the last 30-days will be used as the forecasted value.

``example code:``

- demonstrate that the mean of the first 30 days 
- is equal to rolling(30) on day 30

print(train['sales_total'].rolling(30).mean())

----
#### ``Holt's Linear Trend``


Exponential smoothing applied to both the average and the trend (slope).  

- $\alpha$ / smoothing_level: smoothing parameter for mean. Values closer to 1 will have less of a smoothing effect and will give greater weight to recent values.

- $\beta$ / smoothing_slope: smoothing parameter for the slope. Values closer to 1 will give greater weight to recent slope/values. 

**``suggested parameter/period "loop-through" for optimal RMSE``**


**Seasonal Decomposition**


First, let's take a look at the seasonal decomposition for each target. 


``medium link on Holt Winters``
https://medium.com/analytics-vidhya/holt-winters-forecasting-13c2e60d983f

#### ``Basic Holt's Linear Trend``

**Make Predictions**

Now, like we would when using sklearn, we will create the Holt object, fit the model, and make predictions. 

Holt: 

- exponential = True/False (exponential vs. linear growth, additive vs. multiplicative)
- damped $\phi$ = True/False: with Holt, forecasts will increase or decrease indefinitely into the future.  To avoid this, use the Damped trend method which has a damping parameter 0< ϕ <1. 


fit: 

- smoothing_level ($\alpha$): value between (0,1)
- smoothing_slope ($\beta$): value between (0,1)
- optimized: use the auto-optimization that allow statsmodels to automatically find an optimized value for us. 

----
#### ``Predict Baed on Previous Cycle``

Take all the 2016 data points, compute the daily delta, year-over-year, average that delta over all the days, and adding that average to the previous year's value on a day will give you the forecast for that day. 

If a primary cycle is weekly, then you may want to do this on a week-over-week cadence. 

<u>**In the below example:** </u>

1. Compute the 365 average year over year differences from 2013 through 2015
2. Add that average delta to the values during 2015. 
3. Set the index in your yhat dataframe to represent the dates those predictions are make for. 

Let's get started....

**Re-split data**

train = df_resampled[:'2015']
validate = df_resampled['2016']
test = df_resampled['2017']

print(train.shape)
print(validate.shape)
print(test.shape)

**make predictions:**

- finding the year-over-year difference for each day from 2013 to 2015
- taking the mean, and then adding that value to the daily 2015 values. 

- find the diff. from 2013-2014 and 2014-2015, take the mean, and add to each value in 2015.

yhat_df = train['2015'] + train.diff(365).mean()