# Problem Session 6
## Forecasting The Bachelorette and Pumpkin Spice I

In the first of two time series based problem sessions you will focus on some of the basics of time series forecasting. In particular, you will do some exploratory data analysis, test your understanding of data split adjustments and build baseline models for two time series.

The problems in this notebook will cover the content covered in our `Time Series Forecasting` lectures including:
- `What are Time Series and Forecasting`,
- `Adjustments for Time Series Data`,
- `Time and Dates in Python` and
- `Baseline Forecasts`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

sns.set_style("whitegrid")

#### 1. <a href="https://www.imdb.com/title/tt0348894/">The Bachelorette</a>

The first data set you will work with is the IMDB ratings of every episode of the Bachelorette (as of May 2, 2023). This data was pulled with the `Cinemagoer` python IMDB API wrapper. 

##### a.

Load `the_bachelorette.csv` from the `data` folder, look at the first five observations.

In [None]:
## code here



In [None]:
## code here



Here are descriptions for the columns of the data:
- `episode_number` is the number of the episode with respect to the entire series run,
- `title` is the title of the episode,
- `season` is the number of the season in which the episode aired,
- `season_episode_number` is the number of the episode with respect to the season in which it aired,
- `imdb_rating` is the average rating of the episode among IMDB's users.

##### b. 

Our goal will be to predict how good the next episode of the Bachelorette is, that means we want a forecast horizon of $1$ episode. Make a train test split that sets aside the last three episodes as a test set.

In [None]:
## code here



In [None]:
## code here



##### c. 

Plot the IMDB rating for each episode using the training data.

Does this time series seem to exhibit a trend? Does this time series seem to exhibit seasonality? If it exhibits either do your best to describe what you see.

In [None]:
plt.figure(figsize=(16,5))

## Write code to plot the time series here



plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.xlabel("Episode Number", fontsize=12)
plt.ylabel("IMDB Rating", fontsize=12)

    
plt.show()

##### Write here



##### d.

Choose a baseline model that you could build on these data. Plot the forecast from this baseline along with the training data, do not plot the test data.

Recall that we learned about the following baseline models for non-seasonal data:
- The average forecast
- The naive forecast
- The trend forecast and
- The random walk with drift.

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### e.

Calculate the average cross-validation root mean squared error for your baseline model. 

Set up this cross-validation so that there are ten splits and each holdout set only has three observations in it.

In [None]:
## Import what you will need
from sklearn.model_selection import 
from sklearn.metrics import 

In [None]:
## Make the time series cv here
cv = 

In [None]:
## Make an array to hold the cv rmses here


## loop through the cv splits here
i = 0
for train_index, test_index in :
    ## Get the training and holdout sets
    tv_tt = 
    tv_ho = 
    
    ## Fit your model/Make your prediction on the holdout set here
    
    
    ## Record the rmse for the split here

    
    i = i + 1

In [None]:
## code here


We will return to these baseline performances in `Problem Session 7`.

#### 2. Pumpkin spice interest

The second data set you will work with in this problem session is a time series collected using <a href="https://trends.google.com/trends/?geo=US">Google Trends</a>. This data set contains the Google Trends interest level in the United States for the search term "pumkin spice" since 2004.

##### a.

Load the data stored in `pumpkin_spice.csv` in the `Data` folder then look at the first five rows.

You may want to turn the `Month` column into a `datetime` using the `parse_dates` argument of `read_csv`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html</a>.

In [None]:
## code here



In [None]:
## code here



- The `Month` column of this data set gives the month and year that the interest was measured. 
- The `interest_level` column of this data set gives the level of interest for "pumpkin spice" in the United States. From Google Trends: "Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term."

##### b.

One thing you may need to get more practice with is identifying the <i>stakeholders</i> for a particular problem. The stakeholders are the people who are most interested in your problem and the outcome of your solution.

Thinking about this can help you frame your project goals and focus your thinking to provide a solution that most suits the stakeholders' wants/needs.

For this question, take some time to think about what kinds of people may most be interested in forecasting Google search interest in "pumpkin spice". Why might they be interested? How could this forecast best help them?

##### Write here




##### c.

Make a train test split in the data. Set aside all observations on or after January 1, 2022 aside as the test set.

<i>Hint: the `datetime` module could be useful.</i>

In [None]:
## Get the training set here
p_train = 

## Get the test set here
p_test = 

##### d.

Plot the training data.

Does this time series appear to exhibit a trend or seasonality?

In [None]:
plt.figure(figsize=(16,5))

## Fill in the code to plot here
plt.plot()

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.xlabel("Date", fontsize=12)
plt.ylabel("Google Interest Level", fontsize=12)

plt.show()

##### Write here



##### e.

One way to explore the number of time steps in a given season is to plot scatter plots of the time series against itself at given <i>lags</i>. Such plots place the time series on the horizontal axis and the time series at $\ell$ steps into the future on the vertical axis. Seasonal data should exhibit a high correlation between itself and lags at multiples of the season length.

Make such scatter plots for lag values from $\ell=1$ to $\ell=25$. Also calculate the correlation between the time series and its lagged series for each value of $\ell$ (this is known as the <i>autocorrelation</i>). Using this information how long would you say a season is?

In [None]:
## This function takes in a lag and plots the 
## lagged scatter plots
## It returns the correlation coefficient for that lagged
## time series with itself
def make_lag_plot(lag):
    x = p_train.interest_level.values[:-lag]
    y = p_train.interest_level.values[lag:]
    
    plt.figure(figsize=(5,5))
    
    plt.scatter(x, y, alpha=.7)
    plt.plot([0,100], [0,100], 'k--')
    
    plt.title("Lag = " + str(lag), fontsize=16)
    
    plt.show()
    
    return np.corrcoef(x,y)[0,1]

In [None]:
## plot the lags and record the autocorrelations in a list of array here




In [None]:
## Plot the autocorrelation against the lag here
plt.figure(figsize=(7,5))



plt.xlabel("Lag", fontsize=12)
plt.ylabel("Auto-correlation", fontsize=12)

plt.xticks(range(0,25,3), fontsize=10)
plt.yticks(fontsize=10)

plt.ylim([-1.1,1.1])

plt.show()

##### Write here

##### f.

Select a baseline forecast for these data.

Plot the baseline forecast along with the training data.

Recall that for seasonal data we considered the following baselines in lecture:
- The seasonal average and
- The seasonal naive.

In [None]:
## Fill in the missing code to plot your training data and forecast
plt.figure(figsize=(16,5))

## Plot the training data
plt.plot(,
            label='Training Data')

## Plot the forecast
plt.scatter(,
               color='red',
               marker='x',
               s=150,
               label='Forecast')

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.xlabel("Date", fontsize=12)
plt.ylabel("Google Interest Level", fontsize=12)

plt.show()

##### g.

Get the average cross-validation RMSE for your baseline model. Do 5-fold cross-validation with a test set size of 12.

In [None]:
## Define the cross-validation here
cv = 

In [None]:
## Make an array of zeros to hold the cv rmses



## Loop through the cv splits
i = 0
for train_index, test_index in :
    p_tt = p_train.iloc[train_index]
    p_ho = p_train.iloc[test_index]
    
    
    ## Fit your model/get your predictions on the holdout set here
        
    
    ## record the rmse on the holdout set

    
    i = i + 1

In [None]:
## Find the average cv rmse


##### h.

Doesn't it seem like pumpkin spice shows up earlier each year? Use the training set to investigate this question. For each year in the training set find the month where the peak search interest occurs, does what you find support the implicit hypothesis of the question.

<i>Hint: the functions `get_year` and `get_month` could be useful here.</i>

In [None]:
def get_year(date):
    return date.year

def get_month(date):
    return date.month

In [None]:
## code here



In [None]:
## code here



##### Write here


--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)