Import libraries

In [None]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

warnings.simplefilter('ignore')

Read in `Miles_Traveled` dataset and print `dtypes`.

In [None]:
miles_traveled = pd.read_csv('data/Miles_Traveled.csv')
miles_traveled.dtypes

Rename columns to simpler names and convert `date` column (renamed from `DATE`) to `datetime`.

In [None]:
miles_traveled.columns = ['date', 'miles']
miles_traveled['date'] = pd.to_datetime(miles_traveled['date'])

The following code block generates a simple line plot showing the value of the `miles` column as the `date` increases. Note that there is an increasing trend and there seems to be some strong seasonality during each year.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='miles',
    ax=ax,
)
ax.spines[['right', 'top']].set_visible(False)

plt.show()

We will use linear regression to generate a predictive model for the data. From https://en.wikipedia.org/wiki/Linear_regression:

<div class="alert alert-block alert-info">
<b>Definition:</b> In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.
</div>

<div class="alert alert-block alert-info">
<b>Formulation:</b> Given a data set $\{y_{i},\,x_{i1},\ldots ,x_{ip}\}_{i=1}^{n}$ of $n$ statistical units, a linear regression model assumes that the relationship between the dependent variable $y$ and the vector of regressors $x$ is linear. This relationship is modeled through a disturbance term or error variable $\varepsilon$ — an unobserved random variable that adds *noise* to the linear relationship between the dependent variable and regressors. Thus the model takes the form ${\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip}+\varepsilon_{i}, \forall i\in \{1,\ldots ,n\}}$
</div>

**Note: In the above equation, $y$ is referred to as the dependent variable and the $x$'s as the independent variables**

We will start simple and only add a `period` value to represent time.

In [None]:
miles_traveled = pd.read_csv('data/Miles_Traveled.csv')
miles_traveled.columns = ['date', 'miles']
miles_traveled['date'] = pd.to_datetime(miles_traveled['date'])

miles_traveled = miles_traveled.reset_index()
miles_traveled = miles_traveled.rename(columns={'index': 'period'})
miles_traveled.head()

The following code block imports the formula API from `statsmodels`. This is the simplest library I have found for conducting a linear regression in python. 

In [None]:
import statsmodels.formula.api as smf

The following code block uses the `ols` function to fit the regression model. `ols` stands for **O**rdinary **L**east **S**quares.

In [None]:
formula = 'miles ~ period'

reg_period = smf.ols(formula, data=miles_traveled).fit()
reg_period.summary()

The following code block uses the model to generate a prediction for the miles traveled during each period.

In [None]:
miles_traveled['ols_prediction_period'] = reg_period.predict(miles_traveled)
miles_traveled.head()

The following code block plots the prediction and the original data. **Based on the visualization, are there any issues with the model?**

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='miles',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='ols_prediction_period',
    ax=ax,
    label='Prediction (Period)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

Let's see what happens if we refit the model using the year and month of travel, with the month represented as an integer. The following code block prepares the data.

In [None]:
miles_traveled['year'] = miles_traveled['date'].dt.year
miles_traveled['month_int'] = miles_traveled['date'].dt.month
miles_traveled.head()

The following code block fits the regression model. **Note that the `R-squared` values increases. What does that mean?**

In [None]:
formula = 'miles ~ year + month_int'

reg_year_month_int = smf.ols(formula, data=miles_traveled).fit()
reg_year_month_int.summary()

The following code block generates the prediction.

In [None]:
miles_traveled['ols_prediction_year_month_int'] = reg_year_month_int.predict(miles_traveled)
miles_traveled.head()

The following code block plots the prediction and the original data. **Based on the visualization, does the new model fit better?**

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='miles',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='ols_prediction_year_month_int',
    ax=ax,
    label='Prediction (Year/Month Int)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

Let's take a closer look at the data until 1/1/1975. **What is the model doing as the month changes? Why do you think it is doing this?**

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled[miles_traveled['date'] < '1/1/1975'],
    x='date',
    y='miles',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=miles_traveled[miles_traveled['date'] < '1/1/1975'],
    x='date',
    y='ols_prediction_year_month_int',
    ax=ax,
    label='Prediction (Year/Month Int)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

Let's now see what happens if we use the name of the month instead of an integer representation.

In [None]:
miles_traveled['month_name'] = miles_traveled['date'].dt.month_name()
miles_traveled.head()

The following code block fits the model and prints the summary. Note that we get a coefficient for each month.

In [None]:
formula = 'miles ~ year + month_name'

reg_year_month_name = smf.ols(formula, data=miles_traveled).fit()
reg_year_month_name.summary()

The following code block generates predictions using the revised model.

In [None]:
miles_traveled['ols_prediction_year_month_name'] = reg_year_month_name.predict(miles_traveled)
miles_traveled.head()

The following code block plots the predictions.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='miles',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=miles_traveled,
    x='date',
    y='ols_prediction_year_month_name',
    ax=ax,
    label='Prediction (Year/Month Name)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

The zoomed in version is plotted below.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=miles_traveled[miles_traveled['date'] < '1/1/1975'],
    x='date',
    y='miles',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=miles_traveled[miles_traveled['date'] < '1/1/1975'],
    x='date',
    y='ols_prediction_year_month_name',
    ax=ax,
    label='Prediction (Year/Month Name)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

## Homework 5

Conduct a linear regression for the `BeerWineLiqour.csv` data and:
1. Generate a plot the shows the fitted model in comparison to the original data,
2. Describe how well the model explains the data, and 
3. State the months with the smallest and largest impacts on sales.

The data captured in the `Miles_Traveled.csv` data file seemed to exhibit *additive* seasonality. Let's now look at case where the seasonality is *multiplicative* and see if we can update our regression model to account for this change. The following code block reads the data and prints the data types.

In [None]:
alcohol_sales = pd.read_csv('data/Alcohol_Sales.csv')
alcohol_sales.dtypes

As was done for the `Miles_Traveled` data, we will rename the columns and change the (renamed) `date` column to be a `datetime.

In [None]:
alcohol_sales.columns = ['date', 'sales']
alcohol_sales['date'] = pd.to_datetime(alcohol_sales['date'])

The following code block prints the sales data. As you can see, the range of the seasonal variation seems to amplify as the sales increase.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='sales',
    ax=ax,
)
ax.spines[['right', 'top']].set_visible(False)

plt.show()

We will first replicate the analysis we performed on the `Mile_Traveled` data. Before fitting the regression, we create two new columns, one capturing the year and one capturing the name of the month.

In [None]:
alcohol_sales['year'] = alcohol_sales['date'].dt.year
alcohol_sales['month_name'] = alcohol_sales['date'].dt.month_name()
alcohol_sales.head()

The following code block fits the regression model for additive seasonality and summarizes the model's fit.

In [None]:
formula = 'sales ~ year + month_name'

reg_additive = smf.ols(formula, data=alcohol_sales).fit()
reg_additive.summary()

The following code block generates predictions for the additive model, which are stored in a new column named `ols_prediction_additive`.

In [None]:
alcohol_sales['ols_prediction_additive'] = reg_additive.predict(alcohol_sales)

The following code block plots the predictions along with the original data.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='sales',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='ols_prediction_additive',
    ax=ax,
    label='Prediction (Additive)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

The following code block specifies a regression model with an *interaction term* between the values in the `year` and `month_name` columns.

In [None]:
formula = 'sales ~ year*month_name'

reg_interaction = smf.ols(formula, data=alcohol_sales).fit()
reg_interaction.summary()

The following code block uses the model with the interaction term to generate predictions.

In [None]:
alcohol_sales['ols_prediction_interaction'] = reg_interaction.predict(alcohol_sales)
alcohol_sales.head()

The following code block plots the original data along with the predictions from both the additive and interaction models.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='sales',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='ols_prediction_additive',
    ax=ax,
    label='Prediction (Additive)'
)

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='ols_prediction_interaction',
    ax=ax,
    label='Prediction (Interaction)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

It is difficult to see the differences in the predictions in the previous graph. The following graph zooms in on dates from 2016 on to better show the differences.

In [None]:
cutoff_date = '1/1/2016'
plot_df = alcohol_sales[alcohol_sales['date'] >= cutoff_date]

fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=plot_df,
    x='date',
    y='sales',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=plot_df,
    x='date',
    y='ols_prediction_additive',
    ax=ax,
    label='Prediction (Additive)'
)

sns.lineplot(
    data=plot_df,
    x='date',
    y='ols_prediction_interaction',
    ax=ax,
    label='Prediction (Interaction)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

Our `alcohol_sales` data ends 1/1/2019. How can we use the prediction model to generate forecasts for future periods? Essentially, we need to create a dataset for prediction that mimics the columns used to fit the regression model. The following code block uses the `pd.date_range` function to generate a seqence of dates starting 2/1/2019 and ending 12/1/2030. See https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases for a list of aliases for the `freq` parameter.

In [None]:
date_range = pd.date_range(
    start='2/1/2019',
    end='12/1/2030',
    freq='MS',
)
date_range

The following code block uses the generated date range to construct a `DataFrame` that mimics the format we used to fit the model.

In [None]:
future_df = pd.DataFrame(
    date_range,
    columns=['date'],
)
future_df['year'] = future_df['date'].dt.year
future_df['month_name'] = future_df['date'].dt.month_name()
future_df.head()

The following code block generates the predictions.

In [None]:
future_df['ols_prediction_interaction'] = reg_interaction.predict(future_df)

The following code block plots the original data along with the future predictions.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

sns.lineplot(
    data=alcohol_sales,
    x='date',
    y='sales',
    ax=ax,
    label='Data',
)

sns.lineplot(
    data=future_df,
    x='date',
    y='ols_prediction_interaction',
    ax=ax,
    label='Prediction (Interaction)'
)

ax.spines[['right', 'top']].set_visible(False)

plt.show()

## Homework 6

Repeat the regression analysis for the `BeerWineLiqour.csv` data using a model with interactions to capture multiplicative seaonality.:
1. Generate a plot the shows the fitted model in comparison to the original data,
2. Describe how well the model explains the data, and 
3. Generate predictions for the period of 1/1/1997 - 1/1/2010.