# Google Stock Price Prediction

[ML Cookbook](https://www.ml-book.com) | [SLACK Channel](https://join.slack.com/t/mlckbk/shared_invite/zt-9qsjm911-6nSHAcCSjKfuHi972iEfEg)


## About
In this project you have to build a **time series forecasting model for predicting the price of Google stocks**.

## Structure
The project is split into 7 sections, each containing step-by-step instructions of what to do. These sections are the following:

- Import the Libratries
- Import the Datasets
- Data Preprocessing
- Data Overview
- Model Building
- Model Evaluation & Hyperparameter Tuning
- Conclusion

## Data
There are 3 datasets provided that you should use for this project which represent the closing price of Google stocks for a 15 years period from 2006 to 2020:

- *GOOGLE_stocks_2006_2010.csv*
- *GOOGLE_stocks_2011_2015.csv*
- *GOOGLE_stocks_2016_2020.csv*

# 1. Import the Libraries
Import the libraries needed (here you will also keep adding up the required libraries as you go further with this project)

In [None]:
import pandas as pd



# 2. Import the datasets

Do the following:

- **Step 1**: Import three dataframes as df1, df2 and df3 **(we did that for you)**
- **Step 2:** See what the dataframes look like
- **Step 3:** For each dataframe print its shape

---

## Step 1
Import three dataframes as df1, df2 and df3 **(we did that for you)**

In [None]:
df1 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project4/GOOGLE_stocks_2006_2010.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project4/GOOGLE_stocks_2011_2015.csv')
df3 = pd.read_csv('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project4/GOOGLE_stocks_2016_2020.csv')


## Step 2
See what the dataframes look like

In [None]:
#example
df1.head(5)

## Step 3
For each dataframe print its shape

# 3. Data Preprocessing

**Step 1:** Combine three datasets into one

**Step 2:** Create a class called "Prep". Inside that class write functions that:

- For a given year and month prints:
    - minimum value
    - mean value
    - median value
    - maximum value 
- Prints data types
- Prints number of null values for each column

**Step 3:** Explore the data: 
- Check dtypes using the Prep class, change data types if they don't look good to you
- Take a look at the records between 2011 and 2015 years, does its order look good to you? If not, add to the Prep class a function to fix it
- Are there any missing data in this dataset?

**Step 4:** Impute missing records. There are multiple ways to do it in time series context. Feel free to select any of the methods suggested below or to use any other method. Add to the Prep class a function that imputes missing values in dataset.
- Imputation by Linear Interpolation
- Imputation by Last Observation Carried Forward
- Imputation by Next Observation Carried Backward
- Imputation by Simple Moving Average

---

## Step 1
Combine three datasets into one

## Step 2
Create a class called "Prep". Inside that class write functions that:

- For a given year and month prints:
    - minimum value
    - mean value
    - median value
    - maximum value 
- Prints data types
- Prints number of null values for each column


## Step 3
Explore the data: 
- Check dtypes using the Prep class, change data types if they don't look good to you
- Take a look at the records between 2011 and 2015 years, does its order look good to you? If not, add to the Prep class a function to fix it
- Are there any missing data in this dataset?

## Step 4
Impute missing records. There are multiple ways to do it in time series context. Feel free to select any of the methods suggested below or to use any other method. Add to the Prep class a function that imputes missing values in dataset.
- Imputation by Linear Interpolation
- Imputation by Last Observation Carried Forward
- Imputation by Next Observation Carried Backward
- Imputation by Simple Moving Average

# 4. Data Overview

Observe the data:

**Step 1:** Plot the graph of a stock price vs time.

**Step 2:** Decompose time series data into [Trend, Cycle and Seasonality](https://en.wikipedia.org/wiki/Decomposition_of_time_series) and plot those graphs. Can you gain any insights or identify any patterns from there?

---

## Step 1
Plot the graph of a stock price vs time.

## Step 2
Decompose time series data into [Trend, Cycle and Seasonality](https://en.wikipedia.org/wiki/Decomposition_of_time_series) and plot those graphs. Can you gain any insights or identify any patterns from there?

# 5. Model Building

Do the following:

**Step 1:** Split the data into train and test. Keep in mind that traditional splitting strategies that you might have used in classic regression / classification problems is not applicable for time series forecasting (try to think why?). Thus, we recommend using a strategy outlined [here (clickable)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html).

**Step 2:** We will use [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html) model as it is simple yet powerful model that can result in good predictive power if treated correctly. This model has [3 main parameters](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average). This model works well when time series is stationary. You can estimate how stationary is the data via Augmented Dickey-Fuller (ADF) test. Calculate p-value for ADF test conducted on original data.

**Step 3:** In case time series is non-stationary, try to experiment with the following transformations to make the time series look more stationary:
- [Box-Cox Transformation](https://stats.stackexchange.com/questions/253917/why-use-differencing-and-box-cox-in-time-series#:~:text=1%20Answer&text=The%20Box-Cox%20transformation%20is,have%20a%20non-constant%20variance.)
- [Differencing](https://people.duke.edu/~rnau/411diff.htm)

The value of parameter *d* is the minimum number of differencing needed to make the series stationary.

**Step 4:** Plot the [ACF (AutoCorrelation Function) graph](https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html) and select a starting value of *q* parameter. It should be equal to the first lag which ACF value is within the significance line minus 1. 

**Step 5:** Plot the [PACF (Partial AutoCorrelation Function) graph](https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_pacf.html) and select a starting value of *p* parameter. It should be equal to the first lag which ACF value is within the significance line minus 1. 

**Step 6:** Fit the [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html) model using the parameters found above.

---

## Step 1
Split the data into train and test. Keep in mind that traditional splitting strategies that you might have used in classic regression / classification problems is not applicable for time series forecasting (try to think why?). Thus, we recommend using a strategy outlined [here (clickable)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html).

## Step 2
We will use [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html) model as it is simple yet powerful model that can result in good predictive power if treated correctly. This model has [3 main parameters](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average). This model works well when time series is stationary. You can estimate how stationary is the data via Augmented Dickey-Fuller (ADF) test. Calculate p-value for ADF test conducted on original data.

## Step 3
In case time series is non-stationary, try to experiment with the following transformations to make the time series look more stationary:
- [Box-Cox Transformation](https://stats.stackexchange.com/questions/253917/why-use-differencing-and-box-cox-in-time-series#:~:text=1%20Answer&text=The%20Box-Cox%20transformation%20is,have%20a%20non-constant%20variance.)
- [Differencing](https://people.duke.edu/~rnau/411diff.htm)

The value of parameter *d* is the minimum number of differencing needed to make the series stationary.

## Step 4
Plot the [ACF (AutoCorrelation Function) graph](https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html) and select a starting value of *q* parameter. It should be equal to the first lag which ACF value is within the significance line minus 1. 

## Step 5
Plot the [PACF (Partial AutoCorrelation Function) graph](https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_pacf.html) and select a starting value of *p* parameter. It should be equal to the first lag which ACF value is within the significance line minus 1. 

## Step 6
Fit the [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html) model using the parameters found above.

# 6. Model Evaluation & Hyperparameter Tuning

**Step 1:** Call the `summary()` method on the fitted ARIMA model. 

**Step 2:** Take a look at the p-values of model coefficients that were outputted as a part of `summary()` output. If there are any p-value larger than 0.05, try to remove them from the model via adjusting parameters p and q (if possible).

**Step 3:** If you decided to change values of p and q in Step 2, then re-train the model with these new parameters, then call the `summary()` method again and see, how [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) and [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion) values changed. We would typically expect better model to have lower values of those coefficients.

**Step 4:** Plot model's residuals. If the model is good, residuals should have zero mean and constant variance.

**Step 5:** Plot Actuals VS Fitted graph to visually observe how well ARIMA models the actual data. You may find [this function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.plot_predict.html#statsmodels.tsa.arima_model.ARIMAResults.plot_predict) to be helpful.

---

## Step 1
Call the `summary()` method on the fitted ARIMA model. 

## Step 2
Take a look at the p-values of model coefficients that were outputted as a part of `summary()` output. If there are any p-value larger than 0.05, try to remove them from the model via adjusting parameters p and q (if possible).

## Step 3
If you decided to change values of p and q in Step 2, then re-train the model with these new parameters, then call the `summary()` method again and see, how [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) and [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion) values changed. We would typically expect better model to have lower values of those coefficients.

## Step 4
Plot model's residuals. If the model is good, residuals should have zero mean and constant variance.

## Step 5
Plot Actuals VS Fitted graph to visually observe how well ARIMA models the actual data. You may find [this function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.plot_predict.html#statsmodels.tsa.arima_model.ARIMAResults.plot_predict) to be helpful.

# 7. Conclusion

Summarize your **findings**. Did you manage to make the given time series (Google stock prices) stationary? What **data preprocessing** strategies have you used in order to get the best model? Which model has performed the best? 

Feel free to share/discuss your findings in our [Slack Channel](https://join.slack.com/t/mlcookbook/shared_invite/zt-eyz4czw4-l95j_2iuETCbVRPpgA3kWA)!

In [None]:
# Answer:

'''

I used X model and achieved Y accuracy...
I believe the model is reliable as I performed X feature selection technique...

'''

# 8. Extra Food for Thought

You might want to spend few minutes to explore the following topics for future use:

- SARIMA + [Seasonal Differencing](https://people.duke.edu/~rnau/411sdif.htm) (in case there is a seasonality in data)
- SARIMAX
- Prophet