# Project: Financial time series forecasting

The following project is could be done individually or in pairs but you not allowed to share your solution with anyone else. Read below carefully!

- The aim of the project is that you learn how to set up an analytics project end-to-end. A secondary aim is that you understand how to work with a time series dataset and forecast based on such data. Third aim is that you gain an insight into how to interpret data and results.

- **The solution must address each grade in the written order.** Therefore, to complete grade two you must have completed grade one without errors, and so on.

- Your code will be tested on [https://notebooks.csc.fi](https://notebooks.csc.fi). If it doesn't run there it will not be accepted.

- Clean up and format nicely your code before the subbmission. Spaghetti-draft-style with code repetitions will lower your grade.

- You are advised to reserve about 60 hours of self-studies for this project.

- Unintentionally, there may be information missing in the description, please go through the description early in advance so that you have time to ask for assistance.


<a id='toc'></a>
# TOC

[Grade 1](#g1)

[Grade 2](#g2)

[Grade 3](#g3)

[Grade 4](#g4)

[Grade 5](#g5)


In [1]:
# First upgrade the environment.
# https://pypi.org/project/yfinance
import pip
from subprocess import run
# add what you will need
modules =[
#     'pandas_datareader',
#     'yfinance',
    'pandas_market_calendars',
    'plotly', 
    'numpy',
    'sklearn',
    'pandas'
]
proc = run(f'pip install {" ".join(modules)} --upgrade --no-input', 
       shell=True, 
       text=True, 
       capture_output=True, 
       timeout=40)
print(proc.stderr)




In [2]:
## Run this if you need to check your modules
# import pip
# from pip._internal.utils.misc import get_installed_distributions
# pkgs = ''.join(str(get_installed_distributions(local_only=True)))

# with open("modules.txt", "a") as file_object:
#     for p in (get_installed_distributions(local_only=True)):
#         file_object.write(str(p)+'\n')
#         print(str(p))
# file_object.close()

In [3]:
import pandas as pd
from pathlib import Path
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, FuncFormatter, StrMethodFormatter
%matplotlib inline

import plotly as ply
import plotly.graph_objects as go

import sklearn
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

from functools import reduce
from operator import mul
from pprint import PrettyPrinter
pprint = PrettyPrinter().pprint

<a id='g1'></a>
# Grade 1
## Implement a complete process for forecasting a single stock.

You should do the following steps:
- Use the [EURUSD data set](https://people.arcada.fi/~parland/hjd5_8amp_Gt3/EURUSD1m.zip) (52Mb)

- Subsampe data to one day timesteps, be shure to get data also from weekends.

- Create a Label column for your forecast, by shifting the Close value 1 step. You will predict one day ahead.

- Split data into 80/20 (train/test). Be carefull: you are splitting a time serie.


- [Normalize or standardize](https://scikit-learn.org/stable/modules/preprocessing.html) wisely so you don't allow information leakage to the test subset. Note, that utility class [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) performs scaling **individual samples to have unit norm**, so it is not usefull for sertain tasks. Write your own function or check [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) or [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Please check in which order you perform split and scale.

    - Data → X, y → split → scale
    - Data → split → scale → X_test, y_test ; X_train, y_train
    
  If you take the second path, you should figure out how to do the inverse transform for your predictions.
  
  
  

- Calculate feature [Larry William’s %R](https://www.investopedia.com/terms/w/williamsr.asp) from the paper [Predicting the Direction of Stock Market Index Movement Using an Optimized Artificial Neural Network Model](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4873195) (implement in code, insert values in a complementary column). 

**Note that you need to implement your own calculation of each feature and be able to explain the code.**

- Drop other data than the Close and the features for inference. You don't want to feed time-column into the model, it's not a feature to base your prediction on.

- Fit a [linear model](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) to the training data

- Forecast one day ahead based on the test data

- Calculate the [R² error](https://en.wikipedia.org/wiki/Coefficient_of_determination) on both the training data set and the test. Please format numbers to four [significant digits](https://en.wikipedia.org/wiki/Significant_figures). As a check you may note that values suppose to be in the range 0.9755 … 0.9922 for test and 0.9955 … 0.9988 for train. If you get R² error outside these ranges it **indicates** that highly likely you have an error in your logic.

**NB! You may also get an expected result by performing several cumulative mistakes.**  

$$ (2+1)\times 3 = ?$$

$$ Attempt $$
$$ 2+1=4 $$
$$ 4\times 3 = 9 $$
$$ Answer: 9 $$

**The objective of this assignment is not to achieve a correct result but to achieve the result correctly.**

- Compare the R² errors for test and train and explain the outcome. 

- Extra: Test your model (get R² errors for test and train without LW%R, just Close column) on [this dataset](https://people.arcada.fi/~parland/hjd5_8amp_Gt3/strangeClose.csv). Comment and explain the result. A reasonable explanation will compensate for one error in the following grades.


### **Larry William’s %R**

$ (H_n − C_t)/(H_n − L_n)\times100 $

### Coefficient of determination ($R^2$)

$$R^2 = 1 - \frac {SSResid}{SSTot}$$

#### Residual Sum of Squares: $SSResid = \sum_{i} (y_i - \hat{y_i})^2$

#### Total Sum of Squares: $SSTot = \sum_{i} (y_i - \bar{y})^2$

#### A baseline model, which always predicts $\bar {y}$, will have $R^2 = 0$

In [None]:
# here is an example of Disparity in 5 days = Ct/MA5 × 100
def Disparity_5(df):
    return 100 * df['close'] / df['close'].rolling(window = 5).mean()

df['Disparity_5'] = Disparity_5(df)

<a id='g2'></a>
# Grade 2
## Illustrate data using plotly (or other) library

- Calculate additional feature [Stochastic slow %D](https://www.investopedia.com/ask/answers/05/062405.asp)
- Create a figure based on OHLC candles **covering the test period** (the 20% of data)
- Add a line chart(s) that illustrates the *label* (actual data) and the *forecast* to the same figure over OHLC. The lines should have different colors and include names of the series.
- Add subplot(s) with the LW%R and Stochastic slow %D feature in the figure with OHLC, label and forecast so we can se all time-aligned. Don't forget to sclae back the predicted values to the original range.
- What patterns can you observe from the line figure?

### Stochastic %K	
<br>
<span style='font-size:20px'>
$\frac{(C_t − L_n)}{(H_n − L_n)}\times100$
</span>
    
### Stochastic %D
<br>
<span style='font-size:20px'>
$\sum\nolimits_{i=0}^{n-1}\frac{\%K_{t-i}}n$
</span>
    
### Stochastic slow %D
<br>
<span style='font-size:25px'>
$\frac{\sum_{i=0}^{n-1}\%D_{t-i}}n$
</span>

<a id='g3'></a>
# Grade 3

- Calculate additional feature [RSI (relative strength index)](https://www.investopedia.com/terms/r/rsi.asp)
- Add the feature as a subplot to the figure in the previos step
- Set up an [ElasticNet](https://scikit-learn.org/stable/modules/linear_model.html#elastic-net) (**not** an [ElasticNetCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html)) model
- Fit/train the ElasticNet to the training data
- Forecast and calculate the R² error on both the training data set and the test
- Add your new *forecast* line to the previos figure, so the predictions from both models are placed over each other and can be compared
- Compare the errors and explain the outcome

### RSI
<br>
<span style='font-size:25px'>
$100-\frac{100}{\left(1+\frac{\frac{\sum_{i=0}^{n-1}Up_{t-i}}{\text{n}}}{\frac{\sum_{i=0}^{n-1}Dw_{t-i}}{\text{n}}}\right)} $
    </span>

<a id='g4'></a>
# Grade 4

- Calculate additional feature [On Balance Volume](https://www.investopedia.com/terms/o/onbalancevolume.asp)
- Create sliding windows for the input data, e.g. the window length of 10, 5, and 2 samples. You will extract data for the window length n (rows), and turn the data from a matrix (2D) form into a vector form of the size $R\times C$ (i.e. number of rows * number of columns in the window) [NumPy Array Reshaping](https://www.w3schools.com/python/numpy_array_reshape.asp). You will probably need to create a function that returns a vector (array, tuple, list, Series). Other solutions are also possible. Here is an example of two days window for two features attached to the original data:

|     | a | b   | a-1 | b-1 | a-2 | b-2 |
|:---:|:-:|:---:|:---:|:---:|:---:|:---:| 
| t₁  | 0 |	1   | nan | nan | nan | nan |	
| t₂  | 2 |	3   | 0   | 1   | nan | nan |
| t₃  | 4 |	5   | 2   | 3   | 0	  | 1   | 
| t₄  | 6 |	7   | 4   | 5   | 2	  | 3   | 
| t₅  | 8 |	9   | 6   | 7   | 4	  | 5   |

- Set up a [Polynomial regression](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions)
- Fit and run all three different models (on all features) for all three different window lengths
- Summarize and compare their R² error measures. Is anyone better than the [LinearRegression model](https://en.wikipedia.org/wiki/Linear_regression) without window information attached?

| M\W  | no |  2 |  5 | 10 |
|:----:|:--:|:--:|:--:|:--:|
|  M₁  | R² | R² | R² | R² |
|  M₂  | R² | R² | R² | R² |
|  M₃  | R² | R² | R² | R² |


### **On Balance Volume**

$ OBV_t = OBV_{t − 1} + \theta \times V_t $

where $V_t$ is the volume of trade at time $t$, and 

(*classic definition*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t > C_{t−1} \\
  0, & \textit{if} \ C_t = C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$  <br /><br /> or  (*definition from the paper*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t \geq C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$


<a id='g5'></a>
# Grade 5
Implement an investment decision to either buy or sell based on some signals which you choose to detect. 
See the paper for how this can be done, the easiest solution is to hand-craft this decision to either buy or sell.

- Compare the regression forecast with the known Close price.
- Once the the forecast go above Close price you can define a buy opportunity
- You can decide to keep and hold if forecasted difference small or based on other signals.
- Calculate the hit ratio of your investment decision for each of the windows. The HR is the number of times a correct prediction has been made in relation to the total number of predictions. For example, if 10 predictions were correct out of 50, the HR would be 1:5 or 0.2
- Forecast one week ahead and compare the hit ratio with one day ahead forecast
- Which setup was the best, and why was that?


# Happy coding! 👩‍💻

