# Project: Financial time series forecasting

The following project is could be done individually or in pairs but you not allowed to share your solution with anyone else. Read below carefully!

- The aim of the project is that you learn how to set up an analytics project end-to-end. A secondary aim is that you understand how to work with a time series data set and forecast based on such data. Third aim is that you gain an insight into how to interpret data and results.

- The solution must address each grade in the written order. Therefore, to complete grade five you must have completed the other grades first.

- Unintentionally, there may be information missing in the description, please go through the description early in advance so that you have time to ask for assistance.


<a id='toc'></a>
# TOC

[Grade 1](#g1)

[Grade 2](#g2)

[Grade 3](#g3)

[Grade 4](#g4)

[Grade 5](#g5)


In [2]:
# First upgrade the environment.
# https://pypi.org/project/yfinance
import pip
from subprocess import run
# add what you will need
modules =[
#     'pandas_datareader',
#     'yfinance',
    'pandas_market_calendars',
    'plotly', 
    'numpy',
    'sklearn',
    'pickle5',
    'pandas'
]
proc = run(f'pip install {" ".join(modules)} --upgrade --no-input', 
       shell=True, 
       text=True, 
       capture_output=True, 
       timeout=40)
print(proc.stderr)




In [3]:
import pickle
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
import os.path
from datetime import datetime, timedelta
from os import path
import math

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, FuncFormatter, StrMethodFormatter
%matplotlib inline

import plotly as ply
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import sklearn
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures, MinMaxScaler
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

from functools import reduce
from operator import mul
from pprint import PrettyPrinter
pprint = PrettyPrinter().pprint

<a id='g1'></a>
# Grade 1
## Implement a complete process for forecasting a single stock.

You should do the following steps:
- Use the [EURUSD data set](https://people.arcada.fi/~parland/hjd5_8amp_Gt3/EURUSD1m.zip) (52Mb)

- Subsampe data to one day timesteps, be shure to get data also from weekends.

- Create a Label column for your forecast, by shifting the Close value 1 step. You will predict one day ahead.

- Split data into 80/20 (train/test). Be carefull: you are splitting a time serie.


- [Normalize or standardize](https://scikit-learn.org/stable/modules/preprocessing.html) wisely so you don't allow information leakage to the test subset. Note, that utility class [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) performs scaling **individual samples to have unit norm**, so it is not usefull for sertain tasks. Write your own function or check [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) or [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Please check in which order you perform split and scale.

    - Data → X, y → split → scale
    - Data → split → scale → X_test, y_test ; X_train, y_train
    
  If you take the second path, you should figure out how to do the inverse transform for your predictions.
  
  
  

- Calculate feature [Larry William’s %R](https://www.investopedia.com/terms/w/williamsr.asp) from the paper [Predicting the Direction of Stock Market Index Movement Using an Optimized Artificial Neural Network Model](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4873195) (implement in code, insert values in a complementary column). 

**Note that you need to implement your own calculation of each feature and be able to explain the code.**

- Drop other data than the Close and the features for inference. You don't want to feed time-column into the model, it's not a feature to base your prediction on.

- Fit a [linear model](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) to the training data

- Forecast one day ahead based on the test data

- Calculate the [R² error](https://en.wikipedia.org/wiki/Coefficient_of_determination) on both the training data set and the test. Please format numbers to four [significant digits](https://en.wikipedia.org/wiki/Significant_figures). As a check you may note that values suppose to be in the range 0.9755 … 0.9922 for test and 0.9955 … 0.9988 for train. If you get R² error outside these ranges it **indicates** that highly likely you have an error in your logic.

**NB! You may also get an expected result by performing several cumulative mistakes.**  

$$ (2+1)\times 3 = ?$$

$$ Attempt $$
$$ 2+1=4 $$
$$ 4\times 3 = 9 $$
$$ Answer: 9 $$

**The objective of this assignment is not to achieve a correct result but to achieve the result correctly.**

- Compare the R² errors for test and train and explain the outcome. 

- Extra: Test your model (get R² errors for test and train without LW%R, just Close column) on [this dataset](https://people.arcada.fi/~parland/hjd5_8amp_Gt3/strangeClose.csv). Comment and explain the result. A reasonable explanation will compensate for one error in the following grades.



### **Larry William’s %R**

$ (H_n − C_t)/(H_n − L_n)\times100 $

### Coefficient of determination ($R^2$)

$$R^2 = 1 - \frac {SSResid}{SSTot}$$

#### Residual Sum of Squares: $SSResid = \sum_{i} (y_i - \hat{y_i})^2$

#### Total Sum of Squares: $SSTot = \sum_{i} (y_i - \bar{y})^2$

#### A baseline model, which always predicts $\bar {y}$, will have $R^2 = 0$

In [4]:
import pickle5 as pickle
pickle.HIGHEST_PROTOCOL = 4

dName = "EurUsd1m.pickle"
dUrl = "https://people.arcada.fi/~parland/hjd5_8amp_Gt3/EURUSD1m.zip"
if path.isfile(dName):
    data = pickle.load(open(dName, "rb"))
    print("File exists in directory (%s)" % dName)
else:
    print("File not found, fetching from %s" % dUrl)
    
     # Notice how we specify parsing the dates together as one DateTime object as we read in the data
    data = pd.read_csv(dUrl, compression="zip", parse_dates=[["Date", "Timestamp"]])
    data.to_pickle(dName)
    print("Done")
data.set_index(data.Date_Timestamp, inplace=True)
data.drop("Date_Timestamp", axis=1, inplace=True)

File not found, fetching from https://people.arcada.fi/~parland/hjd5_8amp_Gt3/EURUSD1m.zip
Done


### Fetch data to perform on

In [5]:
# Let's first subsample our data to daily 
def subsampleData(data, symbol=str):

    subsample_aggDict = {"Open":"first",
                        "High":"max",
                        "Low":"min",
                        "Close":"last",
                        "Volume":"sum"}

    data = data.resample(symbol).agg(subsample_aggDict)
    
    data.dropna(inplace=True)
    return data
df = subsampleData(data, symbol="1D")
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date_Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-01,1.43327,1.43356,1.43207,1.43335,39761.000053
2010-01-03,1.43024,1.43359,1.42951,1.43141,3001.600003
2010-01-04,1.43143,1.44556,1.42559,1.44244,80019.400094
2010-01-05,1.44238,1.44834,1.43445,1.43634,79887.100067
2010-01-06,1.43638,1.44342,1.42807,1.44005,80971.800085
...,...,...,...,...,...
2019-12-26,1.10944,1.11088,1.10821,1.11012,52487.634115
2019-12-27,1.11013,1.11883,1.10987,1.11713,124575.902401
2019-12-29,1.11736,1.11839,1.11718,1.11813,3216.499400
2019-12-30,1.11812,1.12207,1.11806,1.12013,130773.875147


### Calculate Larry William’s %R

In [6]:
# Williams %R
def WilliamsR(data, period=14, close="Close", high="High", low="Low"):
    closePeriod = data[close]
    highPeriod = data[high].rolling(14).max()
    lowPeriod = data[low].rolling(14).min()
    
    williamsR = (highPeriod - closePeriod) / (highPeriod - lowPeriod) * -100
    
    return williamsR

featureDf = pd.DataFrame(df)    
featureDf["%R"] = WilliamsR(df)

### Set the label and drop unnecessary data

In [7]:
def setLabel(dframe, column="Close"):
    dframe["label"] = dframe[column].shift(-1)
    dframe.dropna(inplace=True)
    
    try:
        dframe.drop(columns=["Open", "High", "Low", "Volume"], inplace=True)
    except:
        return dframe
        
    return dframe

featureDf = setLabel(featureDf)

### Split into respective train and test data

### Define functions for data splitting and normalization/scaling for more practical later use

In [8]:
# To inverse_transform our scaled data back to readable
def inverseTransform(data, scaler):  
    inverseScaler = MinMaxScaler()

    # The scaler we used before to normalize our data is fit with the train data
    # So we use that scalers scale_ and min_ values to inverse
    inverseScaler.scale_ = np.append(scaler.scale_, scaler.scale_[0])
    inverseScaler.min_ = np.append(scaler.min_, scaler.min_[0])

    data = pd.DataFrame(inverseScaler.inverse_transform(data), columns=data.columns.values)
    return data

def dataSplitScale(data, column=str, no_window=bool):
    split_ratio = 0.8
    limit = math.floor(len(data) * split_ratio)
    X_train = []
    y_train = []
    X_test = []
    y_test = []
    
    train = data[:limit]
    test = data[limit:]
    
    scaler = MinMaxScaler().fit(train)
    
    train = pd.DataFrame(scaler.transform(train), columns=train.columns.values)
    test = pd.DataFrame(scaler.transform(test), columns=test.columns.values)
    
    if no_window == True:
        X_train = train[:-1]
        y_train = train[column].shift(-1).dropna()

        X_test = test[:-1]
        y_test = test[column].shift(-1).dropna()
        
        #print("1", no_window)
    
    elif no_window == False:
        X_train = train[:-1]
        y_train = train[column].shift(-1).dropna()

        X_test = test[:-1]
        y_test = test[column].shift(-1).dropna()
        
        #print("2", no_window)
        
    else:
        X_train = train.drop(column, axis=1)
        y_train = train[column]

        X_test = test.drop(column, axis=1)
        y_test = test[column]
        
        #print("3", no_window)
    
    return X_train, y_train, X_test, y_test, scaler, train, test

### Setup a Linear Model

In [9]:
def linRegFit(data, column="label", no_window=bool):
    train_x, train_y, test_x, test_y, scaler, train, test = dataSplitScale(data, column, no_window)
    
    test_split = len(test_y)
    train_split = len(train_y)
    
    reg = LinearRegression()
    reg.fit(train_x, train_y)
    
    results = test[-test_split:]
    results["pred"] = reg.predict(test_x)
    score = reg.score(test_x, test_y)
    
    trainResults = train[:train_split]
    trainResults["pred"] = reg.predict(train_x)
    train_score = reg.score(train_x, train_y)
    
    results = inverseTransform(results, scaler)
    trainResults = inverseTransform(trainResults, scaler)
    
    
    return reg, results, trainResults, score, train_score

reg, results, trainResults, lin_score, train_score = linRegFit(featureDf, column="label", no_window=None)

print("Test score: %.4f" % lin_score)
print("Train score: %.4f" % train_score)

Test score: 0.9898
Train score: 0.9964


### Ignore the following below - not in use in practice
### Calculate Residual and Total sum of squares for R-squared value

In [10]:
# Residual sum of squares
def rSquared(data, label="label", pred="pred"):
    mean = data[label].mean()
    SSres = (data[label] - data[pred]) ** 2
    SStot = (data[label] - mean) ** 2

    r2 = 1 - (SSres / SStot)
    return r2

<a id='g2'></a>
# Grade 2
## Illustrate data using plotly (or other) library

- Calculate additional feature [Stochastic slow %D](https://tradingsim.com/blog/slow-stochastics)
- Create a figure based on OHLC candles covering the test period
- Second add a line chart(s) that illustrates the *label* (actual data) and the *forecast* in the same figure over OHLC. The lines should have different colors and include names of series.
- Add subplot(s) with features so we can se them time-aligned
- What patterns can you observe from the line figure?

### Stochastic %K	
<br>
<span style='font-size:20px'>
$\frac{(C_t − L_n)}{(H_n − L_n)}\times100$
</span>
    
### Stochastic %D
<br>
<span style='font-size:20px'>
$\sum\nolimits_{i=0}^{n-1}\frac{\%K_{t-i}}n$
</span>
    
### Stochastic slow %D
<br>
<span style='font-size:25px'>
$\frac{\sum_{i=0}^{n-1}\%D_{t-i}}n$
</span>


### We apply the dates again as indices and the Low, High data

In [11]:
split_ratio = 0.8
limit = math.floor(len(featureDf) * split_ratio)

df1 = df.copy() # for later use

df.dropna(inplace=True)
trainResults.index = df.index[:limit]
results.index = df.index[limit:]

trainResults["Open"] = df[:limit]["Open"]
trainResults["High"] = df[:limit]["High"]
trainResults["Low"] = df[:limit]["Low"]
trainResults["Close"] = df[:limit]["Close"]
results["Open"] = df[limit:]["Open"]
results["High"] = df[limit:]["High"]
results["Low"] = df[limit:]["Low"]
results["Close"] = df[limit:]["Close"]
results["Volume"] = df[limit:]["Volume"]

In [12]:
featureDf

Unnamed: 0_level_0,Close,%R,label
Date_Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-17,1.43665,-65.843113,1.44039
2010-01-18,1.44039,-54.292773,1.42788
2010-01-19,1.42788,-91.209457,1.41119
2010-01-20,1.41119,-93.242974,1.40979
2010-01-21,1.40979,-86.810811,1.41361
...,...,...,...
2019-12-25,1.10944,-76.883687,1.11012
2019-12-26,1.11012,-73.743436,1.11713
2019-12-27,1.11713,-21.155289,1.11813
2019-12-29,1.11813,-5.733006,1.12013


### Calculate Stochastic slow %D

In [13]:
def stochasticSlowD(data1, data2, period=14, slowPeriod=3, close="Close", low="Low", high="High"):
    close = data1[close]
    lowest_low = data2[low].rolling(window=period).min().values
    highest_high = data2[high].rolling(window=period).max().values
    
    stochastic_k = 100 * ((close - lowest_low) / (highest_high - lowest_low))
    data1["Stochastic %K"] = stochastic_k
    
    stochastic_d = data1["Stochastic %K"].rolling(window=slowPeriod).mean()
    data1["Stochastic %D"] = stochastic_d
    
    stochastic_slow_d = data1["Stochastic %D"].rolling(window=slowPeriod).mean()
    data1["Stochastic slow %D"] = stochastic_slow_d
    
    return data1

results = stochasticSlowD(results, results)

### Plot the data

In [29]:
rows = 3
cols = 1
graph = make_subplots(rows=rows, cols=cols, subplot_titles=("EUR/USD", "Stochastic slow %D and %R", " "), vertical_spacing=0.3)
graph.add_trace(go.Candlestick(x=results.index,
                            open=results.Open,
                            high=results.High,
                            low=results.Low,
                            close=results.Close,
                            name="EUR/USD"), row=1, col=1)

label_trace = go.Scatter(x=results.index, y=results.label, name="Label", line=dict(color = "orange", width=1))
pred_trace = go.Scatter(x=results.index, y=results.pred, name="Prediction", line=dict(color = "blue", width=1))

candlestick = go.Candlestick(x=results.index,
                            open=results.Open,
                            high=results.High,
                            low=results.Low,
                            close=results.Close,
                            name="EUR/USD")

Rpercent_trace = go.Scatter(x=results.index, y=results["%R"], name=results["%R"].name)
slowD_trace = go.Scatter(x=results.index, y=results["Stochastic slow %D"], name=results["Stochastic slow %D"].name)

data = [candlestick, label_trace, pred_trace, Rpercent_trace, slowD_trace]

graph.append_trace(go.Scatter(x=results.index, y=results.label, name="Label", line=dict(color = "orange", width=1)), row=1, col=1)
graph.append_trace(go.Scatter(x=results.index, y=results.pred, name="Prediction", line=dict(color = "blue", width=1)), row=1, col=1)

graph.append_trace(go.Scatter(x=results.index, y=results["%R"], name=results["%R"].name), row=2, col=1)
graph.append_trace(go.Scatter(x=results.index, y=results["Stochastic slow %D"], name=results["Stochastic slow %D"].name), row=2, col=1)

graph.layout.xaxis.range = [results.index[0], results.index[-1]]
graph.update_layout(height=1200, width=1000)

layout = graph.layout

fig = go.FigureWidget(data=graph.data, layout=graph.layout)
fig.layout.yaxis.range = [results.label.min() - 0.01, results.label.max() + 0.01]
fig.update_yaxes(fixedrange=False, nticks = 5, tick0=1.1)

def zoom(layout, xrange):
    in_view = results.loc[fig.layout.xaxis.range[0]:fig.layout.xaxis.range[1]]
    fig.layout.yaxis.range = [in_view.Low.min() - 0.001, in_view.High.max() + 0.001]
    

try:
    fig.layout.on_change(zoom, "xaxis.range")
except: 
    pass
fig.show()

<a id='g3'></a>
# Grade 3

- Calculate additional feature [RSI (relative strength index)](https://www.investopedia.com/terms/r/rsi.asp)
- Add the feature as a subplot to the illustration from in the previos step
- Set up an [ElasticNet](https://scikit-learn.org/stable/modules/linear_model.html#elastic-net) model
- Fit/train the ElasticNet to the training data
- Forecast and calculate the R² error on both the training data set and the test
- Combine line chart(s) that illustrates the *label* (actual data) and the *forecast* from both models in the previos figure.
- Compare the errors and explain the outcome


### RSI
<br>
<span style='font-size:25px'>
$100-\frac{100}{\left(1+\frac{\frac{\sum_{i=0}^{n-1}Up_{t-i}}{\text{n}}}{\frac{\sum_{i=0}^{n-1}Dw_{t-i}}{\text{n}}}\right)} $
    </span>

### Calculate the RSI and plot the data

In [30]:
def RSI(data, period=14, close="Close"):
    # Compare close prices, determine if up or down
    diff = data[close].diff()
    diff = diff[1:]
    up, down = diff.copy(), diff.copy()
    
    # Neutralize values that don't correspond to up or down
    up[diff < 0] = 0
    down[diff > 0] = 0
    
    # Average gain and loss of period
    periodUp = up.rolling(window=period).mean()
    periodDown = down.abs().rolling(window = period).mean()
    
    RS = periodUp / periodDown
    RSI = 100 - (100 / (1 + RS))
    
    data["RSI"] = RSI
    return data
results = RSI(results)

RSI_trace = go.Scatter(x = results.index, y = results["RSI"], name=results["RSI"].name)
graph.append_trace((RSI_trace), row=3, col=1)
graph.layout.annotations[-1].update(text="RSI")
RSI_plot = go.Figure(data=RSI_trace)
display(fig.show(), RSI_plot)

None

### Set up ElasticNet model

In [16]:
def elasticNetFit(data, column="label", no_window=bool):
    lr = ElasticNet()
    alphas = []
    l1_ratios = [.1, .2, .3, .4, .7, .8, .85, .9, .95, .99]
    alphaFrom = 1e-10
    alphaTo = 1e+0
    num = 1e-10

    best_a = 0
    best_r = 0
    best_d = 0
    
    train_x, train_y, test_x, test_y, scaler, train, test = dataSplitScale(data, column, no_window)

    for x in np.logspace(alphaFrom, alphaTo, 10):
        alpha = num * 1e+1
        alphas.append(alpha)
        num = alpha
    
    # Determine best alpha and l1 ratio, doesn't work as intended though
    # Mostly here for demonstration. Hard coded alpha and l1 value is used in the end
    for alpha in alphas:
        for ratio in l1_ratios:
            lr = ElasticNet(alpha=alpha, l1_ratio=ratio)
            lr.fit(train_x, train_y)
            best_score = lr.score(test_x, test_y)
            diff = lr.score(test_x, test_y) - best_score
            if diff > best_d:
                best_d = diff
                best_a = alpha
                best_r = ratio

            else:
                best_a = 0.0001
                best_r = 0.5
                
    test_split = len(test_y)
    train_split = len(train_y)

    lr = ElasticNet(alpha=best_a, l1_ratio=best_r)
    lr.fit(train_x, train_y)
    results = test[-test_split:]
    results["pred"] = lr.predict(test_x)
    score = lr.score(test_x, test_y)
    
    trainResults = train[:train_split]
    trainResults["pred"] = lr.predict(train_x)
    train_score = lr.score(train_x, train_y)
    
    results = inverseTransform(results, scaler)
    trainResults = inverseTransform(trainResults, scaler)
    
    
    return results, trainResults, score, train_score

elastic_results, elastic_trainResults, elastic_score, elastic_train_score = elasticNetFit(featureDf, no_window=None)

elastic_trainResults.index = featureDf.index[:limit]
elastic_results.index = featureDf.index[limit:]

print("Test score: %.6f" % elastic_score)
print("Train score: %.4f" % elastic_train_score)

Test score: 0.989795
Train score: 0.9964


### Plot label and prediction from both models (Linear Regression, ElasticNet)

In [17]:
modelGraph = go.Figure()
modelGraph.add_trace(go.Scatter(x=results.index, y=results.label, name="Label"))
modelGraph.add_trace(go.Scatter(x=results.index, y=results.pred, name="LinReg Prediction"))
modelGraph.add_trace(go.Scatter(x=elastic_results.index, y=elastic_results.pred, name="Elastic Prediction"))

<a id='g4'></a>
# Grade 4

- Calculate additional feature [On Balance Volume](https://www.investopedia.com/terms/o/onbalancevolume.asp)
- Create sliding windows for the input data, e.g. the window length of 10, 5, and 2 samples. You will extract data for the window length n (rows), and turn the data from a matrix (2D) form into a vector form of the size $R\times C$ (i.e. number of rows * number of columns in the window) [NumPy Array Reshaping](https://www.w3schools.com/python/numpy_array_reshape.asp). You will probably need to create a function that returns a vector (array, tuple, list, Series). Other solutions are also possible. Here is an example of two days window for two features attached to the original data:

|     | a | b   | a-1 | b-1 | a-2 | b-2 |
|:---:|:-:|:---:|:---:|:---:|:---:|:---:| 
| t₁  | 0 |	1   | nan | nan | nan | nan |	
| t₂  | 2 |	3   | 0   | 1   | nan | nan |
| t₃  | 4 |	5   | 2   | 3   | 0	  | 1   | 
| t₄  | 6 |	7   | 4   | 5   | 2	  | 3   | 
| t₅  | 8 |	9   | 6   | 7   | 4	  | 5   |

- Set up a [Polynomial regression](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions)
- Fit and run all three different models (on all features) for all three different window lengths
- Summarize and compare their R² error measures. Is anyone better than the [LinearRegression model](https://en.wikipedia.org/wiki/Linear_regression) without window information attached?

| M\W  | no |  2 |  5 | 10 |
|:----:|:--:|:--:|:--:|:--:|
|  M₁  | R² | R² | R² | R² |
|  M₂  | R² | R² | R² | R² |
|  M₃  | R² | R² | R² | R² |


### **On Balance Volume**

$ OBV_t = OBV_{t − 1} + \theta \times V_t $

where $V_t$ is the volume of trade at time $t$, and 

(*classic definition*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t > C_{t−1} \\
  0, & \textit{if} \ C_t = C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$  <br /><br /> or  (*definition from the paper*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t \geq C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$


### Calculate On-Balance Volume

In [18]:
def onBalanceVolume(data, close="Close", volume="Volume"):
    OBV = []
    OBV.append(0)
    close = data[close]
    volume = data[volume]
    
    for i in range(1, len(close)):
        if close[i] > close[i-1]:
            OBV.append(0 + volume[i])
        elif close[i] < close[i-1]:
            OBV.append(0 - volume[i])
        else:
            OBV.append(OBV[-1])
    data["OBV"] = OBV
    return data

df1 = stochasticSlowD(df1, df1)
df1 = onBalanceVolume(df1)
df1 = RSI(df1)
df1.drop(columns=["Open", "High", "Low", "Volume", "Stochastic %K", "Stochastic %D", "label"], inplace=True)
df1.reset_index(drop=True, inplace=True)
df1

Unnamed: 0,Close,%R,Stochastic slow %D,OBV,RSI
0,1.43335,,,0.000000,
1,1.43141,,,-3001.600003,
2,1.44244,,,80019.400094,
3,1.43634,,,-79887.100067,
4,1.44005,,,80971.800085,
...,...,...,...,...,...
3123,1.11012,-73.743436,24.453872,52487.634115,51.417649
3124,1.11713,-21.155289,30.387089,124575.902401,55.642816
3125,1.11813,-5.733006,44.468283,3216.499400,49.669967
3126,1.12013,-12.556634,65.348992,130773.875147,66.316608


### Create sliding windows

In [19]:
def slidingWindow(data, n=1):
    n = n + 1
    sliding = pd.DataFrame(data)
    for x in range(1, n, 1):
        sliding = sliding.join(data.shift(x), rsuffix="-" + str(x))
    sliding.dropna(inplace=True)
    return sliding

sliding_two = slidingWindow(df1)
sliding_five = slidingWindow(df1, n=4)
sliding_ten = slidingWindow(df1, n=9)
sliding_no = slidingWindow(df1, n=0)

### Turn data from a 2D Matrix to a Vector

In [20]:
def matrixToVector(data):
    vector = data.to_numpy()
    vector = vector.flatten()
    return vector

sliding_two_vector = matrixToVector(sliding_two)
sliding_five_vector = matrixToVector(sliding_five)
sliding_ten_vector = matrixToVector(sliding_ten)

### Perform, fit and run with Polynomial Regression

In [21]:
def polyRegFit(data, degrees=2, column="Close", no_window=bool):
    X_train, y_train, X_test, y_test, scaler, train, test = dataSplitScale(data, column, no_window)
    
    poly_features = PolynomialFeatures(degree=degrees, include_bias=False)
    lin_reg = LinearRegression()

    x_poly = poly_features.fit_transform(X_train)
    poly_features.fit(X_train, y_train)
    
    test_split = len(y_test)
    train_split = len(y_train)
    
    lin_reg.fit(x_poly, y_train)
    results = test[-test_split:]
    results["pred"] = lin_reg.predict(poly_features.fit_transform(X_test))
    score = lin_reg.score(poly_features.fit_transform(X_test), y_test)
    
    results = inverseTransform(results, scaler)
    
    return results, score


In [22]:
# Create sliding-2, 5 and 10 windows

ten_result, ten_score = polyRegFit(sliding_ten, no_window=None)
five_result, five_score = polyRegFit(sliding_five, no_window=None)
two_result, two_score = polyRegFit(sliding_two, no_window=None)

In [23]:
# Perform predictions on respective windows with x-regression

column = "Close"
reg_two, lin_reg_two, lin_two_train_results, lin_two_score, lin_train_two_score = linRegFit(sliding_two, column, no_window=False)
reg_five, lin_reg_five, lin_five_train_results, lin_five_score, lin_train_five_score = linRegFit(sliding_five, column, no_window=False)
reg_ten, lin_reg_ten, lin_ten_train_results, lin_ten_score, lin_train_ten_score = linRegFit(sliding_ten, column, no_window=False)

e_two, e_two_train_results, e_two_score, e_two_train_score = elasticNetFit(sliding_two, column=column, no_window=False)
e_five, e_five_train_results, e_five_score, e_five_train_score  = elasticNetFit(sliding_five, column=column, no_window=False)
e_ten, e_ten_train_results, e_ten_score, e_ten_train_score = elasticNetFit(sliding_ten, column=column, no_window=False)


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2989314755008768, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2988565203993942, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2987815479646894, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2987065581924782, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.29848148480805603, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2984064256430615, tolerance: 0.01684215398459639


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.29836888955125956, t

In [24]:
# Predictions with no window data
column = "Close"
no_poly, no_poly_score = polyRegFit(df1.dropna(), column=column, no_window=True)
e_test, e_train_results, no_e_score, e_train_score = elasticNetFit(df1.dropna(), column=column, no_window=True)
reg, no_reg, no_reg_train, no_reg_score, no_reg_train_score = linRegFit(df1.dropna(), column=column, no_window=True)

In [25]:
comparison_df = pd.DataFrame(columns=list(["no", "2", "5", "10"]))
comparison_df.index.name = "M/W"
comparison_df.loc["M₁"] = [no_reg_score, lin_two_score, lin_five_score, lin_ten_score]
comparison_df.loc["M₂"] = [no_e_score, e_two_score, e_five_score, e_ten_score]
comparison_df.loc["M₃"] = [no_poly_score, two_score, five_score, ten_score]

comparison_df

Unnamed: 0_level_0,no,2,5,10
M/W,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M₁,0.989831,0.98981,0.989538,0.989239
M₂,0.989794,0.989664,0.989245,0.989106
M₃,0.989369,0.996134,0.994842,0.990603


<a id='g5'></a>
# Grade 5
Implement an investment decision to either buy or sell based on some signals which you choose to detect. 
See the paper for how this can be done, the easiest solution is to hand-craft this decision to either buy or sell.

- Compare the regression forecast with the known Close price.
- Once the the forecast go above Close price you can define a buy opportunity
- You can decide to keep and hold if forecasted difference small or based on other signals.
- Calculate the hit ratio of your investment decision for each of the windows
- Forecast one week ahead and compare the hit ratio with one day ahead forecast
- Which setup was the best, and why was that?


# Happy coding! 👩‍💻

