# II. Analysis

## Data Exploration


### Description of Primary Dataset
The primary dataset used is daily stock data for stocks on the London Stock Exchange (LSE). The date range for stock data varies depending on when the stock went public. The furthest date was in the year 1954. The most recent date in the dataset was 9 September 2016. The data was taken from Quandl's free access database.

All the data is in one comma-separated value file (CSV), with each row being one datapoint. There are over 14 million datapoints in the dataset. 

Each row has 14 columns. That means we have 14 features for each stock on every trading day since the year when the stock was tradable (from 1954 onwards). Unless otherwise indicated, the column values are all floats.

<table>
<th>Column</th><th>Format or accuracy if float</th><th>Meaning</th>
<tr><td>Stock symbol</td><td>string</td><td>How the stock is represented on the London Stock Exchange. E.g. GOOGLE's stock symbol is GOOGL.</td></tr>
<tr><td>Date</td><td>YYYY-MM-DD</td><td></td></tr>
<tr><td>Open</td><td>given to 2 decimal places (2 d.p.)</td><td>Price of stock when the market opened on that day in GBP £.</td></tr>
<tr><td>High</td><td>2 d.p.</td><td>Maximum price of the stock during the trading day in GBP £.</td></tr>
<tr><td>Low</td><td>2 d.p.</td><td>Minimum price of the stock during the trading day in GBP £.</td></tr>
<tr><td>Close</td><td>2 d.p.</td><td>Price of stock when the market closed on that day in GBP £.</td></tr>
<tr><td>Volume</td><td>1 d.p.</td><td>The number of shares of that stock traded on that day.</td></tr>
<tr><td>Ex-Dividend</td><td>1 d.p.</td><td>The value of the declared or upcoming dividend that will belong to the seller of the stock share rather than the buyer. Dividend is profits distributed to shareholders. If the upcoming dividend will be given to the buyer, Ex-Dividend = 0.</td></tr>
<tr><td>Split Ratio</td><td>1 d.p.</td><td>A company may choose to split their stock. E.g. a 2.0 (2:1) split ratio means shareholders get two new shares for every share they hold. This halves the price to preserve the market capitalisation (total value) of the company.</td></tr>
<tr><td>Adjusted Open</td><td>6 d.p.</td><td>Adjusted opening price (price of stock when the market opened on that day). Adjusted prices are prices amended to include any distributions and corporate actions such as stock splits (splitting one stock into two which would halve the price), dividends (giving stockholders cash as a fraction of profits) that occurred at any time before the next day's open.</td></tr>
<tr><td>Adjusted High</td><td>6 d.p.</td><td>See Adjusted Open and High.</td></tr>
<tr><td>Adjusted Low</td><td>6 d.p.</td><td>See Adjusted Open and Low.</td></tr>
<tr><td>Adjusted Close</td><td>6 d.p.</td><td>See Adjusted Open and Close.</td></tr>
<tr><td>Adjusted Volume</td><td>1 d.p.</td><td>See Adjusted Open and  Volume.</td></tr>
</table>

Reference: [Definition of Ex-Dividend (Investopedia)](http://www.investopedia.com/terms/e/ex-dividend.asp)

#### Data sample

<table>
<tr><th></th><th>Symbol</th><th>Date</th><th>Open</th><th>High</th><th>Low</th><th>Close</th><th>Volume</th><th>Ex-Dividend</th><th>Split Ratio</th><th>Adj. Open</th><th>Adj. High</th><th>Adj. Low</th><th>Adj. Close</th><th>Adj. Volume</th></tr>
<tr><td>0</td><td>A</td><td>1999-11-18</td><td>45.50</td><td>50.00</td><td>40.00</td><td>44.00</td><td>44739900.0</td><td>0.0</td><td>1.0</td><td>43.471810</td><td>47.771219</td><td>38.216975</td><td>42.038673</td><td>44739900.0</td></tr>
<tr><td>1</td><td>A</td><td>1999-11-19</td><td>42.94</td><td>43.00</td><td>39.81</td><td>40.38</td><td>10897100.0</td><td>0.0</td><td>1.0</td><td>41.025923</td><td>41.083249</td><td>38.035445</td><td>38.580037</td><td>10897100.0</td></tr>
<tr><td>2</td><td>A</td><td>1999-11-22</td><td>41.31</td><td>44.00</td><td>40.06</td><td>44.00</td><td>4705200.0</td><td>0.0</td><td>1.0</td><td>39.468581</td><td>42.038673</td><td>38.274301</td><td>42.038673</td><td>4705200.0</td></tr>
<tr><td>3</td><td>A</td><td>1999-11-23</td><td>42.50</td><td>43.63</td><td>40.25</td><td>40.25</td><td>4274400.0</td><td>0.0</td><td>1.0</td><td>40.605536</td><td>41.685166</td><td>38.455832</td><td>38.455832</td><td>4274400.0</td></tr>
<tr><td>4</td><td>A</td><td>1999-11-24</td><td>40.13</td><td>41.94</td><td>40.00</td><td>41.06</td><td>3464400.0</td><td>0.0</td><td>1.0</td><td>38.341181</td><td>40.070499</td><td>38.216975</td><td>39.229725</td><td>3464400.0</td></tr>
</table>
*Obtained using `df.head()`*

### Description of supplementary dataset (FTSE100)

I wanted to add features that corresponded to the general market trend and thought the FTSE100 would be a good representation. The FTSE100 as a single index was not included in my primary dataset, so I obtained the data by scraping Google Finance with a python script (see `google-finance-scraper.py`).

The supplementary dataset has Open, High, Low, Close data in the date range April 1, 1984 - September 9, 2016.

### Defining Characteristics about Stock Data
1. Limit Down Circuit Breakers

### Dataset Statistics 

The summary statistics for the dataset are not too meaningful, but it gives us an idea of the **variance within the dataset**. The standard deviation of the adjusted close price is of magnitude 10^3 ($1000), and the standard deviation of adjusted volume is of magnitude 10^6 (1,000,000 shares). 

The summary statistics suggest that the data is **positively skewed**. 


<table>
<tr><th></th><th>Open</th><th>High</th><th>Low</th><th>Close</th><th>Volume</th><th>Ex-Dividend</th><th>Split Ratio</th><th>Adj. Open</th><th>Adj. High</th><th>Adj. Low</th><th>Adj. Close</th><th>Adj. Volume</th></tr>
<tr><td>mean</td><td>7.092291e+01</td><td>7.188109e+01</td><td>7.047024e+01</td><td>7.120251e+01</td><td>1.182026e+06</td><td>1.982789e-03</td><td>1.000210e+00</td><td>7.518079e+01</td><td>7.633755e+01</td><td>7.451613e+01</td><td>7.544570e+01</td><td>1.402925e+06</td></tr>
<tr><td>std</td><td>2.193723e+03</td><td>2.220224e+03</td><td>2.191789e+03</td><td>2.206792e+03</td><td>8.868551e+06</td><td>3.370723e-01</td><td>2.165061e-02</td><td>2.266636e+03</td><td>2.295340e+03</td><td>2.261718e+03</td><td>2.279264e+03</td><td>6.620816e+06</td></tr>
<tr><td>min</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>1.000000e-02</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td><td>0.000000e+00</td></tr>
<tr><td>max</td><td>2.281800e+05</td><td>2.293740e+05</td><td>2.275300e+05</td><td>2.293000e+05</td><td>6.674913e+09</td><td>9.625000e+02</td><td>5.000000e+01</td><td>2.281800e+05</td><td>2.293740e+05</td><td>2.275300e+05</td><td>2.293000e+05</td><td>2.304019e+09</td></tr>
</table>

I have checked the count is constant across all columns, i.e. that there are no missing values.

### Interesting observations: Abnormalities in dataset
The minimum Open, High, Low and Close are all zero. If a stock trades at a price of zero, it kind of doesn't exist. I will examine this in the Data Preprocessing section.

### BP Statistics

More meaningful than the summary statistics for all 3,000+ stocks is the summary statistics for one stock. Since one of the stocks we are hoping to predict is that of BP (British Petroleum), let's examine the corresponding summary statistics.

<table>
<tr><th></th><th>Open</th><th>High</th><th>Low</th><th>Close</th><th>Volume</th><th>Ex-Dividend</th><th>Split Ratio</th><th>Adj. Open</th><th>Adj. High</th><th>Adj. Low</th><th>Adj. Close</th><th>Adj. Volume</th><th>Daily Variation</th></tr>
<tr><td>mean</td><td>59.428433</td><td>59.908222</td><td>58.943809</td><td>59.446137</td><td>2.816082e+06</td><td>0.004626</td><td>1.000400</td><td>18.705367</td><td>18.855246</td><td>18.547576</td><td>18.707358</td><td>3.408274e+06</td><td>0.0</td></tr>
<tr><td>std</td><td>20.589378</td><td>20.676885</td><td>20.513272</td><td>20.598500</td><td>7.217241e+06</td><td>0.048270</td><td>0.019987</td><td>14.127674</td><td>14.228791</td><td>14.011973</td><td>14.122609</td><td>7.532096e+06</td><td>0.0</td></tr>
<tr><td>min</td><td>27.250000</td><td>27.850000</td><td>26.500000</td><td>27.020000</td><td>0.000000e+00</td><td>0.000000</td><td>1.000000</td><td>1.522366</td><td>1.528872</td><td>1.503109</td><td>1.522366</td><td>0.000000e+00</td><td>0.0</td></tr>
<tr><td>25%</td><td>44.750000</td><td>45.162500</td><td>44.250000</td><td>44.770000</td><td>1.831500e+05</td><td>0.000000</td><td>1.000000</td><td>5.426399</td><td>5.493816</td><td>5.373302</td><td>5.442764</td><td>7.536000e+05</td><td>0.0</td></tr>
<tr><td>50%</td><td>53.940000</td><td>54.360000</td><td>53.500000</td><td>53.940000</td><td>6.371500e+05</td><td>0.000000</td><td>1.000000</td><td>15.077767</td><td>15.165769</td><td>15.033179</td><td>15.099474</td><td>1.904100e+06</td><td>0.0</td></tr>
<tr><td>75%</td><td>69.750000</td><td>70.230000</td><td>69.327500</td><td>69.795000</td><td>3.784475e+06</td><td>0.000000</td><td>1.000000</td><td>31.849522</td><td>32.207689</td><td>31.524772</td><td>31.889513</td><td>4.051675e+06</td><td>0.0</td></tr>
<tr><td>max</td><td>147.120000</td><td>147.380000</td><td>146.380000</td><td>146.500000</td><td>2.408085e+08</td><td>0.840000</td><td>2.000000</td><td>50.669004</td><td>50.988683</td><td>50.039144</td><td>50.533702</td><td>2.408085e+08</td><td>0.0</td></tr>
</table>

I have checked the count is 10010 across all columns, i.e. that there are no missing values.

This is much better understood with a visualisation of the BP data.

## Exploratory Visualisations

### Open and Adjusted Open Prices
Let's first get an idea of the open and adjusted open prices. This is equivalent to visualising the the close and adjusted close prices - the variable we want to predict - shifted by one day.

<img src="images/bp-open-prices.png" />
<img src="images/bp-adj-open-prices.png" />

*Prices are in GBP £.*

#### Observations
1. **Adjusted vs non-adjusted figures** It is extraordinary: the adjusted open and the open are radically different for BP, whereas with stock 'A' in the first few rows of the df, Adj. Open and Open had similar values. This makes sense because some stocks that have few corporate actions e.g. stocks that don't have stock splits or give out dividends will require little value adjustment.
    - Since we are predicting the Adjusted Close, my guess is that the Adjusted figures (Open, High, Low, Volume) will be more useful in predicting the adjusted price. The non-adjusted figures (specifically Volume) may still useful in predicting momentum.

2. **Trend** The non-adjusted prices do not show an upward trend. The adjusted open prices show somewhat of an upward trend but it has been too volatile in recent years to draw any conclusions.

3. **Volatility** The stock price looks volatile, which is expected for an oil stock. From the descriptive statistics, the mean daily percentage variation is 1.72% and the maximum daily percentage variation is 16.0%.

### Volatility: Percentage Variation

To examine the volatility of BP stock, I constructed the features Percentage Variation and Adj. Percentage Variation, where

`Percentage Variation = (High - Low)/Open * 100`.

<img src="images/bp-percentage-variation.png" />
<img src="images/bp-adj-percentage-variation.png" />

#### Observations
The Adjusted Percentage Variation and Percentage Variation look similar. There does not seem to be marked trends. It is of note that the stocks are consistently volatile with typical percentage variation of 0-4% in recent years, punctuated with spikes of extremely volatile periods of up to 16% variation.

## Algorithms and techniques


### Algorithm

I intend to use **linear regression**. 

#### Algorithm Description

Linear Regression is a way of modelling data by observing data and constructing an equation that minimises error. This regression is linear because the equation takes the form
$$\hat y = \sum \beta_i x_i$$

where $y$ is what we want to predict (stock prices) and $x_i$s are features such as the date. The hat on top of $y$ indicates it is an estimate.

That is, this regression is linear because the $x_i$s all have degree 1.


#### Algorithm Justification
1. I am using a **linear algorithm** because the the **signal-to-noise ratio in trading is low** and more complicated models seem like they would overfit.
2. A linear regression is appropriate because this is a **regression problem** - that is, the output are continuous. 
    - Note that *regression* in linear regression does not mean the same thing as *regression* in a regression problem.

#### Algorithm Parameters
There are only four parameters for `LinearRegression`:
- `fit_intercept` is set to True by default; setting it to false assumes the data is centered and will not produce better results.
- `normalize` normalizes the regressors X before regression. It is set to `False` by default.
- `copy_X` alters whether or not X may be overwritten, which does not affect the result.
- `n_jobs` can provide a speedup if the problem is large and you ask the algorithm to use more CPUs, but it will not change error measures.

Within these, there is only one parameter that it may be useful to adjust (`normalize`) to improve the error of the result.

### Techniques

1. Time-series train-test split
2. Time-series cross-validation

#### TODO: Add detail.

## Benchmark

The benchmark given in the project outline was +/- 5% of the stock price 7 days out. That seems reasonable to start.


## III. Methodology
_(approx. 3-5 pages)_

### Data Preprocessing
In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
- _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
- _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
- _If no preprocessing is needed, has it been made clear why?_

# III. Methodology

## Data Preprocessing

### Minor edits
1. On opening the CSV and sampling it with `df.head()`, I realised the CSV had no header. I added a header to the CSV:
```python
df = pd.read_csv('~/lse-data/lse/WIKI_20160909.csv', header=None, names=header_names)
```
where `header_names` was an slightly edited header I'd obtained from downloading the data for an individual stock from Quandl.

### Examining Abnormalities

I noted above that there were datapoints with opening price, high, low and closing price of 0.0. Were these mistakes? On investigating the data, it is plausble these were not mistakes.

<table>
<tr><th></th><th>Symbol</th><th>Date</th><th>Open</th><th>High</th><th>Low</th><th>Close</th><th>Volume</th><th>Ex-Dividend</th><th>Split Ratio</th><th>Adj. Open</th><th>Adj. High</th><th>Adj. Low</th><th>Adj. Close</th><th>Adj. Volume</th></tr>
<tr><td>1047193</td><td>ARWR</td><td>2002-10-11</td><td>0.0</td><td>0.00</td><td>0.0</td><td>0.00</td><td>65000.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>0.00</td><td>0.0</td><td>0.000000</td><td>100.000000</td></tr>
<tr><td>1047194</td><td>ARWR</td><td>2002-10-14</td><td>0.0</td><td>0.00</td><td>0.0</td><td>0.00</td><td>0.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>0.00</td><td>0.0</td><td>0.000000</td><td>0.000000</td></tr>
<tr><td>7608936</td><td>LFVN</td><td>2003-02-21</td><td>0.0</td><td>0.01</td><td>0.0</td><td>0.01</td><td>27200.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>4.76</td><td>0.0</td><td>4.760000</td><td>57.142857</td></tr>
</table>

I've included three examples in the table above. The third example shows that the figures may not actually be zero but may be zero to one or two decimal places: the open and low prices were 0.0, but the high and close prices were 0.01.

I assembled a list of stocks where the open or close was equal to 0 and will examine individual stocks on the list if they end up as features I'd like to use in my model.

### Feature Engineering

### 1. Daily and Percentage Variation

Reasoning: This is an indicator of how volatile prices have been. If the daily variation has been higher recently, that may mean there is a lot of uncertainty and that we can expect more fluctuations or that we shouldn't take big one-day changes too seriously when considering long-term predictions. 

I calculated the daily absolute and percentage variation (adjusted and unadjusted) for the entire data frame.

### 2. Prices of related stocks (Oil stocks)

Reasoning: BP's stock price is affected by how people feel about oil in general. Thus prices of oil stocks may correlate positively or negatively (if they are direct competitors) with BP's prices.

I obtained a list of oil companies listed on the LSE by searching for stocks with the same group code (537 for oil) in `list-of-all-securities-ex-debt.csv`.

Unfortunately there was only one other oil stock on my list that I found in this database (`GAIA`), so instead of creating an aggregated dataframe, I only included `GAIA`'s data in my additional set of features.

Improvement for future studies: Collect data from another data source to come up with a more informative feature.

#### Adding GAIA Features
The GAIA trading dates started on 1999-10-29 whereas the BP trading dates started much earlier, so that cut out a large portion of the dataset. Data had to be taken out because it did not make sense to create proxy values for 20+ years' of volatile price data.

**Complications** There was also a discrepancy in the trading dates. We have data for BP and GAIA on every trading day from 1999-10-29 to 2014-10-02, but beyond that the data for GAIA is incomplete. There was no information on GAIA trading on the second, fourth or fifth of October 2014 (whereas there was for BP). Thus our dataset is pared down even further to a size of 3754 as opposed to 10010 for BP. This is a huge cut.

### 3. Prices of FTSE100

Reasoning: Stock prices are also affected by how people feel about the market in general. The FTSE100 is fairly representative of the performance of the market in general, so including it as a feature can help us account for that aspect.

#### TODO: Add some more detail

### TODO: Are you cutting X-day running averages?
### 4. X-day running averages (Cut down the number of features but try to provide the same information)

Reasoning: This allows us to leverage information from 21 days' of data without including all 21 adjusted close prices. Having many features increases the risk of overfitting.



## Initial implementation
I initially implemented the Linear Regression algorithm with the following basic features:
* Adjusted Close prices on each of the 7 days prior to the first prediction date
* Max Adjusted High and Min Adjusted Low for that 7-day period prior to the first prediction date.

### Process:
1. Construct dataframe `X` containing initial features and dataframe `y` with 'Adjusted Close' prices.
    - This required some setting up to extract the relevant features from the dataset and put them in an appropriately formatted dataframe. This is in the first half of `prepare_train_test()` function in part 2.1 of `III. Methodology - Code.ipynb`.
    - The `y` target `nday_prices` had prices for the next `n` days.
2. Split `X` and `y` into training and test datasets.
    - I wrote my own function to do this (initially in `train_test_split_noshuffle` before I absorbed it into `prepare_train_test()`) instead of using sklearn's `train_test_split`. This was because sklearn's function automatically shuffles the data. Shuffling the data is okay and desired for situations in which data is not ordered, but is not okay for time-series data. 
    - If the data were shuffled, e.g. the adjusted close price for 1 Sept 2016 might be in the training set. We might then be asked to predict the adjusted close prices for the 7 days after 31 Aug 2016, which would include the price for 1 Sept 2016 which we'd have seen before. That's cheating.
3. Train model on training data.
    - Because there were multiple outputs to predict in `nday_prices` (the model had to forecast prices for each of the 7 trading days after the last date it was given), I wrapped `MultiOutputRegressor` from sklearn's `multioutput` module around my classifier.
    - This is in the first half of the function `classify_and_metrics` in `2.2 Classifier` in `III. Methodology - Code.ipynb`.
4. Ask model to predict prices on test features.
5. Print metrics
    - I included this in `classify_and_metrics()` using my helper functions `rmsp()` (root mean squared percentage error) and `print_metrics()`. See Section `2.2 Classifier` in `III. Methodology - Code.ipynb`.

#### Refactoring
I refactored the code so that I could run a full (1) train-test split, (2) train classifier, (3) test classifier and print metrics cycle using only one line. To do this, I wrapped all the functions those processes with the `execute()` function.

### Initial Results
The results are shown below. I also tried using an SVM regression for comparison. 

#### Linear Regression
<table>
<th>Days after last training date</th><th>Mean Root mean squared daily percentage error (across 8 distinct train-test sets)</th>
<tr><td>1</td><td>1.669</td></tr>
<tr><td>2</td><td>2.422</td></tr>
<tr><td>3</td><td>2.968</td></tr>
<tr><td>4</td><td>3.407</td></tr>
<tr><td>5</td><td>3.834</td></tr>
<tr><td>6</td><td>4.230</td></tr>
<tr><td>7</td><td>4.590</td></tr>
</table>

Mean R2 score: 0.807. Ranged from 0.606 to 0.936.

#### SVM.SVR
<table>
<th>Days after last training date</th><th>Mean Root mean squared daily percentage error (across 8 distinct train-test sets)</th>
<tr><td>1</td><td>11.230</td></tr>
<tr><td>2</td><td>11.460</td></tr>
<tr><td>3</td><td>11.761</td></tr>
<tr><td>4</td><td>12.022</td></tr>
<tr><td>5</td><td>12.323</td></tr>
<tr><td>6</td><td>12.667</td></tr>
<tr><td>7</td><td>13.060</td></tr>
</table>
Mean R2 score: -2.044. Ranged from -9.156 to 0.822.

The Linear Regression did surprisingly well, with a mean R2 score above 0.807 overall for 7-day predictions and a mean RMS percentage error of under 5% for forecasts 7 days away. 

The SVM regression did horribly - it had a negative mean R2 score (-2.044) and negative median R2 score, which means it was worse than guessing randomly. It had a mean RMS percentage error of over 24% for all number-of-days ahead predicted.

It is impressive that the Linear Regression model did so well with such basic features.

# TODO: Insert plot of predictions vs actual prices

## Refinement

### 1. Adjusting parameters

As discussed in Analysis: Algorithm Parameters, there is only one parameter that it may be useful to adjust (`normalize`).

I ran the algorithm with `normalize=True` to see if it produced better results. The metrics returned were exactly the same as when, by default, `normalize=False`.

### 2. Add features (Feature Selection)

I then experimented with adding the features I'd engineered earlier. (See *Data Preprocessing: Feature Engineering* for more details on how these features came about.)

#### 2.1 Adding more of the same type of features:

In the first implementation, I only used prices from the 7 days running up to the first prediction day. I then tried using prices from 10, 14, 21 and 30 days running up to the first prediction day. 

Reasoning: If we have more data, it makes sense to use it if we are confident it will give us better results.

To do this, I changed the value of the parameter `days` in the function `execute`, which trains and tests the classifier and prints metrics. 

#### TODO: DELETE THIS AFTER USING FOR PLOT
7d:
1.669
2.422
2.968
3.407
3.834
4.230
4.590


10d
Mean daily error:  [1.7321477061307597, 2.5432152188018913, 3.1383346165356416, 3.5793927574194155, 3.9394427230724309, 4.2692644737508925, 4.5432050435026108]

14 days:
Mean daily error:  [1.7285404855953252, 2.5255007498628097, 3.1026280963920607, 3.5862999911658147, 4.0020669863612239, 4.3722863441980762, 4.701971393685997]

21 days:
Mean daily error:  [1.7458324393865607, 2.5550697635040556, 3.1130306876040765, 3.5859111257648624, 3.9906346379964006, 4.3416348748811986, 4.6578080578960108]

30 days:
Mean daily error:  [1.7839163888017815, 2.593162562286222, 3.1521417303676622, 3.6325948299484372, 4.0479378120671301, 4.3916975345657692, 4.7046907424412074]

100 days:
Mean daily error:  [1.9238550915564432, 2.7676076433106056, 3.3695076303415705, 3.8902423145616098, 4.3550552824867319, 4.7687380251335467, 5.1629268283684322]


#### Mean Daily Error across 15 trials
<table>
<th>Day to predict</th><th>7d (used)</th><th>10d</th><th>14d</th><th>21d</th><th>30d</th><th>100d</th>
<tr><td>1</td><td>1.669</td><td>1.732</td><td>1.729</td><td>1.746</td><td>1.784</td><td>1.924</td></tr>
<tr><td>2</td><td>2.422</td><td>2.543</td><td>2.526</td><td>2.555</td><td>2.593</td><td>2.768</td></tr>
<tr><td>3</td><td>2.968</td><td>3.138</td><td>3.103</td><td>3.113</td><td>3.152</td><td>3.370</td></tr>
<tr><td>4</td><td>3.407</td><td>3.579</td><td>3.586</td><td>3.586</td><td>3.633</td><td>3.890</td></tr>
<tr><td>5</td><td>3.834</td><td>3.939</td><td>4.002</td><td>3.991</td><td>4.048</td><td>4.355</td></tr>
<tr><td>6</td><td>4.230</td><td>4.269</td><td>4.372</td><td>4.342</td><td>4.392</td><td>4.769</td></tr>
<tr><td>7</td><td>4.590</td><td>4.543</td><td>4.702</td><td>4.658</td><td>4.705</td><td>5.163</td></tr>
</table>

We can see that mean RMS percentage error is slightly smaller in one instance (using 10d instead of 7d to predict precisely 7 days ahead),but otherwise that mean RMS percentage error is greater as the number of days of data given increases.

This is because more days' of data in this case means more features (e.g. for 100 days' of data we have 102 features). This increases the risk of overfitting.

#### 2.2 Adding GAIA (Oil Stock) Prices


There were far fewer datapoints to work with because because of date inconsistencies (3753 datapoints vs 10010 for the BP-only model), so I decreased the step length (the difference between start dates) between consecutive trials to 200 from 500. This does not affect individual trial performance, but reduces the variety of data used for trials. We should bear this in mind when comparing performance of adding GAIA prices as features and not adding GAIA prices as features. 

<table>
<th>Day to predict</th><th>7d (no GAIA)</th><th>7d (GAIA)</th><th>10d (no GAIA)</th><th>10d (GAIA)</th>
<tr><td>1</td><td>1.669</td><td>1.744</td><td>1.732</td><td>1.751</td></tr>
<tr><td>2</td><td>2.422</td><td>2.444</td><td>2.543</td><td>2.467</td></tr>
<tr><td>3</td><td>2.968</td><td>2.938</td><td>3.138</td><td>2.978</td></tr>
<tr><td>4</td><td>3.407</td><td>3.424</td><td>3.579</td><td>3.479</td></tr>
<tr><td>5</td><td>3.834</td><td>3.881</td><td>3.939</td><td>3.946</td></tr>
<tr><td>6</td><td>4.230</td><td>4.294</td><td>4.269</td><td>4.368</td></tr>
<tr><td>7</td><td>4.590</td><td>4.702</td><td>4.543</td><td>4.816</td></tr>
</table>

*Trial information: (1) Not GAIA: Mean over 15 trials, buffer step = 500. 
(2) GAIA: Mean over 13 trials, buffer step = 200.*

When considering 7 days' worth of data, adding GAIA features produces predictions with a similar mean RMS percentage error. The mean error is higher for 6 out of 7 days-ahead (the exception being 3 days ahead).

When considering 10 days' worth of data, adding GAIA features performs slightly better for 2-4 days-ahead (0.08%, 0.16%, 0.1% improved) and slightly worse for all other days-ahead (0.02%, 0.01%, 0.1%, 0.27% worse). But these mean RMS percentage errors are all larger than the 7-day no-GAIA mean RMS percentage errors.

**Action**: I conclude that adding GAIA features in this way does not reliably produce better results, likely because additional features increase the risk of overfitting.

**Interpretation**: It makes sense because BP prices would not correlate perfectly in one direction or the other with GAIA prices: oil companies' stock prices incorporate sentiment about oil but companies are also often in different regions and compete against each other, muddying correlations.

#### 2.3 Adding related features: FTSE100

#### TODO: Do it and report results.

Proxied


### Model Choice

Perhaps counterintuitively, the final model is the initial model. All the improvements I tried to make only made the model worse. This goes to show that added complexity doesn't necessarily make a model better, especially when that complexity contains much noise.

# IV. Results

## Model Evaluation and Validation

### Generalisability
When we evaluated the model in the previous section, each iteration of the model was run on 15 training and test sets. We then looked at the mean daily root mean squared percentage error. This **variation of input data** is to ensure that the model can generalise well and does not only perform well on one set of data.

There are two types of metrics we need to look at: mean performance and variance of performance. 

#### TODO: Insert mean performance and variance of perf metrics results

## Justification

Overall, this model aligns with solution expectations and on average performs slightly better than the benchmark of predicting within +/- 5% of the stock's adjusted closing price 7 days after the last training date. The model has mean performance of **TODO: ADD and use some type of statistical analysis to compare?**.

The solution gives a reasonably accurate predictions but it **is not significant enough** to give advice on trades. **TODO: INSERT VARIANCE FIGURES.** A 5% error is significant in trading. There are also transaction costs with every trade, which would cut into profits.
