### Index {#index} #############################################

1. [Model Development](#models)  
    5.1 [LSTM](#lstm)  
    5.2 [Random Forest Regressor](#rf)  

2. [Portfolio Optimization](#optimization)  
    6.1 [Monte Carlo Simulation](#montecarlo)  
    6.2 [Genetic Algorithm](#genetic)  

3.  [Business Applicability](#business)  
4.  [Ethics](#ethics)

5.  [Conclusion](#conclusion)  
6.  [References](#ref)

## Model Development {#models}

In this chapter, we will be creating models based on the project statement requirements and existing paper recommendations.
All sources are cited along this report, and organized in the final chapter.

In order to feed useful pattern recognition data to portfolio optimization algorithms, we chose to implement an **LSTM**. 

Unlike a standard Random Forest Classifier, which is an ensemble learning method designed for classification tasks and outputs the majority vote or class probabilities, a Random Forest Regressor is used for **regression tasks**, where multiple decision trees are built, and their predictions are averaged to **provide a final continuous output**.


### LSTM {#lstm}

### Random Forest Regressor

The **Random Forest Regressor model** was meant to be an extra, a comparison point to the neural network. Unfortunately, this proved to be extremely difficult. For this reason, the `random_forest_regressor.ipynb` notebook does not output buying signals, despite possessing all of the scripts and data needed to do so, as well as important conceptions about the **principle of causality** which, in conversation with other colleagues, seemed to be the predominant issue. 

This means that if the model is capable of detecting time-based patters, it can potentially output some very relevant results.

Inside our notebook file, you can find specialized feature extraction and selection, as well as specific feature prediction and data splitting.

#### Initial Model

***Disclaimer***: in order to avoid visual cluttering and allow fair comparison between images, we decided to show every model performance by using the `AAPL.csv` file.  

Our initial model featured a very simple RF Regressor implementation, which uses the same extracted financial metrics of the final model.
These results were, initially, quite surprising: A fast, light model with extremely accurate predictions:

![](./plots/violates_causality.png)

Of course, we quickly realized some kind of overfitting was happening. 
We discovered that our lagged financial metric features were not being correctly implemented. This is because, as the project statement requires, we must predict an entire month, and train with 13 years of data. Naturally this prediction is correct for the first day of january, because it's only using information from 14 days earlier (all 2023), but every day from there on out will use **information from january 2024 as context**, which inherently **violates the principle of causality**.

This means that, although indirectly, we were using **January 2024 information to predict January 2024 data**, which outputs extremely overinflated results.

In order to fix this, we needed to get creative. Initially, we researched the type of staystical distributions and mathematical patterns our metrics are likely to follow, but the results obtained were not accurate at all, exponentially increasing the RMSE.

So, in order to use only data from 2023 and predict metric behaviour for 2024, we decided to use **LightGBM**. The problem with this is that LightGBM was focusing on feature importance, and because we could only use dates in order to predict metrics, the method was overfitting to the day of the year, and completely disregarding month and year.

Finally, as an attempt to not elevate the execution time and complexity to disproportionate measures, we decided to use **simple RF Regressors**, which were able to consistently predict metrics for 2024 using only the dates!


Now, after using our feature prediction for 2024, we obtained the following results:

![](./plots/rf_standard.png)

Root Mean Squared Error (RMSE): 49.67390248373879

We face yet another issue: The pattern in the price change is being correctly predicted, but RMSE values are quite high due to **the lack of correct absolute price adaptation**.

As an attempt to mitigate this issue, we implemented **MinMax** scaling and **binning**. Even though the RMSE **drastically** decreased, the values remained different in scale.

![](./plots/rf_minmax_bin.png)

Root Mean Squared Error (RMSE): 0.6897974447004057

Aditionally, we used Random Search to find the best parameters for a portion of the 500 files, assuming that this sample would somewhat represent the general behaviour of the SP500 Index, meaning the most common best settings for that section would represent the best settings for all the data:

![](./plots/rf_optimized.png)

Although not present in this notebook, we also implemented some ideas suggested by the professor, such as:

 - Concatenated every Stock data file, using new columns to represent 'Tickers' and 'Sectors', as we were told large amounts of data allowed for extrapolation
 - Used LightGBM in order to improve training speed (later removed due to feature importance issues)

Due to external pressures, such as available working time hindered by the elevated amount of delivery checkpoints, we could not run all of the final codes in time. Furthermore, this is explained in detail in the final parts of our Random Forest Regressor notebook.

We strongly suggest you analyze the sector retrieval and loop for all companies, inside the `random_forest_regressor.ipynb` notebook.

#### RF Regressor Conclusion

Random Forest Regressor was meant to produce buy signals, in order to have a comparison element with the LSTM.

Essentialy, this would be an extra for the project. Unfortunately, we were not able to obtain results we were satisfied with, therefore we will not waste more time than we already have developing such a complex solution to the problem.

Although the professor suggested our model would be capable of extrapolating, external pressures such as the overwhelming ammount of projects and tests scheduled for this delivery week hindered our ability to calmly and timely approach this problem. 

Due to these factors, we still implemented many different ways of mitigating complications we found along the way. When comparing to other groups' projects whose RMSE's were far superior, we noticed and informed them that they were **violating the causality principle**, hence why our results were much inferior. These solutions include:

 - Using MinMax scaling, in order to verify if patterns were correctly detected
 - Using binning to facilitate model training and testing, due to the elevated number of different values for each column
 - Extracted financial metrics and predicting them in order to completely bypass the violation of causality principle
 - Optimized RF parameters
 - Concatenated every Stock data file, using new columns to represent 'Tickers' and 'Sectors', as we were told large ammounts of data allowed for extrapolation
 - Used LightGBM in order to improve training speed (later removed due to feature importance issues)
 - Determined Feature Importance and explained RF's decisions and why they were outputting such results

Despite this, we were able to correctly identify relative changes in price, meaning our model is capable of correctly detecting price changes, while **respecting pertinent prediction principles**.