# Machine Learning and the Efficient Market Hypothesis
### Hayden Hogenson
### Econ 411 - Computational Economics

Artificial intelligence (AI) and machine learning (ML) algorithms have been increasingly used as tools to help predict stock prices and returns both in academia, professional practice, and day trading throughout the last few decades. Most suited to financial time-series are a special type of recurrent neural network (RNN) called long short-term memory (LSTM) networks. This model is both the most recent to be applied and the model able to garner the highest returns (Fisher and Krauss, 2017). As the prevalence of these algorithms increases, however, the efficient market hypothesis would imply that the ability of these algorithms to accurately predict future stock prices would decrease and thus financial returns would diminish. Indeed, this seems to have been documented in the literature. 
Fisher and Krauss (2017) found an inability of several models to garner positive returns (after transaction costs) past the year 2010. Most research into this area, however, has used algorithms trained on several stocks (Fisher and Krauss use every stock in the S&P 500 for at least one month between the years 1980 and 2015 for example). The reason for this is relatively obvious: neural networks and deep learning algorithms are structures especially sensitive to misspecification and are computationally and temporally expensive to train. 

Other, memory free algorithms, however, are easier to specify and have shorter training times. One such algorithm is the random forest (RF). RF models are particularly robust to overfitting and generate higher returns during recessions. This is likely due to the structure of a random forest model. The random forest model utilizes the majority decision of an ensemble of decision trees to make a final prediction. Each of these decision trees is trained on a random subset of the data which reduces the propensity of an algorithm to overfit to the data. This allows the model to filter out noise and outliers, two phenomena found in abundance in stock returns.
This paper will investigate the nature of the decline of AI and ML models’ ability to predict stock returns through two primary means. First, I narrow the scope of the output to a single stock. Thirty RF models are trained for a different stock chosen at random from the S&P 500. The accuracy of their predictions is calculated for each available year to determine an average trendline of predictability. Second, I add additional variables to the input data such as other, related stocks and macroeconomic indicators. The same process is conducted for the new models as well. It is my hypothesis that narrowing the scope and adding additional predictors will level the decrease in predictive power demonstrated by Fisher and Krauss (2017). This paper will also investigate if the RF model’s ability to remain accurate during volatile times extends to the COVID-19 pandemic. It is my hypothesis that the random forest will not have increased predictive power during the pandemic due to the widespread availability of machine learning algorithms having reduced their ability to accurately predict stock movements as the market has begun to account for their use.

[SUMMARIZE RESULTS AND DISCUSSION]

[GIVE OVERVIEW OF PAPER’S STRUCTURE]


## Literature Review
Fisher and Krauss (2017) tested the ability of a long short-term memory (LSTM) network to predict price changes for S&P 500’s constituent stocks against that of a random forest, deep neural net, and a logistic regression. Overall, they found that the LSTM outperformed the comparison models except during one key time period: the Great Recession. During the financial crisis, the random forest model not only outperformed the LSTM network but also outperformed its previous returns throughout the decade. An additional finding of Fisher & Krauss (2017) is that the accuracy of their algorithms decreased over time, coinciding with the broad-scale availability of similar algorithms to financial investors. In years prior to 2001, the models displayed a much greater ability to generate profit and accurately determine price changes than in the 2001-2009 period (which in turn was also more predictable than the 2010 – 2015 period). Indeed, the average daily returns for both the LSTM and random forest models were negative for more than half the final period. Fisher and Krauss’s data covers the time period from 1992 until 2015, and another dire financial time has occurred since then: COVID-19.

## Data and Methods
[DATA SOURCES]

[RANDOMLY CHOSEN STOCKS]

[ADDITIONAL VARIABLES]

[VISUALIZATIONS]

As touched on previously, a random forest model works by developing several decision trees trained on random subsets of the data. Each tree not only has access to a random subset of observation but also a limited number of variables. The final decision of the model is the decision held by the majority of the decision trees. To predict the outcomes of new data, the new variable must be run through each decision tree and the final “vote” taken. This kind of ensemble method results in a model that is robust to noise and outliers in the data and is less likely to overfit the data. Overfitting is when a model is overly specialized in the training data and, as a result, performs poorly on new data given. The benefits of this kind of robustness are especially obvious when thought of in conjunction with something like a recession or pandemic.

In order to construct a random forest, the dataset is subdivided into training and testing sets. The algorithm uses the training set to construct the decision trees and uses the testing set to evaluate the performance of the algorithm. The algorithm selects a random portion of the independent variables provided in the input. This reduces the correlation between the decision trees, helping to avoid overfitting the model. The number of features each tree has access to is up to the programmer but is usually set to the square root of the number of independent variables. Using these randomly selected features, the algorithm builds decision trees based on portion of the training set created earlier. This is done through a process called recursive partitioning. 

Recursive portioning divides the dataset into increasingly smaller subsets based on the values of the independent variables. The algorithm uses Gini impurity to choose the split point that maximizes the separation of the data into subsets. Gini impurity measures the probability that a randomly chosen observation from the data has been misclassified based on the distribution of the dependent variable for the current sample. The splitting point is determined by finding the feature and value combination that minimizes the Gini impurity of the data after the split. 

Each observation in the dataset is sorted into the resulting subsets based on their relative value for that feature (observations with a value higher than the split point are placed in one subset, and those with lower values in another). This process is repeated until a stopping criterion is met. In the trees created herein, the stopping criterion is the maximum depth of the tree, 20. Once the stopping criterion is met, the algorithm assigns the last nodes in the tree a value based on those remaining in the subsets. One can then obtain the resulting decision tree’s prediction by starting at the root node and traveling down the tree, classifying the observation in the process. The label attached to the final node gives the prediction. This is done for each tree in the forest and the resulting predictions are averaged to obtain the final prediction.

[EXPLAIN TRAINING]

[EXPLAIN TESTING]

[DESCRIBE EVALUATION METRICS]


## Results
[FOCUSING ON SINGLE STOCKS]

[ADDITION OF RELATED STOCKS]

[ADDITION OF MACRO VARIABLES]

[SPECIFICALLY DISCUSS GREAT RECESSION AND COVID-19]


## Robustness Checks

## Conclusion

## References