This project aims to apply classification models, using R, on the Hedge Fund X Dataset, to analyze model performance, runtimes, and variable importance. This dataset is available on Kaggle's website, summarizing a heterogeneous set of financial products, made available via an ML competition through Signate of Japan, circa 2017 / 2018.
https://www.kaggle.com/datasicencelab/hedge-fund-x-financial-modeling-challenge/version/1#
The dataset contains 10,000 observations, with 91 attributes (all numerical), 2 non-predictive, and 1 target (𝑡𝑎𝑟𝑔𝑒𝑡) variables. For this project, a 10% random sample (RS) was taken, using 1,000 observations and 88 predictive attributes. Two distinct runs were performed: one with 1,000 statistic observations for reproducible results (R1), and another with 1,000 randomly-chosen observations for varying results (R2), whereby I made comparisons between the two runs throughout the PowerPoint slide deck.
Fortunately, as all of the attributes were numerical and the dataset was relatively balanced, not much (if any) data pre-processing was necessary. The imbalance ratio (IR):
- For the original dataset was 𝑛0 = 4,994, 𝑛1 = 5,006 ∴ 𝑛0 / 𝑛1 = 99.76%
- For Run 1: 𝑛0∗ = 512, 𝑛1 = 488 ∴ 𝑛1 / 𝑛0 = 95.31%
- For Run 2: 𝑛0 = 473, 𝑛1∗ = 527 ∴ 𝑛0 / 𝑛1 = 89.75%
Additionally, some preliminary exploratory data analysis (EDA) boxplots were created, to better visualize attribute importance and weight. Furthermore, the target variable y-distribution was taken and appears to be slightly skewed, but otherwise normal.
A 90/10 split of the training and testing set was performed to fit 6 classification models (Logistic, LASSO, Elastic-net (EN), Ridge, Random Forest (RF), and Radial Support Vector Machines (SVM)) 50 times for 50 samples. The AUCs and runtimes were recorded, with boxplots to visualize these results:
- 0.9𝑛 AUC training
- SVM and RF medians at, or near, 1,
- 0.9𝑛 AUC testing
- Larger variances overall compared to AUC training boxplots,
- Logistic, Elnet, LASSO, and Ridge medians are close,
- SVM and RF higher,
- 0.9𝑛 training errors
- Near mirror-image of AUC training boxplots, with νEN slightly smaller,
- SVM and RF medians at, or near, 0,
- 0.9𝑛 testing errors
- Larger variances overall; medians for SVM and RF are still lower.
For one of the 50 samples, misclassification error 10-fold cross-validation (CV) curves, using EN, LASSO, and Ridge, were done with LASSO performing the best in R1, but Ridge outperforming in R2. Overall, log(λ) and runtimes were generally the same, with EN, LASSO, and Ridge producing similarly-shaped upward-sloping graphs, albeit ungracefully.
Upon observation, the average performance (training error rate) was about 0.422, processing at fast runtimes, regardless. However, the AUCs yielded no better than 50 / 50, meaning there was a slight trade-off with performance compared to runtime. RF consistently performed the best, albeit with a runtime of +3.50 secs, with SVM taking the longest, and with minimal performance improvement.
Standardizing the estimated coefficients allows for its visualization (RF) and variable importance (LASSO, EN, Ridge) bar plots to be generated. Variables c69, c27, and c80 were the top 3 influencers for RF.
- The top 3 positive influencers are c85, c17, and c45
- The top 3 negative influencers are c70, c81, and c80
- There are too many unknowns, where it cannot be ascertained which specific funds and/or stocks affected performance,
- Overall, the results are no better than a 50/50 coin toss, except for RF and SVM AUCs.
- RF & SVM (best performers), AdaBoost, SGD, kNN, NB,
- Therefore, this makes financial modeling decidedly more difficult to perform.
While financial behavior can be predicted, to a certain extent to ascertain some insights from the Hedge Fund X dataset, performance may increase under differing classification models, but probably not by much. As such, although improvements can surely be made, the data may be insufficient to adequately predict the future movement of financial products, unless certain unknowns, such as what the products are and the timeline, are made available.