Skip to content

This project aims to apply classification models, using R, on the Hedge Fund X Dataset, to analyze model performance, runtimes, and variable importance. This dataset is available on Kaggle's website, summarizing a heterogeneous set of financial products, made available via an ML competition through Signate of Japan, circa 2017 / 2018.

Notifications You must be signed in to change notification settings

mbonetti-nyc/Hedge-Fund-X

Repository files navigation

Hedge Fund X: Financial Modeling Challenge

Michael S. Bonetti

Zicklin School of Business

CUNY Bernard M. Baruch College

Brief Description

This project aims to apply classification models, using R, on the Hedge Fund X Dataset, to analyze model performance, runtimes, and variable importance. This dataset is available on Kaggle's website, summarizing a heterogeneous set of financial products, made available via an ML competition through Signate of Japan, circa 2017 / 2018.
https://www.kaggle.com/datasicencelab/hedge-fund-x-financial-modeling-challenge/version/1#

The dataset contains 10,000 observations, with 91 attributes (all numerical), 2 non-predictive, and 1 target (𝑡𝑎𝑟𝑔𝑒𝑡) variables. For this project, a 10% random sample (RS) was taken, using 1,000 observations and 88 predictive attributes. Two distinct runs were performed: one with 1,000 statistic observations for reproducible results (R1), and another with 1,000 randomly-chosen observations for varying results (R2), whereby I made comparisons between the two runs throughout the PowerPoint slide deck.

Data Pre-Processing & Imbalance Ratio (IR)

Fortunately, as all of the attributes were numerical and the dataset was relatively balanced, not much (if any) data pre-processing was necessary. The imbalance ratio (IR):

  • For the original dataset was 𝑛0 = 4,994, 𝑛1 = 5,006 ∴ 𝑛0 / 𝑛1 = 99.76%
  • For Run 1: 𝑛0 = 512, 𝑛1 = 488 ∴ 𝑛1 / 𝑛0 = 95.31%
  • For Run 2: 𝑛0 = 473, 𝑛1 = 527 ∴ 𝑛0 / 𝑛1 = 89.75%

Additionally, some preliminary exploratory data analysis (EDA) boxplots were created, to better visualize attribute importance and weight. Furthermore, the target variable y-distribution was taken and appears to be slightly skewed, but otherwise normal.

Creating Machine Learning (ML) Models

AUCs

A 90/10 split of the training and testing set was performed to fit 6 classification models (Logistic, LASSO, Elastic-net (EN), Ridge, Random Forest (RF), and Radial Support Vector Machines (SVM)) 50 times for 50 samples. The AUCs and runtimes were recorded, with boxplots to visualize these results:

  • 0.9𝑛 AUC training
    • SVM and RF medians at, or near, 1,
  • 0.9𝑛 AUC testing
    • Larger variances overall compared to AUC training boxplots,
    • Logistic, Elnet, LASSO, and Ridge medians are close,
    • SVM and RF higher,
  • 0.9𝑛 training errors
    • Near mirror-image of AUC training boxplots, with νEN slightly smaller,
    • SVM and RF medians at, or near, 0,
  • 0.9𝑛 testing errors
    • Larger variances overall; medians for SVM and RF are still lower.

Cross-validation Curves

For one of the 50 samples, misclassification error 10-fold cross-validation (CV) curves, using EN, LASSO, and Ridge, were done with LASSO performing the best in R1, but Ridge outperforming in R2. Overall, log(λ) and runtimes were generally the same, with EN, LASSO, and Ridge producing similarly-shaped upward-sloping graphs, albeit ungracefully.

Performance and Runtimes

Upon observation, the average performance (training error rate) was about 0.422, processing at fast runtimes, regardless. However, the AUCs yielded no better than 50 / 50, meaning there was a slight trade-off with performance compared to runtime. RF consistently performed the best, albeit with a runtime of +3.50 secs, with SVM taking the longest, and with minimal performance improvement.

Variable Importance

Standardizing the estimated coefficients allows for its visualization (RF) and variable importance (LASSO, EN, Ridge) bar plots to be generated. Variables c69, c27, and c80 were the top 3 influencers for RF.

Results

Variable Importance (Top 3)

  • The top 3 positive influencers are c85, c17, and c45
  • The top 3 negative influencers are c70, c81, and c80

However...

  • There are too many unknowns, where it cannot be ascertained which specific funds and/or stocks affected performance,
  • Overall, the results are no better than a 50/50 coin toss, except for RF and SVM AUCs.

Improvements can be made…

  • RF & SVM (best performers), AdaBoost, SGD, kNN, NB,

… but the financial market is unpredictable!

  • Therefore, this makes financial modeling decidedly more difficult to perform.

Closing Thoughts

While financial behavior can be predicted, to a certain extent to ascertain some insights from the Hedge Fund X dataset, performance may increase under differing classification models, but probably not by much. As such, although improvements can surely be made, the data may be insufficient to adequately predict the future movement of financial products, unless certain unknowns, such as what the products are and the timeline, are made available.

About

This project aims to apply classification models, using R, on the Hedge Fund X Dataset, to analyze model performance, runtimes, and variable importance. This dataset is available on Kaggle's website, summarizing a heterogeneous set of financial products, made available via an ML competition through Signate of Japan, circa 2017 / 2018.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages