Hedge Fund X: Financial Modeling Challenge

Michael S. Bonetti

Zicklin School of Business

CUNY Bernard M. Baruch College

Brief Description

This project aims to apply classification models, using R, on the Hedge Fund X Dataset, to analyze model performance, runtimes, and variable importance. This dataset is available on Kaggle's website, summarizing a heterogeneous set of financial products, made available via an ML competition through Signate of Japan, circa 2017 / 2018.
https://www.kaggle.com/datasicencelab/hedge-fund-x-financial-modeling-challenge/version/1#

The dataset contains 10,000 observations, with 91 attributes (all numerical), 2 non-predictive, and 1 target (𝑡𝑎𝑟𝑔𝑒𝑡) variables. For this project, a 10% random sample (RS) was taken, using 1,000 observations and 88 predictive attributes. Two distinct runs were performed: one with 1,000 statistic observations for reproducible results (R1), and another with 1,000 randomly-chosen observations for varying results (R2), whereby I made comparisons between the two runs throughout the PowerPoint slide deck.

Data Pre-Processing & Imbalance Ratio (IR)

Fortunately, as all of the attributes were numerical and the dataset was relatively balanced, not much (if any) data pre-processing was necessary. The imbalance ratio (IR):

For the original dataset was 𝑛₀ = 4,994, 𝑛₁ = 5,006 ∴ 𝑛₀ / 𝑛₁ = 99.76%
For Run 1: 𝑛₀^∗ = 512, 𝑛₁ = 488 ∴ 𝑛₁ / 𝑛₀ = 95.31%
For Run 2: 𝑛₀ = 473, 𝑛₁^∗ = 527 ∴ 𝑛₀ / 𝑛₁ = 89.75%

Additionally, some preliminary exploratory data analysis (EDA) boxplots were created, to better visualize attribute importance and weight. Furthermore, the target variable y-distribution was taken and appears to be slightly skewed, but otherwise normal.

Creating Machine Learning (ML) Models

AUCs

A 90/10 split of the training and testing set was performed to fit 6 classification models (Logistic, LASSO, Elastic-net (EN), Ridge, Random Forest (RF), and Radial Support Vector Machines (SVM)) 50 times for 50 samples. The AUCs and runtimes were recorded, with boxplots to visualize these results:

0.9𝑛 AUC training
- SVM and RF medians at, or near, 1,
0.9𝑛 AUC testing
- Larger variances overall compared to AUC training boxplots,
- Logistic, Elnet, LASSO, and Ridge medians are close,
- SVM and RF higher,
0.9𝑛 training errors
- Near mirror-image of AUC training boxplots, with ν_EN slightly smaller,
- SVM and RF medians at, or near, 0,
0.9𝑛 testing errors
- Larger variances overall; medians for SVM and RF are still lower.

Cross-validation Curves

For one of the 50 samples, misclassification error 10-fold cross-validation (CV) curves, using EN, LASSO, and Ridge, were done with LASSO performing the best in R1, but Ridge outperforming in R2. Overall, log(λ) and runtimes were generally the same, with EN, LASSO, and Ridge producing similarly-shaped upward-sloping graphs, albeit ungracefully.

Performance and Runtimes

Upon observation, the average performance (training error rate) was about 0.422, processing at fast runtimes, regardless. However, the AUCs yielded no better than 50 / 50, meaning there was a slight trade-off with performance compared to runtime. RF consistently performed the best, albeit with a runtime of +3.50 secs, with SVM taking the longest, and with minimal performance improvement.

Variable Importance

Standardizing the estimated coefficients allows for its visualization (RF) and variable importance (LASSO, EN, Ridge) bar plots to be generated. Variables c69, c27, and c80 were the top 3 influencers for RF.

Results

Variable Importance (Top 3)

The top 3 positive influencers are c85, c17, and c45
The top 3 negative influencers are c70, c81, and c80

However...

There are too many unknowns, where it cannot be ascertained which specific funds and/or stocks affected performance,
Overall, the results are no better than a 50/50 coin toss, except for RF and SVM AUCs.

Improvements can be made…

RF & SVM (best performers), AdaBoost, SGD, kNN, NB,

… but the financial market is unpredictable!

Therefore, this makes financial modeling decidedly more difficult to perform.

Closing Thoughts

While financial behavior can be predicted, to a certain extent to ascertain some insights from the Hedge Fund X dataset, performance may increase under differing classification models, but probably not by much. As such, although improvements can surely be made, the data may be insufficient to adequately predict the future movement of financial products, unless certain unknowns, such as what the products are and the timeline, are made available.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Hedge Fund X - Group 1 - 12-13-2021 (Final Version).pdf		Hedge Fund X - Group 1 - 12-13-2021 (Final Version).pdf
Hedge Fund X - Group 1 - 12-13-2021 (Final Version).pptx		Hedge Fund X - Group 1 - 12-13-2021 (Final Version).pptx
README.md		README.md
STA 9891 - Michael S. Bonetti (Group 1) - Hedge Fund X [Financial Modeling Challenge].R		STA 9891 - Michael S. Bonetti (Group 1) - Hedge Fund X [Financial Modeling Challenge].R
deepanalytics_dataset.csv		deepanalytics_dataset.csv
deepanalytics_dataset_1000.csv		deepanalytics_dataset_1000.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hedge Fund X: Financial Modeling Challenge

Michael S. Bonetti

Zicklin School of Business

CUNY Bernard M. Baruch College

Brief Description

Data Pre-Processing & Imbalance Ratio (IR)

Creating Machine Learning (ML) Models

AUCs

Cross-validation Curves

Performance and Runtimes

Variable Importance

Results

Variable Importance (Top 3)

However...

Improvements can be made…

… but the financial market is unpredictable!

Closing Thoughts

About

Releases

Packages

Languages

mbonetti-nyc/Hedge-Fund-X

Folders and files

Latest commit

History

Repository files navigation

Hedge Fund X: Financial Modeling Challenge

Michael S. Bonetti

Zicklin School of Business

CUNY Bernard M. Baruch College

Brief Description

Data Pre-Processing & Imbalance Ratio (IR)

Creating Machine Learning (ML) Models

AUCs

Cross-validation Curves

Performance and Runtimes

Variable Importance

Results

Variable Importance (Top 3)

However...

Improvements can be made…

… but the financial market is unpredictable!

Closing Thoughts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages