-----------------------------------------------------------
## Dissertation Project: 
An Empirical Study on the Classification Performance of Deep Learning vs. Gradient Boosting 
on Heterogeneous Tabular Data

Author: Adam Mabrouk

Supervisor: Ben Ralph

Institution: University of Bath

created on: 01/01/2024

-----------------------------------------------------------

## Due to size limitations the Datasets currently not in these folders are:

1. Data pipeline folder.
- The Tabular Data loader will not run with the out the engineered csv files.

2. Visualisation
- The visualisations to show class imbalance can bot be viewed without the raw datasets. 

#### The folders presented in the README list below are based on the original structure in the link mentioned below.
#### Please use this link: https://drive.google.com/drive/folders/1BBM3cF6YhyN1BKxOPfYVsnbIkohL5GRL?usp=drive_link to access the complete code.

## Instructions

- **To begin using the notebooks** 
  - **Please run the 3 notebooks to assess the raw results which form the projects main hypothesis. The results are from the classifiers, NODE, TabNet and XGBoost. NOTE: FFNN is not part of the studies main hypothesis and only used for exploratory work. 'credit_default_model_data' is not used in this study just for testing purposes.**
  
  - `1. Run_code_classification_ablation_results`: Will provide the classification results.
  - `2. Run_code_model_log_loss`: Will provide the log loss results from the overfitting experiment.
  - `3. Run_code_time_results`: Will provide the time results. 
  
  - **Please note preliminary ablation work is exploratory and used for further investigation, it is not part of the studies main hypothesis.** 
  
- **To run the models, please access**
  -`Models_and_data`**: Here you can run all the models including FFNN. Instructions are provided in the code. The user can use either optuna or carry out manual tuning. The user selects the dataset and runs the model. When running the scripts, the 'time.csv' file will appear as these were manually moved when testing. an inventory of each folder is provided below. 
  
  - `Data_pipeline`: If the user wants to make any alterations to the datasets they need to access this folder. The user can access the The ipynb notebook files labelled 01-06 should they wish to make any changes to feature engineering. An additional dataset is also included here 'credit_default_model_data' but not used in this study. The 07_Tabular_data_hetero_preprocessor is was made for the purpose of transparency within this field by creating a standardised approach when carrying out comparative analysis. The user has the option to:
  
  - `1. select categorical columns for either: One-hot, label encoding`
  - `2. select split ratio.`
  - `3. select over or undersampling`
  - `4. Select embeddings`

## Contents Folder 

Each ablation folder contains 10 csv files. Preliminary ablation folders contains only 3 csv files

- **Ablation** Carried out to understand the models function
  - **NODE Model Ablation Studies** (See Chapters 5 and 6 for details)
      - `ablation_node`: Contains the results of NODE model ablation experiments.
      - `entmax`: Ablation results for the entmax activation function.
      - `gumbel_softmax`: Ablation results for the Gumbel Softmax technique.
      - `low_depth`: Ablation results for experiments with reduced depth.
      - `softmax`: Ablation results for the softmax activation function.
      - `sparsemax`: Ablation results for the sparsemax activation function.
      - `tree_increase`: Ablation results for experiments with increased tree complexity.
  - **TabNet Model Ablation Studies** (Refer to Chapters 5 and 6)
      - `ablation_tabnet`: Contains the results of TabNet model ablation experiments.
      - `glu`: Ablation results for the Gated Linear Unit (GLU) activation.
      - `mish`: Ablation results for the Mish activation function.
      - `relaxation_factor`: Ablation results for experiments with varied relaxation factors.
      - `relu`: Ablation results for the ReLU activation function.
      - `sparse_loss_strength`: Ablation results for experiments with different strengths of sparse loss.
  - **Preliminary Ablation Studies for TabNet** (Discussed in Chapter 6 and Appendix C)
      - `preliminary_ablation_tabnet`: Contains preliminary ablation study results for the TabNet model.
      - `Batch_size`: Ablation results for experiments with different batch sizes.
      - `feature_dimension`: Ablation results for experiments with varied feature dimensions.
      - `high_Lambda`: Ablation results for experiments with high Lambda regularization strength.


- **Data Pipeline**
  - **Cleaned Data**
    - `cleaned data`: cleaned (basic pre-processing) heloc and lending club data csv files.
    - `credit_default_model_data`: This data is not used in this study but optional for further research. This 
    folder contains the test, train and validation dataset for X and y. 
    - `feature_engineered_model_data`: This folder contains the datasets Heloc, Credit Default, Lending Club 
    and Adult Income(also known as income evaluation) after feature engineering.
    - `heloc_model_data`: The Heloc dataset is ready to be fed into the model, hence the name model and 
    contains the test, train and validation dataset for X and y. 
    - `income_evaluation_model_data`: The Income Evaluation datasets (also known as Adult Income (AI)) is 
    ready to be fed into the model, hence the name model and contains the test, train and validation dataset 
    for X and y.
    - `lending_club_model_data`: The Lending Club (LC) datasets is ready to be fed into the model, hence the 
    name model and contains the test, train and validation dataset for X and y.
    - `raw_datasets`: This folder contains the raw datasets for Heloc, Credit Default, Lending Club and Adult 
    Income before any pre-processing. 
  - **Data Pipeline Notebooks** 
    - `01_lending_club_cleaner`: This notebook cleans the LC dataset.
    - `02_lending_club_feature_engineering`: This notebook applies feature engineering methods to the LC 
    dataset.
    - `03_heloc_data_cleaner`: This notebook cleans the Heloc dataset.
    - `04_heloc_feature_engineering`: This notebook applies feature engineering to the Heloc dataset.
    - `05_default_of_credit_cards_feature_engineered`: This notebook applies feature engineering to the credit 
    card default dataset which is NOT used in this study, only for additional reasearch.
    - `06_income_evaluation_feature_engineering`: This notebook applies feature engineering to the income 
    evaluation (Adult Income) dataset. 
    - `07_Tabular_data_hetero_preprocessor`: This notebook is part of a novel design strategy that operates 
    with `Tabular_loader_class`: Datasets can be added for further research.
    - `Tabular_loader_class`: This python script contains the pre-processing steps before the data is fed into 
    the models. Pre-processing steps are optional. 

- **Log_Loss_and_Validation_AUPRC_Results**

Each folder contains the results from the overfitting experiment using log loss, and auprc and contains 15 csv files. 

  - **node**: 
    - `node_adult_income`
    - `node_heloc`
    - `node_lending club`
  - **tabnet**: 
    - `tabnet_adult_income`
    - `tabnet_heloc`
    - `tabnet_lending_club`
  - **xgboost**: 
    - `xgboost_adult_income`
    - `xgboost_heloc`
    - `xgboost_lending_club`

- **Model_results**

Each folder contains the results from the classification experiment, 15 csv files are found in each folder.

  - **node_results**: 
    - `node_adult_income`
    - `node_heloc`
    - `node_lending club`
  - **tabnet_results**: 
    - `tabnet_adult_income`
    - `tabnet_heloc`
    - `tabnet_lending_club`
  - **xgboost_results**: 
    - `xgboost_adult_income`
    - `xgboost_heloc`
    - `xgboost_lending_club`

- **Model_results_for_further_testing**

These results are not presented in the study and are for continued work for researchers who want to test the classifiers and/or expand on the experiments presented. 

  - **node**: csv files from the classification results
    - `node_adult_income`
    - `node_heloc`
    - `node_lending club`
  - **tabnet**: csv files from the classification results
    - `tabnet_adult_income`
    - `tabnet_heloc`
    - `tabnet_lending_club`
  - **xgboost**: csv files from the classification results
    - `xgboost_adult_income`
    - `xgboost_heloc`
    - `xgboost_lending_club`

- **Models_and_data**

These folders present the models used in all the experiments including the FFNN model. Use one of the four notebooks presented (FFNN, NODE, TabNet, XGBoost) to run the model. A time csv file will appear after each run as these were manually moved during testing. 
  
  - *`datasets`*: For the purpose to run the models
  - *`tensor_logs`*: These tensor logs are from NODE and TabNet
  - *`data_loader.py`*: Data loader uses a random shuffle, extracting 5k to load into the model
  - *`feed_forward_network_model.py`*: Basline classifier for the exploratory phase
  - *`feed_forward_network_model.ipynb`*: Basline classifier for the exploratory phase
  - *`model_training.py`*: This python script is used to train all the models
  - *`neural_oblivious_decision_ensembles.ipynb`*: NODE notebook
  - *`node_entmax_implementation.py`*: Entmax activation function used in ablation testing for NODE
  - *`node_model`*: Node python script which provides comments and details of the NODE algorithm. 
  - *`requirements.txt`*: Applied to all models
  - *`Results.py`*: Results script used for all models
  - *`tabnet_model.py`*: TabNet model python script
  - *`TabNet.ipynb Notebook`*: TabNet Notebook
  - *`xgboost_model.py`*: Benchmark model XGBoost python script
  - *`xgboost.ipynb`*: Benchmark model XGBoost Notebook script

- **Shap_pictures**

Each folder contains 15 pictures of shap features from each model output in the classification experiment explained in chapter 4 and 5. The first 5 most important shap features were manually counted, on which after a mean was calculated for each model. Summarised in Chapter 5 for further details. 

  - **node_shap**: 
    - `node_shap_adult_income`
    - `node_shap_lending_club`
  - **tabnet_shap**: 
    - `tabnet_adult_income_shap`
    - `Tabnet_shap_lending_club`
  - **xgboost_shap**: 
    - `xgboost_shap_adult_income`
    - `xgboost_shap_lending_club`

- **Time_results**

Each folder contains 10 csv files of the timed runs carried out on each model. 

  - *`time_node`*: 
  - *`time_tabnet`*: 
  - *`time_xgboost`*: 

- **Visualisation**

The csv files are used to output the visualisation graphs for shap and the imbalanced classes in the datasets. 

  - **`csv files`**: For the purpose to run the models
  - *`shap_visual_lending_club_and_adult_income`*: notebook
  - *`Visualisation_class_balance_heloc`*: notebook
  - *`Visualisation_class_balance_income_eval`*: notebook
  - *`Visualisation_class_balance_lend_club`*: notebook

- **Notebook documents**

These notebooks are to be run by the user to assess/cross reference the results of all experiment. Please open the notebook and run the jupyter file. 

  - **`README`**: Instructions 
  - *`Run_code_classification_ablation_results`*: Run the ipynb notebook to assess results 
  - *`Run_code_model_log_loss`*: Run the ipynb notebook to assess results 
  - *`Run the ipynb notebook to assess results `*: Run the ipynb notebook to assess results 