In [9]:
from analysis_article.true_positive_ratio import true_positive_ratio
from analysis_article.paths_importance import paths_importance_analysis
from analysis_article.cross_validation_overfitting_iteration import cross_validation
from analysis_article.cross_validation_overfitting_iteration import patience_cross_validation
from analysis_article.signal_to_noise import signal_to_noise


[0]	validation_0-rmse:0.00847
Step number  123
[0]	validation_0-rmse:0.01752
Step number  103
i
0
Creating a new labels for 5k dataset
[0]	validation_0-rmse:0.01359
[0]	validation_0-rmse:2.55568


KeyboardInterrupt: 

# Introduction

Welcome to the accompanying Jupyter Notebook for our published paper. This interactive notebook is designed to allow you to reproduce the results presented in our study with ease. By following the instructions and running the provided code cells, you can verify the findings and explore the data and methodologies used in our research.

Please be aware that due to the computational intensity of some of the analyses, the Jupyter Notebook environment may occasionally encounter difficulties in executing the entire workflow. If you encounter such issues, we have provided a robust alternative to ensure you can still replicate the results.

Within the directory of this notebook, locate the function files that the Jupyter Notebook is intended to run. At the end of each of these files, there is a section of commented-out code. This code is identical to what is executed within the Jupyter Notebook. To proceed, simply uncomment this code block and run the file as a standalone script in your preferred Python environment. By doing so, you should be able to achieve the same outputs as those intended within the Jupyter Notebook setup without any compromise in the results.

If `save_fig` is True the plots are saved in the folder `analysis_article`.

We hope this notebook enhances your understanding of our work, and we encourage you to reach out with any questions or feedback you may have.

# Introduction to True Positive Ratio Plotting with Synthetic Dataset Analysis

In this snipped of code we will be plotting the true positive ratio using a synthetic dataset. Our objective is to gauge the efficacy of an algorithm known for its path selection capabilities.

## Parameters Description:

- `number_of_simulations` (Default: 200): This parameter controls the number of simulation iterations for the algorithm. To secure a robust measure of algorithm performance, we average the results over these simulations, defaulting to 200 unless adjusted according to user requirements.

- `synthetic_dataset_scenario`: Selects the scenario for synthetic dataset generation, each designed to challenge the algorithm in different ways, reflecting various possible real-life conditions.

## True Positive Ratio (TPR):
The True Positive Ratio is indicative of the algorithm's accuracy, computed each iteration as the number of correct path selections over the total selected paths.

## Synthetic Dataset:
Through the `synthetic_dataset_scenario` parameter, the user can decide which setting to replicate.


## Note on Path Boosting Methodologies:
It is important to note the distinction in path boosting methods applied in different scenarios. For scenarios 1 and 2, where only a single metal center is considered, path boosting is employed in its standard form. However, scenario 3 is unique in that it applies cyclic path boosting, taking into account the multiple metal centers that influence the path selection process in more complex ways. This specificity in methodology is crucial to accurately modeling and analyzing each scenario's corresponding dataset.




In [10]:
true_positive_ratio(number_of_simulations=200, synthetic_dataset_scenario=1, noise_variance=0.2,
                    maximum_number_of_steps=None, save_fig=False, show_settings=True)

NameError: name 'true_positive_ratio' is not defined

# Paths Importance
Get paths importance, settings as before, in addition consider the parameter `update_features_importance_by_comparison`, to select the importance measure

## Importance measure:
If `update_features_importance_by_comparison` is `True` then the importance of each selected column is given by the error improvment of the seclected column compared with the second best choice.
If `update_features_importance_by_comparison` is `False` the importance of the selected column is given by the overall error improvment.



In [None]:
paths_importance_analysis("5k_synthetic_dataset", number_of_simulations=200, synthetic_dataset_scenario=2,
                          noise_variance=0.2, maximum_number_of_steps=None,
                          update_features_importance_by_comparison=True, show_settings=True)

## Cross Validation
Performs cross validation to find the optimal `maximum_number_of_steps`.
Patience is the number of consecutive steps where increases in the cross validation test error after which we consider the algorithm is overfitting

In [None]:
cross_validation(number_of_simulations=15, k_folds=5, scenario=1, patience=3, dataset_name="5k_synthetic_dataset",
                 noise_variance=0.2, maximum_number_of_steps=None, save_fig=False, use_wrapper_boosting=None,
                 show_settings=True)


### Cross Validation patience analysis
After running cross validation a file containing the test error for the k-fold cross validation is created and here it is possible to study how the selected point for overfitting moves with different `patience`

In [None]:
patience_cross_validation(
    file_path="/Users/popcorn/PycharmProjects/pattern_boosting/results/cross_validation/Xgb_step_1800_max_path_length_5_60k_dataset_gbtree_999999/wrapped_boosting/test_errors_cross_validation_list.pkl",
    patience_range=range(5, 100, 5))

## Signal to noise ratio
`noise_variance_list` is contains the list of the variance of the normal distribution used to generate the noise for the synthetic dataset.

In [None]:
signal_to_noise(number_of_simulations=200,
                noise_variance_list=[0.2, 0.5, 0.8, 1.1, 1.4, 1.7],
                # [0.2, 0.325, 0.5, 0.625, 0.75, 0.875, 1, 1.125, 1.25, 1.375, 1.5, 1.625]
                synthetic_dataset_scenario=1,
                dataset_name="5k_synthetic_dataset", maximum_number_of_steps=None,
                save_fig=True, use_wrapper_boosting=None, show_settings=True)

## Path Boosting
Here you can run path boosting and obtain the main analysis on the results
Since parallelization is needed here, it is not possible to run the algorithm in the jupiter Notebook directly, the following script just launches the file [path_boosting.py](path_boosting.py). To modify the settings go into the file `settings.py` in the main folder. [settings.py](../settings.py). Due to jupiter notebook running environment, parallel processing the set-up for launching the file may result quite complicated in particular for multiprocessing, in this case we suggest to run directly the file [path_boosting.py](path_boosting.py) from the console.

In [13]:

%run 'path_boosting.py'

Number of CPU's:  10
Dataset name:  5k_synthetic_dataset
Creating a new labels for 5k dataset
Splitting the dataset
Splitting the dataset
[0]	validation_0-rmse:3.32264
[0]	validation_0-rmse:1.49982[0]	validation_0-rmse:6.46860

[0]	validation_0-rmse:1.62534
[0]	validation_0-rmse:2.31699
[0]	validation_0-rmse:3.97699
[0]	validation_0-rmse:2.06944
[0]	validation_0-rmse:1.47794
[0]	validation_0-rmse:1.80273[0]	validation_0-rmse:1.91690

computing test error
computing test error
computing test error
computing test error
[0]	validation_0-rmse:2.34583
computing test error
[0]	validation_0-rmse:2.98663
[0]	validation_0-rmse:2.68685
computing test error
computing test error
computing test error
computing test error
computing test error
[0]	validation_0-rmse:2.16601
computing test error
computing test error
computing test error
computing test error
len final test error 350
final test error:
 0.71949810277156
Saving location:
/Users/popcorn/PycharmProjects/pattern_boosting/results/Xgb_step_350_m

AttributeError: module '__main__' has no attribute '__spec__'