<center> <h2>Projet Finance Quantitative</h2> </center> <br>
<center> <h3>Master 2 MoSEF Data Science - Université Paris 1 Panthéon-Sorbonne</h3> </center> <br>
<center> <h3><b>Genetic Algorithm-optimized Triple Barrier Labeling for Predictive Stock Trading Using GBM Stacking</b></h3> </center> <br>
<center> <h3>Louis LEBRETON</h3> </center> <br>

# Prédiction des labels *buy*, *hold* et *sell*

## Optimisation des modèles GBMs

Dans un premier temps, j'optimise les modèles **XGBoost**, **LGBM**, et **CatBoost** sur un échantillon d'entraînement en utilisant une approche d'optimisation bayésienne. Cette méthode permet d'ajuster efficacement les hyperparamètres pour améliorer les performances des modèles.

## Optimisation du classificateur Softmax

Dans un second temps, j'optimise le métaclassificateur **Softmax** sur un échantillon de validation distinct afin d'éviter le surapprentissage (*overfitting*). Cette étape est également réalisée via une optimisation bayésienne.

## Stratégies de trading

Les données sont divisées en deux datasets distincts, chacun labellisé pour répondre à une stratégie de trading spécifique :

1. **Stratégie High Risk, High Profit**  
   - Objectif : Maximisation de 0.7 * profit - 0.3 * maximum drawdown.

2. **Stratégie Low Risk, Low Profit**  
   - Objectif : Maximisation de 0.3 * profit - 0.7 * maximum drawdown.

## Évaluation des performances

Une fois les modèles optimisés, j'évalue et compare les performances des prédictions en mesurant plusieurs métriques clés :  
- **Profit**  
- **Maximum drawdown**  
- **Autres indicateurs pertinents**  


## Période d'analyse

Les données utilisées couvrent une période de cinq années, tout en excluant les impacts liés à la pandémie de COVID-19. Les années analysées sont les suivantes : **2018, 2019, 2022, 2023 et 2024**.


In [11]:
import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, ParameterGrid
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from bayes_opt import BayesianOptimization

from prediction.GBM_stacking import GBMStacking

from df_building.get_labels.equity_strategy import EquityStrategy

In [12]:
risk_profile = 'HRHP' # or LRLP

In [13]:
if risk_profile == 'HRHP':
    risk_profile_type_dict = {'weight_p':0.7, 'weight_mdd':0.3} # High risk High profit
else:
    risk_profile_type_dict = {'weight_p':0.3, 'weight_mdd':0.7} # Low risk Low profit

In [14]:
# importation des données
df_train = pd.read_csv(f"../data/train_{risk_profile}.csv", index_col=0)
df_test = pd.read_csv(f"../data/test_{risk_profile}.csv", index_col=0)
df_train, df_valid = train_test_split(df_train, test_size=0.2, shuffle=True, random_state=1111)

# renommage du label bail de -1 à 2 pour être compatible avec le modèle
df_train[df_train['tbm label'] == -1]  = 2
df_test[df_test['tbm label'] == -1]  = 2

X_train, X_valid, X_test = df_train.drop(columns=['tbm label']), df_valid.drop(columns=['tbm label']), df_test.drop(columns=['tbm label'])
y_train, y_valid, y_test = df_train['tbm label'].copy(), df_valid['tbm label'].copy(), df_test['tbm label'].copy()

# Prédiction des labels

### Optimisation des GBMs

#### Optimisation du XGBoost

In [None]:
def xgb_evaluate(max_depth, learning_rate, n_estimators, gamma, min_child_weight, subsample, colsample_bytree):
    """
    evaluation d'un xgboost
    """
    max_depth = int(max_depth)
    n_estimators = int(n_estimators)
    num_classes = len(np.unique(y_train))

    # modèle avec les hyperparamètres
    model = XGBClassifier(
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        gamma=gamma,
        min_child_weight=min_child_weight,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        objective='multi:softprob',
        num_class = num_classes,
        random_state=111
    )
    
    # cross val -> score
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')
    return np.mean(scores)

In [28]:
param_bounds = {
    'max_depth': (3, 10),
    'learning_rate': (0.01, 0.3),
    'n_estimators': (50, 300),
    'gamma': (0, 5),
    'min_child_weight': (1, 10),
    'subsample': (0.5, 1),
    'colsample_bytree': (0.5, 1),
}


xgb_optimizer = BayesianOptimization(
    f=xgb_evaluate,
    pbounds=param_bounds,
    random_state=111,
    verbose=2,
)


xgb_optimizer.maximize(init_points=5, n_iter=25)

print("meilleurs hyperparamètres trouvés :")
print(xgb_optimizer.max)

|   iter    |  target   | colsam... |   gamma   | learni... | max_depth | min_ch... | n_esti... | subsample |
-------------------------------------------------------------------------------------------------------------
| [39m1        [39m | [39mnan      [39m | [39m0.8061   [39m | [39m0.8453   [39m | [39m0.1365   [39m | [39m8.385    [39m | [39m3.658    [39m | [39m87.29    [39m | [39m0.5112   [39m |


Traceback (most recent call last):
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 139, in __call__
    score = scorer._score(
            ^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 376, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\utils\_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_ranking.py", line 640, in roc_auc_score
    return _average_binary_score(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\leb

| [39m2        [39m | [39mnan      [39m | [39m0.7101   [39m | [39m1.193    [39m | [39m0.1079   [39m | [39m9.935    [39m | [39m3.14     [39m | [39m70.3     [39m | [39m0.8348   [39m |


Traceback (most recent call last):
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 139, in __call__
    score = scorer._score(
            ^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 376, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\utils\_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_ranking.py", line 640, in roc_auc_score
    return _average_binary_score(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\leb

| [39m3        [39m | [39mnan      [39m | [39m0.8106   [39m | [39m1.371    [39m | [39m0.1452   [39m | [39m3.829    [39m | [39m1.666    [39m | [39m275.2    [39m | [39m0.897    [39m |
| [39m4        [39m | [39mnan      [39m | [39m0.9203   [39m | [39m4.076    [39m | [39m0.2974   [39m | [39m7.041    [39m | [39m8.324    [39m | [39m155.3    [39m | [39m0.5137   [39m |
| [39m5        [39m | [39mnan      [39m | [39m0.7271   [39m | [39m0.5266   [39m | [39m0.247    [39m | [39m7.884    [39m | [39m6.088    [39m | [39m118.6    [39m | [39m0.9992   [39m |


Traceback (most recent call last):
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 139, in __call__
    score = scorer._score(
            ^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 376, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\utils\_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lebre\OneDrive\Bureau\Finance Quant_S9\Projet_QF\Scripts\.venv\Lib\site-packages\sklearn\metrics\_ranking.py", line 640, in roc_auc_score
    return _average_binary_score(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\leb

ValueError: Input y contains NaN.

#### Optimisation du LightGBM

In [None]:
def lgbm_evaluate(num_leaves, max_depth, learning_rate, n_estimators, min_child_weight, subsample, colsample_bytree):
    """
    evaluation d'un lgbm
    """
    num_leaves = int(num_leaves)
    max_depth = int(max_depth)
    n_estimators = int(n_estimators)
    
    # modèle avec les hyperparamètres
    model = LGBMClassifier(
        num_leaves=num_leaves,
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        min_child_weight=min_child_weight,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        random_state=111
    )
    
    # cross val -> score
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')

    return np.mean(scores)

In [None]:
lgbm_param_bounds = {
    'num_leaves': (20, 100),
    'max_depth': (3, 15),
    'learning_rate': (0.01, 0.3),
    'n_estimators': (50, 300),
    'min_child_weight': (1, 10),
    'subsample': (0.5, 1),
    'colsample_bytree': (0.5, 1),
}


lgbm_optimizer = BayesianOptimization(
    f=lgbm_evaluate,
    pbounds=lgbm_param_bounds,
    random_state=111,
    verbose=2
)


lgbm_optimizer.maximize(init_points=5, n_iter=25)


print("meilleurs hyperparamètres lgbm trouvés :")
print(lgbm_optimizer.max)

#### Optimisation du Catboost

In [None]:
def catboost_evaluate(depth, learning_rate, iterations, l2_leaf_reg, subsample):
    """
    evaluation d'un catboost
    """
  
    depth = int(depth)
    iterations = int(iterations)
    
    # modèle avec les hyperparamètres
    model = CatBoostClassifier(
        depth=depth,
        learning_rate=learning_rate,
        iterations=iterations,
        l2_leaf_reg=l2_leaf_reg,
        subsample=subsample,
        loss_function='multi:softprob',
        verbose=0,
        random_state=111
    )
    
    # cross val -> score
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')

    return np.mean(scores)

In [None]:
catboost_param_bounds = {
    'depth': (3, 10),
    'learning_rate': (0.01, 0.3),
    'iterations': (50, 300),
    'l2_leaf_reg': (1, 10),
    'subsample': (0.5, 1),
}

catboost_optimizer = BayesianOptimization(
    f=catboost_evaluate,
    pbounds=catboost_param_bounds,
    random_state=111,
    verbose=2
)

catboost_optimizer.maximize(init_points=5, n_iter=25)


print("meilleurs hyperparamètres catboost trouvés :")
print(catboost_optimizer.max)

### Optimisation du Metaclassifier

In [None]:
def metaclassifier_evaluate(C, penalty, multi_class, solver):
    """
    evaluation d'un catboost
    """
  
    depth = int(depth)
    iterations = int(iterations)
    
    gbm_stacking_model = GBMStacking(models_to_use=('catboost', 'lightgbm', 'xgboost'),
                                catboost_parameters=catboost_optimizer.max,
                                lightgbm_parameters=lgbm_optimizer.max,
                                xgboost_parameters=xgb_optimizer.max,
                            logistic_regression_parameters={'C': C, 
                                                            'penalty': penalty, 
                                                            'multi_class': multi_class, 
                                                            'solver': solver})
    gbm_stacking_model.fit(X_valid, y_valid)
    predictions = gbm_stacking_model.predict(X_valid)

    # cross val -> score
    scores = cross_val_score(gbm_stacking_model, X_train, y_train, cv=3, scoring='roc_auc')

    return np.mean(scores)

In [None]:
metaclassifier_param_bounds = {
    'C': (0.01, 10),
}

metaclassifier_optimizer = BayesianOptimization(
    f=metaclassifier_evaluate,
    pbounds=metaclassifier_param_bounds,
    random_state=111,
    verbose=2
)

metaclassifier_optimizer.maximize(init_points=5, n_iter=25)


print("meilleurs hyperparamètres metaclassifier trouvés :")
print(metaclassifier_optimizer.max)

### Evaluation des modèles

In [None]:
gbm_stacking_model = GBMStacking(models_to_use=('catboost', 'lightgbm', 'xgboost'),
                                catboost_parameters=catboost_optimizer.max,
                                lightgbm_parameters=lgbm_optimizer.max,
                                xgboost_parameters=xgb_optimizer.max,
                            logistic_regression_parameters=metaclassifier_optimizer.max)
gbm_stacking_model.fit(X_test, y_test)
y_test_pred = gbm_stacking_model.predict(X_test)
y_test_proba = gbm_stacking_model.predict_proba(X_test)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report


accuracy = accuracy_score(y_test, y_test_pred)
print(f"Accuracy : {accuracy:.2f}")

f1 = f1_score(y_test, y_test_pred, average='weighted')
print(f"F1 Score : {f1:.2f}")

conf_matrix = confusion_matrix(y_test, y_test_pred)
print("matrice de confusion :")
print(conf_matrix)

report = classification_report(y_test, y_test_pred)
print("rapport de classification :")
print(report)

# auc
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Binariser les labels pour une classification multi-classes
classes = list(set(y_test))
y_test_bin = label_binarize(y_test, classes=classes)

# Calculer l'AUC en mode one-vs-rest
auc_score = roc_auc_score(y_test_bin, y_test_proba, average='weighted', multi_class='ovr')
print(f"AUC Score (multi-classes) : {auc_score:.2f}")

Sauvegarde du modèle

In [None]:
joblib.dump(gbm_stacking_model, f"../../data/models/gbm_stacking_model_{risk_profile}.pkl")
print(f"modèle enregistré dans data/gbm_stacking_model_{risk_profile}.pkl")

Mise en place de la stratégie à partir des prédictions

In [None]:
file_path = f"../../data/tbm_parameters_{risk_profile}.json"

with open(file_path, "r") as json_file:
    tbm_parameters = json.load(json_file)
equity_strategy = EquityStrategy(df=df, buy_number=tbm_parameters['buy_number'], sell_number=tbm_parameters['sell_number'])
profit = equity_strategy.calculate_profit()
mdd = equity_strategy.calculate_maximum_drawdown()
fitness = equity_strategy.fitness_function(weight_p=risk_profile_type_dict['weight_p'], weight_mdd=risk_profile_type_dict['weight_mdd'])

Equity curve

In [None]:
equity_curve = equity_strategy.equity_curve

Simplement hold

In [None]:
# nb_btc = df_price[0] / 100000
# hold_equity_curve = nb_btc * df_price

Comparaison

### Valeurs SHAP

In [2]:
import pandas as pd
import shap


# modèle
final_model = gbm_stacking_model.named_steps['votingclassifier'].estimators_[0]

# SHAP explainer
explainer = shap.LinearExplainer(gbm_stacking_model, X_test)
shap_values = explainer.shap_values(X_test)

# 10 prédicteurs les plus importants
shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=10)
shap.summary_plot(shap_values, X_test)

ModuleNotFoundError: No module named 'shap'