<center> <h2>Projet Finance Quantitative</h2> </center> <br>
<center> <h3>Master 2 MoSEF Data Science - Université Paris 1 Panthéon-Sorbonne</h3> </center> <br>
<center> <h3><b>Genetic Algorithm-optimized Triple Barrier Labeling for Predictive Stock Trading Using GBM Stacking</b></h3> </center> <br>
<center> <h3>Louis LEBRETON</h3> </center> <br>

## Création des DataFrames d'entraînement et de test

Les DataFrames d'entraînement et de test sont générés à partir des données suivantes :  

- **Données macroéconomiques** : Issues de l'API **FRED** (St Louis FED)
- **Scores de Tweets** : Les tweets sont scrapés puis sont convertis en scores numériques à l'aide du modèle **BERTweet** permettant de quantifier leur sentiment
- **Données Bitcoin (BTC)** : Collectées via l'API **Coingecko**. Ces données sont labellisées en utilisant la méthode **Triple Barrier**. Les parmètres de cette méthode ont été optimisés par un **algorithme génétique** afin de maximiser la performance d'une stratégie de trading

L'objectif est d'obtenir ces données sur l'ensemble de la période allant du **01/01/2018 au 01/01/2025**

In [3]:
start_date = '2024-01-01'
end_date = '2025-01-01'

In [4]:
import os

from datetime import datetime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from concurrent.futures import ThreadPoolExecutor
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification


from get_data_API import get_economic_data, get_BTC_data
from get_sentiment_score import tweets_to_sentiment_scores
from get_data_scraping import scrape_tweets_one_account

### Données macroéconomique

In [5]:
# clef API FRED obtenue via https://fred.stlouisfed.org/docs/api/api_key.html
FRED_API_KEY = os.getenv('FRED_API_KEY')

# variables d'intérêt
series_list = [
    'DFF',       # Federal Funds Rate
    'NFINCP',    # Nonfinancial commercial paper outstanding
    'FINCP',     # Financial commercial paper outstanding
    'DPRIME',    # Bank prime loan rate
    'DPCREDIT',  # Discount window primary credit
    'DTWEXBGS',  # Nominal Broad U.S. Dollar Index
    'CPIAUCSL',  # Consumer Price Index
    'DGS3MO',    # Market Yield on U.S. Treasury Securities (3 months)
    'DGS1',      # Market Yield on U.S. Treasury Securities (1 year)
    'DGS30'      # Market Yield on U.S. Treasury Securities (30 years)
]

economic_df = get_economic_data(series_id_list=series_list, api_key=FRED_API_KEY, start_date=start_date, end_date=end_date)
economic_df.head()

Unnamed: 0,date,DFF,NFINCP,FINCP,DPRIME,DPCREDIT,DTWEXBGS,CPIAUCSL,DGS3MO,DGS1,DGS30
0,2024-01-01,5.33,,,8.5,5.5,,309.685,,,
1,2024-01-02,5.33,,,8.5,5.5,119.6165,,5.46,4.8,4.08
2,2024-01-03,5.33,273.0175,636.064172,8.5,5.5,120.0407,,5.48,4.81,4.05
3,2024-01-04,5.33,,,8.5,5.5,119.9775,,5.48,4.85,4.13
4,2024-01-05,5.33,,,8.5,5.5,119.7474,,5.47,4.84,4.21


### Données X/Twitter

Les tweets une fois scrapés sont exportés dans le folder data/tweets

In [6]:
# # mon compte X
# LOGIN = os.getenv('LOGIN')
# PASSWORD = os.getenv('PASSWORD')

# # intervalle de temps à scraper
# since_date = datetime.strptime(start_date, "%Y-%m-%d")
# until_date = datetime.strptime(end_date, "%Y-%m-%d")

# # comptes X d'intérêt
# accounts_list = ["documentingbtc", "100trillionUSD", "CoinDesk", "saylor", "scottmelker",
#                 "woonomic", "LynAldenContact", "PrestonPysh", "PeterLBrandt", "rektcapital"]


# options = Options()
# options.add_argument('--ignore-certificate-errors')
# options.add_argument('--ignore-ssl-errors')
# options.add_argument("--disable-blink-features=AutomationControlled")

# # avec ou sans le visuel sur les pages webs
# # options.add_argument("--headless")

# # parallelisation
# with ThreadPoolExecutor(max_workers=1) as executor:
#     for account in accounts_list:
#         driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
#         # scraping des tweets d'un compte X entre 2 dates
#         executor.submit(scrape_tweets_one_account, LOGIN, PASSWORD, account, since_date, until_date, driver)

### Concaténation de tous les tweets scrapés ensemble

In [7]:
folder_path = '../../data/tweets/'

csv_list = []

for root, dirs, files in os.walk(folder_path):
    for file in files:
        if file.endswith('.csv'):
    
            file_path = os.path.join(root, file)
            df = pd.read_csv(file_path)
            csv_list.append(df)

df_tweets_combined = pd.concat(csv_list, ignore_index=True).drop_duplicates().dropna()
df_tweets_combined.tail()

Unnamed: 0,author,timestamp,tweet_text
31,Michael Saylor@saylor·5 déc. 2024,2024-12-05T15:36:24.000Z,You can still buy #bitcoin for less than $1 mi...
32,Michael Saylor@saylor·5 déc. 2024,2024-12-05T14:16:41.000Z,"Year to date, $MSTR treasury operations delive..."
33,Michael Saylor@saylor·5 déc. 2024,2024-12-05T13:45:36.000Z,We have a #Bitcoin President.
35,Michael Saylor@saylor·5 déc. 2024,2024-12-05T13:33:29.000Z,$SMLR is a company on the #Bitcoin Standard.
36,Michael Saylor@saylor·5 déc. 2024,2024-12-05T11:32:52.000Z,Keep Your Promises #Bitcoin


#### Transformation des tweets en scores de sentiment via BERTweet

In [8]:
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")

tweets_text = df_tweets_combined['tweet_text'].tolist()

# prediction des scores
scores = tweets_to_sentiment_scores(tweets_text, tokenizer, model)

# ajout des scores au df
df_tweets_combined['negative'] = scores[:, 0].tolist()
df_tweets_combined['neutral'] = scores[:, 1].tolist()
df_tweets_combined['positive'] = scores[:, 2].tolist()

df_tweets_combined.head()

Unnamed: 0,author,timestamp,tweet_text,negative,neutral,positive
0,Michael Saylor@saylor·4 déc. 2024,2024-12-04T20:39:49.000Z,I journeyed into the @FoxBusiness studios to d...,0.001839,0.458866,0.539295
1,Michael Saylor@saylor·4 déc. 2024,2024-12-04T13:02:48.000Z,"In November, $MSTR raised $13.5 billion and ac...",0.002764,0.86933,0.127906
2,Michael Saylor@saylor·3 déc. 2024,2024-12-03T21:35:51.000Z,The Deal of the Century is Trump Max #Bitcoin,0.01184,0.458442,0.529718
3,Michael Saylor@saylor·3 déc. 2024,2024-12-03T14:13:32.000Z,My discussion of the economic benefits of Bitc...,0.00401,0.375384,0.620606
4,Michael Saylor@saylor·3 déc. 2024,2024-12-03T12:46:43.000Z,"Last week, $MSTR's treasury operations deliver...",0.002188,0.517071,0.480741


### Données Bitcoin + Labels TBM via Genetic optimization

Construction de variables supplémentaires

In [None]:
btc_df = get_BTC_data(days = 365, interval = 'daily')

# pct_change
btc_df["increase_volume"] = (btc_df["volume"] - btc_df["volume"].shift(1)) / btc_df["volume"].shift(1)
btc_df["increase_market_cap"] = (btc_df["market_cap"] - btc_df["market_cap"].shift(1)) / btc_df["market_cap"].shift(1)
btc_df["increase_price"] = (btc_df["price"] - btc_df["price"].shift(1)) / btc_df["price"].shift(1)

# moving average
btc_df["MA30_volume"] = btc_df["volume"].rolling(window=30).mean()
btc_df["MA30_market_cap"] = btc_df["market_cap"].rolling(window=30).mean()
btc_df["MA30_price"] = btc_df["price"].rolling(window=30).mean()

# lags
btc_df["volume_lag_7"] = btc_df["volume"].shift(7)
btc_df["market_cap_lag_7"] = btc_df["market_cap"].shift(7)
btc_df["price_lag_7"] = btc_df["price"].shift(7)

btc_df.tail()

Unnamed: 0,date,price,market_cap,volume,increase_volume,increase_market_cap,increase_price,MA30_volume,MA30_market_cap,MA30_price,volume_lag_7,market_cap_lag_7,price_lag_7
361,2025-01-02 00:00:00,94384.176115,1869193000000.0,23275010000.0,-0.486514,0.009371,0.009372,78481460000.0,1945124000000.0,98268.073011,33963750000.0,1966481000000.0,99344.954174
362,2025-01-03 00:00:00,96852.146812,1917905000000.0,45157340000.0,0.940164,0.02606,0.026148,77055510000.0,1945691000000.0,98295.423539,45049340000.0,1894744000000.0,95678.312446
363,2025-01-04 00:00:00,98084.342793,1942835000000.0,35721650000.0,-0.208951,0.012998,0.012722,74971880000.0,1945209000000.0,98268.85265,41498540000.0,1867709000000.0,94331.947271
364,2025-01-05 00:00:00,98256.738768,1946611000000.0,20979040000.0,-0.412708,0.001944,0.001758,69322500000.0,1946071000000.0,98304.027264,22429850000.0,1885557000000.0,95184.619453
365,2025-01-05 16:39:17,98010.720833,1939574000000.0,19153980000.0,-0.086994,-0.003615,-0.002504,66100550000.0,1944778000000.0,98238.589575,24065310000.0,1854873000000.0,93663.44752


#### Labelisation du prix du bitcoin