
###Automated stock trading using FinRL

The algorithm is trained using Deep Reinforcement Learning (DRL) algorithms and the components of the reinforcement learning environment are:

Action: The action space describes the allowed actions that the agent interacts with the environment. Normally, a ∈ A includes three actions: a ∈ {−1, 0, 1}, where −1, 0, 1 represent selling, holding, and buying one stock. Also, an action can be carried upon multiple shares. We use an action space {−k, ..., −1, 0, 1, ..., k}, where k denotes the number of shares. For example, "Buy 10 shares of AAPL" or "Sell 10 shares of AAPL" are 10 or −10, respectively

Reward function: r(s, a, s′) is the incentive mechanism for an agent to learn a better action. The change of the portfolio value when action a is taken at state s and arriving at new state s', i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively

State: The state space describes the observations that the agent receives from the environment. Just as a human trader needs to analyze various information before executing a trade, so our trading agent observes many different features to better learn in an interactive environment.

Environment: Dow 30 consituents

Install all the packages through FinRL library

In [1]:
!python --version



Python 3.10.16


In [17]:
!python -m pip install git+https://github.com/AI4Finance-Foundation/FinRL.git

Collecting git+https://github.com/AI4Finance-Foundation/FinRL.git
  Cloning https://github.com/AI4Finance-Foundation/FinRL.git to c:\users\natna\appdata\local\temp\pip-req-build-l77kt8or
  Resolved https://github.com/AI4Finance-Foundation/FinRL.git to commit bc12fe7b57c483e8fac666f4cf05cbf62077958a
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting elegantrl@ git+https://github.com/AI4Finance-Foundation/ElegantRL.git (from finrl==0.3.6)
  Cloning https://github.com/AI4Finance-Foundation/ElegantRL.git to c:\users\natna\appdata\local\temp\pip-install-cwe6w9_h\elegantrl_f378379c6c3f4bdeb5934f405e115e04
  Resolved https://github.com/AI4Finance-Foundation/ElegantRL.git to commit 2fa34dd9236498beada8d8443d9

  Running command git clone --filter=blob:none --quiet https://github.com/AI4Finance-Foundation/FinRL.git 'C:\Users\natna\AppData\Local\Temp\pip-req-build-l77kt8or'
  Running command git clone --filter=blob:none --quiet https://github.com/AI4Finance-Foundation/ElegantRL.git 'C:\Users\natna\AppData\Local\Temp\pip-install-cwe6w9_h\elegantrl_f378379c6c3f4bdeb5934f405e115e04'
  You can safely remove it manually.
  You can safely remove it manually.


In [6]:
!pip3 install pandas
!pip install numpy



In [25]:
!set PATH=%PATH%;C:\Users\natna\miniforge3\envs\finRl\Scripts


In [3]:
!python -m pip install numpy==1.26.4 scipy==1.12.0 scikit-learn==1.6.1



# !python --version 1.23

Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp310-cp310-win_amd64.whl.metadata (61 kB)
Collecting scipy==1.12.0
  Using cached scipy-1.12.0-cp310-cp310-win_amd64.whl.metadata (60 kB)
Collecting scikit-learn==1.6.1
  Using cached scikit_learn-1.6.1-cp310-cp310-win_amd64.whl.metadata (15 kB)
Using cached numpy-1.26.4-cp310-cp310-win_amd64.whl (15.8 MB)
Using cached scipy-1.12.0-cp310-cp310-win_amd64.whl (46.2 MB)
Using cached scikit_learn-1.6.1-cp310-cp310-win_amd64.whl (11.1 MB)
Installing collected packages: numpy, scipy, scikit-learn
Successfully installed numpy-1.26.4 scikit-learn-1.6.1 scipy-1.12.0




In [15]:
import numpy
import scipy
import sklearn
print("Numpy version:", numpy.__version__)
print("Scipy version:", scipy.__version__)
print("Scikit-learn version:", sklearn.__version__)


Numpy version: 1.26.4
Scipy version: 1.12.0
Scikit-learn version: 1.6.1


Import Packages

In [201]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# matplotlib.use('Agg')
import datetime

%matplotlib inline
from finrl import config
from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
from finrl.agents.stablebaselines3.models import DRLAgent
from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline

from pprint import pprint

import sys
sys.path.append("../FinRL-Library")

import itertools

In [202]:
# pip freeze > requirements.txt

Create Folders

In [203]:
import os
if not os.path.exists("./" + config.DATA_SAVE_DIR):
    os.makedirs("./" + config.DATA_SAVE_DIR)
if not os.path.exists("./" + config.TRAINED_MODEL_DIR):
    os.makedirs("./" + config.TRAINED_MODEL_DIR)
if not os.path.exists("./" + config.TENSORBOARD_LOG_DIR):
    os.makedirs("./" + config.TENSORBOARD_LOG_DIR)
if not os.path.exists("./" + config.RESULTS_DIR):
    os.makedirs("./" + config.RESULTS_DIR)

Download Data

In [204]:
from finrl import config_tickers
df = YahooDownloader(start_date = '2009-01-01',
                           end_date = '2020-09-30',
                           ticker_list = config_tickers.DOW_30_TICKER).fetch_data()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

Shape of DataFrame:  (86111, 8)


Preprocess Data

In [205]:
tech_indicators = config.INDICATORS

In [206]:
print(tech_indicators)

['macd', 'boll_ub', 'boll_lb', 'rsi_30', 'cci_30', 'dx_30', 'close_30_sma', 'close_60_sma']


In [207]:
df = FeatureEngineer(use_technical_indicator=True,
                      tech_indicator_list = tech_indicators,
                      use_turbulence=True,
                      user_defined_feature = False).preprocess_data(df.copy()).fillna(0)


Successfully added technical indicators
Successfully added turbulence index


In [208]:
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]
df = df.reset_index(drop=True)
# cov_list = []
# # look back is one year
# lookback=252
# for i in range(lookback,len(df.index.unique())):
#   data_lookback = df.loc[i-lookback:i,:]
#   price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
#   return_lookback = price_lookback.pct_change().dropna()
#   covs = return_lookback.cov().values
#   cov_list.append(covs)

# df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list})
# df = df.merge(df_cov, on='date')
# df = df.sort_values(['date','tic']).reset_index(drop=True)

In [209]:
df.sample(5)

Unnamed: 0,date,close,high,low,open,volume,tic,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,turbulence
79114,2019-11-04,111.409859,120.5,119.010002,119.620003,1956400,AXP,0,0.381791,112.537719,105.779776,52.113993,109.945882,7.92159,108.788552,110.114745,80.34655
60472,2017-04-17,76.372101,106.480003,105.650002,106.169998,5278800,CVX,0,-0.441153,79.092288,76.249487,42.090719,-141.825542,23.687403,78.069278,79.391192,8.534036
46309,2015-05-07,62.192539,66.779999,65.540001,65.580002,6660300,V,3,-0.01952,63.621352,60.000547,51.700715,21.976977,7.534645,61.651997,62.338989,13.012892
29661,2013-01-25,59.666771,78.550003,77.82,78.25,1524200,TRV,4,1.127458,60.224847,53.626378,68.814248,177.829517,55.297327,56.560709,54.969991,52.056346
31949,2013-05-20,28.895786,35.099998,34.68,34.73,54020800,MSFT,0,1.017984,28.937825,25.229476,72.791278,127.062228,62.233441,26.058291,24.549373,32.667513


In real life trading, the model needs to be updated periodically using rolling windows. But here I'm just cutting the data into train and trade set.

In [210]:
train = data_split(df, '2009-01-01','2019-12-31')
trade = data_split(df, '2020-01-01','2020-09-30')

State Space and Action Space Calculation

In [211]:
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(config.INDICATORS)*stock_dimension #stock, shares, technical indicators, 

In [212]:
print(stock_dimension)
print(state_space)

29
291


## Environment Details

In [213]:
# Define transaction cost lists for buying and selling stocks
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
# Explanation: 
# - `buy_cost_list` and `sell_cost_list` represent the transaction costs as a percentage for buying and selling stocks.
# - `[0.001] * stock_dimension` creates a list where each element is 0.001 (0.1% transaction fee), repeated for each stock.
# - The use of `=` assigns the same list to both `buy_cost_list` and `sell_cost_list`.

# Initialize the list to track the number of shares owned for each stock
num_stock_shares = [0] * stock_dimension
# Explanation:
# - `num_stock_shares` is a list where each element is initialized to 0, representing that no shares are owned initially.
# - `[0] * stock_dimension` ensures the list length matches the number of stocks (`stock_dimension`).

# Create a dictionary to store environment configuration parameters
env_kwargs = {
    "hmax": 100,  # Maximum number of shares that can be bought or sold in a single transaction.
    "initial_amount": 1_000_000,  # Initial cash available for the agent to trade with (e.g., $1,000,000).
    "num_stock_shares": num_stock_shares,  # Initial portfolio: number of shares owned for each stock.
    "buy_cost_pct": buy_cost_list,  # Transaction cost percentage for buying stocks.
    "sell_cost_pct": sell_cost_list,  # Transaction cost percentage for selling stocks.
    "state_space": state_space,  # Dimension of the state space (e.g., features describing the environment).
    "stock_dim": stock_dimension,  # Number of stocks being traded (dimension of the stock universe).
    "tech_indicator_list": config.INDICATORS,  # List of technical indicators used as features for the state space.
    "action_space": stock_dimension,  # Dimension of the action space (one action per stock).
    "reward_scaling": 1e-4,  # Scaling factor for rewards to normalize them and improve learning stability.
    "print_verbosity":5
}
# Explanation:
# - This dictionary (`env_kwargs`) encapsulates all the necessary parameters required to initialize the stock trading environment.
# - It includes configuration for portfolio management (e.g., `hmax`, `initial_amount`, `num_stock_shares`) and the structure of the RL problem (e.g., `state_space`, `action_space`).

# Initialize the stock trading environment with the training data and configuration parameters
e_train_gym = StockTradingEnv(df=train, **env_kwargs)
# Explanation:
# - `StockTradingEnv` is a custom environment class for stock trading, compliant with OpenAI Gym standards.
# - `df=train` specifies the training data (a DataFrame containing historical stock prices and other features).
# - `**env_kwargs` unpacks the `env_kwargs` dictionary, passing each key-value pair as an argument to the environment initializer.
# - The environment simulates the stock trading process, enabling the RL agent to interact with it by observing states, taking actions, and receiving rewards.


Environment for training

In [214]:
env_train, _ = e_train_gym.get_sb_env() #get stable baseline environment for training
print(type(env_train))

<class 'stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv'>


In [215]:
agent = DRLAgent(env = env_train)
# Set the corresponding values to 'True' for the algorithms that you want to use
if_using_a2c = True
if_using_ddpg = True
if_using_ppo = True
if_using_td3 = False
if_using_sac = False

In [15]:
from stable_baselines3.common.logger import configure

 Implement DRL Algorithms

In [16]:
import torch

print('Current version of PyTorch: ', torch.__version__)

if torch.cuda.is_available:
  print('PyTorch can use GPUs!')
else:
  print('PyTorch cannot use GPUs.')

Current version of PyTorch:  2.6.0+cu118
PyTorch can use GPUs!


DDPG

Training

In [17]:
agent = DRLAgent(env = env_train)
model_ddpg = agent.get_model("ddpg")

if if_using_ddpg:
  # set up logger
  tmp_path = config.RESULTS_DIR + '/ddpg'
  new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ddpg.set_logger(new_logger_ddpg)

{'batch_size': 128, 'buffer_size': 50000, 'learning_rate': 0.001}
Using cuda device
Logging to results/ddpg


In [84]:
trained_ddpg = agent.train_model(model=model_ddpg,
                             tb_log_name='ddpg',
                             total_timesteps=10000) if if_using_ddpg else None

day: 2514, episode: 5
begin_total_asset: 1000000.00
end_total_asset: 4835037.87
total_reward: 3835037.87
total_cost: 1560.41
total_trades: 27700
Sharpe: 1.086


In [20]:
trained_ddpg.save(config.TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None


Trading

In [85]:
e_trade_gym = StockTradingEnv(df = trade, **env_kwargs)

In [86]:
df_account_value, df_actions = DRLAgent.DRL_prediction(
    model=trained_ddpg,
    environment = e_trade_gym)

hit end!


In [94]:
df_actions.head(5)

Unnamed: 0_level_0,AAPL,AMGN,AXP,BA,CAT,CRM,CSCO,CVX,DIS,GS,...,MRK,MSFT,NKE,PG,TRV,UNH,V,VZ,WBA,WMT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-02,0,0,0,100,0,100,0,100,100,100,...,100,0,0,100,0,0,0,0,0,100
2020-01-03,0,0,0,100,0,100,0,100,100,100,...,100,0,0,100,0,0,0,0,0,100
2020-01-06,0,0,0,100,0,100,0,100,100,100,...,100,0,0,100,0,0,0,0,0,100
2020-01-07,0,0,0,100,0,100,0,100,100,100,...,100,0,0,100,0,0,0,0,0,100
2020-01-08,0,0,0,100,0,100,0,100,100,100,...,100,0,0,100,0,0,0,0,0,100


In [88]:
df_account_value.tail()

Unnamed: 0,date,account_value
183,2020-09-23,889699.361718
184,2020-09-24,886929.528149
185,2020-09-25,902766.579949
186,2020-09-28,920169.575047
187,2020-09-29,915718.3475


Backtesting Performance

In [39]:
df_dji = YahooDownloader(
    start_date='2020-01-01', end_date='2020-09-30', ticker_list=["dji"]
).fetch_data()
df_dji = df_dji[["date", "close"]]
fst_day = df_dji["close"][0]
dji = pd.merge(
    df_dji["date"],
    df_dji["close"].div(fst_day).mul(1000000),
    how="outer",
    left_index=True,
    right_index=True,
).set_index("date")


[*********************100%***********************]  1 of 1 completed

Shape of DataFrame:  (183, 8)





In [40]:
print("==============Get Backtest Results===========")
now = datetime.datetime.now().strftime('%Y%m%d-%Hh%M')

perf_stats_all = backtest_stats(account_value=df_account_value)
perf_stats_all = pd.DataFrame(perf_stats_all)
perf_stats_all.to_csv("./"+config.RESULTS_DIR+"/perf_stats_all_"+now+'.csv')

Annual return         -0.050662
Cumulative returns    -0.038044
Annual volatility      0.399601
Sharpe ratio           0.069960
Calmar ratio          -0.141434
Stability              0.000611
Max drawdown          -0.358206
Omega ratio            1.014689
Sortino ratio          0.095153
Skew                        NaN
Kurtosis                    NaN
Tail ratio             0.858610
Daily value at risk   -0.050234
dtype: float64


In [41]:
#baseline stats
print("==============Get Baseline Stats===========")
baseline_df = get_baseline(
        ticker="^DJI",
        start = '2020-01-01',
        end = '2020-09-30')

stats = backtest_stats(baseline_df, value_col_name = 'close')

[*********************100%***********************]  1 of 1 completed

Shape of DataFrame:  (188, 8)
Annual return         -0.065199
Cumulative returns    -0.049054
Annual volatility      0.416030
Sharpe ratio           0.046016
Calmar ratio          -0.175803
Stability              0.012240
Max drawdown          -0.370862
Omega ratio            1.009343
Sortino ratio          0.062829
Skew                        NaN
Kurtosis                    NaN
Tail ratio             0.860019
Daily value at risk   -0.052339
dtype: float64





Back Test Plot

In [45]:
df_result_ddpg = df_account_value.set_index(df_account_value.columns[0])
result = pd.DataFrame(
    {
        "ddpg": df_result_ddpg["account_value"],
        "dji": dji["close"],
    }
)
result

Unnamed: 0_level_0,ddpg,dji
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,1.000000e+06,
2020-01-03,9.978712e+05,1.000000e+06
2020-01-06,9.981349e+05,1.002392e+06
2020-01-07,9.939272e+05,9.982119e+05
2020-01-08,1.000095e+06,1.003848e+06
...,...,...
2020-09-23,9.234568e+05,9.346322e+05
2020-09-24,9.281680e+05,9.364587e+05
2020-09-25,9.375273e+05,9.489818e+05
2020-09-28,9.505631e+05,


In [47]:
plt.rcParams["figure.figsize"] = (15,5)
plt.figure()
result.plot()

<Axes: xlabel='date'>

PPO

In [24]:
agent = DRLAgent(env = env_train)
PPO_PARAMS = {
    "n_steps": 2048,
    "ent_coef": 0.01,
    "learning_rate": 0.00025,
    "batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)

if if_using_ppo:
  # set up logger
  tmp_path = config.RESULTS_DIR + '/ppo'
  new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ppo.set_logger(new_logger_ppo)

{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 128}
Using cuda device
Logging to results/ppo




In [25]:
trained_ppo = agent.train_model(model=model_ppo,
                             tb_log_name='ppo',
                             total_timesteps=200000) if if_using_ppo else None

---------------------------------
| time/              |          |
|    fps             | 147      |
|    iterations      | 1        |
|    time_elapsed    | 13       |
|    total_timesteps | 2048     |
| train/             |          |
|    reward          | 0.337964 |
---------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 145         |
|    iterations           | 2           |
|    time_elapsed         | 28          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014884099 |
|    clip_fraction        | 0.207       |
|    clip_range           | 0.2         |
|    entropy_loss         | -41.2       |
|    explained_variance   | 0.00748     |
|    learning_rate        | 0.00025     |
|    loss                 | 5.02        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0258     |
|    reward           

In [26]:
trained_ppo.save(config.TRAINED_MODEL_DIR + "/agent_ppo") if if_using_ppo else None

Trading

In [48]:
e_trade_gym = StockTradingEnv(df = trade, **env_kwargs)

In [49]:
df_account_value, df_actions = DRLAgent.DRL_prediction(
    model=trained_ppo,
    environment = e_trade_gym)

hit end!


In [50]:
df_account_value.tail()

Unnamed: 0,date,account_value
183,2020-09-23,909194.253185
184,2020-09-24,911009.383914
185,2020-09-25,916962.469833
186,2020-09-28,928577.920684
187,2020-09-29,919340.924575


Backtesting Performance

In [51]:
print("==============Get Backtest Results===========")
now = datetime.datetime.now().strftime('%Y%m%d-%Hh%M')

perf_stats_all = backtest_stats(account_value=df_account_value)
perf_stats_all = pd.DataFrame(perf_stats_all)
perf_stats_all.to_csv("./"+config.RESULTS_DIR+"/perf_stats_all_"+now+'.csv')

Annual return         -0.106606
Cumulative returns    -0.080659
Annual volatility      0.422709
Sharpe ratio          -0.056058
Calmar ratio          -0.294170
Stability              0.000500
Max drawdown          -0.362396
Omega ratio            0.989203
Sortino ratio         -0.076674
Skew                        NaN
Kurtosis                    NaN
Tail ratio             1.107088
Daily value at risk   -0.053350
dtype: float64


In [52]:
#baseline stats
print("==============Get Baseline Stats===========")
baseline_df = get_baseline(
        ticker="^DJI",
        start = '2020-01-01',
        end = '2020-09-30')

stats = backtest_stats(baseline_df, value_col_name = 'close')

[*********************100%***********************]  1 of 1 completed

Shape of DataFrame:  (188, 8)
Annual return         -0.065199
Cumulative returns    -0.049054
Annual volatility      0.416030
Sharpe ratio           0.046016
Calmar ratio          -0.175803
Stability              0.012240
Max drawdown          -0.370862
Omega ratio            1.009343
Sortino ratio          0.062829
Skew                        NaN
Kurtosis                    NaN
Tail ratio             0.860019
Daily value at risk   -0.052339
dtype: float64





In [53]:
df_result_ppo = df_account_value.set_index(df_account_value.columns[0])
result = pd.DataFrame(
    {
        "ppo": df_result_ppo["account_value"],
        "dji": dji["close"],
    }
)
result

Unnamed: 0_level_0,ppo,dji
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,1.000000e+06,
2020-01-03,9.994978e+05,1.000000e+06
2020-01-06,9.999417e+05,1.002392e+06
2020-01-07,9.997660e+05,9.982119e+05
2020-01-08,1.000481e+06,1.003848e+06
...,...,...
2020-09-23,9.091943e+05,9.346322e+05
2020-09-24,9.110094e+05,9.364587e+05
2020-09-25,9.169625e+05,9.489818e+05
2020-09-28,9.285779e+05,


In [54]:
plt.rcParams["figure.figsize"] = (15,5)
plt.figure()
result.plot()

<Axes: xlabel='date'>

A2C

In [30]:
agent = DRLAgent(env = env_train)
model_a2c = agent.get_model("a2c")

if if_using_a2c:
  # set up logger
  tmp_path = config.RESULTS_DIR + '/a2c'
  new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_a2c.set_logger(new_logger_a2c)

{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007}
Using cuda device
Logging to results/a2c




In [31]:
trained_a2c = agent.train_model(model=model_a2c,
                             tb_log_name='a2c',
                             total_timesteps=50000) if if_using_a2c else None

----------------------------------------
| time/                 |              |
|    fps                | 126          |
|    iterations         | 100          |
|    time_elapsed       | 3            |
|    total_timesteps    | 500          |
| train/                |              |
|    entropy_loss       | -41.3        |
|    explained_variance | 0.0855       |
|    learning_rate      | 0.0007       |
|    n_updates          | 99           |
|    policy_loss        | -33.9        |
|    reward             | -0.018644849 |
|    std                | 1.01         |
|    value_loss         | 0.967        |
----------------------------------------
---------------------------------------
| time/                 |             |
|    fps                | 126         |
|    iterations         | 200         |
|    time_elapsed       | 7           |
|    total_timesteps    | 1000        |
| train/                |             |
|    entropy_loss       | -41.3       |
|    explained_variance 

In [32]:
trained_a2c.save(config.TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None

Trading

In [55]:
e_trade_gym = StockTradingEnv(df = trade, **env_kwargs)

In [56]:
df_account_value, df_actions = DRLAgent.DRL_prediction(
    model=trained_a2c,
    environment = e_trade_gym)

hit end!


In [57]:
df_account_value.tail()

Unnamed: 0,date,account_value
183,2020-09-23,932457.358241
184,2020-09-24,935975.452659
185,2020-09-25,949245.242111
186,2020-09-28,965283.748893
187,2020-09-29,961955.772083


Backtesting Performance

In [58]:
print("==============Get Backtest Results===========")
now = datetime.datetime.now().strftime('%Y%m%d-%Hh%M')

perf_stats_all = backtest_stats(account_value=df_account_value)
perf_stats_all = pd.DataFrame(perf_stats_all)
perf_stats_all.to_csv("./"+config.RESULTS_DIR+"/perf_stats_all_"+now+'.csv')

Annual return         -0.050662
Cumulative returns    -0.038044
Annual volatility      0.399601
Sharpe ratio           0.069960
Calmar ratio          -0.141434
Stability              0.000611
Max drawdown          -0.358206
Omega ratio            1.014689
Sortino ratio          0.095153
Skew                        NaN
Kurtosis                    NaN
Tail ratio             0.858610
Daily value at risk   -0.050234
dtype: float64


In [59]:
#baseline stats
print("==============Get Baseline Stats===========")
baseline_df = get_baseline(
        ticker="^DJI",
        start = '2020-01-01',
        end = '2020-09-30')

stats = backtest_stats(baseline_df, value_col_name = 'close')

[*********************100%***********************]  1 of 1 completed

Shape of DataFrame:  (188, 8)
Annual return         -0.065199
Cumulative returns    -0.049054
Annual volatility      0.416030
Sharpe ratio           0.046016
Calmar ratio          -0.175803
Stability              0.012240
Max drawdown          -0.370862
Omega ratio            1.009343
Sortino ratio          0.062829
Skew                        NaN
Kurtosis                    NaN
Tail ratio             0.860019
Daily value at risk   -0.052339
dtype: float64





In [60]:
df_result_a2c = df_account_value.set_index(df_account_value.columns[0])
result = pd.DataFrame(
    {
        "a2c": df_result_a2c["account_value"],
        "dji": dji["close"],
    }
)
result

Unnamed: 0_level_0,a2c,dji
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,1.000000e+06,
2020-01-03,9.985980e+05,1.000000e+06
2020-01-06,9.999666e+05,1.002392e+06
2020-01-07,9.986441e+05,9.982119e+05
2020-01-08,1.004843e+06,1.003848e+06
...,...,...
2020-09-23,9.324574e+05,9.346322e+05
2020-09-24,9.359755e+05,9.364587e+05
2020-09-25,9.492452e+05,9.489818e+05
2020-09-28,9.652837e+05,


In [61]:
plt.rcParams["figure.figsize"] = (15,5)
plt.figure()
result.plot()

<Axes: xlabel='date'>

In [65]:
result.plot()
plt.savefig('results.png')

Meta-Policy

In [216]:
agent = DRLAgent(env = env_train)
model_ddpg = agent.get_model("ddpg")
trained_ddpg = model_ddpg.load(config.TRAINED_MODEL_DIR + "/agent_ddpg")
model_ppo = agent.get_model("ppo")
trained_ppo = model_ppo.load(config.TRAINED_MODEL_DIR + "/agent_ppo", device='cpu')
model_a2c = agent.get_model("a2c")
trained_a2c = model_a2c.load(config.TRAINED_MODEL_DIR + "/agent_a2c", device='cpu')

{'batch_size': 128, 'buffer_size': 50000, 'learning_rate': 0.001}
Using cuda device
{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 64}
Using cuda device
{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007}
Using cuda device




In [217]:
def DRL_prediction(model, environment, deterministic=True):
        """make a prediction and get results"""
        test_env, test_obs = environment.get_sb_env()
        # print(test_obs)
        account_memory = None  # This help avoid unnecessary list creation
        actions_memory = None  # optimize memory consumption
        # state_memory=[] #add memory pool to store states
        state_memory = None
        rewards_memory = []

        test_env.reset()
        max_steps = len(environment.df.index.unique()) - 1

        for i in range(len(environment.df.index.unique())):
            action, _states = model.predict(test_obs, deterministic=deterministic)
            # account_memory = test_env.env_method(method_name="save_asset_memory")
            # actions_memory = test_env.env_method(method_name="save_action_memory")
            next_obs, rewards, dones, info = test_env.step(action)
            rewards_memory.append(rewards)
            # state_memory.append(test_obs)
            test_obs = next_obs
            if (
                i == max_steps - 1
            ):  # more descriptive condition for early termination to clarify the logic
                account_memory = test_env.env_method(method_name="save_asset_memory")
                actions_memory = test_env.env_method(method_name="save_action_memory")
            # add current state to state memory
                state_memory=test_env.env_method(method_name="save_state_memory")

            if dones[0]:
                print("hit end!")
                break
        # state_columns = [f"feature_{i}" for i in range(len(state_memory[0][0]))]
        # state_memory = pd.DataFrame([i[0] for i in state_memory], columns=state_columns)
        return account_memory[0], actions_memory[0], state_memory[0]

In [218]:
def get_agent_predictions(model, environment):
    return DRL_prediction(
        model=model,
        environment = environment)


In [219]:
import gym
from gymnasium import spaces

class CustomStockTradingEnv(StockTradingEnv):
    def __init__(self, df, **kwargs):
            # Call the parent constructor
            super().__init__(df=df, **kwargs)
            self.action_space = spaces.Box(
            low=-self.hmax, 
            high=self.hmax, 
            shape=(self.stock_dim,), 
            dtype=np.float32
        )
    def save_state_memory(self):
        """Save state memory dynamically without hard-coded column names."""
        if len(self.df.tic.unique()) > 1:
            # Get the date memory (excluding the last one)
            date_list = self.date_memory[:-1]

            # Ensure state_list is not empty
            state_list = self.state_memory
            # print(len(state_list[0]))
            if not state_list:
                print("Warning: state_memory is empty!")
                return pd.DataFrame()  # Return an empty DataFrame if no data exists

            # Ensure each state has the same structure
            try:
                num_features = len(state_list[0])  # Get number of features from the first state
                state_columns = [f"feature_{i}" for i in range(num_features)]  # Generate feature names
            except IndexError:
                print("Error: state_memory contains empty entries.")
                return pd.DataFrame()  # Return an empty DataFrame if there's an issue

            # Ensure the length of date_list and state_list match
            # print(len(date_list), len(state_list))
            # min_length = min(len(date_list), len(state_list))
            # date_list = date_list[:min_length]
            # state_list = state_list[:min_length]

            df_date = pd.DataFrame(date_list, columns=["date"])
            df_states = pd.DataFrame(state_list, columns=state_columns)
            df_states.index = df_date["date"]  # Set index to match dates
        else:
            date_list = self.date_memory[:-1]
            state_list = self.state_memory
            df_states = pd.DataFrame({"date": date_list, "states": state_list})

        return df_states
    def step(self, action):
        # Clip actions to feasible shares based on current holdings
        for i in range(self.stock_dim):
            current_shares = self.state[i * 2 + 1]
            action[i] = np.clip(action[i], -current_shares, self.hmax)
        
        return super().step(action)
    



In [135]:
# Get predictions and save account values, actions, rewards, and states
ddpg_account_value, ddpg_actions, ddpg_states = get_agent_predictions(
    trained_ddpg, CustomStockTradingEnv(df=trade, **env_kwargs)
)



ppo_account_value, ppo_actions, ppo_rewards, ppo_states = get_agent_predictions(
    trained_ppo, StockTradingEnv(df=trade, **env_kwargs)
)

a2c_account_value, a2c_actions, a2c_rewards, a2c_states = get_agent_predictions(
    trained_a2c, StockTradingEnv(df=trade, **env_kwargs)
)


hit end!


In [139]:
ddpg_account_value.head()

Unnamed: 0,date,account_value
0,2020-01-02,1000000.0
1,2020-01-03,997871.2
2,2020-01-06,998134.9
3,2020-01-07,993927.3
4,2020-01-08,1000095.0


In [140]:
ddpg_actions.head()

Unnamed: 0_level_0,AAPL,AMGN,AXP,BA,CAT,CRM,CSCO,CVX,DIS,GS,...,MRK,MSFT,NKE,PG,TRV,UNH,V,VZ,WBA,WMT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-02,0,0,100,0,0,0,100,100,100,100,...,100,0,0,100,100,100,100,100,100,100
2020-01-03,0,0,100,0,0,0,100,100,100,100,...,100,0,0,100,100,100,100,100,100,100
2020-01-06,0,0,100,0,0,0,100,100,100,100,...,100,0,0,100,100,100,100,100,100,100
2020-01-07,0,0,100,0,0,0,100,100,100,100,...,100,0,0,100,100,100,100,100,100,100
2020-01-08,0,0,0,0,0,0,100,100,100,100,...,0,0,0,0,0,0,0,0,100,100


In [141]:
ddpg_states.head()

Unnamed: 0_level_0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_281,feature_282,feature_283,feature_284,feature_285,feature_286,feature_287,feature_288,feature_289,feature_290
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-02,560817.287831,72.08831,204.705475,116.450272,330.791901,132.522903,165.247025,40.779331,96.888985,144.71228,...,70.479639,141.683122,89.297288,107.437986,120.683839,247.389214,174.77627,45.044702,44.064002,36.784958
2020-01-03,341939.827378,72.662704,206.276382,115.945587,331.766083,132.43364,172.486588,40.924873,96.560699,143.872635,...,70.58926,142.022018,89.447571,107.460597,120.634986,248.478179,175.011198,45.057752,44.164049,36.781098
2020-01-06,124393.355856,72.320992,204.336334,115.33812,335.285156,130.683777,175.022415,40.659473,95.327667,143.922028,...,70.671277,142.324136,89.589535,107.472662,120.55231,249.514437,175.238082,45.054893,44.249602,36.76813
2020-01-07,8.678264,73.484337,204.490891,117.328796,329.410095,131.844421,176.345047,40.685143,94.238762,143.625687,...,70.737247,142.657051,89.714067,107.504708,120.475839,250.670922,175.481917,45.052159,44.277977,36.749868
2020-01-08,7.056161,75.045212,205.100342,119.450317,334.350708,131.514084,178.602432,40.51392,94.086655,143.062653,...,70.809116,143.023915,89.821805,107.571505,120.40687,251.824222,175.742331,45.050449,44.295805,36.743282


In [220]:
import torch
import torch.nn as nn
import torch.optim as optim

In [221]:
class MetaPolicy(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, hmax=100):
        """
        :param hmax: Maximum shares allowed (used for action scaling)
        """
        super().__init__()
        self.hmax = hmax
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
            nn.Tanh()  # Outputs in [-1, 1] for easy scaling
        )
    
    def forward(self, x):
        # Outputs in [-1, 1], scaled to [-hmax, hmax]
        return self.net(x) * self.hmax

In [222]:
meta_env = CustomStockTradingEnv(df=trade, **env_kwargs)
num_stocks  = meta_env.action_space.shape[0]
state_features_dim = meta_env.state_space
num_base_models = 3
input_dim = num_base_models * num_stocks + state_features_dim
output_dim = num_stocks
hidden_dim = 64 

In [223]:
meta_policy_model = MetaPolicy(input_dim, hidden_dim, output_dim)

In [224]:
criterion = nn.MSELoss()
optimizer = optim.Adam(meta_policy_model.parameters(), lr=1e-3)

In [225]:
def get_base_actions(current_state, models):
    """Get actions from base models for the CURRENT state."""
    base_actions = []
    for model in models:
        # Assuming models use Stable Baselines3-style prediction
        action, _ = model.predict(current_state, deterministic=True)
        base_actions.append(action)
    return np.concatenate(base_actions)  # Shape: (num_models * num_stocks,)

In [226]:
from collections import deque
import random

# Initialize replay buffer with a maximum capacity (e.g., 10,000 transitions)
replay_buffer = deque(maxlen=10000)
batch_size = 64  # Number of transitions to sample per training step

In [None]:
for episode in range(100):
    state = meta_env.reset()[0]
    done = False
    
    while not done:
        # 1. Get base model predictions for CURRENT state
        base_actions = []
        for model in [trained_ddpg, trained_ppo, trained_a2c]:
            action, _ = model.predict(state)  # Use current state
            base_actions.append(action)
        base_actions = np.concatenate(base_actions)
        
        # 2. Prepare meta-policy input
        input_vector = np.concatenate([base_actions, state])
        input_tensor = torch.FloatTensor(input_vector).unsqueeze(0)
        
        # 3. Predict action (e.g., [-50, 0, 100])
        with torch.no_grad():
            predicted_action = meta_policy_model(input_tensor)
        action = predicted_action.squeeze().numpy()
        # print(action.shape)
        # 4. Take action in the environment
        next_state, reward, done, info, _ = meta_env.step(action)
        # print(len(next_state))
        # 5. Store transition in replay buffer
        replay_buffer.append((input_vector, action, reward, next_state, done))
        
        # 6. Batch training (only if buffer has enough samples)
        if len(replay_buffer) >= batch_size:
            # Sample a batch of transitions
            batch = random.sample(replay_buffer, batch_size)
            states, actions, rewards, next_states, dones = zip(*batch)
            # print(len(states[0]))
            # Convert to tensors
            states_tensor = torch.FloatTensor(np.array(states))
            actions_tensor = torch.FloatTensor(np.array(actions))
            rewards_tensor = torch.FloatTensor(np.array(rewards))
            
            # 7. Calculate loss (policy gradient)
            # print(states_tensor.shape)
            predicted_actions = meta_policy_model(states_tensor)
            # print(predicted_action.shape)
            loss = -torch.mean(predicted_actions * rewards_tensor.unsqueeze(-1))  # Maximize reward
            
            # print(f"Episode {episode} | Loss: {loss.item()}")
            
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        state = next_state if not done else None

(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,)
(29,

KeyboardInterrupt: 

In [236]:
state = meta_env.reset()[0]
state = np.array(state)

In [237]:
state.shape

(291,)

In [229]:
base_actions = []
for model in [trained_ddpg, trained_ppo, trained_a2c]:
    action, _ = model.predict(state)  # Use current state
    base_actions.append(action)
base_actions = np.concatenate(base_actions)

In [231]:
base_actions.shape

(87,)

In [232]:
input_vector = np.concatenate([base_actions, state])
input_tensor.shape

torch.Size([1, 378])

In [235]:
input_tensor = torch.FloatTensor(input_vector).unsqueeze(0)
input_tensor.shape

torch.Size([1, 378])

In [172]:
state = meta_env.reset()
action, _ = model.predict(state[0])
meta_env.step(action)

([868295.4193400651,
  72.08831024169922,
  204.70547485351562,
  116.45027160644531,
  330.7919006347656,
  132.5229034423828,
  165.2470245361328,
  40.77933120727539,
  96.88898468017578,
  144.7122802734375,
  204.90480041503906,
  194.07896423339844,
  161.28224182128906,
  101.77963256835938,
  53.01372528076172,
  125.5931167602539,
  119.87416076660156,
  46.837806701660156,
  178.208740234375,
  121.90216064453125,
  74.7109146118164,
  151.71771240234375,
  96.2235336303711,
  107.690673828125,
  122.58610534667969,
  269.0248718261719,
  183.0200653076172,
  45.04258728027344,
  44.919612884521484,
  36.457855224609375,
  0,
  0,
  11,
  0,
  0,
  93,
  0,
  0,
  0,
  100,
  0,
  0,
  0,
  0,
  100,
  100,
  100,
  82,
  100,
  85,
  0,
  0,
  0,
  0,
  100,
  0,
  32,
  0,
  69,
  2.1689951430842456,
  3.1252949474499587,
  1.319575344388781,
  -6.685749959383145,
  1.5064351479882134,
  1.5934533068988515,
  0.5808899371309337,
  0.6562985473616578,
  0.6242537354758042,
 

In [194]:
def calculate_sharpe(returns, risk_free_rate=0.0):
    """Annualized Sharpe ratio."""
    excess_returns = returns - risk_free_rate / 252
    return np.sqrt(252) * excess_returns.mean() / excess_returns.std()

def calculate_max_drawdown(account_value):
    """Maximum drawdown (peak-to-trough decline)."""
    peak = np.maximum.accumulate(account_value)
    trough = np.minimum.accumulate(account_value)
    return np.min((trough - peak) / peak)

def plot_performance(account_values, labels):
    """Plot cumulative returns of multiple strategies."""
    plt.figure(figsize=(12, 6))
    for values, label in zip(account_values, labels):
        plt.plot(values, label=label)
    plt.title("Portfolio Value Over Time")
    plt.xlabel("Time Step")
    plt.ylabel("Account Value ($)")
    plt.legend()
    plt.show()

In [199]:
def backtest_ensemble(
    env, 
    base_models,  # List: [ddpg_model, ppo_model, a2c_model]
    meta_model, 
    initial_amount=1_000_000,
    verbose=True
):
    """
    Simulate trading with the trained ensemble model on historical data.
    Returns portfolio values, actions, and performance metrics.
    """
    # Initialize environment
    state = env.reset()[0]
    account_value = [initial_amount]
    actions = []
    done = False

    while not done:
        # Get base model predictions for current state
        base_actions = []
        for model in base_models:
            action, _ = model.predict(state, deterministic=True)
            base_actions.append(action)
        base_actions = np.concatenate(base_actions)

        # Combine with state features
        input_vector = np.concatenate([base_actions, state])
        input_tensor = torch.FloatTensor(input_vector).unsqueeze(0)

        # Meta-policy prediction
        with torch.no_grad():
            meta_action = meta_model(input_tensor).squeeze().numpy()

        # Execute action in environment
        next_state, reward, done, info, _ = env.step(meta_action)

        # Record results
        portfolio_value = env.get_portfolio_value()
        account_value.append(portfolio_value)
        actions.append(meta_action.copy())
        state = next_state

    # Calculate metrics
    returns = pd.Series(account_value).pct_change().fillna(0)
    sharpe_ratio = calculate_sharpe(returns)
    max_drawdown = calculate_max_drawdown(account_value)

    return {
        "account_value": account_value,
        "actions": np.array(actions),
        "sharpe_ratio": sharpe_ratio,
        "max_drawdown": max_drawdown,
        "returns": returns
    }

In [200]:
# Load trained models
base_models = [trained_ddpg, trained_ppo, trained_a2c]
meta_model = meta_policy_model  # Your trained DNN meta-policy

# Initialize test environment with historical data
test_env = StockTradingEnv(df=trade, **env_kwargs)

# Back-test ensemble
ensemble_results = backtest_ensemble(test_env, base_models, meta_model)

# Back-test base models for comparison
base_results = {}
for name, model in zip(["DDPG", "PPO", "A2C"], base_models):
    test_env.reset()
    account_value, _ = DRLAgent.DRL_prediction(model, test_env)
    base_results[name] = {
        "account_value": account_value,
        "sharpe_ratio": calculate_sharpe(pd.Series(account_value).pct_change()),
        "max_drawdown": calculate_max_drawdown(account_value)
    }

# Compare with S&P 500 (example benchmark)
benchmark_returns = trade.groupby('date')['close'].first().pct_change().fillna(0)
benchmark_value = 1_000_000 * (1 + benchmark_returns).cumprod()

AttributeError: 'StockTradingEnv' object has no attribute 'get_portfolio_value'

In [None]:
# Plot performance
plot_performance(
    [ensemble_results["account_value"], 
     base_results["DDPG"]["account_value"],
     base_results["PPO"]["account_value"],
     benchmark_value],
    labels=["Meta-Policy", "DDPG", "PPO", "S&P 500"]
)

# Print metrics
print(f"Meta-Policy Sharpe: {ensemble_results['sharpe_ratio']:.2f}")
print(f"DDPG Sharpe: {base_results['DDPG']['sharpe_ratio']:.2f}")
print(f"PPO Sharpe: {base_results['PPO']['sharpe_ratio']:.2f}")
print(f"S&P 500 Sharpe: {calculate_sharpe(benchmark_returns):.2f}")