In [1]:
from lib.Environment import DeepTradingEnvironment, LinearAgent
import numpy as np
import datetime
# parameters related to the transformation of data, this parameters govern an step before the algorithm

out_reward_window=datetime.timedelta(days=7)
meta_parameters = {"in_bars_count": 30,
                   "out_reward_window": datetime.timedelta(days=7),
                   "state_type":"in_window_out_window"}

# parameters that are related to the objective/reward function construction
objective_parameters = {"percent_commission": .001}
print("===Meta Parameters===")
print(meta_parameters)
print("===Objective Parameters===")
print(objective_parameters)


assets_simulation_details={"asset_1":{"method":"GBM","sigma":.1,"mean":.1},
                    "asset_2":{"method":"GBM","sigma":.2,"mean":.2}}

env=DeepTradingEnvironment.build_environment_from_simulated_assets(assets_simulation_details=assets_simulation_details,
                                                                     data_hash="simulation_gbm",
                                                                     meta_parameters=meta_parameters,
                                                                     objective_parameters=objective_parameters)



===Meta Parameters===
{'in_bars_count': 30, 'out_reward_window': datetime.timedelta(days=7), 'state_type': 'in_window_out_window'}
===Objective Parameters===
{'percent_commission': 0.001}


In [2]:
#example get state data

date=env.features.index[100]
state_features, weights_on_date =env.state.get_state_on_date(target_date=date)
print("on date {} we observe {} input features".format(date,state_features.shape))
state_features

on date 2021-01-05 13:10:00 we observe (2,) input features


asset_1_log_returns    0.007406
asset_2_log_returns    0.016915
Name: 2021-01-05 13:10:00, dtype: float64

In [7]:
print("and the weights that the portfolio have up to that time")
tyweights_on_date

and the weights that the portfolio have up to that time


asset_1_forward_return_7_days,_0:00:00    0.926181
asset_2_forward_return_7_days,_0:00:00    0.107904
Name: 2021-01-05 13:10:00, dtype: float64

In [4]:
action_portfolio_weights = np.random.rand(2)
obs, reward, done, info = env.step(action_portfolio_weights=action_portfolio_weights,
                                                action_date=date)
reward

2021-01-05 13:10:00   -0.015374
dtype: float64

# Linear Policy Parametrization

## Continous Policies Parametrization

### Background

Gaussian probability distribution:

$$
p(x):=\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{(x-\mu)^2}{2\sigma^2})
$$

### Gaussian Policy

$$
\pi(a |s,\boldsymbol\theta):=\frac{1}{\sigma(s,\boldsymbol\theta)\sqrt{2\pi}}exp(-\frac{(a-\mu(s,\boldsymbol\theta))^2}{2\sigma(s,\boldsymbol\theta)^2})
$$

#### Linear Parametrization

$$
\theta:=[\boldsymbol{\theta_{\mu}},\boldsymbol{\theta_{\sigma}}]
$$

$$
\mu(s,\boldsymbol\theta) :=\boldsymbol{\theta_{\mu}}^T \pmb{x}(s)
$$

$$
\sigma(s,\boldsymbol\theta) :=exp(\boldsymbol{\theta_{\sigma}}^T \pmb{x}(s))
$$

#### Gradients

$$
\nabla ln\pi(a|s,\boldsymbol\theta_\mu)=\frac{1}{\sigma(s,\boldsymbol\theta)^2}(a-\mu(s,\boldsymbol\theta))\pmb{x}(s)
$$

$$
\nabla ln\pi(a|s,\boldsymbol\theta_\sigma)=\left(\frac{(a-\mu(s,\boldsymbol\theta))^2}{\sigma(s,\boldsymbol\theta)^2} -1\right)\pmb{x}(s)
$$

##  Multivariate Case

We have a continuous multivariate case parametrized the following way:

$$
\pi(a|s,\theta)=\frac{exp\left(\frac{1}{2}(a-\mu(s,\Theta_\mu))^T\Sigma(s,\theta_\Sigma)^{-1}(a-\mu(s,\Theta_\mu))\right)}{\sqrt{(2\pi)^k|\Sigma(s,\theta_\Sigma)|}}
$$

Where

$$
\mu(s,\Theta_\mu)=\Theta f
$$

The parametrization of $\Sigma$ will be defined as

$$
\Sigma{i,j}=\sum_k^m \theta^{\sigma}_{i,j,k}f_k
$$

$f=[f_1,...,f_m]$ is the vector of $m$ features and $\Theta$ is a row matrix $\theta_i^T=[\theta_{i,1}...\theta_{i,n}]$ . Hence each of the individual means is a linear combination of the features.


Therefore to calculuate the log gradient we need to use Matrix Calculus. first, we define

$$
u=\Theta f
$$

then we can rewrite

$$
ln(\pi(a|s,\theta))=\frac{1}{2}(a-u)^T\Sigma(s,\theta_\Sigma)^{-1}(a-u) -ln(\sqrt{(2\pi)^k|\Sigma(s,\theta_\Sigma)|})
$$

By queation  [137] from http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf
$$
\nabla ln(\pi(a|s,\theta_{i,k}^\mu))=Tr[\frac{\partial g(u)}{\partial u}^T \frac{\partial u}{\partial \theta_{i,k}^\mu }]
$$

And then by equation [86] http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf
 
$$
\nabla ln(\pi(a|s,\theta_{i,k}^\mu))=Tr[(\Sigma^{-1}(a-\Theta f))^T I[i]*f_k]
$$

Where $I[i]$ is a row vector  with zeros in all rows and 1 in row $i$ This can expressed as

$$
\nabla ln(\pi(a|s,\theta_{i,k}^\mu))=\text{row}_i[\Sigma^{-1}(a-\Theta f)] * f_k
$$

For the case of  $\Sigma$ we can also follow https://stats.stackexchange.com/questions/27436/how-to-take-derivative-of-multivariate-normal-density and apply the chain rule to obtain

$$
\nabla ln(\pi(a|s,\theta_{i,j,k}^\sigma))=Tr\left[[-\frac{1}{2}(\Sigma^{-1}-\Sigma^{-1}(a-\mu(s,\Theta_\mu))(a-\mu(s,\Theta_\mu))^T \Sigma^{-1}]^T I[i=i,j=j]*f_k\right]
$$



In [5]:
#example sample environment with  Linear agent policy

In [6]:
linear_agent=LinearAgent(environment=env,out_reward_window_td=out_reward_window)
linear_agent.sample_env(observations=32)

TypeError: 'int' object is not subscriptable

In [None]:
#weights remain in buffer After  few iterations of training we can plot performance

env.state.weight_buffer.plot()
