# Functional Prototype Demonstration 2 

Team Epsilon-Greedy Quants <br/>
Michael Lee, Nikat Patel, Jose Antonio Alatorre Sanchez

## What we did in this milestone?

We implemented the Policy Gradient algorithms REINFORCE with baseline and Actor-Critic.  We also refactored our model code to work in PyTorch.

## Presentation Overview
- REINFORCE with Baseline Summary
- Actor-Critic Summary
- Discussion of Problems Encountered
- Code Documentation/Organization
- Next Steps


## REINFORCE with Baseline Summary

#### Policy Gradient Method 
- Estimates Policy directly, not from Action-Value function
- Continuous action space


#### REINFORCE 
- Performance under Policy-Gradient Theorem: $\nabla J(\theta) \propto \sum_s \mu(s) \sum_a (q_{\pi}(s,a)) \nabla \pi(a|s, \theta)$
- Relies on estimated return by Monte-Carlo method
- Uses episode samples to update policy parameter $\theta$
- **High variance results in slow learning**

#### REINFORCE with Baseline

- Compares the action-value to an arbitrary baseline b(s)
    - Performance under Policy-Gradient Theorem: $\nabla J(\theta) \propto \sum_s \mu(s) \sum_a (q_{\pi}(s,a) - b(s)) \nabla \pi(a|s, \theta)$
    - Can be any function or random variable as long as it does not vary with action **a**
    - Commonly used baseline: state value function $\hat{v}(S_t,w)$
- Baseline functions don't change expected value of update but **can significantly reduce the variance** (speed up learning)

#### REINFORCE with Baseline Algorithm Steps
From Sutton and Barto, Chapter 13.4

**Steps:**
- Initialize the policy parameter **$\theta$** and state-value weights **w** at random.
- Loop forever (for each episode):
    - Generate one episode using policy $\pi_{\theta}: S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T$.
    - Loop for each step of the episode t=0,1,...,T-1:
        - Estimate the return $G$ ← $\sum_{k=t+1}^T \gamma^{k-t-1} R_k$ 
        - **Calculate the delta between $G$ and baseline function: $\delta ← G - \underbrace{\hat{v}(S_t,w)}_{baseline}$**  
        - **Update the state-value weights: $w ← w+ \alpha^w \delta \nabla \hat{v}(S_t, w)$**
        - Update policy parameters: $\theta←\theta+\alpha^{\theta}\gamma_t \delta \nabla ln \pi(A_t|S_t,\theta)$
            - $\alpha^{\theta}$ - stepsize
            - $\gamma$ - discount factor
            - $\nabla ln \pi(A_t|S_t,\theta)$ - eligibility vector: gradient of the probability of taking action $A_t$ given a state $S_t$ and policy $\pi_{\theta}$


## Performance

REINFORCE with Baseline converges 2x faster than REINFORCE given the same contraints.

## Performance Metrics

* In our Benchmarks and Benchmark Utility scripts, we have multiple methods established that we will use in the future to gauge our models performance.

```
def compute_daily_returns(asset_prices):
	...
def compute_covariance_matrix(returns):
	...
def compute_expected_portfolio_variance(cov_matrix, weights):
	...
def compute_expected_portfolio_volatility(portfolio_variance):
	...
def compute_annual_return(returns, weights):
	...
def get_normalized_returns(asset_prices):
	...
def compute_sharpe_ratio(asset_allocation):
	...
def compute_annualized_sharpe_ratio(asset_allocation):
	...
def compute_rolling_sharpe_ratio(asset_allocation, window=30):
	...
def compute_rolling_returns(asset_allocation, returns_type, window=30):
	...
def compute_rolling_volatility(asset_allocation, window=30):
	...
```


## Version Control Repository

* We have created a private GitHub repository that contains all our data, documentations, code, and notebooks.
* We use version control to develop, update, and collaborate our work.


```
├── data
│   ├── ACWF_ETF__AMEX_1m
│   ├── EEMV_ETF__AMEX_1m
│   ├── EFAV_ETF__AMEX_1m
│   ├── IAU_ETF__AMEX_1m
│   ├── IEF_ETF__NASDAQ_1m
│   ├── IMTM_ETF__AMEX_1m
│   ├── IQLT_ETF__AMEX_1m
│   ├── IVLU_ETF__AMEX_1m
│   ├── LQD_ETF__AMEX_1m
│   ├── MTUM_ETF__AMEX_1m
│   ├── QUAL_ETF__AMEX_1m
│   ├── USMV_ETF__AMEX_1m
│   ├── UUP_ETF__AMEX_1m
│   ├── VLUE_ETF__AMEX_1m
├── data_env
│   ├── ief.parquet
│   └── spy.parquet
├── lib
│   ├── Benchmarks.py
│   ├── DataHandling.py
│   ├── Environment.py
│   └── __init__.py
├── notebooks
│   ├── Benchmark_EDA.ipynb
│   ├── Environment.ipynb
│   ├── Environment_Reinforce.ipynb
│   ├── REINFORCE.ipynb
│   ├── data_handler_test.ipynb
│   ├── environment_data_presentation.ipynb
│   ├── environment_gymai.ipynb
│   └── sharpe_sample.ipynb
├── sprints_capstone.docx
├── temp_persisted_data
│   ├── forward_return_dates_simulation_gbm
│   ├── only_features_simulation_gbm
│   └── only_forward_returns_simulation_gbm
└── tests
    ├── TestLinearAgent.py
    └── __init__.py
```

## Next Steps

* Implement Sortino ratio as reward to control drawdown.
* Increase weight of negative reward if drawdown reaches a certain level.
* Test our models on real-world data
