# INTRODUCTION

## Libraries

In [1]:
# Basics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
from collections import defaultdict

# OpenAI gym
import gym

## References

\[1\] Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba, *OpenAI Gym*, 2016, [arXiv:1606.01540](https://arxiv.org/abs/1606.01540), [GitHub:openai/gym](https://github.com/openai/gym).
  


# DEFINITION OF ENVIRONMENTS

## Introduction

As we said an environment comprise everything outside the agent, so its code implementation has to contain all the functionality to allow the agent to interact with it and to learn. We are going to implement this environment according to the guidelines of OpenAI gym \[1\], one on the most used libraries to develop reinforcement learning applications.

The environments are defined as a class with the following method:
* __init(self)__: define initial information of the environment such as the observation space or the action space.
* __reset(self)__: reset the environment's state.
* __step(self, action)__: step the environment by one timestep. Returns
    * _observation_ (object): an environment-specific object representing our observation of the environment. 
    * _reward_ (float): amount of reward achieved by the previous action.
    * _done_ (boolean): whether it’s time to reset the environment again.
    * _info_ (dict): diagnostic information useful for debugging. 
* __render(self, mode='human')__: render one frame of the environment.

## init

The first thing we need is to define how an agent should perceive its environment, that is, we need to consider how a human will perform the task. In this sense, we can think about which kind of observations uses a human when trading:
* Stock prices: the historical values of the price of a stock (open, hihg, low, close, volume) are the principal source of information to take decissions.
* Protfolio status: Account balance, Stock positions, profit



The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action.
Once a trader has perceived their environment, they need to take an action. In our agent’s case, its action_space will consist of three possibilities: buy a stock, sell a stock, or do nothing.
But this isn’t enough; we need to know the amount of a given stock to buy or sell each time. Using gym’s Box space, we can create an action space that has a discrete number of action types (buy, sell, and hold), as well as a continuous spectrum of amounts to buy/sell (0-100% of the account balance/position size respectively).
You’ll notice the amount is not necessary for the hold action, but will be provided anyway. Our agent does not initially know this, but over time should learn that the amount is extraneous for this action.
The last thing to consider before implementing our environment is the reward. We want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.
The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies.


In [70]:
stocks = pd.read_csv('data/AAPL_2000_2019.csv', index_col = 0, parse_dates = True)
stocks.head()

Unnamed: 0,Adj Close,Close,High,Low,Open,Volume
2000-01-03,3.470226,3.997768,4.017857,3.631696,3.745536,133949200.0
2000-01-04,3.17765,3.660714,3.950893,3.613839,3.866071,128094400.0
2000-01-05,3.224152,3.714286,3.948661,3.678571,3.705357,194580400.0
2000-01-06,2.945139,3.392857,3.821429,3.392857,3.790179,191993200.0
2000-01-07,3.084645,3.553571,3.607143,3.410714,3.446429,115183600.0


In [53]:
class SingleStockEnvironment:
    '''
    Reinforcement Leaning environment representing a Stock Market with a single stock.

    Attributes:
        data: series of prices
        history: series of positions
    '''
    def __init__(self, data):
        self.data = np.array(data['Close'])
        self.reset()

    def reset(self):
        self.t = 0

        # Create an empty history
        self.history = np.zeros_like(self.data)
        self.positions = []

        # Create current values for these histories
        self.history_t = 0
        self.position = 0
        return self.history

    def step(self, action):
        self.history_t = action
        # Execute the action
        if action == 0:
            self.history[self.t] = self.history_t
        elif action == 1:
            self.history[self.t] = self.history_t
        elif action == 2:
            self.history[self.t] = self.history_t
        
        # Prepare the next step
        self.t += 1
