# Reinforcement Learning visualized on Simple Stock Data

### What is reinforcement learning?

Reinforment learning is type of machine learning paradigm where an "agent" learns to make decisions by interacting with an environment. The agent takes actions in the environment and receives feedback in the form of rewards or penalties, indicating the success or failure of its actions. The goal of the agent is to maximize the total cumulative reward it receives over time.

The RL process is modeled after the Markov Decision model, which consists of the following components:

- agent: The learner and decision maker that interacts with the enviornment
- enviornemnt: The external sytem which the agent interacts with
- state: A representation of a particular moment in time. capturing all relevent info for the agent
- action: set of decisions for the agent to make given a state
- reward: A scalar value that the environment provides as feedback after each action. It represents the immediate positive or negative consequences of the agent's action.

### About the dataset that we will be working with

We will use a finance dataset that describes general information about how the stock behaved over the course of a decade or so. This data is not only insightful but extremely simple to work with. We will focus our attention to the "open" and "close" column as those values are our best indicators of any relevant information about the specfic stock. 

#### lets read the dataset into our notebook

1. import the pandas package 

In [1]:
import pandas as pd

2. lets read the .csv file into a dataframe using pandas

In [3]:
data = pd.read_csv("GME_stock.csv")
print(data)

            date  open_price  high_price   low_price  close_price  \
0     2021-01-28  265.000000  483.000000  112.250000   193.600006   
1     2021-01-27  354.829987  380.000000  249.000000   347.510010   
2     2021-01-26   88.559998  150.000000   80.199997   147.979996   
3     2021-01-25   96.730003  159.179993   61.130001    76.790001   
4     2021-01-22   42.590000   76.760002   42.320000    65.010002   
...          ...         ...         ...         ...          ...   
4768  2002-02-20    9.600000    9.875000    9.525000     9.875000   
4769  2002-02-19    9.900000    9.900000    9.375000     9.550000   
4770  2002-02-15   10.000000   10.025000    9.850000     9.950000   
4771  2002-02-14   10.175000   10.195000    9.925000    10.000000   
4772  2002-02-13    9.625000   10.060000    9.525000    10.050000   

           volume  adjclose_price  
0      58815800.0      193.600006  
1      93396700.0      347.510010  
2     178588000.0      147.979996  
3     177874000.0       76.

### How do we apply RL to this dataset

To utilize this machine learning paradigm we need to consider formulating a problem using this dataset. What does that mean? 

1. Define the Enviornment 
2. Implement the agent
3. Training
4. Evaluation
5. Tuning
6. Considerations

#### Defining the Enviornement:

- Define the state space: This could be relevant indicators such as moving averages, and other strong indicators 
- Define the action space: The actions the agent can "take", in this context "buy", "sell", "hold" 
- Define the reward function: The reward function should be designed to reward profitable decisions and penalize unprofitable ones. The reward could be based on the change in portfolio value or other performance metrics.

#### Implement the agent:

- Choose a suitable Reinforcement Learning algorithm such as Q-learning, Deep Q Networks (DQNs), Proximal Policy Optimization (PPO), etc.
- Define the neural network architecture for the agent if using DQNs or other deep RL methods.

#### Training: 

- Train the agent on historical financial data. The agent interacts with the environment, observes states, takes actions, and receives rewards during training.


#### Evaluation:

- Evaluate the trained agent on a separate test dataset to assess its performance in a more realistic scenario.
- Consider using metrics like the final portfolio value, annualized return, Sharpe ratio, or other performance measures to evaluate the agent's effectiveness.

#### Fine-Tuning and Iteration

- Depending on the results, you may need to fine-tune hyperparameters, adjust the reward function, or modify the architecture to improve the agent's performance.
- Iterativly refines the RL model to improve the agent's performance

#### Implementation Considerations

- Take into account transaction costs and slippage in the environment simulation to make the RL model more realistic.

### Lets try and implement the Enviornment

In [3]:
import pandas as pd
import numpy as np

Lets set up a class to set our enviornment up for us

In [4]:
class Enviornment:
    #Initialize the Trading Enviornment
    #Paramaters: dataframe, window-size for short term, window-size for long term
    def __init__(self, data, short_window, long_window):
        self.data = data
        self.short_window = short_window
        self.long_window = long_window
        self.current_step = 0
        self.current_position = 0
        self.initial_balance = 10000.0
        self.balance = self.initial_balance
        self.stock_owned = 0
    
    #Resetting the Trading Enviornment
    #Paramaters: dataframe
    def reset(self):
        self.current_step = 0
        self.current_position = 0
        self.balance = self.initial_balance
        self.stock_owned = 0
        return self._get_observation()
    
    #Get the current observation representing the state
    def _get_observation(self):
        start = max(0, self.current_step - self.long_window + 1)
        end = self.current_step + 1
        prices = self.data.iloc[start:end]['adjclose_price'].values
        return np.concatenate([prices])
    
    #Calculate reward based on the current state and action
    def _calculate_reward(self):
        return self.balance + self.stock_owned * self.data.iloc[self.current_step]['adjclose_price']
    
    #Performs a single step in the fEnviornment
    def step(self, action):
        self.current_step += 1

        if action == 0 and self.current_position != 1:  # Buy
            self.stock_owned = self.balance / self.data.iloc[self.current_step]['adjclose_price']
            self.balance = 0
            self.current_position = 1
        elif action == 1 and self.current_position != -1:  # Sell
            self.balance = self.stock_owned * self.data.iloc[self.current_step]['adjclose_price']
            self.stock_owned = 0
            self.current_position = -1

        done = self.current_step == len(self.data) - 1
        reward = self._calculate_reward() - self.initial_balance
        return self._get_observation(), reward, done, {}

Now that we have this enviornment lets create some simple test data and run some of the functions. What we will want to see is the outcome of perfroming the steps. 

In [5]:
if __name__ == "__main__":
   
    data = pd.DataFrame({
        'date': ['2021-01-28', '2021-01-27', '2021-01-26', '2021-01-25', '2021-01-22', '2021-01-21', '2021-01-20', '2021-01-19', '2021-01-15', '2021-01-14'],
        'adjclose_price': [193.60000610351562, 347.510009765625, 147.97999572753906, 76.79000091552734, 65.01000213623047, 43.029998779296875, 39.119998931884766, 39.36000061035156, 35.5, 39.90999984741211]
    })

    # Initialize the environment
    env = Enviornment(data=data, short_window=2, long_window=5)

    # Reset the environment and get the initial observation
    observation = env.reset()

    # Run some steps in the environment (for illustration purposes)
    for _ in range(5):
        action = np.random.choice([0, 1])  # Random action (buy or sell)
        next_observation, reward, done, _ = env.step(action)
        print(f"Action: {action}, Reward: {reward}, Done: {done}, Current Balance: {env.balance}, Stock Owned: {env.stock_owned}")

Action: 1, Reward: -10000.0, Done: False, Current Balance: 0.0, Stock Owned: 0
Action: 1, Reward: -10000.0, Done: False, Current Balance: 0.0, Stock Owned: 0
Action: 0, Reward: -10000.0, Done: False, Current Balance: 0, Stock Owned: 0.0
Action: 0, Reward: -10000.0, Done: False, Current Balance: 0, Stock Owned: 0.0
Action: 1, Reward: -10000.0, Done: False, Current Balance: 0.0, Stock Owned: 0


Lets try and understand what each line here represents

1. Action: 1, Reward: -10000.0, Done: False, Current Balance: 0.0, Stock Owned: 0
- the agent took to "sell" as action indicates "1"
- the reward being -10000 means that the agent lost 10000 in its initial balance
- the episode is not finished
- since he sold he has no stocks

The rest of the lines are a the same parameters other than the action being different

### Lets now set up the agent for the RL Enviornment

We need to define a agent using an algorithm provided by the RL library that we choose to use. There are several options we can choose from. Here is a list of the current most used algorithms that are used to define RL agents.

1. DQN (Deep Q Network) 
This algorithm stems from the "gym" package in python. It is a popular reinforcement learning algorithm that combines Q-learning. The key idea behind DQN is to use deep neural networks to approximate the Q-function, which represents the expected cumulative future rewards for taking a specifc action in a given state and following a certain policy thereafter

2. PPO (Proximal Policy Optimization)
Another popular reinforcement learing algorithm that belong to a policy optimization mehtod. PPO operates by iteratively updating the policy to improve its performance while ensuring that the policy updates remains within a certain region to prevent overly large policy changes

For this dataset we will utilize a very simple Random Agent algorithm to set up our agent. This could also be done using PPO

In [12]:
import gym 
import pandas as pd
import numpy as np
class TradingGymEnv(gym.Env):
    def __init__(self, data, short_window, long_window):
        self.env = Enviornment(data, short_window, long_window)
        self.observation_space = gym.spaces.Box(low=0, high=np.inf, shape=(short_window + long_window,))  # Replace with correct shape
        self.action_space = gym.spaces.Discrete(2)  # Buy or sell


    def step(self, action):
        return self.env.step(action)

    def reset(self):
        return self.env.reset()

data = pd.read_csv("GME_stock.csv")

# Initialize the Gym-compatible environment
env = TradingGymEnv(data=data, short_window=2, long_window=5)



In [13]:
# Define and set up the RandomAgent
class RandomAgent:
    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, _):
        return self.action_space.sample()

# Initialize the RandomAgent with the environment's action space
agent = RandomAgent(env.action_space)

# Test the agent
obs = env.reset()
for _ in range(1000):
    action = agent.act(obs)
    obs, reward, done, _ = env.step(action)
    print(f"Action: {action}, Reward: {reward}, Done: {done}")







Action: 0, Reward: 0.0, Done: False
Action: 1, Reward: -5741.705517278687, Done: False
Action: 0, Reward: -5741.705517278687, Done: False
Action: 1, Reward: -6394.950773305251, Done: False
Action: 0, Reward: -6394.950773305251, Done: False
Action: 0, Reward: -6722.53019059939, Done: False
Action: 0, Reward: -6702.422872683814, Done: False
Action: 0, Reward: -7025.813358627411, Done: False
Action: 1, Reward: -6656.343988637882, Done: False
Action: 1, Reward: -6656.343988637882, Done: False
Action: 0, Reward: -6656.343988637882, Done: False
Action: 0, Reward: -6658.020044985602, Done: False
Action: 1, Reward: -7035.124091995988, Done: False
Action: 1, Reward: -7035.124091995988, Done: False
Action: 1, Reward: -7035.124091995988, Done: False
Action: 0, Reward: -7035.124091995988, Done: False
Action: 0, Reward: -7055.606969370396, Done: False
Action: 0, Reward: -6784.210716154175, Done: False
Action: 0, Reward: -6712.521133690187, Done: False
Action: 1, Reward: -6692.038581880071, Done: Fa

To explain each step we see our agent making decisions to either sell or buy a stock. With the decisions our agent is making he goes through a roller coaster of making and losing money. 

it's normal to get negative rewards in a trading environment, especially if the agent's actions result in losses. In a trading scenario, the goal of the agent is to maximize its cumulative reward over time. Since trading involves risk and uncertainty, there will be times when the agent's actions result in losses, leading to negative rewards.

Negative rewards can indicate that the agent's current strategy is not performing well or that it's making suboptimal decisions. The agent's learning process involves exploring different actions and learning from the outcomes, both positive and negative. Over time, through learning, the agent should aim to improve its strategy to achieve better outcomes and maximize its cumulative reward.

In reinforcement learning, it's common to have a mix of positive and negative rewards as the agent explores and learns to navigate the environment effectively. The ultimate goal is for the agent to learn a policy that results in a positive cumulative reward over the long term.

As for our agent pertaining to our dataset, it seems to be learning over the many iterations. With some steps it seems like its learning how to make smarter investments. By the last iteration, its made back a lot of the money its lost over the course of its step cycle.