# Q-Learning

Teach an agent **how good an action is in a state** by updating a table of values.


1. Agent states and actions
    - List of agent actions
    - List of agent states
2. Initialise Q Table and learning paramaters
    - learning_rate : How fast values change 
    - discount_factor : How much future rewards matter
    - exploratory_probability
3. work
    1. get current agent state from input data
    2. decide action to take
        - random state if exploratory probability condition, else, the action that has the highest value in the Q-table
    3. get the reward for the action taken
    4. update the q-table

>`q_predict = q_table[state, action]`  
>`q_target = reward + discount_factor * np.max(q_table[next_state])`  
>`q_table[state, action] += learning_rate * (q_target - q_predict)`

        

# Gymnasium 

Gymnasium provides an API for all single agent reinforcement learning environments, with implementations of common environments.

https://gymnasium.farama.org/

An API standard for reinforcement learning with a diverse collection of reference environments. e.g

https://gymnasium.farama.org/environments/toy_text/taxi/


# Reward table

Each state of the agent needs a reward for the action that can be potentially be taken.

| State                       | BUY | SELL | HOLD |
| --------------------------- | --- | ---- | ---- |
| **GOLD_WITHIN_FAIT**        | +1  | −3   | 0    |
| **GOLD_WITHOUT_FAIT**       | +1  | −3   | 0    |
| **HIT_GOLD_FAIT**           | +3  | −3   | 0    |
| **DEATH_WITHIN_FAIT**       | −3  | −3   | 0    |
| **DEATH_WITHOUT_FAIT**      | −3  | −3   | 0    |
| **GOLD_WITHIN_ASSET**       | −3  | −0.5 | +0.1 |
| **DEATH_WITHIN_ASSET**      | −3  | −0.5 | +0.1 |
| **DEATH_WITHOUT_ASSET**     | −3  | −0.5 | +0.1 |
| **ABOVE_TAKE_PROFIT_ASSET** | −3  | +3   | +0.1 |
| **BELOW_STOPLOSS_ASSET**    | −3  | +2   | −1   |
| **HIT_DEATH_WITH_ASSET**    | −3  | +3   | +0.1 |


## Learning formula:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha \Big[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\Big]
$$



# MACD Trading Strategy

- compare a slow, and fast moving average, to see if there is upward or downward momentum.
- Golden Cross: When the fast moving average crosses over the slow average.
- Death Cross: When the slow moving average crosses below the fast average.
- Stop-Loss: A price level where you automatically exit a trade to limit losses.
- Risk-Reward: The ratio between how much you’re willing to lose and how much you aim to gain.

## 1. List all of the possible states and actions of the agent

In [9]:
from libs.indicators import Indicator
from libs.indicators import MACD

ACTION_HOLD = 0
ACTION_BUY = 1
ACTION_SELL = 2
USD = "USDT"
ASSET = "BTC"

actions = [ ACTION_HOLD, ACTION_BUY, ACTION_SELL ];

mMACD = MACD( pair=[USD,ASSET], stopLoss=-1.0, riskReward="1:3");

print("Agent actions: {}, Agent states: {}".format(actions,mMACD.getStates()));

Agent actions: [0, 1, 2], Agent states: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


## 2. Initialise Q Table, and variables

In [11]:
import pandas as pd
import numpy as np

q_table = np.random.uniform( low=-0.01, high=0.01, size=( len( mMACD.getStates() ) , len( actions ) ) );

learning_rate = 0.1

discount_factor = 0.9

exploration_probability = 0.3

# Load Data
df = pd.read_csv( "./assets/BTC_USDT-4h.csv" ,index_col=0,parse_dates=True)
df = df.copy()
df.index.name = 'Date'

mState = {};
work_index = 0;

def initState():
    global mState, work_index, q_table, mMACD
    mMACD = MACD( pair=[USD,ASSET], stopLoss=-1.0, riskReward="1:3");
    mMACD.prep( df );
    q_table = np.random.uniform( low=-0.01, high=0.01, size=( len( mMACD.getStates() ) , len( actions ) ) );
    work_index = 0;
    mState["last"] = df.index[0];
    mState["start"] = df.index[0];
    mState["closeLast"] = df.iloc[0]["close"]
    mState["openLast"] = df.iloc[0]["open"]
    mState["volumeLast"] = df.iloc[0]["volume"]  
    mState["trades"] = 0
    mState["holding"] = USD
    mState["balancesBefore"] = { USD: 0, ASSET: 0 }
    mState["start"] = 100
    mState["balances"] = { USD: 100, ASSET: 0 }
    
initState();

In [12]:
def _agent_work( i ):
    global mState
    mMACD.work( df , i );
    if i > 0:
        mState["closeLast"] = df.iloc[i-1]["close"]
        mState["openLast"] = df.iloc[i-1]["open"]
        mState["volumeLast"] = df.iloc[i-1]["volume"]

def _usdToAsset( spotUSD, holdingUSD  ):
    return holdingUSD / spotUSD;
    
def _assetToUSD( spotUSD, holdingAsset  ):
    return holdingAsset * spotUSD;

def _enact( action, i):
    global mState
    try:
        price = (df.iloc[i]["close"] + df.iloc[i]["open"]) / 2
        holding = mState["holding"]

        if holding == USD:
            if action != ACTION_BUY:
                return False
            prev_usd = mState["balances"][USD]
            asset = _usdToAsset(price, prev_usd)
            mMACD.buy(price)
            mState["holding"] = ASSET
            mState["balances"][USD] = 0
            mState["balances"][ASSET] = asset
            mState["balancesBefore"][USD] = prev_usd
            mState["trades"] += 1

            return True
        
        if holding == ASSET:
            if action != ACTION_SELL:
                return False
            prev_asset = mState["balances"][ ASSET ]
            usd = _assetToUSD(price, prev_asset)
            mMACD.sell(price)
            mState["holding"] = USD
            mState["balances"][ ASSET ] = 0
            mState["balances"][ USD ] = usd
            mState["balancesBefore"][ ASSET ] = prev_asset
            mState["trades"] += 1
            return True
        return False
    except Exception:
        traceback.print_exc()
        return False


def _calculate_reward(action, i ):
    if df.size == 0:
        raise ValueError("No Data Frame");
    if i >= df.size:
        return 0;
    return mMACD.getReward( action, df, i );

def _q_learning_update( state, action, reward, next_state):
    global q_table
    if q_table.size == 0:
        raise ValueError("No Q table");
    q_predict = q_table[state, action]
    q_target = reward + discount_factor * np.max(q_table[next_state])
    q_table[state, action] += learning_rate * (q_target - q_predict)

def _work( enact=True):
    global work_index
    try:
        if df.size == 0:
            raise ValueError("No Data Frame");
        
        while work_index < df.shape[0] - 1:
            _agent_work( work_index );
            state = mMACD.getState( df , work_index );
            
            if None == state: 
                work_index += 1;
                continue;
                
            if np.random.rand() < exploration_probability:
                action = np.random.randint( len( actions ) )
            else:
                action = np.argmax( q_table[state])

            if work_index < df.size:
                # Update the Q-table
                next_time_step = work_index + 1
                reward = _calculate_reward( action, next_time_step )
                next_state = mMACD.getState( df , next_time_step );
                _q_learning_update( state, action, reward, next_state)
                # Enact the action
                if reward > 0 and enact == True:
                    _enact( action , work_index );
                work_index = next_time_step
    except:
        traceback.print_exc();

## work

In [13]:
def _train( episodes = 1 ):
    global work_index, mState
    print("Training");
    statePreUSD = mState["balances"][ USD ];
    statePreAsset = mState["balances"][ ASSET ];
    spot = ( mState["closeLast"] + mState["openLast"] ) / 2
    if statePreUSD == 0:
        statePreUSD = _assetToUSD( spot, statePreAsset );
    if statePreAsset == 0:
        statePreAsset = _usdToAsset( spot, statePreUSD );

    for episode in range( episodes ):
        is_last = (episode == episodes - 1)
        mState["trades"] = 0;
        work_index = 0;
        _work( enact=is_last );

    state_post_usd = mState["balances"][USD]
    state_post_asset = mState["balances"][ASSET]
    spot = (mState["closeLast"] + mState["openLast"]) / 2
    if state_post_usd == 0:
        state_post_usd = _assetToUSD(spot, state_post_asset)
    if state_post_asset == 0:
        state_post_asset = _usdToAsset(spot, state_post_usd)
    def percent_change(old, new):
        return 0 if old == 0 else ((new - old) / old) * 100
    usd_pct = percent_change(statePreUSD, state_post_usd)
    asset_pct = percent_change(statePreAsset, state_post_asset)
    print("\n" + "=" * 45)
    print(f"{'':7} | {'USD':>12} | {'ASSET':>12}")
    print(f"{'START':7} | {statePreUSD:12.5f} | {statePreAsset:12.5f}")
    print(f"{'END':7} | {state_post_usd:12.5f} | {state_post_asset:12.5f}")
    print(f"{'% Δ':7} | {usd_pct:11.2f}% | {asset_pct:11.2f}%")
    print("=" * 45)


In [14]:
initState()
_train( episodes=1 )

Training

        |          USD |        ASSET
START   |    100.00000 |      0.00215
END     |   1383.37115 |      0.01429
% Δ     |     1283.37% |      564.52%


In [16]:
initState()
_train( episodes=2 )

Training

        |          USD |        ASSET
START   |    100.00000 |      0.00215
END     |    593.58289 |      0.00613
% Δ     |      493.58% |      185.14%


In [17]:
initState()
_train( episodes=5 )

Training

        |          USD |        ASSET
START   |    100.00000 |      0.00215
END     |    942.30296 |      0.00973
% Δ     |      842.30% |      352.65%
