# Exploration-Exploitation Dilemma

This is all about taking a decision on whether to choose Exploration or Exploitation as a method  to (to select an action) train a model.

To deal with this dilemma, RL practitioners created multiple method:

1. UCB (Upper Confidence Bound)
2. Thompson Sampling
3. Epsilon Greedy Algorithm




## Epsilon Greedy Algorithm

In [2]:
import numpy as np
import random

#Policy Function
def epsilonGreedy(currentState,epsilon):
    listActions=[0,1]
    poleAngle=currentState

    p = np.random.randn()

    if p < epsilon:
        #Exploration Strategy
        action = random.choice(listActions)

    else:
        #Exploitation Strategy
        if poleAngle < 0:
            action = 0
        else:
            action = 1

    return action


In [3]:
import time
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="human")

In [4]:
# Environment: CartPole
# Documentation Link: https://gymnasium.farama.org/environments/classic_control/cart_pole/
#
# Goal: Is to balance the pole on the cart by moving the cart left or right for a given episode
#
# Agent: Cart
#
# Actions: 0 ---- Left
#          1 ---- Right
#
# State: ["Cart Position","Cart Velocity","Pole Angle","poleVelocity"]
#
# Reward: 1 for every step taken such that the pole is balanced successfully.
#
# Termination Condition:
# 1. Pole Angle is greater than +-12 DEGREE
# 2. Cart Position is greater than +-2.4
# 3. Episode length greater than 500

In [None]:
EPSILON = 0.2

for episodeCount in range(1,11):
    #initialize the state
    env = gym.make("CartPole-v1", render_mode="human")
    
    observation,info = env.reset()    
    # observation (State): ["Cart Position","Cart Velocity","Pole Angle","poleVelocity"]

    

    for episodeStep in range(400):
        
        #Choose a random action
        state = observation[2] #Pole Angle
        action = epsilonGreedy(state, EPSILON)

        #Supply action to the env
        newState,reward,isTerminated,isTruncated,info = env.step(action)

        #Add small delay and call render to see game in execution
        time.sleep(0.02) #20ms delay
        env.render()

        #Print info
        print(f"Episode Step {episodeStep} Given Action {action} I got reward {reward} and next state {newState}")

        #Check for Termination
        if isTerminated:
            print("GAME OVER --- Terminated!!!")
            env.close()
            break

    #Check for Truncation(Episode ended)
    if isTruncated:
        ("Episode Over. Total Allowed Steps Done. Agent was able to balance pole successfully :)")

    env.close()

  from pkg_resources import resource_stream, resource_exists


Episode Step 0 Given Action 1 I got reward 1.0 and next state [ 0.03553887  0.16738862  0.01286034 -0.24731459]
Episode Step 1 Given Action 0 I got reward 1.0 and next state [ 0.03888664 -0.02791462  0.00791404  0.04939688]
Episode Step 2 Given Action 1 I got reward 1.0 and next state [ 0.03832835  0.16709296  0.00890198 -0.24077863]
Episode Step 3 Given Action 1 I got reward 1.0 and next state [ 0.04167021  0.36208662  0.00408641 -0.53064036]
Episode Step 4 Given Action 1 I got reward 1.0 and next state [ 0.04891194  0.55715084 -0.0065264  -0.82203287]
Episode Step 5 Given Action 1 I got reward 1.0 and next state [ 0.06005496  0.7523615  -0.02296706 -1.1167613 ]
Episode Step 6 Given Action 1 I got reward 1.0 and next state [ 0.07510219  0.9477772  -0.04530228 -1.4165593 ]
Episode Step 7 Given Action 0 I got reward 1.0 and next state [ 0.09405773  0.75324464 -0.07363347 -1.1383742 ]
Episode Step 8 Given Action 1 I got reward 1.0 and next state [ 0.10912263  0.94924814 -0.09640095 -1.45

: 