## CartPole Skating

> **Problem**: If Peter wants to escape from the wolf, he needs to be able to move faster than him. We will see how Peter can learn to skate, in particular, to keep balance, using Q-Learning.

First, let's install the gym and import required libraries:


In [None]:
import sys
%pip install gym 

import gym
import matplotlib.pyplot as plt
import numpy as np
import random

## Create a cartpole environment


In [None]:
env = gym.make("CartPole-v1")
print(env.action_space)
print(env.observation_space)
print(env.action_space.sample())

To see how the environment works, let's run a short simulation for 100 steps.


In [None]:
env.reset()

for i in range(100):
   env.render()
   env.step(env.action_space.sample())
env.close()

During simulation, we need to get observations in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:


In [None]:
env.reset()

done = False
while not done:
   env.render()
   obs = env.step(env.action_space.sample())
   rew = env.step(env.action_space.sample())
   print(f"{obs} -> {rew}")
env.close()

We can get min and max value of those numbers:


In [None]:
print(env.observation_space.low)
print(env.observation_space.high)

## State Discretization


In [None]:
def discretize(x):
    return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))

Let's also explore other discretization method using bins:


In [None]:
def create_bins(i,num):
    return np.arange(num+1)*(i[1]-i[0])/num+i[0]

# print("Sample bins for interval (-5,5) with 10 bins\n",create_bins((-5,5),20))

ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
nbins = [20,20,10,10] # number of bins for each parameter
bins = [create_bins(ints[i],nbins[i]) for i in range(4)]
print(bins)

def discretize_bins(x):
    return tuple(np.digitize(x[i],bins[i]) for i in range(4))

Let's now run a short simulation and observe those discrete environment values.


In [None]:
env.reset()

done = False
steps = 0
while not done:
   obs = env.step(env.action_space.sample())
   print(obs)
   if steps % 10 == 0:
       print(discretize(obs[0]))
   steps += 1

env.close()


## Q-Table Structure


In [None]:
Q = {}
actions = (0,1)

def qvalues(state):
    return [Q.get((state,a),0) for a in actions]

## Let's Start Q-Learning!


In [None]:
# hyperparameters
alpha = 0.3
gamma = 0.9
epsilon = 0.90

 Implementing the Q-learning algorithm with epsilon-greedy exploration strategy. The Q-learning algorithm is a popular Reinforcement Learning technique for learning an optimal policy for an agent in a Markov Decision Process (MDP) with a discrete action and state space.

In this code, the agent interacts with the environment for 100000 epochs, where in each epoch, it resets the environment, chooses an action, takes a step, and updates its Q-table based on the observed reward and next state. The agent uses an epsilon-greedy exploration strategy to balance the exploration-exploitation trade-off during the learning process.

The Q-values are stored in a dictionary called Q, where each key is a tuple of (state, action) pair, and the corresponding value is the estimated Q-value for that state-action pair. The qvalues(s) function is used to get the Q-values for a given state s.

The discretize_bins function seems to be used for discretizing continuous state space into a discrete set of states for Q-learning. The probs function is used to calculate the probability distribution over the Q-values for a given state using the softmax function.

The cum_rewards list is used to keep track of the cumulative reward obtained in each epoch, and the rewards list is used to keep track of the reward obtained in each step. The alpha and gamma variables are the learning rate and discount factor, respectively, used for updating the Q-values.

The code prints the average cumulative reward obtained over the last 5000 epochs and updates the Qmax and Qbest variables if the current average reward is higher than the previous best reward. Overall, this code seems to be a simple implementation of the Q-learning algorithm with an epsilon-greedy exploration strategy for solving MDPs with a discrete action and state space.

In [None]:
# Q = {}
# actions = [0, 1]

# def qvalues(state):
#     return [Q.get((state, a), 0) for a in actions]

def probs(v, eps=1e-4):
    v = v - v.min() + eps
    v = v / v.sum()
    return v

Qmax = 0
cum_rewards = []
rewards = []

for epoch in range(100000):
    obs = env.reset()
    done = False
    cum_reward = 0
    
    while not done:
        s = discretize(obs[0])
        print(s)
        
        if random.random() < epsilon:
            # Exploitation - chose the action according to Q-Table probabilities
            v = probs(np.array(qvalues(s)))
            a = random.choices(actions, weights=v)[0]
        else:
            # Exploration - randomly chose the action
            a = np.random.randint(env.action_space.n)
        print(a)
        obs = env.step(a) 
        rew = env.step(a)
        print(rew)
        #1.0
        cum_reward += rew[1]
        ns = discretize(obs[0])
        Q[(s, a)] = (1 - alpha) * Q.get((s, a), 0) + alpha * (rew[1] + gamma * max(qvalues(ns)))

    cum_rewards.append(cum_reward)
    
    # Periodically print results and calculate average reward
    if epoch % 5000 == 0:
        print(f"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}")
        if np.average(cum_rewards) > Qmax:
            Qmax = np.average(cum_rewards)
            Qbest = Q
        cum_rewards = []




## Plotting Training Progress


In [None]:
plt.plot(rewards)