# CART-POLE GYM ENVIRONMENT

Our aim is to devise RL algorithms to try and solve the "Cart-Pole" environment provided in the Open-AI gym. In this environment, a pole is attached to a cart which can move left or right. Our goal is to apply forces to the cart so as to keep it upright. An episode runs a maximum for 500 time-steps, hence if the pole stays upright for 500 time-steps, the episode is considered solved. The reward for each time-step is 1.

The action space for this environment is a ndarray with shape (1,) which can take values {0,1} indicating the direction of the fixed force the cart is pushed with. The observation space is a ndarray with shape (4,) which contains the following values:
| Num | Observation | Min | Max |
|-----------------|-----------------| -----|-----|
| 0  | Cart Position  | -4.8 | 4.8 |
| 1  | Cart Velocity  | -Inf | Inf |
| 2  | Pole Angle     | -24 deg | 24 deg |
| 3  | Pole Angular Velocity  | -Inf | Inf |

Also, the episode terminates if the cart position leaves the range [-2.4, 2.4] or the pole angle leaves the range [-12 deg, 12 deg] and the pole angle is returned in radians.

Now, the State Space is continous, hence we cannot just use a lookup table to store our Q-values. There are two ways we can counter this, one is we use a function approximator with some parameters to approximate our Q-values. Another way is to discretise the state space, as to break it into a discrete space rather than being continous. 

To discretize the state space we will split each observation into some n equal parts and assign each interval to its lower bound. Now for obs number 0 and 2 we can discretize in a way that if the values are outside their non terminal range, we just put all of those into one interval rather than splitting them but as of now we aren't doing that, maybe we'll try later to see if it optimises our space much. (As in our original implementation a lot of the states won't ever be reached as the episode terminates before reaching them). For observation 1 and 3, they go from -INF to +INF, hence we can't just split it into some n intervals and get a finite state space. For that what we did was we ran the environment many times by hand to get an idea of the range of values the velocity and angular velocity stay in before the episode terminates because the velocity or angular velocity we just to large enough to be able to correct. 

Using this code, we did some experimentation.

In [None]:
import gymnasium as gym

env = gym.make("CartPole-v1" , render_mode="rgb_array")
env.reset()
while True:
    action = int(input("Action: "))
    if action in (0, 1):
        x = env.step(action)
        print(f"v:{x[0][1]} , w:{x[0][3]}")
        env.render()

From this we got to the conclusion that velocity takes values between -3.0 and 3.0 and angular velocity stays between -4 and 4 at max. Also if the values go outside this interval, we equate them to these bounds.

Doing some pretty basic mathematics we see that we can implementing this by mapping our observation x to $$round(\frac{x-a}{b-a} \times n)$$ where n is the "granularity constant" and a and b are the lower and upper bounds of x respectively.

Now coming to the RL algorithm to be used, we see all episodes are guaranteed to be terminated, hence we can use any either forward view or backward view algorithm. Here we'll be implementing backward view SARSA($\lambda$) with Eligibility Traces.

First we import the required modules we'll be using.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

: 