# The CartPole Problem

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kashifliaqat/Data_Science_and_Machine-Learning/blob/main/Reinforcement_Learning/CartPole_Problem.ipynb)

The classic CartPole problem is a control problem in which the goal is to balance a pole on a cart moving along a horizontal track. The problem is considered solved when the pole is balanced for a certain duration without falling over.

- The implementation of Reinforcement Learning (RL) in the code involves training an RL algorithm to learn a policy that can balance the pole on the cart. Specifically, the Proximal Policy Optimization (PPO) algorithm is used to learn the policy. PPO is a model-free RL algorithm that uses a policy gradient method to update the policy.

- The code first creates an instance of the CartPole environment using the OpenAI Gym library. It then runs 10 episodes of the environment, where each episode involves repeatedly selecting random actions and observing the resulting rewards until the pole falls or the maximum number of steps is reached. This is to get a baseline performance of the environment.

- Next, a PPO model is created using the stable_baselines3 library. The model is trained on the CartPole environment for a specified number of timesteps. After training, the trained model is used to evaluate the policy on the environment for 10 episodes, with rendering turned on to visualize the performance of the policy.

- Finally, the trained model is used to run 10 episodes of the CartPole environment, where each episode involves selecting an action using the trained model's policy and observing the resulting rewards until the pole falls or the maximum number of steps is reached. The scores for each episode are printed to the console.

Overall, the implementation of RL in the code involves using a PPO algorithm to learn a policy that can balance the pole on the cart in the CartPole environment. The code demonstrates the basic steps involved in training and evaluating an RL algorithm, including creating an environment, creating a model, training the model, and evaluating the policy on the environment.

In [15]:
# import required libraries
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

### Set environment name and create environment

In [16]:
env_name = 'CartPole-v0'
env = gym.make(env_name)

#### Loop to run episodes of the environment
10 episodes of the environment are run and at each time step randomly selects an action from the action space, until the episode is completed. It then prints the score achieved in the episode.

In [17]:
for episode in range(1, 11):
    score = 0
    state = env.reset()
    done = False
    
    # loop until episode is completed
    while not done:
        env.render() # render the environment
        action = env.action_space.sample() # randomly select an action
        n_state, reward, done, info = env.step(action) # take the action and observe the next state, reward and done flag
        score += reward # add the reward to the score
        
    # print episode number and score
    print('Episode:', episode, 'Score:', score)

Episode: 1 Score: 22.0
Episode: 2 Score: 14.0
Episode: 3 Score: 31.0
Episode: 4 Score: 28.0
Episode: 5 Score: 16.0
Episode: 6 Score: 12.0
Episode: 7 Score: 38.0
Episode: 8 Score: 15.0
Episode: 9 Score: 16.0
Episode: 10 Score: 13.0


In [18]:
env.close() # close the environment

### Training the model
Create a vectorized environment from the previously created environment, creates an instance of the PPO model, specifies the policy to be used ('MlpPolicy'), and trains the model for 20000 timesteps. Finally, it saves the trained model.

In [19]:
env = gym.make(env_name) # create a new environment
env = DummyVecEnv([lambda: env]) # create a vectorized environment from the environment
model = PPO('MlpPolicy', env, verbose=1) # create a PPO model with MLP policy and vectorized environment
model.learn(total_timesteps=20000) # train the model for 20000 timesteps

# save the model
model.save('ppo model')

Using cpu device
-----------------------------
| time/              |      |
|    fps             | 772  |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 589         |
|    iterations           | 2           |
|    time_elapsed         | 6           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008211978 |
|    clip_fraction        | 0.0882      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.00479     |
|    learning_rate        | 0.0003      |
|    loss                 | 8.09        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0128     |
|    value_loss           | 52.3        |
-----------------------------------------
-----------------

#### Model Evaluation
Evaluate the trained model on the environment for 10 episodes and displays the results, with render set to True to visualize the performance.

In [20]:
# evaluate the trained model
evaluate_policy(model, env, n_eval_episodes=10, render=True)

(200.0, 0.0)

#### Loop to run episodes of the environment using the trained model
Run 10 episodes of the environment using the trained model to select actions at each time step. It then prints the score achieved in each episode.

In [21]:
for episode in range(1, 11):
    score = 0
    obs = env.reset()
    done = False
    
    # loop until episode is completed
    while not done:
        env.render() # render the environment
        action, _ = model.predict(obs) # use the trained model to select an action
        obs, reward, done, info = env.step(action) # take the action and observe the next state, reward and done flag
        score += reward # add the reward to the score
        
    # print episode number and score
    print('Episode:', episode, 'Score:', score)

Episode: 1 Score: [200.]
Episode: 2 Score: [200.]
Episode: 3 Score: [200.]
Episode: 4 Score: [200.]
Episode: 5 Score: [200.]
Episode: 6 Score: [200.]
Episode: 7 Score: [200.]
Episode: 8 Score: [200.]
Episode: 9 Score: [200.]
Episode: 10 Score: [200.]


In [22]:
env.close() # close the environment

#### Comparison

##### Before Training
<p align="center"><img src="https://github.com/kashifliaqat/Data_Science_and_Machine-Learning/raw/main/Images/cart_1.gif" alt="CartPool Before Training" width="500" height="300">

##### After Training
<p align="center"><img src="https://github.com/kashifliaqat/Data_Science_and_Machine-Learning/raw/main/Images/cart_2.gif" alt="CartPool After Training" width="500" height="300">
