# AI Capstone Project 15B - CartPole Data Preparation
## 1. Dependencies
Please uncomment below code lines and install packages if you run this notebook for the first time.  
You can either use Google Colab environment or your own Conda environment on a personal computer.

Hardware Requirements:
- No external GPU needed (PPO uses CPU)

In [1]:
# !conda install pytorch torchvision torchaudio cpuonly -c pytorch
# !pip install gym[classic_control]==0.21.0
# !pip install stable-baselines3[extra]
# !pip install pyglet==1.4.10
# !pip install moviepy

In [2]:
import gym
from tqdm import tqdm
import numpy as np

In [3]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

  and should_run_async(code)


ModuleNotFoundError: ignored

## 2. Train an AI Expert Agent
We will use the dataset generated by AI expert which will be trained through reinforcement learning.  
The algorithm that the AI expert uses will not be written from scratch, rather than, the agent will be trained using PPO(Proximal Policy Optimization) algorithm provided from Stable Baselines3 package.

Our goal is training the AI expert to achieve a maximum reward score from the 'CartPole-v1' simulation on the OpenAI Gymnasium environment. From the official document the episode ends when the reward score reaches 500 points (https://gymnasium.farama.org/environments/classic_control/cart_pole/).


In [None]:
env = make_vec_env("CartPole-v1", n_envs=1)

In [None]:
ai_expert = PPO('MlpPolicy', env, verbose=1)
ai_expert.learn(total_timesteps=25000)
ai_expert.save('ai-expert')

Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22       |
|    ep_rew_mean     | 22       |
| time/              |          |
|    fps             | 1580     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 24.9        |
|    ep_rew_mean          | 24.9        |
| time/                   |             |
|    fps                  | 1178        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008221751 |
|    clip_fraction        | 0.12        |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.685      |
|    explained_variance   | 0.00413     |
|    learning

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 164          |
|    ep_rew_mean          | 164          |
| time/                   |              |
|    fps                  | 967          |
|    iterations           | 11           |
|    time_elapsed         | 23           |
|    total_timesteps      | 22528        |
| train/                  |              |
|    approx_kl            | 0.0068453318 |
|    clip_fraction        | 0.0556       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.544       |
|    explained_variance   | 0.842        |
|    learning_rate        | 0.0003       |
|    loss                 | 4.15         |
|    n_updates            | 100          |
|    policy_gradient_loss | -0.00435     |
|    value_loss           | 25.5         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

In [None]:
# You can load saved model if you already trained it
# Commenth above code block, and uncomment this block
ai_expert = PPO.load('ai-expert')

In [None]:
# Evaluate the trained AI expert
mean_reward, std_reward = evaluate_policy(ai_expert, env, n_eval_episodes=10)
print(f'Mean reward = {mean_reward} +/- {std_reward}')

Mean reward = 500.0 +/- 0.0


## 3. Recording Demonstrations
In this part, we will record demonstrations from the AI expert on the OpenAI Gymnasium environment.  


In [None]:
recording_env = gym.wrappers.Monitor(gym.make('CartPole-v1', render_mode='rgb_array'), 'sample-video', video_callable=lambda episode_id: True, force=True)

In [None]:
VIDEO_RECORD_TRY = 5

for _ in tqdm(range(VIDEO_RECORD_TRY)):
    obs = recording_env.reset()
    dones = False
    while not dones:
        
        action, _states = ai_expert.predict(obs)
        obs, rewards, dones, info = recording_env.step(action)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.65s/it]


## References
- https://medium.com/@sthanikamsanthosh1994/imitation-learning-behavioral-cloning-using-pytorch-d5013404a9e5
- https://gymnasium.farama.org/environments/classic_control/cart_pole/
- https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html