# Lab-01: Setup and Introduction to Reinforcement Learning

## Setup

### Python3 and Libraries Installation  

- If you have Nvidia RTX GPU built-in your laptop, you can use your **Own Laptop** throughout the course.  
- If you don't have Nvidia RTX GPU in your laptop, it is recommended to use your **Google Colab** https://colab.research.google.com/ instead since GPU processing is required in most deep learning trainings.
- We will use **Python3** throughout the lab. If you have lower versions of Python such as Python2, you have to install Python3 since a lot of syntaxes are different.  
- To install Python3 package in your machine, please follow below instructions according to your OS.  

**For Windows Users**
- Check the type of your processor architecture using `systeminfo | find "System Type"`
- Go to https://www.python.org/downloads/windows/ and download the installation package (.exe file) according to your architecture type.  
- You can also check `python` or `python3` in your terminal or command line before installing Python3 in your laptop. It will prompt you to the Python3 interpreter.  
- To exit from Python3 interpreter, simple type `Ctrl+z` or `Ctrl+d`.  

<img src="img/windows.png" alt="Windows Users" width="300px" style="float: center" />
<br clear="left" />

- While installing Python from `.exe` file, make sure you tick the add Python to PATH box or you will not be able to access Python from the windows terminal.  

<img src="img/add_path.png" alt="Add Paths" width="150px" style="float: center" />
<br clear="left" />

- Next, you need to install PIP(Preferred Installer Program) in order to easily install python packages. To do so, follow the instructions from https://www.geeksforgeeks.org/how-to-install-pip-on-windows/.

**For Linux Users**
- For Ubuntu(Linux) users, you don't need to install Python as it has been built-in as an OS package already.  
- In order to check whether you have python3 installed, you can check `python` or `python3` command in your terminal and it will prompt you to the Python3 interpreter as shown in below image. Otherwise, you need to install Python3.
- To exit from Python3 interpreter, simple type `Ctrl+z` or `Ctrl+d`.

<img src="img/python3.png" alt="Linux Users" width="500px" style="float: center" />
<br clear="left" />

- For installation, go to terminal and run the commands:
    - `sudo apt-get update && sudo apt-get upgrade -y`
    - `sudo apt-get install python3` or if you want to install specific version, type: `sudo apt-get install python3.xx.x`
- To install Python Build Dependencies, run the command:
    - `sudo apt-get build-dep python3`
- You can also install some useful modules by running the command:
    - `sudo apt-get install build-essential gdb lcov libbz2-dev libffi-dev libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev`
- Next, you need to install PIP(Preferred Installer Program) in order to easily install python packages. To do so, run the command `sudo apt-get install python-pip` or `sudo apt-get install python3-pip`

**For MacOS Users**
- Please follow the instruction from https://www.datacamp.com/blog/how-to-install-python.  

**For All Users**
- To check Python version in terminal: `python3 --version` or `python3 -V`
- To check PIP version in terminal: `pip3 --version` or `pip3 -V`
- To install a python library or package from pip: `pip3 install package_name`, eg. `pip3 install numpy`
- You can always check the versions and commands in www.pypi.org, eg. https://pypi.org/project/numpy/
- To install jupyter notebook `pip3 install jupyterlab` and `pip3 install notebook`
- To run jupyter notebook, open terminal and go to the directory you want to run jupyter notebook, type the command `jupyter notebook`

**Install Gymnasium Library**
- Please refer to https://github.com/Farama-Foundation/Gymnasium for the detail installation and API.
- The environments are categorized into the followings:
    1. Classic Control
    2. Box2D
    3. Toy Text
    4. MuJoCo
    5. Atari
- You can install each environment using the command `pip3 install gymnasium[env]`. However, it is recommended to use `pip3 install gymnasium[all]` as we will play around with a lot of environments.  
- The documentation for Gymnasium can be found in https://gymnasium.farama.org/index.html.

**Checking GPU in Linux and Windows(Only for GTX or RTX GPUs)**
- You can type the command in terminal `nvidia-smi` to check your GPU status and CUDA version.
- The output should look like below:

<img src="img/nvidia-smi.png" alt="CUDA Status" width="500px" style="float: center" />
<br clear="left" />

**Google Colab Users**
- Every time you start the Google Colab Notebook, you need to run the following steps.
- In Colab, you have to first mount your google drive every time you start the notebook using `from google.colab import drive`, `drive.mount('/content/drive')`.
- To install Gymnasium,
    1. `!apt-get -y install swig`
    2. `!pip3 install box2d-py`
    3. `!pip3 install Gymnasium[all]`

**Install PyTorch**
- Please refer to https://pytorch.org/get-started/locally/ for the detail installation.
- For MacOS users, you can select your OS as Mac and PyTorch provides MPS acceleration for M1 and M2 processors. **Notice that the syntax for training using GPU will differ from CUDA**. Please follow https://developer.apple.com/metal/pytorch/.
- After installation, you will be able to test your CUDA availability using `import torch`, `print(torch.cuda.is_available())` and it will output `True`.
- PyTorch is already installed in Google Colab. Thus, you can simply change to GPU and run the above codes.

**Other Libraries**
- Some common libraries you might want to install include:
    1. numpy
    2. pandas
    3. matplotlib
    4. seaborn
    5. opencv
    6. setuptools
- I will timely inform to install other necessary libraries/packages throughout the Lab.


### Testing the Gymnasium Environment

In [None]:
import gymnasium as gym
from gymnasium.utils.save_video import save_video ### This is a utility function to save the video frames


env = gym.make("LunarLander-v2", render_mode="rgb_array_list") ### You can make any environment from Gymnasium
observation, info = env.reset(seed=42)
step_starting_index = 0
episode_index = 0

for _ in range(1000):
    action = env.action_space.sample()  # this is where you would insert your policy
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        save_video(
            env.render(),
            "sample_data",
            fps=env.metadata['render_fps'],
            step_starting_index=step_starting_index,
            episode_index=episode_index,
            name_prefix='testing'
        )
        step_starting_index += 1
        episode_index += 1
        observation, info = env.reset()
env.close()

If you are running on your own laptop, you can render the agent playing video directly without saving it as shown below.

In [None]:
import gymnasium as gym
import time

env = gym.make("LunarLander-v2")
env.reset()
env.render()
time.sleep(5)
env.close()

## Reinforcement Learning

### What is Reinforcement Learning
Mathematical formalism for learning-based decision making. Approach for learning decision making and control from experience. (Robotic AI & Learning Lab at UC Berkeley)  

<img src="img/RL.png" alt="RL Flowchart" width="500px" style="float: center" />
<br clear="left" />  

**Notations**

1. $\mathit{S}_{t}$ : State at time t
2. $\mathit{A}_{t}$ : Action taken at time t
3. $\mathit{R}_{t}$ : Reward at time t
4. $t$ : Discrete time $\in$ {0, 1, 2, $\ldots$, $\mathit{T}$}

Basically, at each time step $t$,
1. The **agent** at **state $\mathit{S}_{t}$** takes an **action $\mathit{A}_{t}$** as the input to the environment.
2. Then, the environment evolves to a **new state $\mathit{S}_{t+1}$** according to the transitional probability or dynamics.
3. The **agent** then observes the **new state** in the environment and (optionally) a **reward $\mathit{R}_{t+1}$**.  

To have a better understanding, let's take a look at the following [example video](https://www.instagram.com/maythesciencebewithyou/reel/C2QSNIutBj6/) and a few examples.  
- Archer Robot  
- Robot Maze

### Supervised Learning vs Reinforcement Learning
<img src="img/supervised_vs_RL.png" alt="Supervised Learning vs RL" width="800px" style="float: center" />
<br clear="left" />

## Implementing a random search policy

Now, let's implement in CartPole environment.  
You can use .py or .ipynb file.  
1. First of all, import Gymnasium and PyTorch packages and make the "CartPole" Environment.

In [None]:
import gymnasium as gym
from gymnasium.utils.save_video import save_video ### This is a utility function to save the video frames
import torch


env = gym.make("CartPole-v1", render_mode="rgb_array_list")

2. Check the number of states and actions

In [None]:
n_state = env.observation_space.shape
print('State matrix:', n_state, 'number of state', n_state[0])

n_action = env.action_space.n
print('number of action:', n_action)

3. Create `run_episode` function to run the agent taking a weight for action as an input and return the total reward for the single episode.

In [None]:
def run_episode(env, episode_index, weight, show=False):
    # reset to default state
    state, info = env.reset(seed=27)
    total_reward = 0
    terminated = False
    truncated = False
    step_starting_index = 0
    while not (terminated or truncated): # Remeber to set a condition to exit the infinite loop
        # Get state situation from environment
        state = torch.from_numpy(state).float()
        # Calculate action from maximum possible
        action = torch.argmax(torch.matmul(state, weight))
        # Send action to environment to get next state
        state, reward, terminated, truncated, _ = env.step(action.item())
        # sum all rewards
        total_reward += reward
 
        if show and (terminated or truncated):
            save_video(
                env.render(),
                "sample_data",
                fps=env.metadata['render_fps'],
                step_starting_index=step_starting_index,
                episode_index=episode_index,
                name_prefix='testing'
            )    
            step_starting_index += 1
        if terminated or truncated:
            step_starting_index += 1
    env.close()

    return total_reward

<code>weight</code> $W$ is used to calculate the random weight for each action $A$ given the state $S$. To calulate weightage of actions, we can multiply the matrices:

$$pA=SW$$
where:  
    $pA$: an array of weighted values for each action

To get the actions, in reinforcement learning, you can do as random actions (from probability) or maximum probability. In this implementation, we select an action $a$ from maximum probability. To get index of maximum value, use <code>torch.argmax()</code> function. This function return an array tensor, to address this, use <code>.item()</code> to get one-element tensor.

4. Try to run one episode from random weight.

In [None]:
# Create random weight
weight = torch.rand(n_state[0], n_action)
# Run one episode to get total_reward (save video)
total_reward = run_episode(env, 1, weight, True)
print('Episode {}: {}'.format(0, total_reward))

5. Find the best weight by searching the maximum reward in 500 episodes

In [None]:
# Initialize
best_total_reward = 0
best_weight = None
total_rewards = []
# Set number of episode
n_episode = 500
for episode in range(n_episode):
    weight = torch.rand(n_state[0], n_action)
    # Run 1 episode to get total_reward (not show simulator)
    total_reward = run_episode(env, episode, weight, False)
    if episode % 11 == 0:
        print('Episode {}: {}'.format(episode+1, total_reward))
    # find the best weight from best reward
    if total_reward > best_total_reward:
        best_weight = weight
        best_total_reward =  total_reward
    # keep all total_rewards
    total_rewards.append(total_reward)

In [None]:
print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))

You can see the rewards are not improved by episode step.  

6. Simulate the result from the best weight

In [None]:
# Run 1 episode to get total_reward (Save Video)
total_reward = run_episode(env, 1, best_weight, True)

7. Plot the total reward

In [None]:
# This library is used for plot
import matplotlib.pyplot as plt


plt.plot(total_rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.show()

8. See the average reward from new 100 episodes

In [None]:
# Initialize
best_total_reward = 0
total_rewards_eval = []
# Set number of episode
n_episode = 100
for episode in range(n_episode):
    # Run 1 episode to get total_reward (not show simulator)
    total_reward_eval = run_episode(env, episode, best_weight, False)
    print('Episode {}: {}'.format(episode+1, total_reward_eval))
    # keep all total_rewards
    total_rewards_eval.append(total_reward_eval)
    
print('Average total reward over {} episode: {}'.format(
           n_episode, sum(total_rewards_eval) / n_episode))

## Lab Assignment

1. Setup and install Python, PyTorch and Gymnasium environment in **any OS** with **any IDE or Notebook**. (Windows, Linux or MacOS), (Colab, PyCharm, VS Code, Jupyter, or other)
    - Show your result that you can use Gymnasium and PyTorch.
    - Save an 3D environment into vdo at least 5 second.
2. (Optional) For the person who have lag of python and pytorch, please study it.
    - [Python tutorial](https://www.w3schools.com/python/)
    - [Numpy tutorial](https://www.w3schools.com/python/numpy/default.asp)
    - [MatPlotLib](https://matplotlib.org/stable/tutorials/index)
    - [PyTorch tutorial](https://pytorch.org/tutorials/)
3. Try to implement [**Hill-climbing**](https://en.wikipedia.org/wiki/Hill_climbing) algorithm in *CartPole*. The weight for each episode can be calculated by:
    $$W_n=W_b+\alpha W_r$$
    
    when $W_n$ is the new weight which input into each episode, $W_b$ is the best weight, $\alpha$ is learning rate scale, and $W_r$ is the new random weight. At default, letting $\alpha=0.01$

    - Plot the graph while training and see the different between random search and hill-climbing
    - Change $\alpha$ to be 0.5, 0.1, and 0.001. See the different.
    - Do a short report (1-2 pages).