<a href="https://colab.research.google.com/github/memorysaver/reinforcement_learning_practice/blob/main/Lession_1_Reinforcement_Q_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 1: Reinforcement Q-Learning Practice

This unit is to understand the basic knowlowdge and the training envionment framework `OpenAI Gym`. By practicing Q-learning algo, you can have a better understanding what value-based method is to solve rl problem and its limits that cannot solve `large observation space` problems.

## Reinforcement Learning Concept
- [the tutorial by thomassimonini](https://thomassimonini.medium.com/an-introduction-to-deep-reinforcement-learning-17a565999c0c) (👉  Must Read!!)
- [thomassimonini rl course github](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) 
- [Huggface RL program](https://github.com/huggingface/deep-rl-class)(👉  same author's latest tutorial with huggface)

## Gym Tutorial
- [Gym Tutorial](https://www.gymlibrary.ml/content/tutorials/) (👉  Great Tutorials!!)

## Q-learning
- https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-1-2-3e8f5e764358
- https://www.gocoder.one/blog/rl-tutorial-with-openai-gym
- https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

## Reference Libraries
- 🎮 Environment: [OpenAI Gym](https://www.gymlibrary.ml/content/api/)

- 📚 RL-Library: [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)



In [None]:
# Installation
!pip install cmake 'gym[atari]' scipy

In [None]:
import numpy as np
import gym
import random

In [None]:
import gym
env = gym.make("Taxi-v3").env
env.render()


# Why the state space is 500? 

You'll also notice there are four (4) locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our illustrated passenger is in location Y and they wish to go to location R.

When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations and five (4 + 1) passenger locations.

So, our taxi environment has
total possible states.

[reference](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/)

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")
action_space = env.action_space.n
print("There are ", action_space, " possible actions")

In [None]:
Q = np.zeros((state_space, action_space))
print(Q)
print(Q.shape)

In [None]:
total_episodes = 25000        # Total number of training episodes
total_test_episodes = 100     # Total number of test episodes
max_steps = 200               # Max steps per episode

learning_rate = 0.01          # Learning rate
gamma = 0.99                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.001            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

In [None]:
def epsilon_greedy_policy(Q, state):
  # if random number > greater than epsilon --> exploitation
  if(random.uniform(0,1) > epsilon):
    action = np.argmax(Q[state])
  # else --> exploration
  else:
    action = env.action_space.sample()
  
  return action

In [None]:
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False

    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    
    for step in range(max_steps):
        #
        action = epsilon_greedy_policy(Q, state)

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        Q[state][action] = Q[state][action] + learning_rate * (reward + gamma * 
                                    np.max(Q[new_state]) - Q[state][action])      
        # If done : finish episode
        if done == True: 
            break
        
        # Our new state is state
        state = new_state

In [None]:
import time
rewards = []

frames = []
for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    print("****************************************************")
    print("EPISODE ", episode)
    for step in range(max_steps):
        env.render()     
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(Q[state][:])
        new_state, reward, done, info = env.step(action)
        total_rewards += reward
        
        if done:
            rewards.append(total_rewards)
            #print ("Score", total_rewards)
            break
        state = new_state
env.close()
print ("Score over time: " +  str(sum(rewards)/total_test_episodes))