# Q-Learning

"Q-learning is a **model-free reinforcement learning algorithm** to learn quality of actions telling an agent what action to take under what circumstances. It does not require a model (hence the connotation "model-free") of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations." [Wikipedia](https://en.wikipedia.org/wiki/Q-learning)

<img src="../images/q-learning.png" width="250" height="300">

**Getting start**

In [1]:
# checking python version
!python -V

Python 3.7.3


In [2]:
# installing OpenAI's gym
# ! pip install gym

**The Environment**

In [3]:
# importing labraries & packages 
import gym
import numpy as np

In [4]:
# checking gym env action
env = gym.make("MountainCar-v0")
print(env.action_space.n)

3


The enviorment is made of **States** and **Actions**. **States** are observations and samples from the environment. **Actions** are the choices the agent make based on the observation.

As we see above, there are 3 **actions** in this environment and they can be:

* **0** = push left
- **1** = stay still
- **2** = push right

In [5]:
# NOTE: uncomment the cell to run the video
# testing the environment
# env = gym.make("MountainCar-v0")
# env.reset()

# complete = False
# while not complete:
#     action = 2  
#     env.step(action)
#     env.render()

The car does not have power enought to go right (and up).


<img src="../images/car-image-right.png" width="300" height="380">

In [6]:
# checking the starting observation state
env = gym.make("MountainCar-v0")
print(env.reset())

[-0.40816819  0.        ]


In [7]:
# checking reward and observations
env = gym.make("MountainCar-v0")
state = env.reset()

complete = False
while not complete:
    action = 2
    new_state, reward, complete, _ = env.step(action)
    print(reward, new_state)

-1.0 [-0.56318086  0.00130589]
-1.0 [-0.5605788   0.00260206]
-1.0 [-0.55669996  0.00387884]
-1.0 [-0.55157327  0.00512669]
-1.0 [-0.54523701  0.00633625]
-1.0 [-0.53773858  0.00749843]
-1.0 [-0.52913414  0.00860444]
-1.0 [-0.51948818  0.00964596]
-1.0 [-0.50887305  0.01061513]
-1.0 [-0.49736833  0.01150472]
-1.0 [-0.48506013  0.0123082 ]
-1.0 [-0.47204033  0.0130198 ]
-1.0 [-0.45840568  0.01363465]
-1.0 [-0.44425687  0.01414882]
-1.0 [-0.42969751  0.01455935]
-1.0 [-0.41483314  0.01486437]
-1.0 [-0.39977011  0.01506303]
-1.0 [-0.38461458  0.01515553]
-1.0 [-0.3694715   0.01514309]
-1.0 [-0.35444361  0.01502788]
-1.0 [-0.33963064  0.01481298]
-1.0 [-0.32512844  0.0145022 ]
-1.0 [-0.31102836  0.01410008]
-1.0 [-0.29741667  0.01361168]
-1.0 [-0.28437415  0.01304253]
-1.0 [-0.2719757   0.01239844]
-1.0 [-0.26029025  0.01168546]
-1.0 [-0.24938054  0.01090971]
-1.0 [-0.23930322  0.01007732]
-1.0 [-0.23010885  0.00919436]
-1.0 [-0.22184208  0.00826677]
-1.0 [-0.21454179  0.00730029]
-1.0 [-0

In [8]:
# printing observ. low and high to get an idea 
print(env.observation_space.high)
print(env.observation_space.low)

[0.6  0.07]
[-1.2  -0.07]


In [9]:
# defining range
SIZE = [20, 20]
discrete_size = (env.observation_space.high - env.observation_space.low)/SIZE
print(discrete_size)

[0.09  0.007]


In [10]:
# building a q-table
q_table = np.random.uniform(low=-2, 
                            high=0, 
                            size=(SIZE + [env.action_space.n]))

In [11]:
# checking table
q_table[:1]

array([[[-1.72337399, -0.50899528, -1.72069667],
        [-0.1120344 , -0.77771319, -0.90070945],
        [-1.69922035, -0.18103912, -0.81326792],
        [-0.13401867, -1.06787684, -0.09520315],
        [-0.83596025, -0.81115752, -1.88672218],
        [-0.98021965, -0.94440693, -1.91847951],
        [-0.34060962, -1.52456608, -1.2638455 ],
        [-1.94535984, -0.07379877, -1.40099371],
        [-1.15004796, -0.07345389, -1.55533996],
        [-0.7876061 , -1.47409822, -0.68716592],
        [-0.17304745, -1.33138205, -1.70853718],
        [-1.28376421, -0.24485188, -1.53111832],
        [-1.85772   , -1.05173408, -0.92245159],
        [-0.61578232, -0.41096098, -0.83505421],
        [-0.58237963, -0.5341208 , -1.92148989],
        [-1.16763046, -0.30087374, -1.00310454],
        [-1.09472774, -1.42432298, -1.81899296],
        [-1.08570785, -1.60949437, -0.30148313],
        [-0.70716094, -1.64700403, -1.86202427],
        [-1.96238603, -0.34880577, -0.49629219]]])

In [12]:
# getting the car to the flag
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 2500
SHOW_EVERY = 300
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE

# settings
epsilon = 1 
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)

# building a q-table
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))

def get_discrete_state(state):
    discrete_state = (state - env.observation_space.low)/discrete_os_win_size
    return tuple(discrete_state.astype(np.int)) 

for episode in range(EPISODES):
    discrete_state = get_discrete_state(env.reset())
    done = False
    if episode % SHOW_EVERY == 0:
        render = True
        print(episode)
    else:
        render = False

    while not done:
        if np.random.random() > epsilon:
            # getting action from Q table
            action = np.argmax(q_table[discrete_state])
        else:
            # getting random action
            action = np.random.randint(0, env.action_space.n)
        new_state, reward, done, _ = env.step(action)
        new_discrete_state = get_discrete_state(new_state)
        if episode % SHOW_EVERY == 0:
            env.render()
        if not done:
            max_future_q = np.max(q_table[new_discrete_state])
            current_q = q_table[discrete_state + (action,)]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action,)] = new_q
        elif new_state[0] >= env.goal_position:
            q_table[discrete_state + (action,)] = 0
        discrete_state = new_discrete_state
        if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value


env.close()

0
300
600
900
1200
1500
1800
2100
2400


Cool! The car moves randonly until it hits the flag.

<img src="../images/car-image-right-full.png" width="300" height="380">

Learning from [Pyton Programming](https://pythonprogramming.net/q-learning-algorithm-reinforcement-learning-python-tutorial/?completed=/q-learning-reinforcement-learning-python-tutorial/).