# Q-learning with FrozenLake

Based on https://github.com/ioarun/openai-gym/blob/master/frozenlake/frozenlake-qlearning.py 

Environment: https://gym.openai.com/ 

Details: https://www.kaggle.com/sandovaledwin/q-learning-algorithm-for-solving-frozenlake-game/code

In [1]:
#Note. You need to install gym! Sometimes difficult on Windows. Google for advise.
# pip install gym-toytext should work
import gym
import numpy as np
import random
import math

## Problem description

In [2]:
'''
The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

A frozenlake-v0 is a 4x4 grid world that looks as follows:
SFFF       
FHFH       
FFFH       
HFFG       

Meaning of the letters:
S: starting point, safe
F: frozen surface, safe
H: hole, fall to your doom
G: goal, where the frisbee is located

The 16 states (position of the agent): 
State 0: upper left corner (Start)
...
State 15: Lower right corner (Goal)

The 4 actions (moves of the agent):
LEFT = 0,
DOWN = 1,
RIGHT = 2,
UP = 3.

Reward:
The episode ends when you reach the goal or fall into the water. 
You receive a reward of 1 if you reach the goal, and 0 otherwise.

Effect of actions:
        def inc(row, col, a):
            if a == LEFT:
                col = max(col-1,0)
            elif a == DOWN:
                row = min(row+1,nrow-1)
            elif a == RIGHT:
                col = min(col+1,ncol-1)
            elif a == UP:
                row = max(row-1,0)
            return (row, col)
'''

'\nThe agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.\n\nA frozenlake-v0 is a 4x4 grid world that looks as follows:\nSFFF       \nFHFH       \nFFFH       \nHFFG       \n\nMeaning of the letters:\nS: starting point, safe\nF: frozen surface, safe\nH: hole, fall to your doom\nG: goal, where the frisbee is located\n\nThe 16 states (position of the agent): \nState 0: upper left corner (Start)\n...\nState 15: Lower right corner (Goal)\n\nThe 4 actions (moves of the agent):\nLEFT = 0,\nDOWN = 1,\nRIGHT = 2,\nUP = 3.\n\nReward:\nThe episode ends when you reach the goal or fall into the water. \nYou receive a reward of 1 if you reach the goal, and 0 otherwise.\n\nEffect of actions:\n        def inc(row, col, a):\n  

## Define environment

In [16]:
env = gym.make("FrozenLake-v1",is_slippery=False)

In [17]:
env.reset()
env.render()

  logger.warn(


## Actions

In [18]:
#Sample actions for exploration:
env.action_space.sample()

2

## Initialization

In [19]:
num_episodes = 15000 #20000 #60000
gamma = 0.95 #0.99
learning_rate = 0.7 #0.95 #0.85
epsilon = 0.5#1 #0.15 #0.1

# initialize the Q table
Q = np.zeros([16, 4])
Q

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Training the Q-table

In [21]:
state

(0, {'prob': 1})

In [25]:
for _ in range(num_episodes):
	state, info = env.reset()
	done = False
	while done == False:
        # First we select an action:
		if random.uniform(0, 1) < epsilon: # Flip a skewed coin
			action = env.action_space.sample() # Explore action space
		else:
			action = np.argmax(Q[state,:]) # Exploit learned values
        # Then we perform the action and receive the feedback from the environment
		# new_state, reward, done, info = env.step(action)
		new_state, reward, done, truncated , info = env.step(action)
        # Finally we learn from the experience by updating the Q-value of the selected action
		prediction_error = reward + (gamma*np.max(Q[new_state,:])) - Q[state, action]
		Q[state,action] += learning_rate*prediction_error 
		state = new_state

In [26]:
Q

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

## Sanity check

In [27]:
'''
Let us sanity check some of the Q-values. 
First we recall what the environment looks like:
SFFF       
FHFH       
FFFH       
HFFG       

And what the 4 actions are:
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
'''

'\nLet us sanity check some of the Q-values. \nFirst we recall what the environment looks like:\nSFFF       \nFHFH       \nFFFH       \nHFFG       \n\nAnd what the 4 actions are:\nLEFT = 0\nDOWN = 1\nRIGHT = 2\nUP = 3\n'

In [28]:
np.argmax(Q[0])
#Should be 1 or 2

1

In [29]:
np.argmax(Q[3])
#Should be 0

0

In [30]:
np.argmax(Q[10])
#Should be 1

1

In [31]:
np.argmax(Q[14])
#Should be 2

2

## Using the Q-table

In [33]:
# Is our Q good enough to guide us from start to goal without falling into the water?
state, info = env.reset()

for step in range(10):
    env.render()
    # Take the action (index) with the maximum expected discounted future reward given that state
    action = np.argmax(Q[state,:])
    # state, reward, done, info = env.step(action)
    state, reward, done, truncated , info = env.step(action)

  logger.warn(
