# DIT821 Software Engineering for AI systems

#### Exam  2020-08-17

Reinforcement Learning - Q-Learning

Enter here your first and last name and your e-mail adress as registered in Canvas.

* Name, e-mail:

In this assignement, you will implement the value iteration algorithm on a given gridworld environment (MDP).

In order to run this lab you have to install the [gym environment](https://gym.openai.com/docs/) first.
In order to install it, simply run in the terminal:
```
    pip3 install gym
```
or
```
    pip install gym
```
It is assumed that you have installed numpy already, otherwise install it (e.g. as with gym).

Please make sure that you have the lib folder in the same directory of this file. 

Write the code  and comments according to the requirement, and run it.

Donwload the file as a notenook file (.ipynb) and submit it to canvas.

In [1]:
# Make sure that everything here is imported correctly
import numpy as np
import sys
sys.path.append('/Users/macmini/anaconda3/lib/python3.7/site-packages')
import gym
import random


#### Generating the environment

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The surface is described using a grid like the following:


```
SFFF     
FHFH   
FFFH     
HFFG      
```
where:
```
S: starting point, safe
F: frozen surface, safe
H: hole, fall to your doom
G: goal, where the frisbee is located
```

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.


In the following cell we will create the environment, reset the position of the agent to the starting position and render it. The red square indicates the current position of the player.
```

In [2]:
env = gym.make("FrozenLake-v0")
env.reset()                    
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


we can inspect the possible actions to perform in the environment, as well as the possible states of the game:

In [3]:
print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)
print(env.action_space.n)

Action space:  Discrete(4)
Observation space:  Discrete(16)
4


The returned objects are of the type Discrete, which describes a discrete space of size n. For example, the action_space for the Frozen Lake environment is a discrete space of 4 values, which means that the possible values for this space are 0 (zero), 1, 2 and 3. The observation_space is a discrete space of 16 values, which goes from 0 to 15.

The `sample()` method which returns a random value from the space. With this method, we can easily create a dummy agent that plays the game randomly.


To take a step in the environment we use the function `env.step(action)`

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

`observation (object)`: an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.


`reward (float)`: amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.


`done (boolean)`: whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)


`info (dict)`: diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.


In [5]:
MAX_ITERATIONS = 1000
 
env = gym.make("FrozenLake-v0")
env.reset()
env.render()
for i in range(MAX_ITERATIONS):
    random_action = env.action_space.sample()
    new_state, reward, done, info = env.step(random_action)
    env.render()
    if done:
        break


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


The code above executes the game for a maximum of 10 iterations using the method sample() from the action_space object to select a random action. Then the env.step() method takes the action as input, executes the action on the environment and returns a tuple of four values:

```
new_state: the new state of the environment
reward: the reward
done: a boolean flag indicating if the returned state is a terminal state
info: an object with additional information for debugging purposes
```

Note in the previous output the cases in which the player moves in a different direction than the one chosen by the agent. This behavior is completely normal in the Frozen Lake environment because it simulates a slippery surface. 

**the transitions from one state to another, for a given action, are probabilistic.**



### Q-Learning 
According to what we have seen in class and what you have done in the labs. Compleate the following code when necessary.

You have to implement a Q-learning agent.


In order to get the maximum Q-value with respect to all the actions allowed in `state`, use:
`np.max(Q[state,:])`

In [7]:
# Set learning parameter

# Maximum number of episodes
number_episodes = 1

# Maximum number of steps in each episode
number_iterations = 100

# Epsilon, for epsilon-greedy
epsilon = 0.9

# Learning Rate
alpha = 0.81

# Discount Factor
gamma = 0.96


# Initialization of the Q table
Q = np.zeros((env.observation_space.n, env.action_space.n))



# Evaluation parameters
# Steps for each episode
# That is: List where each element is the number of steps took by the agent in order to reach a terminal state
steps_total = []

# Reward got at each episode
rewards_total = []



# Utility function that returns true with propability epsilon
def flipCoin():
  return np.random.uniform(0, 1) < epsilon



# Get optimal action according the Q-table
def get_optimal_action(state):
    # Optimal action from the Q-values
    action = np.argmax(Q[state,:])
    return action

    
# Choose an action according to an 'epsilon-greedy' strategy
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action


# Q-learning algorithm
for episode in range(number_episodes):
    
    # Reset environment and get first new observation
    state = env.reset()
    
    # Reset the steps at each episode
    step = 0
    
    # Reset the reward for each episode
    episode_reward = 0
    
    for iteration in range(number_iterations):
        
        # *** INSERT CODE HERE ****   
        env.render()

        action = choose_action(state)  

        state2, reward, done, info = env.step(action)  

#         learn(state, state2, reward, action)
        
        predict = Q[state, action]
        target = reward + gamma * np.max(Q[state2, :])
        Q[state, action] = Q[state, action] + alpha * (target - predict)
        
        state = state2
        
        step += 1
        
        # If a terminal state was reached, i.e. an episode finished, update parameters for evaluation and break the loop
        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            if episode % 100 == 0:
                print('Episode: {} Reward: {} Steps Taken: {}'.format(episode, episode_reward, step))
            break
    

    
# EVALUATION
        
print("Percent of episodes finished successfully: {0}".format(sum(rewards_total)/number_episodes))
print("Percent of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/number_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))



print ("Final Q-Table Values")
print ("          left          down          right          up")
print (Q)


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
Episode: 0 Reward: 0 Steps Taken: 7
Percent of episodes finished successfully: 0.0
Percent of episodes finished successfully (last 100 episodes): 0.0
Average number of steps: 7.00
Average number of steps (last 100 episodes): 0.07
Final Q-Table Values
          left          down          right          up
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Now you can run the game according the Q-table

In [8]:
MAX_ITERATIONS = 10
 
env = gym.make("FrozenLake-v0")
env.reset()
env.render()
for i in range(MAX_ITERATIONS):
    state = env.reset()
    optimal_action = get_optimal_action(state)
    print(optimal_action)
    new_state, reward, done, info = env.step(optimal_action)
    env.render()
    if done:
        break


[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG


## Submit the solution

When you completed the excercise, download (form File menu) this file as a jupyter Notebook file (.ipynb) and uplaod this file in the CANVAS 

By writing down my name I declare that we have done the assignement myself:

* First Name  Last Name:
