# **Practical 2**

In this practical, we will examine classical conditioning and use exploration/exploitation to solve a maze.



# **Conditioning vs Supervised and Online Learning**

We start by looking at the following setting. Suppose there is a dog who is in a room with two lights (one red and one green) which will be on or off depending on whether there is food available on a plate. 

We start by generating the data for training and testing. Note this is effectively an XOr gate, which we use below to obtain the targets

In [1]:

# Import libraries and namespaces
import time
from random import seed
from random import randint
import numpy as np

x_train = np.zeros( (100, 2) )
y_train = np.zeros( 100 )
for i in range(100):
    # seed random number generator with the system clock
    seed(time.clock())
        
    # generate random integers between zero and two
    x_train[i,0] = randint(0,1)
    x_train[i,1] = randint(0,1)
    y_train[i] = np.logical_xor(x_train[i,0],x_train[i,1])
    
x_test = [[1,1],[1,0],[0,1],[0,0]]
y_test = [0,1,1,0]

# Converting to float 32bit
x_train = np.array(x_train).astype(np.float32)
x_test  = np.array(x_test).astype(np.float32)
y_train = np.array(y_train).astype(np.float32)
y_test  = np.array(y_test).astype(np.float32)

# Print data split for validation
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(100, 2) (100,)
(4, 2) (4,)


  if sys.path[0] == '':


We now try the online learning approach with an observation window, i.e. memory. In this case, the network will train with only a few samples at a time, drawing a new observation at every training step

In [2]:
# Importing libraries and namespaces
import scipy as sci
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Setup the variables
y=np.array([0]).astype(np.float32)
x=np.reshape(np.array([0,0]),(1,2)).astype(np.float32)

# Define a shift for the online learning window
def shift(A, N):
    B = np.empty_like(A)
    if N >= 0:
        B[:N] = np.nan
        B[N:] = A[:-N]
    else:
        B[N:] = np.nan
        B[:N] = A[-N:]
    return B

# Set a learning window size. This is the memory of the system
window = 10;

# Model initialization
Model = MLPClassifier(hidden_layer_sizes=(3,3), max_iter=1, alpha=0.01, #try change hidden layer
                     solver='sgd', verbose=1,  random_state=100) 

# Train our model
cl = np.array(np.unique([0,1]))
y = y_train[range(window)]
x = x_train[range(window),:]

for epochs in range(10000):
    # Sample randomly
    i = randint(0,99)
    # Train the model
    Model.partial_fit(x, y, cl)
    # Do the shift and acquire the next instance
    y = shift(y,1)
    x = shift(x,1)
    y[0] = y_train[i]
    x[0,:] = x_train[i,:]
       

#Show the accuracy
y_pred = Model.predict(x_test)
print('accuracy is ',accuracy_score(y_pred,y_test)) # Print accuracy score

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Iteration 7496, loss = 0.52985598
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7497, loss = 0.52991909
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7498, loss = 0.60985949
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7499, loss = 0.52240839
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7500, loss = 0.52213098
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7501, loss = 0.60849674
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7502, loss = 0.57551964
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 7503, loss = 0.57525557
Training loss did not improve mor

We compare our result with that obtained using supervised learning. In the code below, we have used a very similar setting, with the same data and classifier, but with a batch rather than online sampling strategy. 

In [3]:
# Importing libraries and namespaces
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# Model initialization
Model = MLPClassifier(hidden_layer_sizes=(4,4), max_iter=10000, alpha=0.01,
                     solver='sgd', verbose=1,  random_state=100) 

# Train our model
h=Model.fit(x_train,y_train)

# Use our model to predict
y_pred=Model.predict(x_test)

# Scikit for machine learning reporting

print('Classification report')
print(classification_report(y_test,y_pred)) # Print summary report
print('accuracy is ',accuracy_score(y_pred,y_test)) # Print accuracy score


Iteration 1, loss = 0.76276983
Iteration 2, loss = 0.76244617
Iteration 3, loss = 0.76198615
Iteration 4, loss = 0.76141866
Iteration 5, loss = 0.76078488
Iteration 6, loss = 0.76006615
Iteration 7, loss = 0.75927285
Iteration 8, loss = 0.75841442
Iteration 9, loss = 0.75749939
Iteration 10, loss = 0.75653548
Iteration 11, loss = 0.75552961
Iteration 12, loss = 0.75448803
Iteration 13, loss = 0.75341633
Iteration 14, loss = 0.75231951
Iteration 15, loss = 0.75120202
Iteration 16, loss = 0.75006783
Iteration 17, loss = 0.74892045
Iteration 18, loss = 0.74776300
Iteration 19, loss = 0.74659820
Iteration 20, loss = 0.74542849
Iteration 21, loss = 0.74425596
Iteration 22, loss = 0.74308248
Iteration 23, loss = 0.74190963
Iteration 24, loss = 0.74073881
Iteration 25, loss = 0.73957121
Iteration 26, loss = 0.73840785
Iteration 27, loss = 0.73724961
Iteration 28, loss = 0.73601996
Iteration 29, loss = 0.73433441
Iteration 30, loss = 0.73256197
Iteration 31, loss = 0.73071563
Iteration 32, los

# **Action Selection**

We now turn our attention to action selection. To this end, we use a maze as the environment under consideration and proceed to install the required system dependencies


In [4]:
# install required system dependencies
!apt-get install -y xvfb x11-utils  
!apt-get install x11-utils > /dev/null 2>&1
!pip install PyOpenGL==3.1.* \
            PyOpenGL-accelerate==3.1.* \
            gym[box2d]==0.17.* 
!pip install pyglet
!pip install ffmpeg
! pip install pyvirtualdisplay
!pip install Image
!pip install gym-maze-trustycoder83

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libxxf86dga1
Suggested packages:
  mesa-utils
The following NEW packages will be installed:
  libxxf86dga1 x11-utils xvfb
0 upgraded, 3 newly installed, 0 to remove and 29 not upgraded.
Need to get 993 kB of archives.
After this operation, 2,981 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxxf86dga1 amd64 2:1.1.4-1 [13.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 x11-utils amd64 7.7+3build1 [196 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.8 [784 kB]
Fetched 993 kB in 1s (1,277 kB/s)
Selecting previously unselected package libxxf86dga1:amd64.
(Reading database ... 160975 files and directories currently installed.)
Preparing to unpack .../libxxf86dga1_2%3a1.1.4-1_amd64.deb ...
Unpacking libxxf86dga1:amd64 (2:

If the directory vid exists and has videos left over from previous tries, its better to clean it up before continuing.

In [1]:
!mkdir ./vid
!rm ./vid/*.*

rm: cannot remove './vid/*.*': No such file or directory


We now proceed to initialise the monitor wrapper for Gym so we can visualise the maze and the agent on a video

In [2]:
import sys
# import pygame
import numpy as np
# import math
# import base64
# import io
# import IPython
import gym
import gym_maze

# from gym.wrappers import Monitor
# from IPython import display
from pyvirtualdisplay import Display
from gym.wrappers.monitoring import video_recorder

d = Display()
d.start()

# Recording filename
video_name = "./vid/Practical_2.mp4"

# Setup the environment for the maze
env = gym.make("maze-sample-10x10-v0")

# Setup the video
vid = None
vid = video_recorder.VideoRecorder(env,video_name)

# env = gym.wrappers.Monitor(env,'./vid',force=True)
current_state = env.reset()



pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


We now proceed to perform a simple Q-tracking algorithm. The method here is quite sub-optimal since it employs a random choice to pick between exploration and exploitation. It does illustrate, however, how Q-value tracking can be done using the maze.

In [3]:
states_dic = {} #dictionary to keep the states/coordinates of the Q table
count = 0
for i in range(10):
    for j in range(10):
        states_dic[i, j] = count
        count+=1
        
n_actions = env.action_space.n

# Initialize the Q-table to 0
Q_table = np.zeros((len(states_dic),n_actions))


# Number of episode we will run
n_episodes = 10

# Maximum of iteration per episode
max_iter_episode = 500

# Initialize the exploration probability to 1
exploration_proba = 0.5

#Exploartion decreasing decay for exponential decreasing
exploration_decreasing_decay = 0.001

# Minimum of exploration prob
min_exploration_proba = 0.01

# Learning rate
lr = 0.1

rewards_per_episode = list()


# Iterate over episodes
for e in range(n_episodes):
    
    # We are not done yet
    done = False
    
    # Sum the rewards that the agent gets from the environment
    total_episode_reward = 0

    for i in range(max_iter_episode): 
        env.unwrapped.render()
        vid.capture_frame()
        current_coordinate_x = int(current_state[0])
        current_coordinate_y = int(current_state[1])
        current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]

        if np.random.uniform(0,1) < exploration_proba:
          action = env.action_space.sample()
        else:
          action = int(np.argmax(Q_table[current_Q_table_coordinates]))

        next_state, reward, done, _ = env.step(action)

        next_coordinate_x = int(next_state[0]) #get coordinates to be used in dictionary
        next_coordinate_y = int(next_state[1]) #get coordinates to be used in dictionary


        # Update our Q-table using the Q-learning iteration
        next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y]
        Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + max(Q_table[next_Q_table_coordinates,:]))
    
        total_episode_reward = total_episode_reward + reward
        # If the episode is finished, we leave the for loop
        if done:
            break
        current_state = next_state

    #Show the total episode reward        
    print("Total episode reward:", total_episode_reward)
    
    #Reset enviroment for next episode
    current_state = env.reset()
    
    rewards_per_episode.append(total_episode_reward)

    # Save video episode and close
print("Video successfuly saved.")
vid.close()
vid.enabled = False

Total episode reward: -0.5000000000000003
Total episode reward: -0.5000000000000003
Total episode reward: 0.6029999999999998
Total episode reward: 0.6389999999999998
Total episode reward: -0.5000000000000003
Total episode reward: 0.7619999999999998
Total episode reward: 0.6869999999999998
Total episode reward: 0.5709999999999997
Total episode reward: 0.7599999999999998
Total episode reward: 0.6589999999999998
Video successfuly saved.


We can now play the video using the following code

In [4]:
import base64
import io
from IPython import display

video_name = "./vid/Practical_2.mp4"

video = io.open(video_name, 'r+b').read()
encoded = base64.b64encode(video)

display.display(display.HTML(data="""
  <video alt="test" controls>
  <source src="data:video/mp4;base64,{0}" type="video/mp4" />
  </video>
  """.format(encoded.decode('ascii'))))