<img src="Report_Resources/1200px-Universität_Zürich_logo.svg.png" style="width:400px;height:180px;">

# RL Project - Learny McLearnface

* Albert Anguera Sempere (albert.anguerasempere@uzh.ch)
* Dominik Bucher (dominik.bucher@uzh.ch)
* Michael Ziörjen (michael.zioerjen@uzh.ch)


## Deep Q Network Approach

Following the preliminary code of the DQN solution approach:

### Preparations

This section contains functions for dealing with discretization and plotting.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import csv

def compress_statespace(raw_statespace):
    '''
    :param raw_statespace:          The statespace as generated by gym in RGB.
                                    The statespace is an np.array with dimensions 96x96x3,
                                    which is the RGB of every pixel in the output
    :return: compressed_statespace: a compressed statespace in grayscale with
                                    dimension 7056(x1)
    this function
    - cuts away unused pixels
    - converts the state_space to grayscale
    - and normalizes the values to 0 to 1
    Function by https://github.com/elsheikh21/car-racing-ppo
    '''

    statespace_84x84 = raw_statespace[:-12, 6:-6]
    # this cuts away the lowest 12 pixels aswell as 6 left and right. returns 84x84
    compressed_statespace_84x84 = np.dot(statespace_84x84[...,0:3], [0.299, 0.587, 0.114])
    # scalar multiplication (dot-product) of every pixel with these values. these values are given by
    # international standards https://www.itu.int/rec/R-REC-BT.601-7-201103-I/en
    compressed_statespace_84x84_normalized = compressed_statespace_84x84/255.0
    # normalize the gray values to values between 0 and 1 (don't know if necessary)
    #return compressed_statespace_84x84_normalized
    compressed_statespace = compressed_statespace_84x84_normalized.flatten()
    # flat the matrix to a one-dimensional vector for the NN to read
    # don't know why elsheik is doing frame*2-1 tbh.... maybe to amplify 'color' intensity?
    return compressed_statespace

def compress_statespace_light(raw_statespace):
    '''

    :param raw_statespace:          The statespace as generated by gym in RGB.
                                    The statespace is an np.array with dimensions 96x96x3,
                                    which is the RGB of every pixel in the output
    :return: compressed_statespace: a compressed statespace in grayscale with
                                    dimension 84x84(x1)

    this function
    - cuts away unused pixels
    - converts the state_space to grayscale
    - and normalizes the values to 0 to 1
    - NOTE: LIGHT because it does not flatten -> used for Conv2D
    Function by https://github.com/elsheikh21/car-racing-ppo
    '''

    statespace_84x84 = raw_statespace[:-12, 6:-6]
    # this cuts away the lowest 12 pixels aswell as 6 left and right. returns 84x84
    compressed_statespace_84x84 = np.dot(statespace_84x84[...,0:3], [0.299, 0.587, 0.114])
    # scalar multiplication (dot-product) of every pixel with these values. these values are given by international
    # standards https://www.itu.int/rec/R-REC-BT.601-7-201103-I/en
    compressed_statespace_84x84_normalized = compressed_statespace_84x84/255.0
    # normalize the gray values to values between 0 and 1 (don't know if necessary)
    return compressed_statespace_84x84_normalized

def transform_action(action):
    '''
    :param action:                      a discretized action_space as a single integer
                                        0 = nothing
                                        1 = hard left
                                        2 = hard right
                                        3 = full accelerating
                                        4 = (mild?) breaking
    :return: quasi_continuous_action:   The action_space as generated by gym [n, n, n]
                                        for [steering, accelerating, breaking]
                                        can be -1 to 1 for steering and 0 to 1 for accelerating and breaking
                                        these are continuous values

    This function is used to transform the actions generated by the NN to a format that the environment can use
    Function by https://github.com/NotAnyMike/gym/blob/master/gym/envs/box2d/car_racing.py
    '''
    if action == 0: quasi_continuous_action = [0, 0, 0.0]  # Nothing
    elif action == 1: quasi_continuous_action = [-1, 0, 0.0]  # Left
    elif action == 2: quasi_continuous_action = [+1, 0, 0.0]  # Right
    elif action == 3: quasi_continuous_action = [0, +1, 0.0]  # Accelerate
    elif action == 4: quasi_continuous_action = [0, 0, +1]  # break
    else: print("action faulty for action transform", action)

    return quasi_continuous_action

def plot_learning_curve(x, scores, epsilons, filename, reload=None):
    '''
    Args:
        x: counter of episodes
        scores: socre per episode
        epsilons: epsilon per episode
        filename: where to store plot
        reload: used to continue plotting a resumed training
    Returns: a plot
    Function by 'ML with Phil' https://github.com/philtabor/Deep-Q-Learning-Paper-To-Code/blob/master/utils.py
    '''

    if reload is not None:
        plot_data = csv.reader("plot_data.csv", delimiter=";")
        # TODO finish function to plot graphs when resuming training

    fig=plt.figure()
    ax=fig.add_subplot(111, label="1")
    ax2=fig.add_subplot(111, label="2", frame_on=False)

    ax.plot(x, epsilons, color="C0")
    ax.set_xlabel("Episode", color="C0")
    ax.set_ylabel("Epsilon", color="C0")
    ax.tick_params(axis='x', colors="C0")
    ax.tick_params(axis='y', colors="C0")

    N = len(scores)
    running_avg = np.empty(N)
    for t in range(N):
	    running_avg[t] = np.mean(scores[max(0, t-20):(t+1)])

    ax2.scatter(x, scores, color="C1")
    ax2.plot(x, running_avg, color="C1")
    ax2.axes.get_xaxis().set_visible(False)
    ax2.yaxis.tick_right()
    ax2.set_ylabel('Score and MA20', color="C1")
    ax2.yaxis.set_label_position('right')
    ax2.tick_params(axis='y', colors="C1")

    plt.savefig(filename)
    plt.close('all')


## Model and Agent

This section contains the code for the model and as well as the agent.
I also tried an implementation with Conv2d layers which only differs minimally.


In [2]:
import time
import gym
import numpy as np
import random
from collections import deque
from keras.models import Sequential, load_model
from keras.layers import Input, Dense
from keras.optimizers import Adam

class DeepQNetwork:
    '''
    LECTURE 4   page 4: Neural Networks in general and why they are good for RL
                page 26: how many layers
                page 34: issues with NN
                page 35: features should be normalized (done in function compress() )
                page 39; learning rate
                page 53: overfitting (obviously not a problem rn)
                page 59: early stop (not yet)

                https://keras.io/api/layers/convolution_layers/convolution2d/
                https://towardsdatascience.com/reinforcement-learning-w-keras-openai-dqns-1eed3a5338c
                https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
    '''

    def __init__(self, gamma, epsilon, lr, epsilon_min, epsilon_decay, tau, batch_size, mem_size,
                 reload=False, reload_path=None):

        self.lr = lr
        self.reload = reload
        self.reload_path = reload_path
        if self.reload == True:
            self.model = self.load_model(self.reload_path)
            self.target_model = self.load_model(self.reload_path)
        else:
            self.model = self.create_model()
            self.target_model = self.create_model()

        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.tau = tau

        self.mem_cntr = 0
        self.mem_size = mem_size
        self.batch_size = batch_size
        self.memory = deque(maxlen=self.mem_size)

    def load_model(self, path):

        model = load_model(path)
        return model

    def create_model(self):

        model = Sequential()
        input_shape = (7056,)
        model.add(Input(shape=input_shape))
        model.add(Dense(units=100, activation="relu"))
        model.add(Dense(units=5))
        model.compile(loss="mse", optimizer=Adam(lr=self.lr))
        return model

    def act(self, state):
        '''
        LECTURE: 2 page 41: Explore or exploit
        '''

        # funtion to actually perform actions. input state, returns action
        # NOT COMPRESS AS THE FUNCTION IS ONLY FED WITH COMPRESSED
        self.epsilon *= self.epsilon_decay                  # let epsilon decay
        self.epsilon = max(self.epsilon_min, self.epsilon)  # make sure it's not lower than minimum

        if np.random.random() < self.epsilon:               # randomly decide if exploit or explore
            #print("DEBUG explore")
            random_action = random.randint(0,4)
            #print(random_action)
            return random_action                               # explore

        else:
            #print("DEBUG exploit")
            # exploit : the model predicts for each of the five actions the resulting Q value .
            # we choose the "action" (with argmax) highest Q value
            # should return integer e.g. 2 for "left", will be transformed in main()
            state = np.reshape(state,(1,7056))    # solves channel and batch issue i guess TODO verify
            # nolonger used after change to torch
            pred = self.model.predict(state)
            #print(pred)
            action = np.argmax(pred)
            #print(action)
            return action

    def remember(self, state, num_action, reward, new_state, done):
        # remember previous state, action, reward
        self.memory.append([state, num_action, reward, new_state, done])

    def replay(self):
        '''
        LECUTRE:    2 page 39 batch vs oneline
                    2 page 40 MeanSquearedError for Q approximation
                    2 page 42 Cross-entropy error?
                    2 page 44 experience replay
        '''

        if len(self.memory) < self.batch_size:
            #print("not yet")
            return

        samples = random.sample(self.memory, self.batch_size)

        for sample in samples:
            state, action_num, reward, new_state, done = sample  # get a random state from the samples
            state = np.reshape(state, (1, 7056))  # solves channel and batch issue i guess TODO verify
            new_state = np.reshape(new_state, (1, 7056))  # solves channel and batch issue i guess TODO verify
            Q_pred = self.model.predict(state)  # predict what to do with base model, given random state
            Q_true = Q_pred # declare to "true" Q value
            if done:
                Q_pred[0][action_num] = reward  # does it return done? if yes nice, put in final reward
            else:
                Q_future = max(self.target_model.predict(new_state)[0])  # what is the future value of that state after the action that was taken (given one takes the best action next)
                Q_true[0][action_num] = reward + Q_future * self.gamma - Q_pred[0][action_num] # adjust the "true" Q value with immediate reward, and discounted future Q-value for tha action that was taken
                #print("Debug replay: Q_true:", Q_true)

            self.model.fit(state, Q_true, epochs=1, verbose=0)

    def target_train(self):
        # reorient goals, i.e. copy the weights from the main model into the target model

        weights = self.model.get_weights()
        target_weights = self.target_model.get_weights()
        for i in range(len(target_weights)):
            target_weights[i] = weights[i] * self.tau + target_weights[i] * (1 - self.tau)
        self.target_model.set_weights(target_weights)


    def save_model(self, name):

        self.target_model.save(name)

    def print_summary(self):

        print(self.model.summary())




## Main

The function that actually runs the code.

In [None]:

def main():
    env = gym.make("CarRacing-v0")
    env = gym.wrappers.Monitor(env, "DQN/Models/{}/recordings".format(TRIAL_ID), force=True, video_callable=lambda episode_id:True)

    agent = DeepQNetwork(tau=0.25,
                         lr=0.01,                   # 0.01 by Aldape and Sowell
                         gamma=0.99,
                         epsilon=1,
                         epsilon_decay=0.9995,
                         epsilon_min=0.1,          # 0.1 by Aldape amd Sowell
                         batch_size=32,
                         mem_size= 10000,
                         reload=False,
                         reload_path="DQN/Models/20201129/DQNmodel")

    trials = 22         # aka episodes (Aldape and Sowell >>1000)
    trial_len = 1200     # how long one episode is. must be greater, but not needed yet

    step = []
    score_hist = []
    eps_hist = []
    trial_array = []

    agent.print_summary()

    for trial in range(trials):

        start_trial = time.time()

        cur_state = compress_statespace(env.reset())         # COMPRESS current state
        score = 0
        tiles = 0

        for step in range(trial_len):

            if step % 100 == 0: #print every n-th step
                print("\tTrial:", trial, "of", trials-1, "| Step:",step, "of",trial_len-100)

            num_action = agent.act(cur_state)               # act given current state, either explore or exploit

            #print("\tact: ", num_action)
            #print("DEBUG main: action by dqn:", num_action)
            action = transform_action(num_action)               # TRANSFORM ACTION
            #print("DEBUG main: action to step:", action)
            new_state, reward, done, _ = env.step(action)   # actual result of act chosen by dqn_agent.act()
            new_state = compress_statespace(new_state)      # COMPRESS new state
            score += reward
            if reward >= 0:
                tiles += 1
            #print("\tremember")
            agent.remember(cur_state, num_action, reward, new_state, done)
            #print("\treplay")
            # internally iterates default (prediction) model
            agent.replay()
            #print("\ttrain")
            agent.target_train()
            cur_state = new_state

            if done:
                env.stats_recorder.save_complete()
                break

        score_hist.append(score)
        trial_array.append(trial)
        eps_hist.append(agent.epsilon)
        plot_learning_curve(x=trial_array, scores=score_hist, epsilons=eps_hist, filename="Models/{}/{}".format(TRIAL_ID, "Performance"))
        plot_data = np.array([trial_array, score_hist, eps_hist])
        np.savetxt("DQN/Models/{}/plot_data.csv".format(TRIAL_ID), plot_data, delimiter=";")

        end_trial = time.time()
        time_trial = round((end_trial - start_trial)/60,1)

        if score < 900:                                                 # after 'for loop' finishes or done, check if score is <900 then print fail         # TODO score >900
            print("Finished trial {} in {} Minutes, but only reached {} points ({} tiles)".format(trial, time_trial, round(score,0), tiles))
            env.stats_recorder.save_complete()
            env.stats_recorder.done = True

        else:                                                           # after 'for loop' finishes or done, check if step >900 then print success
            print("COMPLETED!!! reached {} points at step {} in trial {} after {} Minutes".format(score, step, trial, time_trial))
            agent.save_model("DQN/Models/{}/DQNmodel_SUCCESSFUL".format(TRIAL_ID))
            env.stats_recorder.save_complete()
            env.stats_recorder.done = True
            break

        agent.save_model("DQN/Models/{}/DQNmodel".format(TRIAL_ID))



start = time.time()
TRIAL_ID = "20201201"

if __name__ == "__main__":
    main()

end = time.time()
print("Elapsed time:", round((end-start)/60,1)," Minutes")




Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 100)               705700    
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 505       
Total params: 706,205
Trainable params: 706,205
Non-trainable params: 0
_________________________________________________________________
None
Track generation: 1117..1404 -> 287-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1056..1324 -> 268-tiles track
	Trial: 0 of 21 | Step: 0 of 1100


### Results

Some results from a training session of the fully connected model. This training took approximately 24h with a basic machine.

![Performance](Report_Resources/DQN_Full_Performance.png "Performance")

In [None]:
from IPython.display import Video
Video("Report_Resources/openaigym.video.0.1031261.video000000.mp4")

In [None]:
Video("Report_Resources/openaigym.video.0.1031261.video000018.mp4")

There is clearly some improvement from the first episode to the 18th.
However, Aldape and Sowell (2018) did more than 1000 epochs, to see improvements.
In my opinion, our computational power is clearly insufficient to reach good results.

### References

* Aldape, Pablo, and Samuell Sowell, 2018, Reinforcement Learning for a Simple Racing Game, https://web.stanford.edu/class/aa228/reports/2018/final150.pdf.