# OpenAI CarRacing with Behavioral Cloning

In this homework, you will train an agent to drive on a race track in a video-game style simulator. The agent has a neural network controller that you will train using example data of a car racing around the track. At each timestep, the neural network takes in the *state* of the car as an image and outputs which *action* to take. 

This system is known as a *Markov Decision Process (MDP)* because at each discrete timestep, the agent makes a decision using only the current state, with no memory of the previous state (this is called Markov property). In the context of Reinforcement Learning, this training strategy is known as *behavioral cloning* because we are learning by copying the actions of another agent.

The simulator is the CarRacing-v0 environment from OpenAI. In this environment, a *state* is a (96,96,3) color image which shows the position of the car along with the current speed, stearing position, and braking status in the bottom of the image. The *actions* that are available to the agent are stear (between -1 and 1), accelerate (0 to 1), and break (0 to 1). To simplify this assignment, I have converted this into a classification problem with only seven discrete actions:

0. Do nothing
1. Left
2. Left+Break
3. Right
4. Right+Break
5. Accelerate!
6. Break

Below is provided a dataset of 11,132 example (state, action) pairs you can use for training. These were sampled from simulations of a highly-skilled AI agent. The first cell downloads the data and installs many of the dependencies needed to run the simulations and generate videos in Google Colab. You should be able to train your agent and view videos of your agent within Colab.

## Tasks:
1.   Create a class called `Agent` with methods 'train' and 'act'.
2.   Train the agent to drive. Optimize hyperparameters such as the learning rate, network architecture, etc. You can do this by hand (you don't need to do anything fancy).
3. Create a video of your agent driving.

## To turn in:
1. Your code as a jupyter notebook.
2. A description of your agent model and its performance. Include this description after your code in the jupyter notebook, following the [Guide to Describing ML Methods](https://laulima.hawaii.edu/access/content/group/MAN.XLSIDIN35ps.202230/Guide_to_Describing_ML_Methods.pdf). I don't expect you to do extensive hyperparameter tuning, but you **must** describe the performance of your model on a validation set using the appropriate metrics so that you know when you are overfitting.
3. Upload a video of your best agent to [this google drive](https://drive.google.com/drive/folders/1Hk4PTqfr5A3BeW2m3mgAuQmbxo_Z-8AK?usp=sharing). (Feel free to also upload any funny or interesting behavior.)


# Create, Train, and Simulate Agent

Create your agent class below. The code provided should help get you started. Then test your agent in the racing environment.






In [2]:
# NO NEED TO MODIFY THIS CELL
# Dependencies for rendering openai gym in colab and enable video recording.
# Remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym[box2d] pyvirtualdisplay piglet > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
import gym
from gym import logger as gymlogger
gymlogger.set_level(40) #error only
from gym.wrappers import Monitor
import tensorflow as tf
import numpy as np
import random, math, glob, io, base64
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def wrap_env(env):
  """
  Utility functions to enable video recording of gym environment and displaying it
  To enable video, just do "env = wrap_env(env)""
  """
  return Monitor(env, './video', force=True)

# Download example data for training.
import gzip, os, pickle, random
import matplotlib.pyplot as plt
!gdown --id 1AQnMFSRU3qQcHA-ruS8Ahcz-00FmYoi0 # File shared on Peter's gdrive 6MB.
with gzip.open('carracing_behavior.gzip', 'rb') as f:
    states, action_classes = pickle.load(f)

print('\nState data shape (examples, x, y, color):', states.shape)
print('Action data shape (examples, action idx):', action_classes.shape)

# Plot an example state. This is the model input.
print('\nExample state (this is the input to your neural network):')
plt.imshow(states[0, :, :, :])

# The simulator expects a length-3 array corresponding to steer, 
# accellerate, and break. But I converted the training data actions into a 
# discrete set to frame the problem as classification. This is the set of 
# possible actions. The indices in training data targets (action_classes) 
# correspond to this set of actions. Your agent's act method should
# return one of these, not an integer index.
ACTION_SPACE = [[0, 0, 0],  # no action
                [-1, 0, 0],  # left
                [-1, 0, 1],  # left+break
                [1, 0, 0],  # right
                [1, 0, 1],  # right+break
                [0, 1, 0],  # acceleration
                [0, 0, 1], ]  # break

ImportError: ignored

In [3]:
# Resources: [1] Professor Peter Sadowski's Deep Learning Quick Start: MNIST in Keras
#                  https://github.com/peterjsadowski/keras_tutorial/blob/master/1_keras_mnist.ipynb
#            [2] Sequential Decision Making in CarRacing Game using Proximal Policy Optimization and Dataset Aggregation
#                   https://ling-k.github.io/uploads/Kong_Xu_CS5180_Project_Report_edited.pdf
#            [3] Behavorial Cloning OpenAI GitHub 
#                   https://github.com/zalkikar/behaviorCloning_CarRacingv0
#            [4] TensorFlow & OpenAI Gym Tutorial: Behavioral Cloning!
#                   https://www.youtube.com/watch?v=0rsrDOXsSeM&t=1973s
#            [5] Average Machine Learning Discord Server Viewer 
#            [6] Jake, Micah (ICS 635 Spring 2022): Just asked them how they were formatting their Agent class 
#            [7] Michael Rogers (ICS 435 Fall 2021): Helped with plot logic

class Agent:

    # [5] The discord server was recommending using more convolutional and max pooling layers. 
    def __init__(self):
        """
        Initialize the agent.
        """
        # Construct model
        model = tf.keras.models.Sequential()
        model.add(tf.keras.layers.Conv2D(32, (3, 3),
                                         activation='relu', 
                                         input_shape=(96, 96, 3)))
        model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
        model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu')) 
        model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
        model.add(tf.keras.layers.Dropout(0.4))
        model.add(tf.keras.layers.Flatten())
        model.add(tf.keras.layers.Dense(128, activation='relu'))
        model.add(tf.keras.layers.Dense(len(ACTION_SPACE), activation='softmax'))
        self.model = model

    # [1] Used the Quick Start framework 
    def train(self, X_train, y_train, X_test, y_test, epochs=10, batch_size=128,
                lr=0.001, verbose=True):
        """
        Train a neural network to predict the action space.
        """
        # Create the model.
        self.model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])
        # Train the model.
        y_train = tf.keras.utils.to_categorical(y_train, len(ACTION_SPACE))
        y_test = tf.keras.utils.to_categorical(y_test, len(ACTION_SPACE))
        return self.model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,
                validation_data=(X_test, y_test), verbose=verbose)

    def act(self, state):
        """
        Use the trained model to predict the action space.
        """
        y_hat = np.argmax(self.model.predict(state.reshape(1, 96, 96, 3)))
        return ACTION_SPACE[y_hat]


In [4]:
# Resource: [5] I forgot to shuffle the data 
#         https://discord.com/channels/931331667040272444/931332854355476521/962078259233837136

# Shuffle data.
indices = np.arange(len(states))
np.random.shuffle(indices)
states = states[indices]
action_classes = action_classes[indices]

# Split data into training and validation sets.
split = 0.8
train_size = int(split * len(states))
X_train = states[:train_size]
y_train = action_classes[:train_size]
X_test = states[train_size:]
y_test = action_classes[train_size:]

# Initialize the agent.
agent = Agent()

# agent.train(X_train, y_train, X_test, y_test, epochs=5, batch_size=64, verbose=True)
history = agent.train(X_train, y_train, X_test, y_test, epochs=10, batch_size=256, lr=0.0001, verbose=True)

# Resources: [7], Helped with plot logic. 
# Plot loss trajectory throughout training.
plt.figure(1, figsize=(14,5))
plt.subplot(1,2,1)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='valid')
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='valid')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

NameError: ignored

# Simulate Agent

In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
actions = []
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1179..1478 -> 299-tiles track
Episode finished after 1000 timesteps
Reward: 658


In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1160..1454 -> 294-tiles track
Episode finished after 1000 timesteps
Reward: 890


The key to a successful run is to not overfit. Training with a high learning rate or with too many epochs to try and squeeze out a higher validation accuracy usually results in a car that does not move. Moreover, the training data is not balanced and therefore if the model is even a little overfit it cannot recover, specifically on left turns. By adding generalization measures such as maxpooling and dropout of threshold of 0.4, along with training at a slower rate and reducing the training epochs to just a couple iterations, we are able to train the model to better handle the randomness of the test environment. The car has some difficulty once it leaves the track, similar to early versions of the Tesla autopilot which tried to stay in a lane but failed when the lane marking was not visible or split.

The car has ridge movements due to the training data being shuffled and it is very difficult to predict more than a few frames ahead. Each prediction only has one action so it is easy to miscalculate and difficult to recover from. For arbitrary scoring, we added a tally of the reward which sums over time. The best runs came close to reaching 900 while the worst runs received a negative score (for example if the car did not move or spinned out of control).

In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1044..1309 -> 265-tiles track
Episode finished after 787 timesteps
Reward: 921


In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1080..1354 -> 274-tiles track
Episode finished after 1000 timesteps
Reward: 863


In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1000..1254 -> 254-tiles track
Episode finished after 1000 timesteps
Reward: 876


In [None]:
# NO NEED TO MODIFY THIS CELL
# Run simulation for t timesteps.
NUM_TIMESTEPS = 2000  # Increase this to run simulation longer.
total_reward = 0  
actions = []
with wrap_env(gym.make("CarRacing-v0")) as env: # Exits env when done.
  observation = env.reset()  # Restarts car at the starting line.
  for t in range(NUM_TIMESTEPS):
    env.render() 
    action = agent.act(observation)
    observation, reward, done, info = env.step(action)
    total_reward += reward
    if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(f"Reward: {round(total_reward)}")
      break
show_video()  # Video can be downloaded by clicking option in bottom right.

Track generation: 1040..1307 -> 267-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1207..1513 -> 306-tiles track
Episode finished after 1000 timesteps
Reward: 887
