#  TASK 3
## Re-inforcement Learning Notebook
### Bhavesh Kumar [16203173], Jeet Banerjeee [17200844]

### Import packages etc

In [5]:
from __future__ import division
import argparse

from PIL import Image
import numpy as np
import gym
import random
import io
import sys
import csv

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, Convolution2D, Permute, Conv2D
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

from IPython.display import Image
from IPython.core.display import HTML, display

Using TensorFlow backend.


### Load LunarLander-v2 environment from gym and define the network architecture

Define the model architecture

In [2]:
# define name of agent environment to use
env_name = 'LunarLander-v2'
# Load the enviroment from openai.gym that contains the state-vector representation of Lunar lander game
env = gym.make(env_name)

sd = 555
np.random.seed(sd)
random.seed(sd)
env.seed(sd)

# get the number of actions available in the environment
nb_actions = env.action_space.n

# neural netwok model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape)) #Input unit layer
model.add(Dense(512)) # Layer 1 with 512 units
model.add(Activation('relu'))
model.add(Dense(256))  # Layer 2 with 256 units
model.add(Activation('relu'))
model.add(Dense(128))  # Layer 3 with 128 units
model.add(Activation('relu'))
model.add(Dense(nb_actions)) #Output unit layer
model.add(Activation('linear'))
print(model.summary())

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               4608      
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               131328    
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               32896     
__________________________________

### Define policy and logger callbacks and compile the model

In [12]:
# using Epsilon Greedy Policy, Adam as optimizer and mean absolute error (mae) to measure the performance
memory = SequentialMemory(limit=300000, window_length=1)
# Define the policy
policy = EpsGreedyQPolicy()

# Define the DQN Agent and load the model in it
# Set target_model_update to 1e-2 since we will keep learning rate to be small
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory, 
               nb_steps_warmup=50000, target_model_update=1e-2)


# compile the model
# Learning rate is kept small: 1e-5
dqn.compile(Adam(lr=1e-5), metrics=['mae'])

# define file names for weights and log during training
weights_filename = 'dqn_' + env_name + '_weights.h5f'
checkpoint_weights_filename = 'dqn_' + env_name + '_weights_{step}.h5f'
log_filename = 'dqn_{}_log.json'.format(env_name)
callbacks = [ModelIntervalCheckpoint(checkpoint_weights_filename, interval=250000)]
callbacks += [FileLogger(log_filename, interval=100)]

### Train the model and save weights in .h5f file

In [2]:
# start training the model for ~1.5 million steps, and add callbacks to generate logs
dqn.fit(env, nb_steps=1550000, log_interval=10000, visualize=False, callbacks=callbacks)

# After training is done, we save the final weights.
dqn.save_weights(weights_filename, overwrite=True)

#We have trained the model in Google Cloud instance and we are directly using the saved trained weights here.
#Hence the training logs are not shown.

### Training steps versus Reward plot

After running different Reinforcment learning model architectures on Google Cloud instance, we observe the below graphs.

Hidden layers = (512,256,128), learning rate=1e-5 | Hidden layers = (128,64,64), learning rate=5e-4
:---:|:---:
![Figure 1](https://image.ibb.co/jaWcmx/512_training_line.png) | ![Figure 2](https://image.ibb.co/bFfvOc/128_training_line.png)
Model architecture 1|Model architecture 2


Figure 1 and Figure 2 above shows how the re-inforcement learning model learns over the training steps and the impact of learning rate on it. 

We have tried a couple of model architectures.

In model 1, Rewards have been collected during the training steps of our final model which has 3 hidden layers and learning rate 1e-5. Model 1 started with very low reward (>-2500) as compared to model 2(~-600). Model 1 started getting positive reward after 1.1 millions steps whereas model 2 started after just 0.85 millions steps. 

Learning rate directly refects here in a way that smaller the learning rate more steps it takes to learn. In later steps Model 1 is quite stable in getting positive reward as compared to the 2nd model.

<br><br>
Hence, we select the 1st model with (512, 256, 128 units in hidden layers, and 1e-5 learning rate) for our further anaylsis and evaluation.

### :: NOTE :: Detailed Comparison of three Reinforcement learning model architectures has been provided in the notebook named "*Analysis and Experimentation*"


### Test the model using trained weights for 200 episodes

In [15]:
# Redirect stdout to capture test results
old_stdout = sys.stdout
sys.stdout = mystdout = io.StringIO()

# use trained weights to test the agent
dqn.load_weights(weights_filename)
dqn.test(env, nb_episodes=200, visualize=False)

<keras.callbacks.History at 0x7f6488d115f8>

results
Testing for 200 episodes ...
Episode 1: reward: 199.971, steps: 326
Episode 2: reward: 222.437, steps: 262
Episode 3: reward: 222.270, steps: 320
Episode 4: reward: 245.405, steps: 299
Episode 5: reward: 33.904, steps: 1000
Episode 6: reward: 127.211, steps: 418
Episode 7: reward: 190.297, steps: 320
Episode 8: reward: 241.606, steps: 280
Episode 9: reward: 194.045, steps: 331
Episode 10: reward: 204.806, steps: 280
Episode 11: reward: 237.855, steps: 321
Episode 12: reward: 181.652, steps: 375
Episode 13: reward: 180.868, steps: 352
Episode 14: reward: 226.004, steps: 326
Episode 15: reward: 203.962, steps: 309
Episode 16: reward: 193.632, steps: 341
Episode 17: reward: 219.200, steps: 324
Episode 18: reward: 170.339, steps: 296
Episode 19: reward: 138.305, steps: 464
Episode 20: reward: 236.445, steps: 281
Episode 21: reward: 238.679, steps: 333
Episode 22: reward: 146.331, steps: 383
Episode 23: reward: 233.027, steps: 318
Episode 24: reward: 234.126, steps: 306
Episode 25: 

### Print the result from the stdout and save results in a csv file

In [17]:
# Reset stdout
sys.stdout = old_stdout

results_text = mystdout.getvalue()

# Print results text
print("results")
print(results_text)

# Extact a rewards list from the results
total_rewards = list()
for idx, line in enumerate(results_text.split('\n')):
    if idx > 0 and len(line) > 1:
        reward = float(line.split(':')[2].split(',')[0].strip())
        total_rewards.append(reward)

# Print rewards and average reward
print("Total rewards = ", total_rewards)
print("\nAverage total reward = ", np.mean(total_rewards))
        
# Write total rewards to file
f = open("lunarlander_rl_rewards.csv",'w')
wr = csv.writer(f)
for r in total_rewards:
     wr.writerow([r,])
f.close()

results
Testing for 200 episodes ...
Episode 1: reward: 199.971, steps: 326
Episode 2: reward: 222.437, steps: 262
Episode 3: reward: 222.270, steps: 320
Episode 4: reward: 245.405, steps: 299
Episode 5: reward: 33.904, steps: 1000
Episode 6: reward: 127.211, steps: 418
Episode 7: reward: 190.297, steps: 320
Episode 8: reward: 241.606, steps: 280
Episode 9: reward: 194.045, steps: 331
Episode 10: reward: 204.806, steps: 280
Episode 11: reward: 237.855, steps: 321
Episode 12: reward: 181.652, steps: 375
Episode 13: reward: 180.868, steps: 352
Episode 14: reward: 226.004, steps: 326
Episode 15: reward: 203.962, steps: 309
Episode 16: reward: 193.632, steps: 341
Episode 17: reward: 219.200, steps: 324
Episode 18: reward: 170.339, steps: 296
Episode 19: reward: 138.305, steps: 464
Episode 20: reward: 236.445, steps: 281
Episode 21: reward: 238.679, steps: 333
Episode 22: reward: 146.331, steps: 383
Episode 23: reward: 233.027, steps: 318
Episode 24: reward: 234.126, steps: 306
Episode 25: 

Thus, our RL model architecture gives really good overall rewards. Average reward is 196.53 whereas the maximum is 251.4. 

Only one test episode resulted in a negative reward, so the trained model performs really well. 
The time taken to train and fit the model was around 4-5 hours. This is quite exorbitant time taken in comparison to ANN and CNN models.