# Week 2 - AI Lab

Author: Khushee Kapoor

Registration Number: 200968052

To start, we set up the tensorflow gym suite in the jupyter environment. We also import the required libraries for creating an environment. In TF-Agents, environments can be implemented either in Python or TensorFlow. Python environments are usually easier to implement, understand, and debug, but TensorFlow environments are more efficient and allow natural parallelization. The most common workflow is to implement an environment in Python and use one of the wrappers to automatically convert it into TensorFlow.

In [None]:
# setting up the tensorflow gym suite
!pip install tf-agents[reverb]

In [2]:
# importing the required libraries
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import abc
import tensorflow as tf
import numpy as np
from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

## Cartpole Game

CartPole problem is the problem of balancing the CartPole. CartPole is the structure where a pole is attached to the cart and the cart is free to slide across the frictionless surface. By sliding the cart left or right, the CartPole is balanced.

### Objective

The objective of the CartPole is to keep it from falling or moving out of the range. Therefore, the failure conditions are:
- Magnitude of the angle of the pole with respect to the vertical exceeding some threshold.
- Distance of the CartPole from the center exceeding some threshold.

### Steps vs Episodes

- An episode is an instance of a game (or life of a game). If the game ends or life decreases, the episode ends.
-  A step is the time or some discrete value which increases monotonically in an episode. With each change in the state of the game, the value of step increases until the game ends.

For each instance of CartPole not toppling down or going out of range, we have a reward of +1.0.

#### Implementing the Cartpole environment for a certain number of steps.

To implement cartpole environment for a certain number of steps, we iterate over the steps, and for every step, we identify and perform an action. In correspondence to our action, we get an appropriate reward.

In [22]:
# loading the cartpole environment
env = suite_gym.load('CartPole-v0')

# wrapping the python environment into a tensorflow environment
tf_env = tf_py_environment.TFPyEnvironment(env)

time_step = tf_env.reset()
num_steps = 26
transitions = []
reward = []

# iterating over the steps
for i in range(num_steps):
  action = tf.constant([i % 2])
  # applies the action and returns the new TimeStep
  next_time_step = tf_env.step(action)
  transitions.append([time_step, action, next_time_step])
  reward.append(next_time_step.reward)
  time_step = next_time_step

np_transitions = tf.nest.map_structure(lambda x: x.numpy(), transitions)
print('\n'.join(map(str, np_transitions)))
print('Total reward:', np.sum(reward))

[TimeStep(
{'discount': array([1.], dtype=float32),
 'observation': array([[-0.01512762,  0.01521305,  0.01495875, -0.00404121]],
      dtype=float32),
 'reward': array([0.], dtype=float32),
 'step_type': array([0], dtype=int32)}), array([0], dtype=int32), TimeStep(
{'discount': array([1.], dtype=float32),
 'observation': array([[-0.01482336, -0.1801202 ,  0.01487792,  0.2933236 ]],
      dtype=float32),
 'reward': array([1.], dtype=float32),
 'step_type': array([1], dtype=int32)})]
[TimeStep(
{'discount': array([1.], dtype=float32),
 'observation': array([[-0.01482336, -0.1801202 ,  0.01487792,  0.2933236 ]],
      dtype=float32),
 'reward': array([1.], dtype=float32),
 'step_type': array([1], dtype=int32)}), array([1], dtype=int32), TimeStep(
{'discount': array([1.], dtype=float32),
 'observation': array([[-0.01842576,  0.0147865 ,  0.0207444 ,  0.00536984]],
      dtype=float32),
 'reward': array([1.], dtype=float32),
 'step_type': array([1], dtype=int32)})]
[TimeStep(
{'discount': 

As we can see, the total reward achieved at the end of 26 steps is 26.

For every step, we also see the discount, observation - the four variables - linear position, angular position, linear velocity, angular velocity representing the state of the environment, reward and step_type/ action. Action is the activity agent performs. Here, the agent can either make the cart go right or left. So, the actions are represented by 0 and 1. 0 means left and 1 means right. 

#### Implementing the Cartpole environment for a certain number of episodes.

To implement cartpole environment for a certain number of episodes, we iterate over the episodes, and for every step that is till the last one in the episodes, we identify and perform an action. In correspondence to our action, we get an appropriate reward.

In [14]:
# loading the cartpole environment
env = suite_gym.load('CartPole-v0')

# wrapping the python environment into a tensorflow environment
tf_env = tf_py_environment.TFPyEnvironment(env)

time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 10

# iterating over the episodes
for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  while not time_step.is_last():
    action = tf.random.uniform([1], 0, 2, dtype=tf.int32)
    time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)
  time_step = tf_env.reset()

# calculating the statistics
num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)
tot_reward = np.sum(rewards)
print('Total Episodes: ', num_episodes)
print('Total Steps: ', num_steps)
print('Total Rewards: ', tot_reward)
print('Average Length: ', avg_length)
print('Average Reward: ', avg_reward)

Total Episodes:  10
Total Steps:  261
Total Rewards:  261.0
Average Length:  26.1
Average Reward:  26.1


As we can see, for 10 episodes having 261 steps in total, we have earned 261 points in total. 

### Comparing rewards earned for both approaches.

In [21]:
# step-wise rewards
reward

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 <tf.Tensor: shape=(1,), 

In [16]:
# episode-wise reward
rewards

[array([23.], dtype=float32),
 array([35.], dtype=float32),
 array([19.], dtype=float32),
 array([43.], dtype=float32),
 array([12.], dtype=float32),
 array([25.], dtype=float32),
 array([10.], dtype=float32),
 array([46.], dtype=float32),
 array([27.], dtype=float32),
 array([21.], dtype=float32)]

Comparing the rewards earned for both approaches, we can see that the total rewards for each step are generally smaller than the total rewards for each episode. This is because each step only allows for a small number of actions, while each episode allows for many actions. Additionally, the total rewards can vary significantly between episodes, while the rewards for each step are generally more consistent.