# Homework: Reinforcement Learning

Choose an environment from https://gym.openai.com/envs and train a policy that solves that environment.
You may use one of the suggested Deep RL libraries to implement your algorithm:
1. [stable-baselines](https://github.com/hill-a/stable-baselines) (note that this only supports up to `Tensorflow 1.15`)
2. [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) (similar to the above but only supports `pytorch`)
3. [RLlib](https://docs.ray.io/en/master/rllib/)

Alternatively you can choose to implement your own by modifying our deep Q-learning notebook.
You may play around with the neural network architecture, as well as hyperparameters.
Check out how well your agent is doing against other RL scientists by looking at the OpenAI-gym [leaderboards](https://github.com/openai/gym/wiki/Leaderboard)!


<div class="alert alert-info">
For your submission, please highlight the following:

1. **[1 pt]** Describe the task needed to accomplish in the chosen Environment.
    > The task is to keep the pole balanced as long as possible.
2. **[1 pt]** List the state observations, actions, and corresponding rewards for your chosen environment.
    | Num | Observation           | Min                  | Max                |
    |-----|-----------------------|----------------------|--------------------|
    | 0   | Cart Position         | -4.8*                 | 4.8*                |
    | 1   | Cart Velocity         | -Inf                 | Inf                |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°)** | ~ 0.418 rad (24°)** |
    | 3   | Pole Angular Velocity | -Inf                 | Inf                |
    > Reward is 1 for every step taken, including the termination step. The threshold is 475 for v1.
    
3. **[2 pts]** How many episodes did it take for the agent to first solve the environment? What was the final reward obtained after training the agent?
    > episode 180/500, reward was 499
4. **[2 pts]** How does your solution compare to those on the leaderboards? How can you improve your agent?
    >Our solution is pretty straightforward which is why we mostly played around with the hyperparameters. The solutions on the leaderboard, however, used more complex methods like particle swarm optimization and uniform crossover<br>
    
    >One way to improve the agent is to explore other nn architectures and hyperparameter tuning<br>
5. **[2 pts]** What are the challenges you encountered in training a policy?
    > The major challenge was really understanding the algorithm itself<br>
    
    > Not practical for high number of steps due to long training time<br>
    
    > Running rewards dropping after n frames and not being able to save the best performing models
6. **[2 pts]** What are some key takeaways you got from this exercise? 
    > There were a lot of ways to solve the problem like how other people used particle swarm optimization for CartPole. For the MountainCar environment, an example is to utilize the reward system in a way that the model would sort of be incentivized by having incremental rewards for moving left or right, and having a certain position<br>
    
    > RL can probably solve most problems if you're creative enough. However, you cannot be creative in constructing an RL solution if you don't have mastery of RL itself<br> 
    
    > Changing the loss function further increases the potential rewards gained<br>
</div>



## Challenge
Use only images from the rendered environment (you may use `atari`, `box-2d`, or `classic-control` environments for this) as the state observations of the model.

In [1]:
import os

import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:

env = gym.make('CartPole-v1')
env.seed(42)
state=env.reset()

# env.render()

In [3]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(2)
State Space Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)


In [4]:
def Random_games():
    for episode in range(100):
        env.reset()
        for t in range(500):
            env.render()
            action = env.action_space.sample()
            
            next_state, reward, done, info = env.step(action)
            print(f'time:{t}', next_state, reward, done, info, action)
            if done:
                break

In [5]:
Random_games()

time:0 [ 0.00597787 -0.17616643 -0.03614886  0.26907313] 1.0 False {} 0
time:1 [ 0.00245454 -0.37075436 -0.0307674   0.5501389 ] 1.0 False {} 0
time:2 [-0.00496055 -0.17521407 -0.01976462  0.24792308] 1.0 False {} 1
time:3 [-0.00846483  0.02018449 -0.01480616 -0.05092794] 1.0 False {} 1
time:4 [-0.00806114 -0.17472206 -0.01582472  0.23704699] 1.0 False {} 0
time:5 [-0.01155558 -0.3696144  -0.01108378  0.5246966 ] 1.0 False {} 0
time:6 [-1.8947868e-02 -5.6457865e-01 -5.8984658e-04  8.1386644e-01] 1.0 False {} 0
time:7 [-0.03023944 -0.3694486   0.01568748  0.52099806] 1.0 False {} 1
time:8 [-0.03762841 -0.17455095  0.02610744  0.23329946] 1.0 False {} 1
time:9 [-0.04111943  0.02018843  0.03077343 -0.05103534] 1.0 False {} 1
time:10 [-0.04071566 -0.17536095  0.02975273  0.25119582] 1.0 False {} 0
time:11 [-0.04422288  0.01932378  0.03477664 -0.03195602] 1.0 False {} 1
time:12 [-0.04383641  0.2139302   0.03413752 -0.31346688] 1.0 False {} 1
time:13 [-0.0395578   0.40854964  0.02786818 -0.5

In [6]:
env.close()

In [33]:
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import tensorflow as tf
import gym
from collections import deque
from tensorflow.keras.optimizers import Adam
import random

In [40]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        
        self.memory = deque(maxlen=memory_length)
        self.gamma = 0.95 # discount factor
        self.epsilon = 1.0 #initial for exploration
        self.epsilon_decay = 0.99 #incremental decay
        self.epsilon_min = 0.01 #min epsilon value
        
        self.learning_rate = 0.001
        
        self.model = self._build_model()
        
    def _build_model(self):
        model = Sequential()
        model.add(Dense(64, input_dim=self.state_size, activation='relu'))
        model.add(Dense(64, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        
        model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))
        return model
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])
    
    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            
            self.model.fit(state, target_f, epochs=1, verbose=0)
        
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
        
    
    def load(self, name):
        self.model.load_weights(name)
    
    def save(self, name):
        self.model.save_weights(name)

In [41]:
memory_length = 2000
batch_size = 32
n_episodes = 500
max_timesteps = 1000

In [42]:
env = gym.make('CartPole-v1')

In [43]:
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [44]:

agent = DQNAgent(state_size, action_size)

In [45]:
############ interact w env

done = False
best_time = 0
for e in range(n_episodes):
    
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    
    for time in range(max_timesteps):
        # env.render()
        action = agent.act(state)
        
        next_state, reward, done, _ = env.step(action)
        
        reward = reward if not done else -10
        
        next_state = np.reshape(next_state, [1, state_size])
        
        agent.remember(state, action, reward, next_state, done)
    
        state = next_state
        
        if done:
            print(f'{e}/{n_episodes}, score:{time}, epsilon:{agent.epsilon}')
            break
            
    if len(agent.memory) > batch_size:
        agent.replay(batch_size)
    
    if (time >= best_time) and time >= 475:
        agent.save(f'model_ep{e}.hdf5')
        best_time = time

0/500, score:13, epsilon:1.0
1/500, score:18, epsilon:1.0
2/500, score:31, epsilon:0.99
3/500, score:20, epsilon:0.9801
4/500, score:36, epsilon:0.9702989999999999
5/500, score:20, epsilon:0.96059601
6/500, score:10, epsilon:0.9509900498999999
7/500, score:14, epsilon:0.9414801494009999
8/500, score:30, epsilon:0.9320653479069899
9/500, score:33, epsilon:0.92274469442792
10/500, score:18, epsilon:0.9135172474836407
11/500, score:22, epsilon:0.9043820750088043
12/500, score:28, epsilon:0.8953382542587163
13/500, score:11, epsilon:0.8863848717161291
14/500, score:13, epsilon:0.8775210229989678
15/500, score:27, epsilon:0.8687458127689781
16/500, score:14, epsilon:0.8600583546412883
17/500, score:18, epsilon:0.8514577710948754
18/500, score:12, epsilon:0.8429431933839266
19/500, score:9, epsilon:0.8345137614500874
20/500, score:19, epsilon:0.8261686238355865
21/500, score:22, epsilon:0.8179069375972307
22/500, score:20, epsilon:0.8097278682212583
23/500, score:23, epsilon:0.80163058953904

In [46]:
print('test')

test


In [71]:
ep = 180
best_model = Sequential()
best_model.add(Dense(64, input_dim=4, activation='relu'))
best_model.add(Dense(64, activation='relu'))
best_model.add(Dense(2, activation='linear'))

best_model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
best_model.load_weights(f'model_ep{280}.hdf5')
# best_model.summary()

state = env.reset()
# env.seed(42)
done = False
for i in range(1000):
    env.render()
    
    action_probs = best_model.predict(state[np.newaxis])
    action = np.argmax(action_probs[0])
    
    state, reward, done, _ = env.step(action)
    
    if done:
        print(f'steps: {i}')
        break
env.close()
        

steps: 499
