<a href="https://gym.openai.com/">
  <img src="https://gym.openai.com/assets/dist/footer/openai-logo-ce082f60cc.svg" 
  alt="Images" width="200">
</a>
<a href="https://gym.openai.com/">
  <img src="https://gym.openai.com/assets/dist/home/header/home-icon-54c30e2345.svg" 
  alt="Images" width="42">
</a>

> Gym is a toolkit for developing and comparing reinforcement learning algorithms.

#  CartPole-v1

> * [Enviroment](https://gym.openai.com/envs/CartPole-v1/)
> * [GitHub](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

<p align="center">
  <img src="CartPole.jpg" alt="drawing" width="420" align="center"/>
</p>




> * A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
> * The system is controlled by applying a force of +1 or -1 to the cart. 
> * The pendulum starts upright, and the goal is to prevent it from falling over.
> * A reward of +1 is provided for every timestep that the pole remains upright.
> * The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

<a href="https://gym.openai.com/docs/">
  <img src="https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg" 
   alt="Images" width="500">
</a>

Observations
If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.
This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.





Spaces
In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

## Original:
### [nicknochnack](https://github.com/nicknochnack/TensorflowKeras-ReinforcementLearning/blob/master/Deep%20Reinforcement%20Learning.ipynb)

# Dependencies

In [1]:
#!pip install tensorflow==2.3.0
#!pip install gym
#!pip install keras
#!pip install keras-rl2

# Enviroment Setup

        All observations are assigned a uniform random value in [-0.05..0.05]

In [2]:
import gym 
env = gym.make('CartPole-v1')

## Observation

        Num     Observation               Min                     Max
        0       Cart Position             -2.4                    2.4
        1       Cart Velocity             -Inf                    Inf
        2       Pole Angle                -0.209 rad (-12 deg)    0.209 rad (12 deg)
        3       Pole Angular Velocity     -Inf                    Inf

In [3]:
states = env.observation_space.shape[0]
print(states)

4


## Actions

        Num   Action
        0     Push cart to the left
        1     Push cart to the right

In [4]:
actions = env.action_space.n
print(actions)

2


**Agend** entscheidet sich aufgrund seiner *inneren Politik* für eine Aktion

**action** wird über  `env.step(action)` an **Enviroment** übergeben.

gibt `n_state, reward, done, info` zurück

#### reward
        Reward is 1 for every step taken, including the termination step
        
#### done

    Episode Termination:
        Pole Angle is more than 12 degrees.
        Cart Position is more than 2.4 (center of the cart reaches the edge of the display).
        Episode length is greater than 200.

### Random Actions
`action = random.choice([0,1])`

`n_state, reward, done, info = env.step(action)`

In [25]:
import random

episodes = 50
for episode in range(1, episodes+1):
    
    # init env
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = random.choice([0,1])
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:28.0
Episode:2 Score:17.0
Episode:3 Score:13.0
Episode:4 Score:34.0
Episode:5 Score:20.0
Episode:6 Score:15.0
Episode:7 Score:16.0
Episode:8 Score:18.0
Episode:9 Score:17.0
Episode:10 Score:13.0
Episode:11 Score:11.0
Episode:12 Score:18.0
Episode:13 Score:11.0
Episode:14 Score:11.0
Episode:15 Score:49.0
Episode:16 Score:31.0
Episode:17 Score:13.0
Episode:18 Score:11.0
Episode:19 Score:18.0
Episode:20 Score:10.0
Episode:21 Score:39.0
Episode:22 Score:29.0
Episode:23 Score:58.0
Episode:24 Score:12.0
Episode:25 Score:20.0
Episode:26 Score:11.0
Episode:27 Score:23.0
Episode:28 Score:10.0
Episode:29 Score:13.0
Episode:30 Score:24.0
Episode:31 Score:26.0
Episode:32 Score:22.0
Episode:33 Score:41.0
Episode:34 Score:27.0
Episode:35 Score:14.0
Episode:36 Score:39.0
Episode:37 Score:18.0
Episode:38 Score:20.0
Episode:39 Score:10.0
Episode:40 Score:16.0
Episode:41 Score:34.0
Episode:42 Score:19.0
Episode:43 Score:12.0
Episode:44 Score:13.0
Episode:45 Score:49.0
Episode:46 Score:17

# Create a Deep Learning Model with Keras

In [6]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [7]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [8]:
model = build_model(states, actions)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 4)                 0         
_________________________________________________________________
dense (Dense)                (None, 24)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


In [9]:
# https://github.com/keras-rl/keras-rl/blob/master/rl/agents/dqn.py#L89

# https://github.com/PacktPublishing/Hands-On-ROS-for-Robotics-Programming/blob/master/Chapter11_OpenAI_Gym/taxi/Taxi-v3.ipynb

In [10]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [11]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

[rjoseph24](https://github.com/nicknochnack/TensorflowKeras-ReinforcementLearning/issues/1)

In [12]:
#del model
model = build_model(states, actions)

In [13]:
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
history =  dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
    1/10000 [..............................] - ETA: 5:15 - reward: 1.0000



96 episodes - episode_reward: 102.615 [8.000, 325.000] - loss: 1.683 - mae: 19.212 - mean_q: 38.896

Interval 2 (10000 steps performed)
37 episodes - episode_reward: 268.432 [191.000, 500.000] - loss: 2.414 - mae: 41.068 - mean_q: 82.941

Interval 3 (20000 steps performed)
42 episodes - episode_reward: 243.262 [194.000, 371.000] - loss: 1.989 - mae: 46.039 - mean_q: 92.869

Interval 4 (30000 steps performed)
41 episodes - episode_reward: 240.098 [191.000, 375.000] - loss: 1.483 - mae: 44.694 - mean_q: 90.021

Interval 5 (40000 steps performed)
done, took 155.342 seconds


In [14]:
print(history.params)
print(history.history.keys())
rewards = history.history['episode_reward']
print(len(rewards))

{'nb_steps': 50000}
dict_keys(['episode_reward', 'nb_episode_steps', 'nb_steps'])
252


In [15]:
scores = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(scores.history['episode_reward']))

Testing for 100 episodes ...
Episode 1: reward: 500.000, steps: 500
Episode 2: reward: 500.000, steps: 500
Episode 3: reward: 500.000, steps: 500
Episode 4: reward: 500.000, steps: 500
Episode 5: reward: 500.000, steps: 500
Episode 6: reward: 500.000, steps: 500
Episode 7: reward: 500.000, steps: 500
Episode 8: reward: 500.000, steps: 500
Episode 9: reward: 500.000, steps: 500
Episode 10: reward: 500.000, steps: 500
Episode 11: reward: 500.000, steps: 500
Episode 12: reward: 500.000, steps: 500
Episode 13: reward: 500.000, steps: 500
Episode 14: reward: 500.000, steps: 500
Episode 15: reward: 500.000, steps: 500
Episode 16: reward: 500.000, steps: 500
Episode 17: reward: 500.000, steps: 500
Episode 18: reward: 500.000, steps: 500
Episode 19: reward: 500.000, steps: 500
Episode 20: reward: 500.000, steps: 500
Episode 21: reward: 500.000, steps: 500
Episode 22: reward: 500.000, steps: 500
Episode 23: reward: 500.000, steps: 500
Episode 24: reward: 282.000, steps: 282
Episode 25: reward: 

In [29]:
_ = dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 500.000, steps: 500
Episode 2: reward: 500.000, steps: 500
Episode 3: reward: 267.000, steps: 267
Episode 4: reward: 500.000, steps: 500
Episode 5: reward: 500.000, steps: 500


In [17]:
dqn.save_weights('dqn_weights.h5f', overwrite=True)

In [18]:
del model
del dqn
del env

In [26]:
env = gym.make('CartPole-v1')
actions = env.action_space.n
states = env.observation_space.shape[0]
model = build_model(states, actions)
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [27]:
dqn.load_weights('dqn_weights.h5f')

In [28]:
_ = dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 500.000, steps: 500
Episode 2: reward: 500.000, steps: 500
Episode 3: reward: 500.000, steps: 500
Episode 4: reward: 500.000, steps: 500
Episode 5: reward: 500.000, steps: 500


        Solved Requirements:
        Considered solved when the average return is greater than or equal to
        195.0 over 100 consecutive trials.