<a href="https://gym.openai.com/">
  <img src="https://gym.openai.com/assets/dist/footer/openai-logo-ce082f60cc.svg" 
  alt="Images" width="200">
</a>
<a href="https://gym.openai.com/">
  <img src="https://gym.openai.com/assets/dist/home/header/home-icon-54c30e2345.svg" 
  alt="Images" width="42">
</a>

> Gym is a toolkit for developing and comparing reinforcement learning algorithms.

#  CartPole-v1

> * [Enviroment](https://gym.openai.com/envs/CartPole-v1/)
> * [GitHub](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

<p align="center">
  <img src="CartPole.jpg" alt="drawing" width="420" align="center"/>
</p>




> * A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
> * The system is controlled by applying a force of +1 or -1 to the cart. 
> * The pendulum starts upright, and the goal is to prevent it from falling over.
> * A reward of +1 is provided for every timestep that the pole remains upright.
> * The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

<a href="https://gym.openai.com/docs/">
  <img src="https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg" 
   alt="Images" width="500">
</a>

Observations
If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.
This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.





Spaces
In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

## Original:
### [nicknochnack](https://github.com/nicknochnack/TensorflowKeras-ReinforcementLearning/blob/master/Deep%20Reinforcement%20Learning.ipynb)

# Dependencies

In [None]:
#!pip install tensorflow==2.3.0
#!pip install gym
#!pip install keras
#!pip install keras-rl2

# Enviroment Setup

        All observations are assigned a uniform random value in [-0.05..0.05]

In [None]:
import gym 
env = gym.make('CartPole-v1')

## Observation

        Num     Observation               Min                     Max
        0       Cart Position             -2.4                    2.4
        1       Cart Velocity             -Inf                    Inf
        2       Pole Angle                -0.209 rad (-12 deg)    0.209 rad (12 deg)
        3       Pole Angular Velocity     -Inf                    Inf

In [None]:
states = env.observation_space.shape[0]
print(states)

## Actions

        Num   Action
        0     Push cart to the left
        1     Push cart to the right

In [None]:
actions = env.action_space.n
print(actions)

**Agend** entscheidet sich aufgrund seiner *inneren Politik* für eine Aktion

**action** wird über  `env.step(action)` an **Enviroment** übergeben.

gibt `n_state, reward, done, info` zurück

#### reward
        Reward is 1 for every step taken, including the termination step
        
#### done

    Episode Termination:
        Pole Angle is more than 12 degrees.
        Cart Position is more than 2.4 (center of the cart reaches the edge of the display).
        Episode length is greater than 200.

### Random Actions
`action = random.choice([0,1])`

`n_state, reward, done, info = env.step(action)`

In [None]:
import random

episodes = 50
for episode in range(1, episodes+1):
    
    # init env
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = random.choice([0,1])
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

# Create a Deep Learning Model with Keras

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [None]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [None]:
model = build_model(states, actions)
model.summary()

In [None]:
# https://github.com/keras-rl/keras-rl/blob/master/rl/agents/dqn.py#L89

# https://github.com/PacktPublishing/Hands-On-ROS-for-Robotics-Programming/blob/master/Chapter11_OpenAI_Gym/taxi/Taxi-v3.ipynb

In [None]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [None]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

[rjoseph24](https://github.com/nicknochnack/TensorflowKeras-ReinforcementLearning/issues/1)

In [None]:
#del model
model = build_model(states, actions)

In [None]:
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
history =  dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)

In [None]:
print(history.params)
print(history.history.keys())
rewards = history.history['episode_reward']
print(len(rewards))

In [None]:
scores = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(scores.history['episode_reward']))

In [None]:
_ = dqn.test(env, nb_episodes=5, visualize=True)
env.close()

In [None]:
dqn.save_weights('dqn_weights.h5f', overwrite=True)

In [None]:
del model
del dqn
del env

In [None]:
env = gym.make('CartPole-v0')
actions = env.action_space.n
states = env.observation_space.shape[0]
model = build_model(states, actions)
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [None]:
dqn.load_weights('dqn_weights.h5f')

In [None]:
_ = dqn.test(env, nb_episodes=5, visualize=True)
env.close()

        Solved Requirements:
        Considered solved when the average return is greater than or equal to
        195.0 over 100 consecutive trials.