# Introductions

## Author

John Cinquegrana   
<I need a new professional email that isn't my school one>

## Preamble

My first adventure into actually writing a reinforcement learning application. I've studied them in a college setting before, but haven't written one. This is used only for getting used to the gym library, don't expect any good models or results out of it.

# Cart Pole Problem

The cart pole problem is a very simple one, perfect for getting used to everything. You have to balance a pole on top of a cart that is restricted to a single dimension of movement. It will make much more sense when you see it live.

Here is an example of using keras with an OpenAI Gym environment: https://keras.io/examples/rl/actor_critic_cartpole/  
Here is an example of using the Cart Pole problem in Gym: https://gym.openai.com/docs/#environments

# Setup
## Environment Setup
Before we set up the agent we should know the details about our environment, and the input it will be receiving.
### Initialization
Import and load in the CartPole environment.

In [42]:
import gym
env = gym.make('CartPole-v0')

### Outputs

See what the actual values we'll be sampling from the environment look like

In [43]:
print(env.action_space)
print(env.observation_space)
print(env.observation_space.high)
print(env.observation_space.low)

Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


In [44]:
input_shape = 4
num_actions = 2

Take note of all the most important data from the printouts that we will be using later.

## Agent Setup
### Imports

Verify that every single tf import works correctly.

In [45]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.layers import Input, Dense

### Model

Create a simple model we will use for Q-learning later on

In [46]:
hidden_size = 128

input_layer = Input( shape=(input_shape,) )
h1 =  Dense( 64, activation="relu" )(input_layer)
h2 =  Dense( 64, activation="relu" )(h1)
final = Dense( 1, "relu" )(h2)

model = keras.Model(inputs=input_layer, outputs=final)

model.summary()





Model: "model_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Total params: 4,545
Trainable params: 4,545
Non-trainable params: 0
_________________________________________________________________


# Training

## General Idea
We are using a Q-network over a RL problem that acts in a frame-by-frame environment over different episodes. Compared to normal machine learning, episodes are comparable to epochs, and frames to a single ovservation.

To explore more options for our model I will be using the Epsilon-Greedy Action Selection algorithm. This means that in the beginning, our model will tend to select random actions to perform. This gives it a *broader idea of the possible state space*. As the model continues through training more often it will use itself, instead of luck, to pick an action.

When the model does pick an action, it will pick the action it believes leads to the best state. It does not perform any searches, min-maxing, or the like. In this way it performs in a very short sighted way. This kind of reaction is fine, if not preferrable, in the CartPole problem.

The algorithm is well described in [this article](https://www.baeldung.com/cs/epsilon-greedy-q-learning) on the Baeldung CS website. Though I am using a network rather than a table, because I think that's more fun.

## Setting up the problem

We will need several variables before we jump into the event loop. All of those are defined below.

In [48]:
#   A basic optimizer and loss function
optimizer = keras.optimizers.SGD()
loss_function = keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)

# These variables will be used for Temporal Difference training
previous_reward = 0.0
current_reward = 0.0
discount = 0.1


# These variables simply keep track of our progres through training
episode_count = 0
frame_count = 0  # The CartPole problem has a maximum of 1000 frames before it stops itself
last_episode = 1

# These variables will be used to determine how often to give a random action
epsilon = 0.8

# These are purely informational variables that won't be used for actual training
random_frame_count = 0
calculated_frame_count = 0
frames_per_episode = []

# Print out basic information
print( "We will be running {} episodes. Each can have up to 1000 frames. The initial chance to take a random action is {}.".format(
    last_episode, epsilon
))

We will be running 1 episodes. Each can have up to 1000 frames. The initial chance to take a random action is 0.8.


## Main Training Loop

We create the environment and step into the main loop. Right now it simply samples random actions from the environment over and over again.

In [59]:
env = gym.make('CartPole-v1') # Most current date CartPole environment in June 2021

try:
    for episode_count in range( last_episode+1 ):
        observation = env.reset() # Reset our environment for a new episode and get the initial state
        done = False # We are no longer done
        frame_count = 0 # Restart the number of frames
        while not done: # continue until the environment decides they are done
            env.render()
            frame_count += 1
            # Decide if we will take a random action or if we will calculate one ourselves
            chance = np.random.randint()

            # Run a random action
            action = env.action_space.sample()
            observation, reward, done, _ = env.step( action )
        # Actions to take at the end of each episode
        # Print out statistics about the episode
        print( "Episode was finished after {} frames.".format(
            frame_count
        ))
except Exception as inst:
    print("Error occured, closing environment manually.")
    env.close()
    print( "Type of exception {}".format(type(inst)) )
    print( "Arguments of exception {}".format(inst.args) )
    print( inst )

# We run the following after every single episode has been ran
env.close()

Error occured, closing environment manually.
Type of exception <class 'TypeError'>
Arguments of exception ('randint() takes at least 1 positional argument (0 given)',)
randint() takes at least 1 positional argument (0 given)


# Evaluation