# Project 3: Cartpole with Gym

In this project we'll be learning about reinforcement learning.

Reinforcement learning falls into the category of 'Machine Learning' but it may or may not use other methods we've discussed so far, such as deep neural networks. Categorically, it falls on the same level as supervised and unsupervised learning. We haven't really talked about this yet, so let's dig in a litle more:

The three types of machine learning:

- **Supervised learning:** This is all we've done so far. Supervised learning involves feeding labeled data into a model. It's 'supervised' in that you're feeding it the data **and** the answers. In our MNIST dataset these were the picutres of the numbers, and the lables.

- **Unsupervised learning:** Imagine supervised learning but without the labels. Unsupervised learning algorithms will develop their own categorizions of data. 
    - Unsupervised didn't seems useful to me for anything other than data exploration at first glance...so here's a little more explanation: An insteresting exapmle of where you might use this is in a recommendation system. Imagine you're Amazon and you're wanting to recommend similar items to similar customers. Rather than you having to say something like "There are two kinds of customer in this world! BLANK and BLANK," you can just let your firendly unsupervised algorithm explore your customer's habits and do the classification for you. 
    
- **Reinforcement learning:** Reinforcement is unlike supervised or unsupervised learning in that it doesn't use a dataset at all. The data that you feed into a reinforcement learning algorithm is the state of the environment.
    - The clearest example of what I mean by feeding "the state of the environment" is a game. Imagine you're trying to an AI how to play tic-tac-toe. You have to:
        1. Set up the environment for the AI to play in. It's not like you can just hand your computer a pencil and paper (unless that's one BA AI you're working with).
        2. Feed the current state of the environment into the AI. This 'state' could be a blank board, just an X in the top left corner, aleternating X's and O's along the right side, etc.
        3. Let the AI make a desicion and execute on it (Draw an X or O somewhere on the board).
        4. Give feedback to the AI about the success or failure of its move. Update the AIs decision making process accordingly. 
        5. Repeat steps 2-4 until the game is over.
        6. Repeat steps 1-5 until your AI is actually good at tic-tac-toe.
        
**Agent:** I'm assuming anyone familiar with reinforcement learning probably winced at my usage of the term 'AI' above. The term used in reinforcement learning for the entity that actually makes the desicion is an agent. This could be anything from a deep neural network to an algorithm that just multiplies everything by 42 and outputs the results.
        
[OpenAI's gym framework](http://gym.openai.com/) allows us to set up an environment, get the state from that environment, and feed our actions back into the environemnt. It basically handles everything outside of the agent and training the agent. Gym comes pre-loaded with sets of environemnts that include toy-like games, 2D and 3D robot simulations, and Atari games. 

We're going to be diving into gym using the CartPole environment. This environment fits into the 'toy-like game' category. It's a 2D game where we're trying to balance a pole on top of a cart. We can only move the cart left or right. If the pole leans too far in either direction or if we go off screen, we lose. This isn't the most impressive environment, but it's the standard beginner's environment and is a well-trod path for us to start our reinforcement learning journey.

In [1]:
# Install gym and tensorflow, if needed
# ! pip install gym
# ! pip install tensorflow



You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Using Gym

Once we've imported gym, here are the main functions we'll be using:

- **gym.make(ENVIRONMENT)**: This function actually creates our environemnt. You pass the name of the gym envronment that you want to use. In our case that's 'CartPole-v1'.
- **.reset()**: This function is called on the environment. This initializes the environment and returns the first observation. An observation is the current state of the game. This is an array of 4 floats representing:
    1. The cart's horizontal position (0 is the middle)
    2. The cart's velocity
    3. The angle of the pole (0 is centered)
    4. The angular velocity of the pole at its tip.
    
    -**Note:** Observations are dependant on the environment. Obviously, 'angular velocity of pole' would be useless in pac-man.
- **.step(ACTION)**: This function feeds your action into the environment. In CartPole our only actions are 0 (move left) and 1 (move right). 
    - step returns 4 values in this order:
        1. observation: An observation with 4 values (for CartPole), just like in reset()
        2. reward: A float with the change in reward. You use reward to enforce positive behavior.
        3. done: A boolean that lets you know if the game is over. You either beat the game or lost.
        4. info: A dict with debugging information. You can use this info in your training, if you're a cheater!
        
Ok, we're not going to use this one, but I want to let you know it's there:
- **.render()**: Displays a picture of our current environment.
    1. This function doesn't work well in Jupyter Notebooks. You have to do slightly hacky things to make it work.
    2. CartPole has an issue that makes it especially difficult (it doesn't respect the mode='rgb_array' argument).
    3. ....so we aren't going to play with this :(

### Figuring out gym

I initially learned everything I know about gym from a [book](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291) (which is somewhat hard to get into but has some serious payoff) and some articles on Medium. The offical docs are **seriously** lacking. 

[This is an ok place to just learn the basics of how gym works](https://gym.openai.com/docs/)

But how, for example, do you figure out that CartPole's observation is a 4 float array and what the numbers in that array mean?

After way too much time poking around the official docs site, I learned that you [have to go to github if you want the real docs](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py). Once you know the basics of gym, it appears that docstring on the env's class is your key to success.

Note 1 issue with github: The docstring say that the max game is 200. This isn't true. It's 500.

In [18]:
import gym

# Make our cartpole environment
env = gym.make('CartPole-v1')
# Initialize the environment
obs = env.reset()
# Let's see that inital observation array
print(f'Initial observation {obs}')
print(f'Cart Horizontal: {obs[0]}')
print(f'Cart Velocity: {obs[1]}')
print(f'Pole Angle: {obs[2]}')
print(f'Pole Velocity: {obs[3]}')

# Dramatic flair
print('=========================================================')
print('Moving left!')
print('=========================================================')

# Set our action to move left (0) then execute (step)
action = 0
obs, reward, done, info = env.step(action)

# Breaking down the return values for our action
print(f'2nd observation: {obs}')
print(f'Reward: {reward}')
print(f'Done: {done}')
print(f'Info:{info}')

# Clear our work
env.reset()

Initial observation [ 0.01832985 -0.03517071 -0.01118265 -0.04422664]
Cart Horizontal: 0.01832984660410783
Cart Velocity: -0.035170709847082995
Pole Angle: -0.01118265173168511
Pole Velocity: -0.04422664243138519
Moving left!
2nd observation: [ 0.01762643 -0.23013054 -0.01206718  0.24490718]
Reward: 1.0
Done: False
Info:{}


array([-0.01742081,  0.01284559,  0.02774481,  0.00797109])

### That's really all there is...

That's really all you need to know to actually use gym. We have an environment that we can get observations from and feed actions into. Now our task is to make an agent capable of playing the game. 

There's really two main steps left: designing our agent and training our agent. I'm going to start with extremely simple agents so that we can first fous in on the steps in the traning process.

### Dumb Agent 1

For our first AI agent we're just going to statically code a policy that moves the cart left when the pole leans left and right when the pole is leaning right. When you break down balancing, this left=left, right=right policy is all that you're really doing anyway. 

More importantly: This will give us the framework that we can build on to create a real, learning agent later.

Also: Time for a new reinforcement learning term!

**Episode**: A single run through the environment. In our case it's a single game of CartPole.

In [13]:
import gym

env = gym.make('CartPole-v1')

# Behold, our AI agent
def dumb_policy(obs):
    pole_angle = obs[2]
    # An angle of less than zero is left leaning
    if pole_angle < 0:
        return 0
    else:
        return 1

# We'll use this to track our top score
totals = []

# Let's play 10,000 games
for episode in range(10000):
    # Initialize our rewards and environment
    episode_reward_total = 0
    obs = env.reset()
    # 500 is the max we can play anyway
    for step in range(500):
        # Get our next action then execute on it and get our reward
        action = dumb_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_reward_total += reward
        # Stop this episode if we lost
        if done:
            break
    totals.append(episode_reward_total)
    
# Top score?
print(f'Max: {max(totals)}')
# Worst Score?
print(f'Min: {min(totals)}')
# Average Score?
print(f'Average: {round(sum(totals) / len(totals), 0)}')

Max: 72.0
Min: 24.0
Average: 42.0


#### That Went Suprisingly Well

I fully expected a score close to 10. Our minimum socre doubled that! Maybe it wasn't so dumb after all...

Regardless of my overly agressive naming: Reading through the code above gives you everything you need to know to start using gym. You can replace the dumb_policy function with anything that returns a 0 or 1 and it would work. 

### Dumb Agent 2

This next agent still isn't going to be anything super fancy, but I found it in [a guide](https://towardsdatascience.com/from-scratch-ai-balancing-act-in-50-lines-of-python-7ea67ef717) while researching for this notebook and I want to play with some concepts. We're going to generate policies at random, figure out one that works best, and use that to output our 0 or 1.

Our policy is going to be an array of 4 randomly generated numbers. We're going to take that array and multiply it by the 4 digit array that makes up our observation. 

Did you say 'new term'?!

**Dot product**: Ok, this is a math term and not a machine learning specific term, but it's important to machine learning. This multiplication of 2 arrays is called a dot product. You don't really need to know more detail than that for now. Just letting you know that you may see this again in your ML journey.

What's interesting here is that we're 'randomly exploring our environment' and when we stumble across a policy that works we're sticking with it. It's 'machine learning' in its most rudimentary form.

In [5]:
import gym
# Using numpy for all the fancy maths we're about to commit
import numpy as np

env = gym.make('CartPole-v1')

# Notice that we're feeding in our random array
def slightly_smarter_policy(random_array, obs):
    dot_product = np.dot(random_array, obs)
    if dot_product < 0:
        return 0
    else:
        return 1

totals = []
best_array = []
tries = 0

# Let's play 10,000 games, with 10,000 different arrays
for episode in range(10000):
    episode_reward_total = 0
    obs = env.reset()
    # Generate random 1X4 array. Value will be between 0 and 1.
    # Our obs can include negative numbers, so we want the array to include negatives
    # Subtract 0.5 to make our values -0.5 to 0.5
    random_array = np.random.rand(1,4) - 0.5
    for step in range(500):
        action = slightly_smarter_policy(random_array, obs)
        obs, reward, done, info = env.step(action)
        episode_reward_total += reward
        if done:
            break
    if totals:
        if episode_reward_total > max(totals):
            best_array = random_array
    totals.append(episode_reward_total)
    # Stop if we beat the game
    if episode_reward_total == 500:
        tries = episode + 1
        break
    
    

print(f'Max: {max(totals)}')
print(f'Min: {min(totals)}')
print(f'Average: {round(sum(totals) / len(totals), 0)}')
# Number of tries
print(f'Number of tries: {tries}')
# Best Array
print(f'Best Array: {best_array}')

Max: 500.0
Min: 8.0
Average: 50.0
Number of tries: 28
Best Array: [[-0.04915913 -0.18100897  0.23591353  0.39705752]]


#### We win!

We just hit the max score of 500. Interestingly:

- We hit a lower low than our first round. It seems that our designed policy in the first iteration performed better than the worst totally random policy.
- We didn't need very many tries to beat the game. I ran it a few times and the numbers ranged from 17-66. Never came close to needing 10,000. 

This completely random guessing worked well for this simple task. But what if we wanted an agent that **really** learned. Time for a neural net!

### Neural Network Agent 1

For this task we'll be combining what we learned in the previous project about neural netowrks and what we've learned so far about reinforcement learning.

Recall that the NN we trained last time needed input data and labels. We don't have that for CartPole, so let's just make our own!

Here's a rough breakdown of what we're about to do:
1. Run our slightly_smarter_policy again, but this time we're not going to stop the first time it wins.
2. Every time we find a policy that beats the game we're going to record all of the observations and their associated actions.
3. We're going to write the observations and actions to CSVs. These will be used as data and labels.

Note that this is going to generate a variable amount of data. I'm expecting a large amount, but it's technically possible to not get any useable data.

As usual, I'm not going to upload the data we're about to generate. You'll need to run the following cell yourself if you want data.

In [2]:
# Used to write our CSVs
import csv
import gym
import numpy as np
# Used to created the data directory, if needed.
import os

# The same policy as last time.
def slightly_smarter_policy(random_array, obs):
    dot_product = np.dot(random_array, obs)
    if dot_product < 0:
        return 0
    else:
        return 1
    

# Writes the current data to CSV
def record_data(observations, actions):
    with open('./cartpole_data/observations.csv', mode='a') as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for row in observations:
            writer.writerow(row)

    with open('./cartpole_data/actions.csv', mode='a') as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for row in actions:
            writer.writerow([row])
                

def generate_data(env):
    # Lists to store all good observations and actions
    training_observations = []
    training_actions = []
    
    # Create data dir
    if not os.path.exists('./cartpole_data'):
        os.makedirs('./cartpole_data')
        
    # Run 1 million different policies
    for episode in range(1000000):
        # Lists to store episode's observations and actions
        espisode_observations = []
        episode_actions = []
        episode_reward_total = 0
        obs = env.reset()
        # Record inital observation
        espisode_observations.append(obs)
        random_array = np.random.rand(1,4) - 0.5
        for step in range(500):
            action = slightly_smarter_policy(random_array, obs)
            # Record action
            episode_actions.append(action)
            obs, reward, done, info = env.step(action)
            episode_reward_total += reward
            if done:
                break
            # Note that we are NOT recording the final observation (if done; break),
            # as it doesn't result in an action.
            espisode_observations.append(obs)

        # Add sucessful policy actions to data recording lists
        if episode_reward_total == 500:
            training_observations += espisode_observations
            training_actions += episode_actions
                
        # Dump every 5000 observation/action sets to CSV and reset lists
        if len(training_observations) >= 5000:
            record_data(training_observations, training_actions)
            del training_observations[:]
            del training_actions[:]

    # Dump remaining data
    if training_observations:
        record_data(training_observations, training_actions)
            

env = gym.make('CartPole-v1')
generate_data(env)

#### That's alot of data!

Well, 1 million policy tries may have been excessive:

```
[mac@localhost cartpole_data]$ ls -lh
total 1.3G
-rw-rw-r--. 1 mac mac  45M Oct 19 13:50 actions.csv
-rw-rw-r--. 1 mac mac 1.3G Oct 19 13:50 observations.csv
[mac@localhost cartpole_data]$ wc -l actions.csv
15612500 actions.csv
```

That's a 45MB file with just a single number per line!

So we have 15,612,500 differnt episodes to feed into our neural netowrk. Remember that these are only the obervations from games that hit 500 points. This means that we had 31,225 successful randomly generated policies.

#### Time to train a neural net

Around a month ago tensorflow relased their verions 2.0. This version included major quality of life improvements. One of the most significant gains (at least for you and I) is the inclusion of keras within tensorflow. You no longer have to install both and run Keras as a separate library.

I'm going to switch over to the tensorflow implementation becasue I anticipate that this will quickly become the norm (as if my anticipation has any authority...). You'll find that this looks exactly like the keras implementation from our last notebook, so I'm not going to annotate much. I reccomend going back to P2 if you want my play by play of keras and neural netowrks.

[Also, here are the tensorflow.keras docs, if you want to read more]('https://www.tensorflow.org/guide/keras')

In [46]:
import pandas as pd
import tensorflow as tf
from tensorflow import keras

features = pd.read_csv('./cartpole_data/observations.csv')
labels = pd.read_csv('./cartpole_data/actions.csv')

'''
Ok this looks a little nasty...
Given the variablity of the size of the input data, I just desicded to split based on percentages.
The bottom 80% of data goes to traning and the top 20% goes to testing.
'''
training_labels = labels[:int(labels.shape[0] * 0.8)]
training_features = features[:int(features.shape[0] * 0.8)]
test_labels = labels[int(labels.shape[0] * 0.8):]
test_features = features[int(features.shape[0] * 0.8):]

# Good ol' One Hot!
training_labels = keras.utils.to_categorical(training_labels, num_classes=2)
test_labels = keras.utils.to_categorical(test_labels, num_classes=2)

print(f'training_labels: {training_labels.shape}')
print(f'training_features: {training_features.shape}')
print(f'test_labels: {test_labels.shape}')
print(f'test_features: {test_features.shape}')

model = keras.Sequential()
model.add(keras.layers.Dense(32, activation='relu', input_dim=4))
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(8, activation='relu'))
model.add(keras.layers.Dense(4, activation='relu'))
model.add(keras.layers.Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(training_features, training_labels,
          epochs=10, batch_size=128,
          validation_data=(test_features, test_labels))

model.save('cartPole_model.h5')


training_labels: (12489999, 2)
training_features: (12489999, 4)
test_labels: (3122500, 2)
test_features: (3122500, 4)
Train on 12489999 samples, validate on 3122500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Interpreting our training

I'm not sure what to think about our training. I started with a NN with 2, 4 node hidden layers. I found that increasing the layers helped up to a point. All but the smallest models seemed to settle around 91%.

Is 91% good? It's high, but not amazing. Also, consider the fact that this data isn't black and white. It's data generated from random polices that just happen to perform well. 

What's 91% translate to in a game? If I get 90% of my answers correct, can I balance the pole forever? Maybe...idk!

Ok, let's just test it.

In [1]:
import gym
import numpy as np
from tensorflow import keras

env = gym.make('CartPole-v1')
policy = keras.models.load_model('cartPole_model.h5')

totals = []

for episode in range(10000):
    episode_reward_total = 0
    obs = env.reset()
    for step in range(500):
        # keras' .predict expects a batch and input tensor
        # We're feeding in a single, 4 item array, hence (1, 4)
        # Remember, argmax allows us to decode the one hot encoded output
        action = np.argmax(policy.predict(obs.reshape(1, 4)))
        obs, reward, done, info = env.step(action)
        episode_reward_total += reward
        if done:
            break

    totals.append(episode_reward_total)

print(f'Max: {max(totals)}')
print(f'Min: {min(totals)}')
print(f'Average: {round(sum(totals) / len(totals), 0)}')

Max: 500.0
Min: 222.0
Average: 490.0


### And the verdict is...

|Policy |Min  |Max  |Average |
|-------|-----|-----|--------|
|Dumb 1 |24   |72   |42      |
|Dumb 2*|8    |500  |50      |
|Neural |222  |500  |490     |

\* Dumb 2 only had 28 runs, compared to the 10,000 runs of the other poicies.

Our NN clearly blew our other techniques out of the water.

### We did it!

Ok, no we didn't...

Keep in mind that isn't exactly your standard reinforcement learning technique.  Ideally, we want our agent to explore and learn on its own. The only reason this worked is because we randomly generated simple policies that just happen to win. We're mostly just fourtunate that cartpole is an easy game. We were able to generate enough good, random data to have a good dataset.

But what if we wanted our agent to learn while it randomly explored, not just after? Think of all the wasted CPU cycles that went into our last method! What if the task was complicated enough to need knowledge of previous states? What happens when we want to play chess, or go, or DOTA, or create skynet...how do we do that?

Stay tuned for CartPole Redeux: Deep Q Learning.