# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
import os
os.system('set KMP_DUPLICATE_LIB_OK=TRUE')
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.95  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(learning_rate=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state[0], (1, observation_space)) 
        step = 0 
        empty = 0
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info, empty = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, (1, observation_space))  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:   
                score_logger.add_record(step, run,dqn_solver.exploration_rate,GAMMA,LEARNING_RATE,MEMORY_SIZE,BATCH_SIZE,EXPLORATION_MAX,EXPLORATION_MIN,EXPLORATION_DECAY)  
                break  
            dqn_solver.experience_replay()  
    


In [None]:
cartpole()

Run: 1, exploration: 1.0, score: 11
Scores: (min: 11, avg: 11, max: 11)



Run: 2, exploration: 0.9369146928798039, score: 22
Scores: (min: 11, avg: 16.5, max: 22)













Run: 3, exploration: 0.7439808620067382, score: 47
Scores: (min: 11, avg: 26.666666666666668, max: 47)





Run: 4, exploration: 0.6935613678313175, score: 15
Scores: (min: 11, avg: 23.75, max: 47)





Run: 5, exploration: 0.6401093727576664, score: 17
Scores: (min: 11, avg: 22.4, max: 47)





Run: 6, exploration: 0.5907768628656763, score: 17
Scores: (min: 11, avg: 21.5, max: 47)



Run: 7, exploration: 0.5507399854171277, score: 15
Scores: (min: 11, avg: 20.571428571428573, max: 47)





Run: 8, exploration: 0.5238143793828016, score: 11
Scores: (min: 11, avg: 19.375, max: 47)



Run: 9, exploration: 0.4982051627146237, score: 11
Scores: (min: 11, avg: 18.444444444444443, max: 47)







Run: 10, exploration: 0.446186062443672, score: 23
Scores: (min: 11, avg: 18.9, max: 47)



Run: 11, exploration: 0.42650460709830135, score: 10
Scores: (min: 10, avg: 18.09090909090909, max: 47)







Run: 12, exploration: 0.3706551064126331, score: 29
Scores: (min: 10, avg: 19, max: 47)













In [None]:
Explain how reinforcement learning concepts apply to the cartpole problem.
What is the goal of the agent in this case?
The goal of the agent is to balance a pole vertically on a rotating pivot point by moving the pole base horizontally.
The agent must determine what the appropriate movements are for the cart (the base), while not exceeding a prespecified distance
from the initial position.

What are the various state values?
The various state value used are really a result of how the cartpole problem is represented. For example,
problem could be represented digitally, or as a physical model. In "Reinforcement Learning Concept on Cart-Pole with DQN",
Phy expounds on this idea, noting that "A state can be the current frame (in pixels) or it can be some 
other information that can represent the cart-pole, for instance, the velocity and position of the cart, 
the angle of the pole and pole velocity at the tip" (Phy, 2019).

What are the possible actions that can be performed?
The possible actions that can occur are for the cart to move left or right, and these actions can occur in 
rapid succession seperated by a small time inteval.

What reinforcement algorithm is used for this problem?
Deep Q-Learning is used for this problem, allowing a rapid estimation of Q-Learning with low cmputational 
overhead. According to Balawejder, "experience replay that randomizes over the data, thereby removing 
correlations in the observation sequence and smoothing over changes in the data distribution. To perform 
experience replay we store the agent’s experiences et=(st,at,rt,st+1)" (Balawejder, 2022). In the equation for experience,
the terms refer to four contributing factors to experience at time "t" (these being state, action, reward, 
and future states)

Analyze how experience replay is applied to the cartpole problem.
Experience replay "looks back" on past attempts to determine what the best attempts have looked like prior to the current attempt.
The model incorporates its best guess about future states into the determination of the next action in an
attempt to maximize total reward. This information is used to increase the likelihood of successful action-state pairings.

How does experience replay work in this algorithm?
Random actions are executed in an emulator and reward values are associated with the transitions from 
one state to another. Total reward for the transitions is optimized, with gradient descent, by updating 
the weights in the neural net model.

What is the effect of introducing a discount factor for calculating the future rewards? The discount factor
determines how much value is placed on the results of future actions. A factor of zero results in a greedy 
algorithm, while a factor of one will cause the model to evaluate each of its actions based on the sum 
total of all of its future rewards. Intermediary values discount actions that occur further in the future, 
allowing actions closer to the present to weigh more heavily in the selection of action.

Analyze how neural networks are used in deep Q-learning.
Q-learning creates a table of actions and states and updates the table to represent approaches that are more successful. For 
complicated tasks with a large number of table entries, the amount of time required for operating such a frameworks can become
computationally expensive. In deep Q-learning, Machine learning can be used to approximate the Q-learning model, with the trained
obtaining high performance with a low amount of run-time computation.

Explain the neural network architecture that is used in the cartpole problem.
The deep Q-Learning model should use at least one hidden layer; different numbers of hidden layers can be used for different instances.
The input layer is made up of all of the possible states, while the output layer is comprised of all possible actions. Each output 
neuron predicts the Q-value for that action at each step. The output with the maximum Q-value is the decided action for the current 
step.

How does the neural network make the Q-learning algorithm more efficient?
In normal Q-learning, a table is maintained with all possible states paired with actions that have performed well as responses to
each state. Each time the table needs to be updated, written to, or queried the computer system that the model runs on uses up 
computer cycles. For complex activities. this table can grow to a size that grows the time-complexity of the Q-learning model to 
a high level. Deep Q-learning reduces the run-time complexity of the model significantly by storing the table architecture in the
complexity of the neural network weights (representing a large amount of pre-computation). The model is trained to approximate the
solutions represented in the unteneble Q-learning table; once trained the deep learning version maps the correct action to the 
current state with very little further computation. 

What difference do you see in the algorithm performance when you increase or decrease the learning rate?
The initial learning rate of 0.001 resulted in a model solution in 72 runs (172 total runs). When the learning rate was doubled to
0.002, the solution was found in 565 runs (665 total runs). Order of magnitude deviations from the initial learning rate of 0.001,
in either direction, resulted in performance loss. Thus, while just a small increase in learning rate resulted in a large degredation
in performance, performance degradation is also an issue for very small learning rate values. 


References

Balawejder, M. (2022, February 20). Solving open ai's cartpole using reinforcement learning part-2. Medium. Retrieved June 4, 2022, 
    from https://medium.com/analytics-vidhya/solving-open-ais-cartpole-using-reinforcement-learning-part-2-73848cbda4f1 
    
Phy, V. (2019, November 4). Reinforcement learning concept on CART-pole with DQN. Medium. Retrieved June 4, 2022, 
    from https://towardsdatascience.com/reinforcement-learning-concept-on-cart-pole-with-dqn-799105ca670 

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.