# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 10
Scores: (min: 10, avg: 10, max: 10)

Run: 2, exploration: 0.9801495006250001, score: 14
Scores: (min: 10, avg: 12, max: 14)

Run: 3, exploration: 0.9000874278732445, score: 18
Scores: (min: 10, avg: 14, max: 18)

Run: 4, exploration: 0.8142285204175609, score: 21
Scores: (min: 10, avg: 15.75, max: 21)

Run: 5, exploration: 0.7328768546436799, score: 22
Scores: (min: 10, avg: 17, max: 22)

Run: 6, exploration: 0.6242658676435396, score: 33
Scores: (min: 10, avg: 19.666666666666668, max: 33)

Run: 7, exploration: 0.5590843898207511, score: 23
Scores: (min: 10, avg: 20.142857142857142, max: 33)

Run: 8, exploration: 0.4484282034609769, score: 45
Scores: (min: 10, avg: 23.25, max: 45)

Run: 9, exploration: 0.43080185560799106, score: 9
Scores: (min: 9, avg: 21.666666666666668, max: 45)

Run: 10, exploration: 0.40769130904675194, score: 12
Scores: (min: 9, avg: 20.7, max: 45)

Run: 11, exploration: 0.3858205374665315, score: 12
Scores: (min: 9, avg: 19.90

Run: 91, exploration: 0.01, score: 163
Scores: (min: 9, avg: 132.01098901098902, max: 471)

Run: 92, exploration: 0.01, score: 178
Scores: (min: 9, avg: 132.5108695652174, max: 471)

Run: 93, exploration: 0.01, score: 132
Scores: (min: 9, avg: 132.50537634408602, max: 471)

Run: 94, exploration: 0.01, score: 47
Scores: (min: 9, avg: 131.59574468085106, max: 471)

Run: 95, exploration: 0.01, score: 187
Scores: (min: 9, avg: 132.17894736842106, max: 471)

Run: 96, exploration: 0.01, score: 142
Scores: (min: 9, avg: 132.28125, max: 471)

Run: 97, exploration: 0.01, score: 159
Scores: (min: 9, avg: 132.55670103092783, max: 471)

Run: 98, exploration: 0.01, score: 277
Scores: (min: 9, avg: 134.03061224489795, max: 471)

Run: 99, exploration: 0.01, score: 175
Scores: (min: 9, avg: 134.44444444444446, max: 471)

Run: 100, exploration: 0.01, score: 156
Scores: (min: 9, avg: 134.66, max: 471)

Run: 101, exploration: 0.01, score: 173
Scores: (min: 9, avg: 136.29, max: 471)

Run: 102, exploration

Run: 191, exploration: 0.01, score: 166
Scores: (min: 10, avg: 178.07, max: 478)

Run: 192, exploration: 0.01, score: 207
Scores: (min: 10, avg: 178.36, max: 478)

Run: 193, exploration: 0.01, score: 193
Scores: (min: 10, avg: 178.97, max: 478)

Run: 194, exploration: 0.01, score: 171
Scores: (min: 10, avg: 180.21, max: 478)

Run: 195, exploration: 0.01, score: 279
Scores: (min: 10, avg: 181.13, max: 478)

Run: 196, exploration: 0.01, score: 346
Scores: (min: 10, avg: 183.17, max: 478)

Run: 197, exploration: 0.01, score: 301
Scores: (min: 10, avg: 184.59, max: 478)

Run: 198, exploration: 0.01, score: 321
Scores: (min: 10, avg: 185.03, max: 478)

Run: 199, exploration: 0.01, score: 202
Scores: (min: 10, avg: 185.3, max: 478)

Run: 200, exploration: 0.01, score: 145
Scores: (min: 10, avg: 185.19, max: 478)

Run: 201, exploration: 0.01, score: 161
Scores: (min: 10, avg: 185.07, max: 478)

Run: 202, exploration: 0.01, score: 156
Scores: (min: 10, avg: 185.18, max: 478)

Run: 203, explora

NameError: name 'exit' is not defined

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
#learning rate increased to 0.005
#increased GAMMA to 0.99, increase Exploration_MIN to 0.05, decrease Exploration_DECAY to 0.95
  
GAMMA = 0.99  
LEARNING_RATE = 0.005  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.05  
EXPLORATION_DECAY = 0.95  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [3]:
cartpole()

Run: 1, exploration: 0.8061065909263957, score: 63
Scores: (min: 63, avg: 63, max: 63)

Run: 2, exploration: 0.7147372386831305, score: 25
Scores: (min: 25, avg: 44, max: 63)

Run: 3, exploration: 0.653073201944699, score: 19
Scores: (min: 19, avg: 35.666666666666664, max: 63)

Run: 4, exploration: 0.5997278763867329, score: 18
Scores: (min: 18, avg: 31.25, max: 63)

Run: 5, exploration: 0.5562889678716474, score: 16
Scores: (min: 16, avg: 28.2, max: 63)

Run: 6, exploration: 0.531750826943791, score: 10
Scores: (min: 10, avg: 25.166666666666668, max: 63)

Run: 7, exploration: 0.40974000909221303, score: 53
Scores: (min: 10, avg: 29.142857142857142, max: 63)

Run: 8, exploration: 0.3897078735047413, score: 11
Scores: (min: 10, avg: 26.875, max: 63)

Run: 9, exploration: 0.36880183088056995, score: 12
Scores: (min: 10, avg: 25.22222222222222, max: 63)

Run: 10, exploration: 0.3438081748424137, score: 15
Scores: (min: 10, avg: 24.2, max: 63)

Run: 11, exploration: 0.3141460853680822, sco

Run: 87, exploration: 0.01, score: 279
Scores: (min: 9, avg: 97.57471264367815, max: 296)

Run: 88, exploration: 0.01, score: 182
Scores: (min: 9, avg: 98.5340909090909, max: 296)

Run: 89, exploration: 0.01, score: 210
Scores: (min: 9, avg: 99.78651685393258, max: 296)

Run: 90, exploration: 0.01, score: 339
Scores: (min: 9, avg: 102.44444444444444, max: 339)

Run: 91, exploration: 0.01, score: 28
Scores: (min: 9, avg: 101.62637362637362, max: 339)

Run: 92, exploration: 0.01, score: 138
Scores: (min: 9, avg: 102.02173913043478, max: 339)

Run: 93, exploration: 0.01, score: 211
Scores: (min: 9, avg: 103.19354838709677, max: 339)

Run: 94, exploration: 0.01, score: 198
Scores: (min: 9, avg: 104.20212765957447, max: 339)

Run: 95, exploration: 0.01, score: 199
Scores: (min: 9, avg: 105.2, max: 339)

Run: 96, exploration: 0.01, score: 216
Scores: (min: 9, avg: 106.35416666666667, max: 339)

Run: 97, exploration: 0.01, score: 247
Scores: (min: 9, avg: 107.80412371134021, max: 339)

Run: 9

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

    Explain how reinforcement learning concepts apply to the cartpole      problem.

o	What is the goal of the agent in this case?

The goal of the agent in this case is to balance the pole vertically on a moving cart for as long as possible, gaining a reward for every timestep. 

o	What are the various state values?

The various state values are the cart position, pole angle, cart velocity, and pole angular velocity. 

o	What are the possible actions that can be performed?

The possible actions that can be performed are moving the cart left or right to keep the pole upright for as long as possible. 

o	What reinforcement algorithm is used for this problem?

The reinforcement algorithm used for this problem is a deep Q-learning algorithm (DQN). 

        Analyze how experience replay is applied to the cartpole problem.
o	How does experience replay work in this algorithm?

Experience replay works in this algorithm because it uses minibatches and stores the agent’s experiences that include state, action, reward, and next state. The Q-values are calculated for each state/action pair, if this pair continues the game play the Q-value will be positive, if this pair ends the game the Q-value will be negative. 

o	What is the effect of introducing a discount factor for calculating the future rewards?

A discount factor, commonly referred to as GAMMA, that approaches 1 incentivizes the agent to prioritize long-term advantageous strategies and future rewards, thereby enabling more informed and strategic decision-making. 

           Analyze how neural networks are used in deep Q-learning.

o	Explain the neural network architecture that is used in the cartpole problem.

The neural network architecture that is used in the cartpole problem has an input layer that includes the state of the environment (pole angle, pole angular velocity, cart position, and cart velocity). This connects the hidden layers with Relu activation function that pulls out features and learns the representations of the input state. These layers connect to the output layer that has 2 outcomes for each action (Balawedjer, 2021). 

o	How does the neural network make the Q-learning algorithm more efficient?

The neural network makes the Q-learning algorithm more efficient by allowing it to run faster by estimated Q values. These Q-values represent the rewards for each state/action pair which allows the agent to play more efficiently. 

o	What difference do you see in the algorithm performance when you increase or decrease the learning rate?

The values that I adjusted: GAMMA from 0.95 to 0.99, learning rate from 0.001 to 0.005, Exploration MIN from 0.01 to 0.005, and the Exploration Decay from 0.995 to 0.95. The base code output was solved in 129 runs, 229 total runs. After adjusting the value listed above, the output is solved in 45 runs, 145 total runs. Demonstrating that the agent was able to solve the problem faster with the averages very close in both examples, although the max value was higher in the base code. 

References:
Hands-on practical: Implementing DQN for Cartpole. (n.d.). https://apxml.com/courses/intermediate-reinforcement-learning/chapter-2-deep-q-networks-dqn/dqn-cartpole-practical 

PyLessons. (n.d.). Default site title. Python Lessons. https://pylessons.com/CartPole-DDQN 

EITCA Academy. (2024a, June 11). What is the significance of the discount factor ( gamma ) in the context of reinforcement learning, and how does it influence the training and performance of a DRL agent? https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/deep-reinforcement-learning/deep-reinforcement-learning-agents/examination-review-deep-reinforcement-learning-agents/what-is-the-significance-of-the-discount-factor-gamma-in-the-context-of-reinforcement-learning-and-how-does-it-influence-the-training-and-performance-of-a-drl-agent/#:~:text=The%20discount%20factor%20is%20a%20scalar%20value%20between,rewards%2C%20thus%20shaping%20the%20agent%27s%20behavior%20and%20strategy. 

Playing Cartpole with the actor-critic method  :  Tensorflow Core. TensorFlow. (n.d.). https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic

Balawejder, M. (2022, February 20). Solving open ai’s cartpole using reinforcement learning part-2. Medium. https://medium.com/analytics-vidhya/solving-open-ais-cartpole-using-reinforcement-learning-part-2-73848cbda4f1




