# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.9046104802746175, score: 40
Scores: (min: 40, avg: 40, max: 40)

Run: 2, exploration: 0.8390886103705794, score: 16
Scores: (min: 16, avg: 28, max: 40)

Run: 3, exploration: 0.7744209942832988, score: 17
Scores: (min: 16, avg: 24.333333333333332, max: 40)

Run: 4, exploration: 0.7328768546436799, score: 12
Scores: (min: 12, avg: 21.25, max: 40)

Run: 5, exploration: 0.6662995813682115, score: 20
Scores: (min: 12, avg: 21, max: 40)

Run: 6, exploration: 0.5937455908197752, score: 24
Scores: (min: 12, avg: 21.5, max: 40)

Run: 7, exploration: 0.5452463540625918, score: 18
Scores: (min: 12, avg: 21, max: 40)

Run: 8, exploration: 0.47622912292284103, score: 28
Scores: (min: 12, avg: 21.875, max: 40)

Run: 9, exploration: 0.3995984329713264, score: 36
Scores: (min: 12, avg: 23.444444444444443, max: 40)

Run: 10, exploration: 0.3455358541129786, score: 30
Scores: (min: 12, avg: 24.1, max: 40)

Run: 11, exploration: 0.18099664897669618, score: 130
Scores: (min: 12, avg

Run: 90, exploration: 0.01, score: 160
Scores: (min: 12, avg: 149.76666666666668, max: 464)

Run: 91, exploration: 0.01, score: 216
Scores: (min: 12, avg: 150.4945054945055, max: 464)

Run: 92, exploration: 0.01, score: 231
Scores: (min: 12, avg: 151.3695652173913, max: 464)

Run: 93, exploration: 0.01, score: 424
Scores: (min: 12, avg: 154.30107526881721, max: 464)

Run: 94, exploration: 0.01, score: 153
Scores: (min: 12, avg: 154.2872340425532, max: 464)

Run: 95, exploration: 0.01, score: 218
Scores: (min: 12, avg: 154.9578947368421, max: 464)

Run: 96, exploration: 0.01, score: 173
Scores: (min: 12, avg: 155.14583333333334, max: 464)

Run: 97, exploration: 0.01, score: 210
Scores: (min: 12, avg: 155.71134020618555, max: 464)

Run: 98, exploration: 0.01, score: 143
Scores: (min: 12, avg: 155.58163265306123, max: 464)

Run: 99, exploration: 0.01, score: 248
Scores: (min: 12, avg: 156.5151515151515, max: 464)

Run: 100, exploration: 0.01, score: 204
Scores: (min: 12, avg: 156.99, max:

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 0.5  #changed exploration
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.8224322824348486, score: 42
Scores: (min: 17, avg: 29.5, max: 42)

Run: 3, exploration: 0.7861544476842928, score: 10
Scores: (min: 10, avg: 23, max: 42)

Run: 4, exploration: 0.736559652908221, score: 14
Scores: (min: 10, avg: 20.75, max: 42)

Run: 5, exploration: 0.7005493475733617, score: 11
Scores: (min: 10, avg: 18.8, max: 42)

Run: 6, exploration: 0.6563549768288433, score: 14
Scores: (min: 10, avg: 18, max: 42)

Run: 7, exploration: 0.6118738784280476, score: 15
Scores: (min: 10, avg: 17.571428571428573, max: 42)

Run: 8, exploration: 0.5848838636585911, score: 10
Scores: (min: 10, avg: 16.625, max: 42)

Run: 9, exploration: 0.5562889678716474, score: 11
Scores: (min: 10, avg: 16, max: 42)

Run: 10, exploration: 0.5264466124450268, score: 12
Scores: (min: 10, avg: 15.6, max: 42)

Run: 11, exploration: 0.500708706245853, score: 11
Scores: (min: 10, avg: 15.181818181818182, max: 42)

Ru

Run: 88, exploration: 0.01, score: 500
Scores: (min: 9, avg: 114.13636363636364, max: 500)

Run: 89, exploration: 0.01, score: 209
Scores: (min: 9, avg: 115.20224719101124, max: 500)

Run: 90, exploration: 0.01, score: 425
Scores: (min: 9, avg: 118.64444444444445, max: 500)

Run: 91, exploration: 0.01, score: 437
Scores: (min: 9, avg: 122.14285714285714, max: 500)

Run: 92, exploration: 0.01, score: 424
Scores: (min: 9, avg: 125.42391304347827, max: 500)

Run: 93, exploration: 0.01, score: 334
Scores: (min: 9, avg: 127.66666666666667, max: 500)

Run: 94, exploration: 0.01, score: 246
Scores: (min: 9, avg: 128.9255319148936, max: 500)

Run: 95, exploration: 0.01, score: 231
Scores: (min: 9, avg: 130, max: 500)

Run: 96, exploration: 0.01, score: 315
Scores: (min: 9, avg: 131.92708333333334, max: 500)

Run: 97, exploration: 0.01, score: 391
Scores: (min: 9, avg: 134.5979381443299, max: 500)

Run: 98, exploration: 0.01, score: 315
Scores: (min: 9, avg: 136.4387755102041, max: 500)

Run: 9

NameError: name 'exit' is not defined

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.5   #changeing Gamma this time  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0 
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [4]:
cartpole()

Run: 1, exploration: 0.9229311239742362, score: 36
Scores: (min: 36, avg: 36, max: 36)

Run: 2, exploration: 0.8348931673187264, score: 21
Scores: (min: 21, avg: 28.5, max: 36)

Run: 3, exploration: 0.7628626641409962, score: 19
Scores: (min: 19, avg: 25.333333333333332, max: 36)

Run: 4, exploration: 0.7005493475733617, score: 18
Scores: (min: 18, avg: 23.5, max: 36)

Run: 5, exploration: 0.6596532430440636, score: 13
Scores: (min: 13, avg: 21.4, max: 36)

Run: 6, exploration: 0.6305556603555866, score: 10
Scores: (min: 10, avg: 19.5, max: 36)

Run: 7, exploration: 0.5907768628656763, score: 14
Scores: (min: 10, avg: 18.714285714285715, max: 36)

Run: 8, exploration: 0.5618938591163328, score: 11
Scores: (min: 10, avg: 17.75, max: 36)

Run: 9, exploration: 0.5290920728090721, score: 13
Scores: (min: 10, avg: 17.22222222222222, max: 36)

Run: 10, exploration: 0.4858739637363176, score: 18
Scores: (min: 10, avg: 17.3, max: 36)

Run: 11, exploration: 0.45522245551230495, score: 14
Scores

Run: 90, exploration: 0.01, score: 176
Scores: (min: 8, avg: 204.96666666666667, max: 500)

Run: 91, exploration: 0.01, score: 326
Scores: (min: 8, avg: 206.2967032967033, max: 500)

Run: 92, exploration: 0.01, score: 500
Scores: (min: 8, avg: 209.4891304347826, max: 500)

Run: 93, exploration: 0.01, score: 55
Scores: (min: 8, avg: 207.8279569892473, max: 500)

Run: 94, exploration: 0.01, score: 212
Scores: (min: 8, avg: 207.87234042553192, max: 500)

Run: 95, exploration: 0.01, score: 355
Scores: (min: 8, avg: 209.42105263157896, max: 500)

Run: 96, exploration: 0.01, score: 151
Scores: (min: 8, avg: 208.8125, max: 500)

Run: 97, exploration: 0.01, score: 275
Scores: (min: 8, avg: 209.49484536082474, max: 500)

Run: 98, exploration: 0.01, score: 207
Scores: (min: 8, avg: 209.46938775510205, max: 500)

Run: 99, exploration: 0.01, score: 269
Scores: (min: 8, avg: 210.07070707070707, max: 500)

Run: 100, exploration: 0.01, score: 304
Scores: (min: 8, avg: 211.01, max: 500)

Solved in 0 r

NameError: name 'exit' is not defined

In [2]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95   
LEARNING_RATE = 0.0001  #changeing Learning rate this time  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0 
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.8475428503023453, score: 36
Scores: (min: 17, avg: 26.5, max: 36)

Run: 3, exploration: 0.7705488893118823, score: 20
Scores: (min: 17, avg: 24.333333333333332, max: 36)

Run: 4, exploration: 0.7183288830986236, score: 15
Scores: (min: 15, avg: 22, max: 36)

Run: 5, exploration: 0.6596532430440636, score: 18
Scores: (min: 15, avg: 21.2, max: 36)

Run: 6, exploration: 0.6180388156137953, score: 14
Scores: (min: 14, avg: 20, max: 36)

Run: 7, exploration: 0.567555222460375, score: 18
Scores: (min: 14, avg: 19.714285714285715, max: 36)

Run: 8, exploration: 0.5425201222922789, score: 10
Scores: (min: 10, avg: 18.5, max: 36)

Run: 9, exploration: 0.510849320360386, score: 13
Scores: (min: 10, avg: 17.88888888888889, max: 36)

Run: 10, exploration: 0.4883155414435353, score: 10
Scores: (min: 10, avg: 17.1, max: 36)

Run: 11, exploration: 0.457510005540005, score: 14
Scores: (min: 10, avg: 16.8181

Run: 85, exploration: 0.01, score: 27
Scores: (min: 8, avg: 19.83529411764706, max: 145)

Run: 86, exploration: 0.01, score: 39
Scores: (min: 8, avg: 20.058139534883722, max: 145)

Run: 87, exploration: 0.01, score: 51
Scores: (min: 8, avg: 20.413793103448278, max: 145)

Run: 88, exploration: 0.01, score: 70
Scores: (min: 8, avg: 20.977272727272727, max: 145)

Run: 89, exploration: 0.01, score: 38
Scores: (min: 8, avg: 21.168539325842698, max: 145)

Run: 90, exploration: 0.01, score: 80
Scores: (min: 8, avg: 21.822222222222223, max: 145)

Run: 91, exploration: 0.01, score: 99
Scores: (min: 8, avg: 22.67032967032967, max: 145)

Run: 92, exploration: 0.01, score: 57
Scores: (min: 8, avg: 23.043478260869566, max: 145)

Run: 93, exploration: 0.01, score: 63
Scores: (min: 8, avg: 23.473118279569892, max: 145)

Run: 94, exploration: 0.01, score: 105
Scores: (min: 8, avg: 24.340425531914892, max: 145)

Run: 95, exploration: 0.01, score: 20
Scores: (min: 8, avg: 24.294736842105262, max: 145)



NameError: name 'exit' is not defined

What is the goal of the agent in this case?
        The goal of this agent is to balance a pole that is attached to a cart that can move side to side. The agent iterates along until it finds the best way to balance the pole. 
        
What are the various state values?
    The state values both start off at 17. Then, the min drops down to around 8 while the max rises to 500. 
    
What are the possible actions that can be performed?
    The possible actions that can be performed are move left (0) and move right(1).
    
What reinforcement algorithm is used for this problem?
    As stated in a cartpole article (Surma, G. (2019) Cartpole - introduction to reinforcement learning (DQN - deep Q-learning), Medium. Medium. Available at: https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288 (Accessed: March 31, 2023). This is a Deep Q Reinforced learning algorithm. 
    
    
How does experience replay work in this algorithm?
    Experience is used in this alogorithm to learn from its last attempts. This allows the algortihm to remeber what didn't work and what got it closer to completeion. It then uses that experience to begin from a different point to streamline its learning. 

What is the effect of introducing a discount factor for calculating the future rewards?
    (KarnivaurusKarnivaurus                    6 et al. (1963) Understanding the role of the discount factor in reinforcement learning, Cross Validated. Available at: https://stats.stackexchange.com/questions/221402/understanding-the-role-of-the-discount-factor-in-reinforcement-learning#:~:text=The%20discount%20factor%20essentially%20determines,that%20produce%20an%20immediate%20reward. (Accessed: March 31, 2023).) Explains that the discount factor is what drives the aglorithm to care about rewards in the future. If the discount factor is 0, the aglorithm will only care about instant rewards. However, if it's 1, it will take into considersation future rewards and use past experince. 
    
    
Explain the neural network architecture that is used in the cartpole problem.
    The nerual network architecture we are using for the cartpole problem is a sequential nerual network. This means that the model has layers and the system is able to process only 1 layer at a time. Thus, 1 input is inputted and 1 output is recieved. 
    
How does the neural network make the Q-learning algorithm more efficient?
    The neural network is able to process our input with pre-defined parameters using the past experience from failed attempts. Using a sequential neural network allows us to determine how the algorithm navigates the network.
    
What difference do you see in the algorithm performance when you increase or decrease the learning rate?
    When decreasing the size of the learning rate, it takes a lot less time for the solution to be reached. When I increased the learning rate, it took nearly 30,000 attempts for the algorithm to solve the problem.