<a href="https://colab.research.google.com/github/jjoannahao/Capstone_1/blob/main/Capstone_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CAPSTONE 1: CLIFF WALKING

In this environment from OpenAI's gymnasium library, the character (agent) aims to try to reach the goal (cookie) without falling off the cliff (burgundy squares).

### ACTION SPACE: (all actions agent can take)
* 0: move up
* 1: move right
* 2: move down
* 3: move left

### OBSERVATION SPACE: (what agent can see about environment)
* 36 possible states since the player cannot be at the cliff or at the goal.

### REWARDS
* Taking one step incurs a -1 reward
* If the player steps into the cliff, they receive a -100 reward

As a result, the ideal situation and maximum (ideal) possible reward is -13 (the minimum number of steps from the starting state to the goal)

In [None]:
# obtain necessary libraries for the program
!pip install gymnasium
import gymnasium as gym
import numpy as np
import time

# create the environment
env = gym.make('CliffWalking-v0', render_mode='rgb_array')
env.reset()
env.render()

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1


In [None]:
# Q-learning function
def qlearn(num_episodes, gamma=0.7):
    # initializing statistics
    rewards = []
    best_reward = -999
    best_episode = []

    ### algorithm
    Q = [[0 for col in range(env.unwrapped.nA)] for row in range(env.unwrapped.nS)]  # create q table

    for i in range(num_episodes):
        # initializing statistics for individual episodes
        episode = []
        episode_reward = 0

        state, _ = env.reset()  # get new enviro
        done = False
        while not done:  # while environment not solved
          value_noise = np.random.randn(1, env.unwrapped.nA)*(1./(i+1))  # add spontaneity to decision making
          # Spontaneity means the agent explore more rather than always going with the 'safe path'
          action_idx = np.argmax(Q[state] + value_noise)
          next_state, reward, done, _, _ = env.step(action_idx)
          Q[state][action_idx] = reward + gamma * np.max(Q[next_state])
          state = next_state

          ### Statistics & visualization ###
          episode_reward += reward  # accumulating total reward received in current episode

        if (i+1) % 50 == 0:
            print(f'Episode {i + 1} Reward: {episode_reward}')

        if episode_reward > best_reward:
            best_episode = episode

        rewards.append(episode_reward)

    return Q, rewards, best_episode

Q, rewards, best_episode = qlearn(1000)
print('Average reward for first 100 episodes:', np.mean(rewards[:100]))
print('Average reward for last 100 episodes:', np.mean(rewards[-100:]))

Episode 50 Reward: -16
Episode 100 Reward: -13
Episode 150 Reward: -13
Episode 200 Reward: -13
Episode 250 Reward: -13
Episode 300 Reward: -13
Episode 350 Reward: -13
Episode 400 Reward: -13
Episode 450 Reward: -13
Episode 500 Reward: -13
Episode 550 Reward: -13
Episode 600 Reward: -13
Episode 650 Reward: -13
Episode 700 Reward: -13
Episode 750 Reward: -13
Episode 800 Reward: -13
Episode 850 Reward: -13
Episode 900 Reward: -13
Episode 950 Reward: -13
Episode 1000 Reward: -13
Average reward for first 100 episodes: -40.28
Average reward for last 100 episodes: -13.0


In [None]:
# TESTING AGENT

state, _ = env.reset()  # get new enviro
frame = env.reset()
done = False
actions = {0: "up", 1: "right", 2: "down", 3: "left"}
solution = []
while not done:  # while environment not solved
  action_idx = np.argmax(Q[state])
  solution.append(action_idx)
  print(f"action: {actions[action_idx]}")
  next_state, reward, done, _1, _2 = env.step(action_idx)
  state = next_state
  if done:
    break


action: up
action: right
action: right
action: right
action: right
action: right
action: right
action: right
action: right
action: right
action: right
action: right
action: down


In [None]:
# rudimentary solution to visualization
env.reset()
env.render()

In [None]:
env.step(solution[0])
env.render()

In [None]:
env.step(solution[1])
env.render()

In [None]:
env.step(solution[2])
env.render()

In [None]:
env.step(solution[3])
env.render()

In [None]:
env.step(solution[4])
env.render()

In [None]:
env.step(solution[5])
env.render()

In [None]:
env.step(solution[6])
env.render()

In [None]:
env.step(solution[7])
env.render()

In [None]:
env.step(solution[8])
env.render()

In [None]:
env.step(solution[9])
env.render()

In [None]:
env.step(solution[10])
env.render()

In [None]:
env.step(solution[11])
env.render()

In [None]:
env.step(solution[12])
env.render()