# HW Assignment 2

Instructions: Implement both PG and an evolutionary algorithm to solve the Open AI Gym Lunar Lander problem, and then apply it to my area of choice, which is chess.

First, we need to do some setup

In [1]:
import torch
import numpy as np
import gym

# Set the device
if torch.cuda.is_available():
    device = "gpu" # 🧮
# elif torch.backends.mps.is_available():
#     device = "mps" # 🧠
else:
    device = "cpu" # 🥺
    
print(f"Using device: {device}")

Using device: cpu


First, we need to write the code for our Policy Gradient function with a baseline (REINFORCE). I'm going to use PyTorch as my neural network library.

I'm going to start with a basic feed forward net for both the network that chooses the policy and the network that learns states' values.

First, the policy network for choosing actions

In [2]:
from torch import nn

class PolicyChoice(nn.Module):
    def __init__(self):
        super(PolicyChoice, self).__init__()
        self.flatten = nn.Flatten()
        self.policy = nn.Sequential(
            nn.Linear(8, 8),
            nn.ReLU(),
            nn.Linear(8, 4), # TODO: try reducing to one hidden layer if learning proves initially dificult
            nn.ReLU(),
            nn.Linear(4, 4),
            nn.Softmax(dim=0) # log softmax for a nice interpretation as probabilities of choosing actions
        )

    def forward(self, x):
        probs = self.policy(x)
        return probs

policy_model = PolicyChoice().to(device)
policy_adam = torch.optim.Adam(policy_model.parameters(), 1e-3)

For our loss function for the policy network, we want to adjust just the parameters with the primary aim of affecting the probability of taking the action that we took on that time step. If the return of the resulting state is better than expected, we want to increase it proportionally. If it is less than expected, we want to decrease it proportionally. Thus, we multiply the gradient of the parameter weights w.r.t. the taken action's probability by the difference of the return for that state-action pair.

Importantly, there is an extra factor however that we must consider; when we decide that we want to take the gradient of the parameters w.r.t. a specific action's return, the policy expectancy must be multiplied by the specific action's likelihood to determine the value it contributes to the policy. Thus, we end up with the gradient of the action's probability conditioned on the state and parameters. 

Thus, the general concept of loss to backpropogate in the REINFORCE algorithm is:


$\Large (G_t - \hat{\upsilon}) \frac{\nabla\pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)}$

This can be expressed as:

$\Large (G_t - \hat{\upsilon}) \nabla \ln{\pi(A_t|S_t, \theta)}$


The code below just worries about the loss and not the gradient, as PyTorch provides autograd differntiation.

In [3]:
def policy_loss(prob, state_util_difference):
    nll_loss = nn.NLLLoss()
    return nll_loss(prob, torch.ones(1)) * state_util_difference

Now, the network for approximating state utililities.

In [4]:
class StateUtility(nn.Module):
    def __init__(self):
        super(StateUtility, self).__init__()
        self.flatten = nn.Flatten()
        self.state_utility = nn.Sequential(
            nn.Linear(8, 8),
            nn.ReLU(),
            nn.Linear(8, 4), # TODO: try reducing to one hidden layer if learning proves initially dificult
            nn.ReLU(),
            nn.Linear(4, 1), # output a tensor of a scalar value
        )

    def forward(self, x):
        state_utility = self.state_utility(x)
        return state_utility

state_util_model = StateUtility().to(device)
state_util_adam = torch.optim.Adam(state_util_model.parameters(), 1e-3)

For the state utilities network, we just use L1 loss with the gradients of W with respect to state utility.

$\Large (G_t - \hat{\upsilon}(S_t, W)) \nabla \hat{\upsilon}(S_t, W)$

Like above, the code below just worries about the loss and not the gradient, as PyTorch provides autograd differntiation.

In [5]:
def state_util_loss(calculated_state_value, episode_state_value):
    # the overall state value is the input, and the individual state value is our target
    l1_loss = nn.L1Loss()
    return l1_loss(calculated_state_value, episode_state_value)


We also need a function to calculate a state instance's utility in a given episode.

In [6]:
def calc_ep_state_util():
    """Given an observation and reward from an"""

Let's define our hyperparameters

In [7]:
gamma = .99

Let's load the Lunar Lander environment now

In [8]:
# TODO: use a custom dataloader class and see if speed up

env = gym.make(
    "LunarLander-v2",
    #render_mode="human"
)

action_space_seed = np.random.seed(13)

observation, info = env.reset(seed=13)

# index i in the lists below corresponds to the timestep i of the current episode
observations = []
rewards = []
episode_total_rewards = []

for timestep in range(10000):
    action_weights = np.array(policy_model(torch.tensor(observation)).tolist())
    action_array = np.random.multinomial(n=1, pvals=action_weights)
    action = np.argmax(action_array)
    
    observation, reward, terminated, truncated, info = env.step(action)
    observations.append(observation)
    rewards.append(reward)
    
    # end of episode
    if terminated or truncated:
        ep_length = len(observations)
        ep_total_reward = np.sum(np.array(rewards))
        episode_total_rewards.append(ep_total_reward)
        for timestep in reversed(range(ep_length)):

            terminal = timestep == len(rewards) - 1
            state_utilities = np.zeros(len(observations))
            state_utilities[timestep] = rewards[timestep] + (gamma * state_utilities[timestep+1]) if not terminal else rewards[timestep]
            
            pred_state_util = state_util_model(torch.tensor(observations[timestep]))
            actual_state_util = torch.tensor([state_utilities[timestep]])
            
            loss_state_utility = state_util_loss(pred_state_util, actual_state_util)
            
            state_util_adam.zero_grad()
            loss_state_utility.backward()
            state_util_adam.step()

        observation, info = env.reset()
        observations, rewards = [], []

print(episode_total_rewards)
env.close()

[-350.86130554663714, -189.8865101089612, -103.03847412831941, -581.2043445816703, -163.93252346597382, -49.51899536481143, -260.75239283003975, -233.18472395477846, -466.09337063747716, -274.9756752804017, -348.73756389432504, -595.4321930174558, -387.43727949568694, -477.39161292267676, -507.30093546356005, -162.46922987107797, -480.7745698076816, -409.8684494245649, -7.226011881097378, -336.07919890640403, -564.960104218054, -181.96143509468413, -473.21099556072414, -479.99705740041173, -96.61613820825796, -491.8673565150349, -289.54653009657926, -430.16777627322404, -89.96845138200774, -178.35957658831273, -251.79772720771726, -491.3091930132326, 11.087863045058057, -64.56443510680057, -175.7370180208552, -101.25521031004382, -288.41384107254646, -318.2153006563705, -334.0488297946413, -101.04652557200455, -79.15067410829134, -373.81468585665124, -86.61116853649762, -357.5281151310988, -234.93475040137642, -294.79145035570787, -230.65786723327614, -520.8971483584046, -75.8121089788