# Final Project

* [Overview](#overview)
* [Mountain Car Reinforcement Learning](#mountain-car-reinforcement-learning)
    * [Mountain Car Environment](#mountain-car-environment)
    * [Mountain Car Implementation](#implementation)
        * [Q-Learning](#q-learning)
            * [Q-Learning Results](#q-learning-results)
        * [DQN](#dqn)
            * [DQN Results](#dqn-results)
        * [Q-Learning VS DQN](#q-learning-vs-double-dqn)
* [Lunar Lander Reinforcement Learning](#lunar-lander-reinforcement-learning)
    * [Lunar Lander Environment](#lunar-lander-environment)
    * [Double DQN](#double-dqn)
        * [Double DQN Implementation](#double-dqn-implementation)
        * [Double DQN Results](#double-dqn-results)
    * [Prioritized Experience Replay Double DQN](#prioritized-experience-replay-double-dqn)
        * [PER Implementation](#per-implementation)
        * [PER Results](#per-results)
    * [PPO](#ppo)
        * [PPO Implementation](#ppo-implementation)
        * [PPO Results](#ppo-results)
* [Conclusion](#conclusion)
* [References](#reference)

## Overview

For this project I decided to implement what we have learned about reinforcement learning in both the Mountain Car and Lunar Lander environments from the OpenAI Gym/Gymnasium tool (https://gymnasium.farama.org/). 

First I developed a Q-Learning algorithm and Deep Q-Network (DQN) for the Mountain Car environment and then I ported the Double DQN over and updated it to see how I could implement the same for the Lunar Lander. Finally I implemented a Proximal Policy Optimization algorithm in the Lunar Lander environment to test against the success rate of the Double DQN with a Replay Buffer. The breakdown of the environments, the steps I took, and the results can be found below.

The majority of the work will be completed in individual .py files and compiled in the github repository where this notebook is also located (https://github.com/grsi7978/Q_Learning/tree/main).

For the development of this project I worked locally on my personal machine. I specifically had to set up a virtual machine to get the Lunar Lander working due to some of its dependencies.

## Mountain Car Reinforcement Learning

### Mountain Car Environment

The Mountain Car MDP is a deterministic MDP where the car is placed at the bottom of a valley. The goal of the MDP is to accelerate the car to reach the goal state (the flag) at the top of the right hill. There are two versions of the mountain car, one with discrete actions and one with continuous. For the purposes of this project I will be working with the one with discrete actions.

The Mountain Car environment is part of the Classic Control environments. Its Action Space = Discrete(3) and it's Observation Space = Box([-1.2 -0.07], [0.6 0.07], (2,), float32).

The Action space includes three actions:
- 0: Accelerate car to the left
- 1: Do nothing
- 2: Accelerate car to the right

The Observation Space contains positional and velocity elements:
- The first element is the position which can vary between -1.2 and 0.6, representing the car's horizontal position on the track
- The second element is the velocity which can vary between -0.07 and 0.07, representing the car's actual velocity

This all means that at each step the car must choose one of the three discrete actions and is in a current state given by the continuous vector (Observation Space).

The base / initial Reward setup for the Mountain Car is a reward of -1 for each timestep that the car is not at the flag (goal).

The Mountain Car episode ends in a termination state if the position of the car is greater than or equal to 0.5 (the goal position) OR in a truncation state if teh length the the episode is 200.

The above and more information about the environment can be found here: https://gymnasium.farama.org/environments/classic_control/mountain_car/ 

### Implementation

#### Q-Learning

Basic Q-Learning implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_q_learning.py

Basic Q-Learning with adaptive decay implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_q_learning_adaptive_decay.py

Initially I built a basic Random Agent that made random action decisions just to test out the environment. I left this agent in the code for posterity.

I then built a `QAgent` that leveraged Q-Learning to make better decisions and increase the rewards. This QAgent used a learning rate (`alpha`), discount factor (`gamma`), epsilon for exploration (`ep`), and some factors influencing epsilon decay (`ep_decay` and `ep_min`).

To complete the `QAgent` I also implemented a few key items:
- Q-table
    - Indexed by discretized states (via buckets)
- Update the Q-values with the Bellman Equation

I created an evaluation function `evaluate` to help in development by collecting the total rewards per episode, returning the mean and standard deviation of said rewards, after running the agent for a specified number of episodes (generally had this number set to 20000).

I made a function that discretizes the continuous position and velocity states into bucketed indicies called `mapPosVel` to ensure each state maps to a finite grid location in the built Q-table.

Then I developed an epsilon-greedy policy function `getAction` that would choose a random exploration action with probability `ep` OR chooses the best exploitation action from the Q-table. In this function I implemented a toggle parameter allowing for the function to always choose exploit if set to True, which was only used when called from the `evaluate` function. The intent here is to test how good the agent is while using a learned policy, not while exploring / taking random actions. 

The `update` function I created uses the Bellman Equation by getting the current state and next state, computing the next best action using the reward plus the best future Q-value, and applies the temporal difference as an update to the appropriate Q-table entry.

The algorithm also decays the exploration over time by simply taking the max between the epsilon minimum and the current epsilon multiplied by the epsilon decay factor.

The `qLearning` function applies the agent actions until termination for each episode. It uses some custom reward shaping that rewards the cart for moving faster, being closer to the flag (goal), accelerating in the direction it is currently moving, as well as a very large bonus for reaching the goal state. These rewards were implemented over time and multiple iterations while trying to help the algorithm learn what best actions to take. It is in this function that I also track any successes that occur.

This implementation required a lot of parameter tuning. You can see just some of the tuning I did in the code comments as I began recording already tested values after it became too difficult to remember all the iterations that I attempted. For instance, for the buckets I was using for the discretized states I iterated through many different sizes ((40,30), (24,18), (18,14), (20,15)) and some others that were not recorded before finding the best combination to be (20,15). Other basic parameters that were tweaked many times included the epsilon decay (`ep_decay`) and epsilon minimum (`ep_min`).

In an attempt to achieve better Reward scores I also implemented a nearly identical version of the Q-Learning implementation, with the addition of an adaptive decay. In this version the epsilon decay rate is increase if the reward is improving beyond a threshold set (close to the goal threshold) or declines (slower decay) if the reward is not moving beyond said threshold. The intent here was to help prevent early convergence. 

I also added a function `plotRewards` that simply plotted the learning progress and success rate of the algorithm. See below for an example output of the graph. This function helped me when tuning the parameters mentioned above as I could compare graphs and see how fast or consistently the learning process was actualizing. 

![Figure_1](img/Q_Learning_Figure_1.png)

I also had the learning function outputting print statements that included the episode number, the total reward for the episode, and the current epsilon value. To make the output managable I restricted the output to every 100 episodes as I was generally running for > 15000 episode. The Final Evaluation over 10 episodes' Average Reward and standard deviation, as well as the total number of successful runs were also output. This another way I could monitor the trends and see if there was any improvement while I was tweaking the parameters and the reward scaling.

![Figure_2](img/Output_Figure_1.png)

Seeing a positive trend in the success rate was the best true measure that the algorithm was working as intended. It appears that a Reward score of < ~-130 would generally lead to a successful run, but manually tracking the actual successes was the only real way I was able to guarantee that a run completed with the cart reaching the flag. 

##### Q-Learning Results

Overall the adaptive decay Q-Learning method had ~100% success rate with a somewhat low Reward score (~140-170) while the non-adaptive Q-Learning method ranged from 60%-90% success rates generally but with better average scores. This indicates that when the non-adaptive method was successful it was successful noticably faster than the adaptive decay method, but the adaptive decay method was consistently successful while the non-adaptive was not.

Here is a gif of ten runs made by the adaptive decay Q-Learning method:

![SegmentLocal](img/adaptive_q_learning_looped.gif "segment")

#### DQN

DQN implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_dqn.py 

For the DQN I leveraged a few key concepts:
- Q-Network (policy net) and Target Network: These are used to stabalize the learning
    - I restricted updates to the Q-Network until I had enough samples to make it worth while (explained again in more detail below)
    - I periodically updated the Target Network to match the Q-Network: The idea here is to make the targets used in learning more stable
    - Here is where I am setting this up, which I mainly got from this <a href="https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html">tutorial</a>:
    ```
        self.q_net = DQN(input_dim, self.n_actions).to(self.device)
        self.target_net = DQN(input_dim, self.n_actions).to(self.device)
        self.target_net.load_state_dict(self.q_net.state_dict())
    ```
    - Here is where I periodically update the target network which I mainly got from this <a href=" https://lightning.ai/docs/pytorch/2.0.9/notebooks/lightning_examples/reinforce-learning-DQN.html">tutorial</a>:
    ```
        if self.step_count % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.q_net.state_dict())
    ```
- Double DQN: Used to reduce overestimation
    - Uses `q_net` to choose best action in the next state
    - Uses `target_net` to evaluate the value of that action
    - Here is where I am setting this up:
    ```
        next_actions = self.q_net(next_states).argmax(1, keepdim=True)  # action selection
        max_next_q = self.target_net(next_states).gather(1, next_actions)  # action evaluation
    ```
- Replay Buffer: Stores transitions in memory and then samples them randomly for training purposes
    - Here are the non-sequential calls I make to implement this, which I mainly got from this <a href=" https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html">tutorial</a>:
        - `self.memory = deque(maxlen=mem_size)`
        - `self.memory.append((state, action, reward, next_state, done))`
        - `batch = random.sample(self.memory, self.batch_size)`
- Epsilon Greedy Policy: Used in each training episode to collect transitions
    - Gradually will shift from exploration to exploitation
    - This is basically the same as the basic Q-Learning implementation
- Normalization of Inputs: Used to standardize state values into similar scales as neural networks learn faster and more effictively with feature inputs on similar scales
    - Here is one place I am using this:
    ```
        state_mean = np.array([-0.3, 0.0], dtype=np.float32) # -0.3 is center of -1.2,0.6 (position) and 0.0 is center of -0.07,0.07 (velocity)
        state_std = np.array([0.9,0.07], dtype=np.float32) # 0.9 = (0.6 - -1.2) / 2
    ```

![Figure_3](img/Output_Figure_2.png)

Initially I did not have reward bonuses for certain states / action combinations like I eventually did in the Q-Learning models. Meaning I was not rewarding for being closer to the goal state, for moving in the right direction, etc. I was not sure that these would matter as much in a DQN. However, after many failed attempts at getting satistifying results I did end up adding some of these, namely a `velocity_bonus`, `position_reward`, and large early termination reward (successful termination):

```
            velocity_bonus = 0.1 * abs(next_state[1])
            position_min = env.observation_space.low[0]
            position_max = env.observation_space.high[0] 
            position_reward = 0.3 * ((next_state[0] - position_min) / (position_max - position_min))

            new_reward = reward + velocity_bonus + position_reward
```
I played around with some different values for these, such as adding 10.0 for early successful termination initially, but eventually bumped that up to 100.0. 

There were many other values that I toyed with, trying to massage the DQN so that it would perform better. Some such values included the `episodes` count used for training (tried 1000, 1500, landed on 2000), the learning rate (`lr`) of the model (tried 1e-3, 1e-4, 3e-4), and the target update frequencey (`target_update_freq`) of the model (started at 5000 landed on 1000). There were many other values that I spent time tweaking up or down depending on what the variable was being used for but eventually I seemed to have found a manageable balance for the model where I was getting 10/10 successes on every run.

One of the most important changes to a value set that I made was an increase to the number of neurons in the network model. I initially had the DQN model with `56` units per layer but bumped it up to `128` per. The intent here was to help the model approximate better.

One of the other many changes I made during implementation was to normalize the velocity and position features to help the neural net better use the state values and use both features more equally. Velocity spans a much smaller range / is on a smaller scale (-0.07 to 0.07 with a range of 0.14) than Position (-1.2 to 0.6 with a range of 1.8) meaning that both were likely not influencing the learning nearly as evenly as I would like.

Initially I had code that looked like this:
```
next_states = torch.FloatTensor(next_states).to(self.device)
```
But I normalized it so it looked like this:
```
next_states = (torch.FloatTensor(next_states).to(self.device) - state_mean) / state_std
```

Another change I made during implementation was to add a warm up period to the agent in the form of delaying updating of said agent for a certain length of time (1000 steps). I accomplished this simply by adding an additional `warmup_steps` variable to the agent and comparing the steps count to the number of experiences in the replay buffer `self.memory`. If the number of experiences was less than the `warmup_steps` variable value, or the `batch_size` variable value, then no update would occur. I already had the `batch_size` limitation in place to ensure there was aenough datat for one mini-batch, but the `warmup_steps` limitation ensures that there is a diverse set of transitions in the buffer. Through some research early on when the DQN was performing terribly I saw that early episodes are often considered poor to use as they may have sparse rewards and everything is still random. This implementation allowed for more valuable transistions to be included in the batches.

One final change I will list here, though there were many many more that occured during development, is the move from the `MSELoss` function to the `SmoothL1Loss` loss function. Looking online, RL targets are often noisey and rarely clean. The `SmoothL1Loss` combines MSE and MAE to handle both small and large errors VS the `MSELoss` function which generally only handles small errors and outliers well. This update made a noticable change to the output of the DQN model.

##### DQN Results

Below you can see the improvement that the DQN experienced during training over time.

![Figure_4](img/DQN_Learning_Figure_1.png)

Overall the DQN ended up having a nearly 100% success rate in all the times that I ran it in its final state. This is as complete and effective as the adaptive decay Q-Learning.

![Figure_5](img/DQN_Success_Rate_Figure_1.png)

Finally here is a gif of 10 runs of the DQN implementation:

![SegmentLocal](img/dqn_learning_looped.gif "segment")

#### Q-Learning VS Double DQN

Though both the adaptive decay Q-Learning and DQN generally experienced ~100% success rates over their 10 evaluation episodes in each of the runs I did they did not **always** experience this. The DQN did seem to have slightly better performance, rarely falling below the 100% success rate.

## Lunar Lander Reinforcement Learning

For the Lunar Lander Reinforcement Learning I opted to first extend the Double DQN I developed for the Mountain Car environment. I then moved on to extending the DQN with a Prioritized Experience Replay. Finally, to test an entirely different model I built a Proximal Policy Optimization algorithim (PPO) in the same environment.

### Lunar Lander Environment

The Lunar Lander environment is a rocket trajectory optimization problem. There are two versions discrete or continuous. The goal is to try to have the ship land (not crash land) on the landing pad which is always at the coordinates (0,0). It is important to note that fuel is infinite for the ship, this allows for learning before landing without fear of a time limit causing a crash landing.

The Action Space includes 4 Discrete Actions:
- 0: Do nothing
- 1: Fire left orientation engine
- 2: Fire main enging (center bottom)
- 3: Fire right orientation engine

The Observation Space is an 8-dimensional vector:
- 0: x-position of lander
- 1: y-position of lander
- 2: x-velocity of lander
- 3: y-velocity of lander
- 4: angle of the lander in radians
- 5: angular velocity of the lander (speed at which the angle is changing)
- 6: left leg contact boolean (represents whether or not the leg is in contact with the ground)
- 7: right leg contact boolean (represents whether or not the leg is in contact with the ground)

The base / initial Reward setup for the Lunar Lander has the total reward of an episode being the sum of the rewards for all the steps within the episode. Elements that impact the overall reward are:
- closer (increased reward)/further (decreased reward) the lander is to the landing pad
- slower(increased reward)/faster(decreased reward) the lander is moving
- decreased the more the lander is tilted (angle not horizontal)
- increased by 10 points for each leg that is in contact with the ground
- decreased by 0.03 points each frame a side engine is firing
- decreased by 0.3 points each frame the main engine is firing
- decreased by 100 for crashing
- increased by 100 for landing safely

It is important to note that an episode is considered a solution if it scores at least 200 points. **Note**: This is a noticably different set up than the Mountain Car as I was manually tweaking the rewards in the Mountain Car algorithms to influence its behavior, and a reaching the goal distance/location (flag) was considered a success. Due to this set up for the Lunar Lander I will not be shaping rewards as the algorithm progresses to ensure desired readability and accuracy in the results.

The above and more information about the Lunar Lander environment can be found here: https://gymnasium.farama.org/environments/box2d/lunar_lander/ 

### Double DQN

#### Double DQN Implementation

Lunar Lander DQN: https://github.com/grsi7978/Q_Learning/blob/main/final_project_lunar_landing_DQN.py

After some research I opted to use the Lunar Lander in the discrete form. This allowed for use of the DQN model that I had been building and learning about in class. If I shifted to the continuous form then the ideal implementations would have been something along the lines of a DDPG or SAC instead of a Double DQN. Even if I included policy gradient extensions it seems as though a DQN is not ideal for the Lunar Lander continuous model. This is because the DQN still uses the highest Q-values when determining each action to take, and in a continuous space, to my understanding it is very difficult to determine the Q-value for every possible action since there are so many possibilities. 

Simply with the removal of the reward shaping from the Mountain Car implementation of the Double DQN I was able to easily transpose the Double DQN to the Lunar Lander. This means that I am still using the same layers and shape for the neural network as I did in the Mountain Car environment. 

Some other minor changes I made involved tweaking the parameters and episode count, changing how successes were tracked, as well as removing any normalization I was doing. One reason I dropped the normalization was due to the different types of discrete variables in the Observation Space (specifically things like the contact booleans for the legs). I also dropped the episode count to 1000. 

Through many iterations I made slight adjustments to my parameters. This included increasing the warm up steps from 1000 to 5000 to increase the number of samles the agent was receiving before learning. I also increased the epsilon decay rate to decrease the amount of time that the agent spent taking random actions. I cut the number of steps required before updating the target network in half to keep the Q-targets more fresh.

#### Double DQN Results

Below you can see a graph representing one set of training episodes.

![Figure_6](img/dqn_lunar_lander_reward_time_01.png)

The Lunar Lander DQN had varrying success rates. I found that running a specific seed yeilded higher success rates (nearly 100%) so during development I worked primarily with this seed. You can see the implementation of the seeds array and use in the code and below an example output of some of the training episodes and an evaluation run.

![Figure_7](img/Lunar_Lander_Output_Figure_1.png)

Below is a gif of ten evaluation runs of the Lunar Lander game:

![SegmentLocal](img/dqn_learning_lunar_lander_looped.gif "segment")

Overall the Lunar Lander Double DQN had strong performance but not perfect. To see overall performance I ran many runs with 10 evaluation episodes each over many seeds and took an average score which came out to be ~75% success rate. This is before seeds were selected for strong performance and were selected at random. 

### Prioritized Experience Replay Double DQN

Lunar Lander PER Double DQN: https://github.com/grsi7978/Q_Learning/blob/main/lunar_lander_per.py

In an attempt to further improve the DQN's success rate I decided to add in a Prioritized Experience Replay (PER). In the DQN I was storing all the experiences into the replay buffer and learning by randomly sampling a batch of these experiences to train. One of the main benefits of this is to help prevent the DQN from overfitting. However, some of the experiences that are stored are hypothetically more important to learn from than others (e.g. **very** successful runs, crashes, etc). With an addition of a PER the more important experiences have a higher chance of being selected for training. I used Temporal Difference Error to determine how much the agent could learn from the experience. For instance, a high TD error means that what occured in the experience was more likely something that the agent had not already experienced before. 

To accomplish this I determined priority based on the formula `priority = |TD Error| + epsilon` with the sampling probability being `priority^alpha`. This way the higher the TD error (positive or negative) the higher priority of selection it has. I mainly got this idea from this <a href="https://davidrpugh.github.io/stochastic-expatriate-descent/pytorch/deep-reinforcement-learning/deep-q-networks/2020/04/14/prioritized-experience-replay.html">walkthrough</a>. Then when sampling the experiences I used the priority weights to determine which experiences to grab. One important thing to note though is that there is bias introduced to the process because I am no longer sampling entirely randomly. This is monitored in my code by `beta`. Essentially by the end of training the goal is to have nearly unbiased training occuring.

#### PER Implementation

You can see here where I use the TD Error to update the priorities of the replay buffer:
```
        td_errors = (target_q - q_values).detach().cpu().numpy().squeeze()
        new_priorities = np.abs(td_errors) + 1e-5
        self.memory.update_priorities(indicies, new_priorities)
```

I toyed with different beta and beta incrementing values to better correct the sampling bias of the PER during training. The goal was to try to make beta reach around 1.0 by the end of the training. 
I also slowed the epsilon decay in this model as I was not seeing great results and wanted to increase the amount of random exploration time. During many iterations trying to find the ideal combination of parameters I tweaked the number of layers in the model, the number of steps taken during warm up, the learning rates, the epsilon and its decay, as well as the target update frequency.

#### PER Results

Below you can see a graph representing one set of training episodes.

![Figure_9](img/per_dqn_lunar_lander_reward_time_01.png)

Also an average breakdown of the success and failure rates of 10 episodes of evaluations for the PER:

![Figure_10](img/per_dqn_lunar_lander_success_rate_01.png)

Here is a gif of ten runs completed after the model has been trained using the PER Double DQN:

![SegmentLocal](img/per_dqn_learning_lunar_lander_looped.gif "segment")

Overall the addition of the PER seems to have minimal impact on the performance of the Double DQN. After running 20 sessions of 2000 episodes it had an accuracy of ~80% which was a slight increase in performance.

### PPO

Lunar Lander PPO: https://github.com/grsi7978/Q_Learning/blob/main/lunar_lander_ppo.py

I opted to build a Proximal Policy Optimization (PPO) for the Lunar Lander environment using the Stable-Baselines3 library because PPOs are good at learning sequential control tasks due to smooth policy updates, avoiding whildly changing behaviors that can be seen in Q-Learning based approaches. PPOs are Reinforcement Learning algorithms that teach an agent how to act via directly learning a policy. It compeletes its learning process a little at a time, in incremental steps. The term "Proximal" comes from the fact that the PPO clips updates to the new policy making it proximal to the old one. From what I found online PPOs are generally considered stable and fast to train, which is somewhat different than DQNs. 

#### PPO Implementation

For this implementation of the PPO, after a couple of more simple implementations, I used the gymnasium environment as the base environment, then wrapped it in a `DummyVecEnv`, followed by a `VecNormalize` to vectorize and normalize the inputs. I got this idea and implementation from <a href="https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html">here</a>. 
```
        gym_env = gym.make(env)
        self.env = DummyVecEnv([lambda: gym_env])
        self.env = VecNormalize(self.env)
        self.model = PPO(
```
After implementing the above, through iteration I found that it was better to leverage `SubprocVevEnv` (from the same link as above) to train the RL Agent on `n` environments per step instead of 1 environment per step:
```
        env_vec = SubprocVecEnv([lambda: gym.make("LunarLander-v3") for _ in range(4)])
        self.env = VecNormalize(env_vec)
        self.model = PPO(
```
I initially started training the PPO with just 500,000 timesteps but continually bumped this number up until hitting 5,000,000 where I saw better results. After making additional changes I ended up dropping this value back down to 2,000,000 for similar results. The evaulation function remained much the same as I had implemented in the Double DQN. 

#### PPO Results

Overall I saw about a 85% success rate using the PPO and training with 2000000 steps, with 10 evaluation episodes over 15 complete runs. This is a slight improvement over the DQN, though not as much as I expected. Below is the output of one of the perfect runs that occured during the many iterations I ran.

![Figure_10](img/PPO_Output_Figure_1.png)

You can see that the average score and standard deviation were very high and small respectively.

Below is an example set of evaluation runs from the PPO.

![SegmentLocal](img/ppo_learning_lunar_lander.gif "segment")

## Conclusion

The fact that I was able to port the DQN over from the Mountain Car to the Lunar Lander leads me to believe that it is flexible and scalable for most discrete environments (further testing needed). The Mountain Car is a 2D state space while the Lunar Lander is an 8D one. I also learned through experimentation that Normalization is not always beneficial as I saw worse results in the Lunar Lander when normalizing the states and data than I did simply leaving them alone. Normalization has its place and depends on feature distribution of the environments.

Through many, many iterations and parameter changes I have concluded that the DQN is a solid model and with the addition of the Experience Replay it remained a consistently successful model. The addition of the Prioritized Experience Replay to the Double DQN only had minimal impact, increasing accuracy by about 5%. This was less that I expected and may be due to my specific implementation or the parameters that I have chosen.

The PPO was faster to implement and worked just as well or better than the Double DQN with Experience Replay for the Lunar Lander. This makes sense to me as the PPO directly learns a policy or what action is best to take, while the DQN learns Q-Values to derive a policy. Because of the way the DQN learns from Q-Values there is potential for overfitting. The PPO on the other hand uses some clipping to mainting small learning segments, preventing volatile updates during training. The PPO always trains on newer data which keeps the learning fresh and prevents outdated information from tainting the learning unlike a DQN. With all of that I honestly expected the PPO to outshine the DQN spectacularly. However as mentioned the improvement was slight. This is likely due to the hyperparameters that I have selected. I test many different hyperparameter values in both the DQN and PPO, and many different combinations of said parameters. However, it seems like the PPO was even more sensitive to the tuning than the DQN was as it was easy to quickly cause the success rate of the PPO to nearly zero out. 

If I had more time I would have liked to spend more time optimizing the parameters in the PPO to get better results. I would look into writing code that allowed for tuning on the fly as well, for instance updating the learning rate as time progressed to yeild better results. Additionally I would have liked to worked on the continuous form of the Lunar Lander Environment and developed a DDPG or SAC to test out those implementations.

## Reference

- DQN
    - https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
    - https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial 
    - https://www.slingacademy.com/article/implementing-deep-q-networks-dqn-in-pytorch-for-complex-environments/
    - https://medium.com/@hkabhi916/mastering-deep-q-learning-with-pytorch-a-comprehensive-guide-a7e690d644fc
    - https://davidrpugh.github.io/stochastic-expatriate-descent/pytorch/deep-reinforcement-learning/deep-q-networks/2020/04/14/prioritized-experience-replay.html
- PPO
    - https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
    - https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html
    - https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html
    - https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py
    - https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb