## Mountain Car Reinforcement Learning

For this project I decided to implement what we have learned about reinforcement learning on the Mountain Car environment from the OpenAI Gym/Gymnasium tool (https://gymnasium.farama.org/). 

The intent is to develop multiple Q-Learning algorithms and determine if I can get better Rewards from one over the others. I will develop a basic Q-Learning algorithm, as well as a Deep Q-Network (DQN).

The majority of the work will be completed in individual .py files and compiled in the github repository where this notebook is also located (https://github.com/grsi7978/Q_Learning/tree/main).

For the development of this project I worked locally on my personal machine. 

### Mountain Car Environment

The Mountain Car MDP is a deterministic MDP where the car is placed at the bottom of a valley. The goal of the MDP is to accelerate the car to reach the goal state (the flag) at the top of the right hill. There are two versions of the mountain car, one with discrete actions and one with continuous. For the purposes of this project I will be working with the one with discrete actions.

The Mountain Car environment is part of the Classic Control environments. Its Action Space = Discrete(3) and it's Observation Space = Box([-1.2 -0.07], [0.6 0.07], (2,), float32).

The Action space includes three actions:
- 0: Accelerate car to the left
- 1: Do nothing
- 2: Accelerate car to the right

The Observation Space contains positional and velocity elements:
- The first element is the position which can vary between -1.2 and 0.6, representing the car's horizontal position on the track
- The second element is the velocity which can vary between -0.07 and 0.07, representing the car's actual velocity

This all means that at each step the car must choose one of the three discrete actions and is in a current state given by the continuous vector (Observation Space).

The base / initial Reward setup for the Mountain Car is a reward of -1 for each timestep that the car is not at the flag (goal).

The Mountain Car episode ends in a termination state if the position of the car is greater than or equal to 0.5 (the goal position) OR in a truncation state if teh length the the episode is 200.

The above and more information about the environment can be found here: https://gymnasium.farama.org/environments/classic_control/mountain_car/ 

### Implementation

#### Q-Learning

Basic Q-Learning implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_q_learning.py

Basic Q-Learning with adaptive decay implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_q_learning_adaptive_decay.py

Initially I built a basic Random Agent that made random action decisions just to test out the environment. I left this agent in the code for posterity.

I then built a `QAgent` that leveraged Q-Learning to make better decisions and increase the rewards. This QAgent used a learning rate (`alpha`), discount factor (`gamma`), epsilon for exploration (`ep`), and some factors influencing epsilon decay (`ep_decay` and `ep_min`).

To complete the `QAgent` I also implemented a few key items:
- Q-table
    - Indexed by discretized states (via buckets)
- Update the Q-values with the Bellman Equation

I created an evaluation function `evaluate` to help in development by collecting the total rewards per episode, returning the mean and standard deviation of said rewards, after running the agent for a specified number of episodes (generally had this number set to 20000).

I made a function that discretizes the continuous position and velocity states into bucketed indicies called `mapPosVel` to ensure each state maps to a finite grid location in the built Q-table.

Then I developed an epsilon-greedy policy function `getAction` that would choose a random exploration action with probability `ep` OR chooses the best exploitation action from the Q-table. In this function I implemented a toggle parameter allowing for the function to always choose exploit if set to True, which was only used when called from the `evaluate` function. The intent here is to test how good the agent is while using a learned policy, not while exploring / taking random actions. 

The `update` function I created uses the Bellman Equation by getting the current state and next state, computing the next best action using the reward plus the best future Q-value, and applies the temporal difference as an update to the appropriate Q-table entry.

The algorithm also decays the exploration over time by simply taking the max between the epsilon minimum and the current epsilon multiplied by the epsilon decay factor.

The `qLearning` function applies the agent actions until termination for each episode. It uses some custom reward shaping that rewards the cart for moving faster, being closer to the flag (goal), accelerating in the direction it is currently moving, as well as a very large bonus for reaching the goal state. These rewards were implemented over time and multiple iterations while trying to help the algorithm learn what best actions to take. It is in this function that I also track any successes that occur.

This implementation required a lot of parameter tuning. You can see just some of the tuning I did in the code comments as I began recording already tested values after it became too difficult to remember all the iterations that I attempted. For instance, for the buckets I was using for the discretized states I iterated through many different sizes ((40,30), (24,18), (18,14), (20,15)) and some others that were not recorded before finding the best combination to be (20,15). Other basic parameters that were tweaked many times included the epsilon decay (`ep_decay`) and epsilon minimum (`ep_min`).

In an attempt to achieve better Reward scores I also implemented a nearly identical version of the Q-Learning implementation, with the addition of an adaptive decay. In this version the epsilon decay rate is increase if the reward is improving beyond a threshold set (close to the goal threshold) or declines (slower decay) if the reward is not moving beyond said threshold. The intent here was to help prevent early convergence. 

I also added a function `plotRewards` that simply plotted the learning progress and success rate of the algorithm. See below for an example output of the graph. This function helped me when tuning the parameters mentioned above as I could compare graphs and see how fast or consistently the learning process was actualizing. 

![Figure_1](img/Q_Learning_Figure_1.png)

I also had the learning function outputting print statements that included the episode number, the total reward for the episode, and the current epsilon value. To make the output managable I restricted the output to every 100 episodes as I was generally running for > 15000 episode. The Final Evaluation over 10 episodes' Average Reward and standard deviation, as well as the total number of successful runs were also output. This another way I could monitor the trends and see if there was any improvement while I was tweaking the parameters and the reward scaling.

![Figure_2](img/Output_Figure_1.png)

Seeing a positive trend in the success rate was the best true measure that the algorithm was working as intended. It appears that a Reward score of < ~-130 would generally lead to a successful run, but manually tracking the actual successes was the only real way I was able to guarantee that a run completed with the cart reaching the flag. 

##### Results

Overall the adaptive decay Q-Learning method had ~100% success rate with a somewhat low Reward score (~140-170) while the non-adaptive Q-Learning method ranged from 60%-90% success rates generally but with better average scores. This indicates that when the non-adaptive method was successful it was successful noticably faster than the adaptive decay method, but the adaptive decay method was consistently successful while the non-adaptive was not.

#### DQN

DQN implementation: https://github.com/grsi7978/Q_Learning/blob/main/final_project_dqn.py 






![Figure_3](img/Output_Figure_2.png)


##### Changes Made During Development

Initially I did not have reward bonuses for certain states / action combinations like I eventually did in the Q-Learning models. Meaning I was not rewarding for being closer to the goal state, for moving in the right direction, etc. I was not sure that these would matter as much in a DQN. However, after many failed attempts at getting satistifying results I did end up adding some of these, namely a `velocity_bonus`, `position_reward`, and large early termination reward (successful termination):

```
            velocity_bonus = 0.1 * abs(next_state[1])
            position_min = env.observation_space.low[0]
            position_max = env.observation_space.high[0] 
            position_reward = 0.3 * ((next_state[0] - position_min) / (position_max - position_min))

            new_reward = reward + velocity_bonus + position_reward
```
I played around with some different values for these, such as adding 10.0 for early successful termination initially, but eventually bumped that up to 100.0. 

There were many other values that I toyed with, trying to massage the DQN so that it would perform better. Some such values included the `episodes` count used for training (tried 1000, 1500, landed on 2000), the learning rate (`lr`) of the model (tried 1e-3, 1e-4, 3e-4), and the target update frequencey (`target_update_freq`) of the model (started at 5000 landed on 1000). There were many other values that I spent time tweaking up or down depending on what the variable was being used for but eventually I seemed to have found a manageable balance for the model where I was getting 10/10 successes on every run.

One of the most important changes to a value set that I made was an increase to the number of neurons in the network model. I initially had the DQN model with `56` units per layer but bumped it up to `128` per. The intent here was to help the model approximate better.

One of the other many changes I made during implementation was to normalize the velocity and position features to help the neural net better use the state values and use both features more equally. Velocity spans a much smaller range / is on a smaller scale (-0.07 to 0.07 with a range of 0.14) than Position (-1.2 to 0.6 with a range of 1.8) meaning that both were likely not influencing the learning nearly as evenly as I would like.

Initially I had code that looked like this:
```
next_states = torch.FloatTensor(next_states).to(self.device)
```
But I normalized it so it looked like this:
```
next_states = (torch.FloatTensor(next_states).to(self.device) - state_mean) / state_std
```

Another change I made during implementation was to add a warm up period to the agent in the form of delaying updating of said agent for a certain length of time (1000 steps). I accomplished this simply by adding an additional `warmup_steps` variable to the agent and comparing the steps count to the number of experiences in the replay buffer `self.memory`. If the number of experiences was less than the `warmup_steps` variable value, or the `batch_size` variable value, then no update would occur. I already had the `batch_size` limitation in place to ensure there was aenough datat for one mini-batch, but the `warmup_steps` limitation ensures that there is a diverse set of transitions in the buffer. Through some research early on when the DQN was performing terribly I saw that early episodes are often considered poor to use as they may have sparse rewards and everything is still random. This implementation allowed for more valuable transistions to be included in the batches.

One final change I will list here, though there were many many more that occured during development, is the move from the `MSELoss` function to the `SmoothL1Loss` loss function. Looking online, RL targets are often noisey and rarely clean. The `SmoothL1Loss` combines MSE and MAE to handle both small and large errors VS the `MSELoss` function which generally only handles small errors and outliers well. This update made a noticable change to the output of the DQN model.

##### Results

#### Q-Learning VS DQN

### Conclusion

### Reference

- DQN
    - https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
    - https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial 
    - https://www.slingacademy.com/article/implementing-deep-q-networks-dqn-in-pytorch-for-complex-environments/
    - https://medium.com/@hkabhi916/mastering-deep-q-learning-with-pytorch-a-comprehensive-guide-a7e690d644fc