## Week 1: Customizing the environment

### Environment understanding
First things first, we took the time to understand the underlying forces at play in this environment. The following are the collected insights:
- **Action Space - Discrete vs Continuous**: This environment has, by default, a continuous action space (p.e.: steering $s \in [-1, 1]$). However, it can be converted to a discrete action space (p.e. by only allowing full left or full right steering).

- **Friction**: As per the environment implementation, friction is a vector applied in the oposite direction of the moving car and proportional to its current speed.

- **Grip**: Describes the adherence of the car to the track. If the rear wheels angle to the cars current moving direction is too great, especially at high speeds, it will lose grip, and enter a **drifting motion**.

Moreover, by reading the environment source code (available under [car_racing.py](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/car_racing.py) and [car_dynamics.py](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/car_dynamics.py)), we found that:

- The track itself is made out of tiles, which are squares with coordinates and rotation variables. They are held in a list in the `CarRacing` class, under `self.track`. Knowing this may make it possible to calculate the line with minimum curvature (ideal trajectory).

- Each wheel in `car.wheels` *knows* if it is over one or more road tiles. This is done by checking `len(wheel.tiles)`

### Reward modifications
We have noticed that the reward function (as described [here](https://gymnasium.farama.org/environments/box2d/car_racing/#rewards)) is very shallow, in the sense that it encourages process (by rewarding per tile completed) but does not encourage any type of behaviour to acheive that progress. As such, we plan to implement the following modifications:

- **Gas Bias**: Pushing forward must be encouraged. As such, and to counteract the early training association<br>
`moving = crashing`, a small reward will be provided for pushing Gas.

- **Wiggle Protection**: Often a model trained on these sorts of environments will perform what is called **intentional wiggling**. That is when, in a curve, for example, the car switches sharply and repeatedly between right-steering and left-steering. This is a technique learnt by models to ensure car grip. It also is, however, a strategy that exploits simple environments, including `CarRacing-v3`, in which steering does not lose speed. To prevent this, sharply changing steering direction will have a moderate penalty.

- **Off-road Penalty**: As described below under *Aditional modifications* the simulation will be truncated shortly after the car leaves the track, if it does not return quickly. However, giving a truncation a flat penalty does not seem appropriate, since it gives no context or warning

### PID Controller-like reward system
In the autonomous racing industry, dynamic error correction is a common theme. The industry standard for such systems is what is called a **PID Controller**. This means taking into account, and trying to correct the error of degree *d* according to its original function (Proportional factor), its derivative  (Differential factor) and its integral (Integral factor). Knowing this the following reward factors were implemented, aditionally to the previous ones:

- **Optimal Line Closeness (P factor)**: In a realistic environment, turning, even while accelerating, causes *some* speed loss. Because of this, the optimal line is the one with less curvature, that is, the one where the driver minimizes steering along the whole track. Having that in mind we applied **Laplacian Smoothing** to all tile coordinates, in order to generate that optimal line, and reward the agent for **traversing** the line (**not** for standing still close to it).

- **Line Angle Minimization Reward (D factor)**: Velocity is the derivative of position. So, by looking to a cars linear velocity, one can deduce where (and how fast) it will go in the next few moments. Thereby, if the cars current linear velocity vector has a great angle difference to the tracks optimal line, it will deviate very fast, and should correct itself as soon as possible. Because of this, a reward is given to the driver as to encourage angle difference minimization. This avoids the driver **"Snaking"** along the track.

### Aditional modifications

- **Early stopping**: Prevents the simulation from going on when the car has deviated too far out of the track, leaving it aimlessly wandering around. Aims to quicken the training speed.

- **HUD removal**: The original box shape (96x96) includes the HUD, displaying bars corresponding to the current action (steering, gas and break). Passing these cluster of pixels effectively represents either redundancy or noise to the model, both of which may impact performance. Consequently, the bottom 12px are cut, leaving the observation space a 84x96 box.

- **Frame stacking**: In this environment, there is no visual indication of how fast the car is currently going. As such, for each model prediction (and also in training), a stack of the 4 last frames is passed.

---

All these implementations can be viewed in [customization.py](customization.py)

## Week 2: Reinforcement Learning Agent
Having customized the environment to our needs (namely the reward function), we moved on to choosing the reinforcement learning algorithm. We aim to implement many different algorithms and compare them, if time allows it.

As both our input and output are **Continuous**, and choosing to implement a On-Policy model, we are left with 3 options: PPO, AC3 and TRPO. We will implement and train the models in this order.

### PPO (Proximal Policy Optimization)
> **TODO -- Small text explaining overall PPO**

#### Starting hyperparameters

- `policy = "CnnPolicy"`: Since the Car Racing environment observation space is, essentially, a stack of images, a CNN is the most appropriate choice.

- `use_sde = True`: (g)SDE or Generalized State-Dependent Exploration is an advanced exploration strategy for Deep Reinforcement Learning (especially algorithms like PPO and A2C) designed to make the agent's actions smoother and more consistent over time. It replaces the standard "random jitter" noise with a **"structured" noise** that depends on the state of the environment. This helps in environments that mimic physics related problems by "directing" its exploration consistently throughout an episode, avoiding rapidly changing inputs in instances where only longer, consistent ones have measurable outcomes.

> **TODO -- complete hyperparameter description and explanation**