# Train a Smartcab to Drive
## Reinforcement learning

### Implement a Basic Driving Agent
To begin, your only task is to get the smartcab to move around in the environment. 
At this point, you will not be concerned with any sort of optimal driving policy. 
Note that the driving agent is given the following information at each intersection:

- The next waypoint location relative to its current location and heading.
- The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
- The current time left from the allotted deadline.

To complete this task, simply have your driving agent choose a random action 
from the set of possible actions (None, 'forward', 'left', 'right') at each intersection, disregarding the input information above. 
Set the simulation deadline enforcement, enforce_deadline to False and observe how it performs.

**QUESTION:** 

Observe what you see with the agent's behavior as it takes random actions. 
Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?

**Answer:**

The angent's behaviour is, unsurprisingly, random. It would at times move in a circle (around the block) not really progressing in any net new direction. And it doesn't appear to have any logic behind its movement as rewards for previous states and actions are ignored.
Another item of the agent's behavior is the agent can also drive over the right edge of the world and turn up on the left edge of the world.
However, the smartcab does eventually make it to its destination on occasion.

### Inform the Driving Agent
Now that your driving agent is capable of moving around in the environment, 
your next task is to identify a set of states that are appropriate for modeling the smartcab and environment. 
The main source of state variables are the current inputs at the intersection, but not all may require representation. 
You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, 
process the inputs and update the agent's current state using the self.state variable. 

Continue with the simulation deadline enforcement enforce_deadline being set to False, 
and observe how your driving agent now reports the change in state as the simulation progresses.

**QUESTION:** 

What states have you identified that are appropriate for modeling the smartcab and environment? 
Why do you believe each of these states to be appropriate for this problem?

**Answer:**

At each state or waypoint we perceive 6 variables, left, right, oncoming, light, next_waypoint and deadline.
Since traffic coming from the right, doesn't affect the smartcab's decision, this variable was removed - e.g. if the 
light is green, traffic from the right has stopped and can be ignored. Likewise, if light is red, the smartcab can only go right and any traffic coming from the right doesn't affect the agent's options. The input deadline was not utilized because at this point it does not provide any significant information to the smartcab.

Each state identifies a situation and option for the smartcab, which results in a particular reward. Once Q-learning is implemented (below) the smartcab will be able to use these states to learn and best determine its path to the destination.
 
Final states used - light, oncoming, left, next_waypoint 

**OPTIONAL:** 

How many states in total exist for the smartcab in this environment? 
Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?

**Answer:**

- Light: Red or green
- Oncoming: None, left, right, forward
- Left: None, left, right, forward
- Next_waypoint: None, left, right, forward

Capturing 4 inputs per state with each input having a variation between 2-4, gives us, 2x4x4x4, 128 states in total that exist for the smartcab in this environment.

### Implement a Q-Learning Driving Agent
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, 
your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, 
based on the Q-values for the current state and action. 

Each action taken by the smartcab will produce a reward which depends on the state of the environment. 
The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, 
set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.

**QUESTION:**

What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? 
Why is this behavior occurring?

**Answer:**

At the beginning of trials the agent's behavior remains the same; however, with larger "n_trials" (more trails), the agent no longer seems to act randomly but its movement tends towards the destination. It appears more "structured" or intentional instead of moving randomly or sometimes in circles.

The agent also reaching its destination more often and seemingly quicker, even though the deadline was disabled during the random trials. Metrics used to track the agent's performance are: 

- Reward: Total reward, sum of positive and negative rewards received at each state
- Win: The number of times the agent reaches its destination (deadline is enforced)
- Lose: The number of times the agent runs out of time without reaching its destination (deadline is enforced)
- Penalties: Total negative reward, sum of negative rewards received

This behvaior occurs because it is learning from its past experience. The q-table starts off empty and as the agent explores its world, the q-table is being updated based on the outcome of each move, each state, action and reward, helping the agent make "better" decisions that maximize reward.

It also appears that with no reward for making it to the destination sooner rather than later, the agent attempts to maximize reward following waypoints that may not result in the shortest path to the destination.


The tables below show the results of 10 Runs with 50 Trials (n_trials = 50) each for both random actions and those of the implemented Q-learning algorithmn.

Random actions: Average **Rewards: 7.6, Win: 11, Lose: 39, Penalties: -568**

Rewards | Win | Lose | Penalties
--- | --- | --- | ---
49.0 | 11 | 39 | -589.5 
-34.5 | 13 | 37 | -593.0
-31 | 7 | 43 | -558.0
49.5 | 14 | 36 | -512.5
5 | 10 | 40 | -587.0

Q-Learning: Average **Rewards: 1059, Win: 40, Lose: 10, Penalties: -164**

Rewards | Win | Lose | Penalties
--- | --- | --- | ---
887.5 | 36 | 14 | -155.0
1016.0 | 41 | 9 | -195.5
1154.5 | 47 | 3 | -137.5
1097.0 | 36 | 14 | -145.0
1140.5 | 42 | 8 | -185.5


### Improve the Q-Learning Driving Agent
Your final task for this project is to enhance your driving agent so that, after sufficient training, 
the smartcab is able to reach the destination within the allotted time safely and efficiently. 
Parameters in the Q-Learning algorithm, such as the learning rate (alpha), 
the discount factor (gamma) and the exploration rate (epsilon) all contribute to the driving agent’s ability to learn the best action for each state. 
To improve on the success of your smartcab:

- Set the number of trials, n_trials, in the simulation to 100.
- Run the simulation with the deadline enforcement enforce_deadline set to True (you will need to reduce the update delay update_delay and set the display to False).
- Observe the driving agent’s learning and smartcab’s success rate, particularly during the later trials.
- Adjust one or several of the above parameters and iterate this process.

This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

**QUESTION:** 

Report the different values for the parameters tuned in your basic implementation of Q-Learning. 
For which set of parameters does the agent perform best? How well does the final driving agent perform?

**Answer:**

Modified other files to remove debug print statements

- Alpha (learning rate) and gamma (discount factor) were utilized.
- if alpha is too small, learning takes too long (deadline), too large and learning will put
too much importance on the most recent action
- if gamma is too small, too much importance is placed on the current reward, too large and 
emphasis is placed on the future reward

TODO:

Below is a table of performance - change each one at a time

- alpha, gamma, epsilon
- table 1 with n trials
- table 2 with n trials

**QUESTION:**

Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, 
and not incur any penalties? How would you describe an optimal policy for this problem?

**Answer:**

Optimal policy: Always get to your destination in the shortest amount of time, without incurring penalties (illegal actions e.g. driving through a red light) while also maximizing your reward.

The optimal / learned policy for this problem is realitively balanced between gaining maximium reward and always arriving at it's destination but it does incur penalities to do so. However, the learn policy does have a lower rate of incurred penalities during higher number of trials.

Based on the above, it can be said that the learned policy does it close to an optimal policy, particularly as the number of trials continue to increse and keeping a balance between learning and exploring.

TODO: need to clean this up