<h1>Self-Driving Agent Report</h1>

<h2>1. Implementation of a Basic Driving Agent</h2>

As starting task, we will move the smartcab around the environment using a random approach. The set of possible actions will be: None, forward, left, right. The deadline will be set to false, but this doesn't mean that smartcab has an infinite number of moves as can see on code of the file *environment.py* (but will increase a lot the number of moves available).

The code corresponding to this agent can be found on the class *RandomAgent* at the file *smartcab/agents.py*.

Observations from simulation:

1. Normally the smartcab action is not optimal, but normally reaches the destination because has a lot of moves available to reach the destination.
2. The environment  doesn't allow any agent to execute and action that violates traffic rules, but a strong negative reward is applied.

<h2>2. Inform the Driving Agent</h2>

The next task  is to identify a set of states that are appropriate for modeling the smartcab and environment. 

All the information we receive come from the environment and the planner.

Sensing the environment provide us with these inputs:

- **light**
    - Possible values: Red / Green
- **oncoming**:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car oncoming and the action wants to execute.
- **right**:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the right oncoming and 
    the action wants to execute.
- **left**:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the left oncoming and 
    the action wants to execute.

Also from the environment we can obtain the **deadline** that the is number of remaining moves to reach the destination.

The planner provides **next_waypoint** with these possible values: Forward, Right and Left.

For representing the state we will use: **next_waypoint**, **light**, **oncoming**, **right** and **left**.

Having in mind we use **next_waypoint**, is not very useful to also use **deadline**. Also **deadline** will increase considerably the number of possible states, and would penalize the Q-Learning implementation.

The information from **light**, **oncoming**, **right** and **left** can help Q-Learning to avoid traffic violations. The information from **next_waypoint** can help Q-Learning to reach the destination as soon as possible.
 
Having in mind the properties used for the state, and possible values for each of these, the total number of different states are: 3 x 2 x 4 x 4 x 4. This means a total of 384 states at a given time.

<h2>Implement a Q-Learning Driving Agent</h2>

The third task is to implement the Q-Learning algorithm for the driving agent. The code corresponding to this agent can be found on the class *QLearningAgent* at the file *smartcab/agents.py*.

Before proceeding to the simulation, the values for three important constants should be assigned:
- *alpha_rate (α)* or *learning rate*: Determines to what extent the newly acquired information will override the old information.
- *epsilon_rate (ε)* or *exploration rate*: Determines when to explore or when to exploit the already learn information.
- *gamma rate (γ)* or *discount factor*: Determines the importance of future rewards.

We will execute 100 simulations with enforce_deadline to True. We will do our first attempt with these values for the constants:
- *alpha_rate (α)* = 0.9
- *epsilon_rate (ε)* = 0.1
- *gamma rate (γ)* = 0.5.

Let's analyze a scatter plot that correlates number of simulations executed (iterations) with size of the Q matrix:

![](plots/qlearner_1_scatter_iterations_q-size.png)

As we can see in the plot, as simulations are executed the number of values in Q matrix increases. At the beginning increases fast, but later increases slow. This is normal because the number of scenarios not visited decreases while simulations
accumulate.

Let's analyze a scatter plot that correlates number of iterations with accumulated reward for each of the iterations:

![](plots/qlearner_1_scatter_iterations_cum-reward.png)

We don't see that as the number of iterations increase the agent gets better accumulated rewards. Maybe the assigned values for the constants were poorly chosen.

Now let's see the number of times q-learner agent has achieved the destination:
- Success: 21 times
- Fail: 79 times

In the first section of this report we saw that the random agent normally was successfully, but we must consider that the deadline was set to false. To make a fairer comparison, lets compare with a random agent with deadline set to true.

Let's see the number of times random agent has achieved the destination:
- Success: 16 times
- Fail: 84 times

There is not too much difference between the success ratio of the *RandomAgent* and the *QLearningAgent*. As said previously maybe the values for the constants were poorly chosen. Other options are that perform 100 simulations are not enough for the 
*QLearningAgent*, or maybe the q-learn algorithm is bad implemented.

<h2>Improve the Q-Learning Driving Agent</h2>

Now let's tune the values for *learning rate (alpha)*, *the discount factor (gamma)* and the *exploration rate (epsilon)*. 

We will perform many simulations with many combinations of these parameters, and the we will report the results to see what 
is the best combination.

The results of the simulations will be stored on a csv file, that we will analyze.

In [21]:
import pandas as pd

tuning_data = pd.read_csv("smartcab/qlearn_agent_tuning_results.csv")
tuning_data.describe()

Unnamed: 0.1,Unnamed: 0,alpha_rate,epsilon_rate,gamma_rate,successPerc,actionsAvg,cumRewardAvg
count,125.0,125.0,125.0,125.0,125.0,125.0,125.0
mean,62.0,0.5,0.5,0.5,23.048,26.496,-0.95944
std,36.228442,0.354976,0.354976,0.354976,15.292764,2.752196,4.390346
min,0.0,0.0,0.0,0.0,0.0,14.0,-12.74
25%,31.0,0.25,0.25,0.25,16.0,26.0,-2.875
50%,62.0,0.5,0.5,0.5,20.0,27.0,-2.005
75%,93.0,0.75,0.75,0.75,24.0,28.0,-0.2
max,124.0,1.0,1.0,1.0,90.0,30.0,16.125


We can see that the better success percentage has been 90%. This is really a good percentage, so we can discard a bad implementation of the Q algorithm. It seems I selected bad values for the parameters.

Lets see the parameters values that correspond to a 90% success.

In [20]:
tuning_data.loc[tuning_data['successPerc'].idxmax()]

Unnamed: 0      51.000
alpha_rate       0.500
epsilon_rate     0.000
gamma_rate       0.250
successPerc     90.000
actionsAvg      15.000
cumRewardAvg     9.925
Name: 51, dtype: float64

The values are:

- *alpha_rate (α)* = 0.500
- *epsilon_rate (ε)* = 0.000
- *gamma rate (γ)* = 0.250
