<h1>Self-Driving Agent Report</h1>

<h2>1. Implementation of a Basic Driving Agent</h2>

As starting task, we will move the smartcab around the environment using a random approach. The set of possible actions will be: None, forward, left, right. The deadline will be set to false, but this doesn't mean that smartcab has an infinite number of moves as can see on code of the file **smartcab/environment.py** (but will increase a lot the number of moves available).

The code corresponding to this agent can be found on the class **RandomAgent** at the file **smartcab/agents.py**.

Observations from simulation:

1. Normally the smartcab action is not optimal, but normally reaches the destination because has a lot of moves available to reach the destination.
2. The environment  doesn't allow any agent to execute and action that violates traffic rules, but a strong negative reward is applied.

<h2>2. Inform the Driving Agent</h2>

The next task  is to identify a set of states that are appropriate for modeling the smartcab and environment. 

All the information we receive come from the environment and the planner.

Sensing the environment provide us with these inputs:

- light:
    - Possible values: Red / Green
- oncoming:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car oncoming and the action wants to execute.
- right:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the right oncoming and 
    the action wants to execute.
- left:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the left oncoming and 
    the action wants to execute.

Also from the environment we can obtain the deadline, that is the number of remaining moves to reach the destination.

The planner provides next_waypoint, with these possible values: Forward, Right and Left.

For representing the state we will use: **next_waypoint**, **light**, **oncoming**, **right** and **left**.

Having in mind we use *next_waypoint*, is not very useful to also use *deadline*. Also *deadline* will increase considerably the number of possible states, and would penalize the Q-Learning implementation.

The information from *light*, *oncoming*, *right* and *left* can help Q-Learning to avoid traffic violations. The information from *next_waypoint* can help Q-Learning to reach the destination as soon as possible.
 
Having in mind the properties used for the state, and possible values for each of these, the total number of different states are: 3 x 2 x 4 x 4 x 4. This means a total of 384 states at a given time.

<h2>3. Implement a Q-Learning Driving Agent</h2>

The third task is to implement the Q-Learning algorithm for the driving agent. The code corresponding to this agent can be found on the class **QLearningAgent** at the file **smartcab/agents.py**.

The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information (Source: [Wikipedia](https://en.wikipedia.org/wiki/Q-learning)):

![](images/qlearn.png)

Before proceeding to the simulation, some parameter values should be set.

In the formula shown above, two contants can be seen:
- **alpha_rate (α)** or **learning rate**: Determines to what extent the newly acquired information will override the old information.
- **gamma rate (γ)** or **discount factor**: Determines the importance of future rewards.

I will start with **alpha_rate = 0.9** and **gamma rate = 0.5**.

Another important value is the **epsilon_rate (ε)** or **exploration rate**, that determines when to explore or when to exploit the already learn information. I will start with **epsilon_rate = 0.1**.

Finally Q values should have an initial value. I will  use **0.0 as initial value**. Please notice that in the corresponding code I don't make a static initialization. Instead in the method **get_q_value** I return **self.q_init_value==0.0** that is equivalent:

~~~~
def get_q_value(self, state, action):
    key = (state, action)
    return self.q_matrix.get(key, self.q_init_value)
~~~~

The reason is that I've added code to build stats, and one of the values i want to track is the number of explored states. I get this number by simply using this code:

~~~~
len(self.q_matrix)
~~~~

The simulation will be executed 100 times, with enforce_deadline to True.

The stats data has been stored in a file that I will analyze below:

In [23]:
import pandas as pd

data = pd.read_csv("smartcab/stats_first_qlearn_agent.csv", index_col=0)

The **data** dataframe is a table containing 100 rows (one by simulation iteration), and 5 columns:
- **simulation_round**: The round number of the simulation.
- **success**: True if the agent reached the destination.
- **cum_reward**: The accumulated reward in that simulation.
- **explored_states_cum**: The accumulated number of states explored.
- **traffic_violations_count**: The traffic violations that occurred in that simulation.
- **actions_count**: The actions taken in that simulation.

Let's explore the 10 first iterations:

In [24]:
data.head(10)

Unnamed: 0,simulation_round,success,cum_reward,explored_states_cum,traffic_violations_count,actions_count
0,1,True,-3.5,8,10,20
1,2,False,-7.0,20,10,51
2,3,False,7.5,21,3,26
3,4,False,6.0,22,0,31
4,5,True,-7.5,24,7,20
5,6,True,19.0,26,2,27
6,7,True,0.5,27,1,3
7,8,True,11.5,28,2,10
8,9,True,4.0,29,0,10
9,10,True,10.5,31,7,31


In the 10 first simulations the success is 

We can see that normally don't reachs the destination. We can see that the number of states explored increases as iterations are done. And we can see that normally do a lot of traffic violations.

Let's see now the 10 last iterations:

In [25]:
data.tail(10)

Unnamed: 0,simulation_round,success,cum_reward,explored_states_cum,traffic_violations_count,actions_count
90,91,False,-4.5,83,8,26
91,92,False,0.0,83,0,26
92,93,False,-2.0,83,5,41
93,94,False,-3.5,83,9,21
94,95,True,-1.0,83,9,25
95,96,False,3.5,83,8,21
96,97,True,3.0,83,2,8
97,98,True,9.0,83,3,16
98,99,False,15.0,83,1,26
99,100,False,17.0,83,0,31


We see more or less the same behaviour that in the first iterations, so we can conclude that *QLearningAgent* is 
not learning well.



Let's see the number of times that the *QLearningAgent* has been successful:

In [12]:
print len(data[(data.success)])

38


The *QLearningAgent* has a 38% of success.

I've done 100 simulations with the *RandomAgent* and *enforce_deadline=True*, and it has a 16% of success.

<h2>4. Improve the Q-Learning Driving Agent</h2>

Now let's tune the values for the **Q init value**, the **learning rate (alpha)**, **the discount factor (gamma)** and the **exploration rate (epsilon)**.

We will use *Grid Search* technique to tune these parameters. First I will do a **grosso modo** Grid Search, and the I will do a second **fine tuned** Grid Search.

<h3>4.1. First (grosso modo) Grid Search</h3>

I will do a first grid search with these range of values:
- *q_init_values*: 0.0, 5.0, 10 (3 values)
- *alpha_rate*: 0.00, 0.25, 0.50, 0.75, 1.00 (5 values)
- *epsilon_rate*: 0.00, 0.25, 0.50, 0.75, 1.00 (5 values)
- *gamma_rate*: 0.00, 0.25, 0.50, 0.75, 1.00 (5 values)

The total of combinations will be 375 (3x5x5x5). For each of the combinations, we will perform 100 simulations. This means 37.500 simulations will be done.

The results of the simulations will be stored on a csv file, that we will analyze.

The correspoding code used to generate this data can be found on **smartcab/main_qlearn_agent_tuning.py**.

In [59]:
import pandas as pd
from altair import Chart

tuning_data = pd.read_csv("smartcab/qlearn_agent_tuning_results.csv", index_col=0)

In [None]:
Let's see some sample rows to understand the data contained:

In [42]:
tuning_data.head(5)

Unnamed: 0,q_init_value,alpha_rate,epsilon_rate,gamma_rate,success_perc,traffic_violations_avg,explored_states_avg,reward_cum_avg,actions_avg
0,0.0,0.0,0.0,0.0,20.0,7.1,28.0,-3.6,26.9
1,0.0,0.0,0.0,0.25,10.0,7.1,29.8,-2.3,29.5
2,0.0,0.0,0.0,0.5,0.0,8.6,29.9,-3.15,34.5
3,0.0,0.0,0.0,0.75,10.0,7.3,25.2,-0.1,29.2
4,0.0,0.0,0.0,1.0,10.0,6.2,30.9,-0.85,28.0


Each row corresponds to a simulation. The columns are:
- **q_init_value**: The Q initial value used in that simulation.
- **alpha_rate**: The alpha rate value used in that simulation.
- **epsilon_rate**: The epsilon rate value used in that simulation.
- **gamma_rate**: The gamma rate value used in that simulation.
- **success_perc**: The percentage of success in that simulation.
- **traffic_violations_avg**: The traffic violation on average in that simulation.
- **explored_states_avg**: The explored states on average in that simulation.
- **reward_cum_avg**: The accumulated reward on average in that simulation.
- **actions_avg**: The actions done on average in that simulation.

After doing this first **grosso modo** grid search, we will do a second **fine tuned** grid search. But first we need to determine what an optimal policy for our problem can be.



<h3>4.2. An optimal policy</h3>

In my opinion an optimal policy for the smartcab is one that (in order of importance):
1. Minimizes the number of traffic violations.
2. Maximizes the success.
3. Minimizes the number of actions taken.

So let's start describing the data:

In [61]:
tuning_data.describe()

Unnamed: 0,q_init_value,alpha_rate,epsilon_rate,gamma_rate,success_perc,traffic_violations_avg,explored_states_avg,reward_cum_avg,actions_avg
count,375.0,375.0,375.0,375.0,375.0,375.0,375.0,375.0,375.0
mean,5.0,0.5,0.5,0.5,21.12,6.990667,28.6384,-1.677067,27.04
std,4.087937,0.354026,0.354026,0.354026,14.764606,1.746239,2.818525,2.809958,3.565626
min,0.0,0.0,0.0,0.0,0.0,1.0,18.2,-7.85,16.8
25%,0.0,0.25,0.25,0.25,10.0,6.0,26.8,-3.4,24.85
50%,5.0,0.5,0.5,0.5,20.0,7.2,28.4,-2.0,27.0
75%,10.0,0.75,0.75,0.75,30.0,8.1,30.3,-0.5,29.6
max,10.0,1.0,1.0,1.0,80.0,11.4,37.4,12.65,35.8


The minimum value for *traffic_violations_avg* is 1.00, this is a really good average.

Let's look for rows where *traffic_violations_avg <= 5* and *success_perc >= 80*:

In [68]:
tuning_data_rows = tuning_data[(tuning_data['traffic_violations_avg'] <= 5) & (tuning_data['success_perc'] >= 80)]
tuning_data_rows.head(20)

Unnamed: 0,q_init_value,alpha_rate,epsilon_rate,gamma_rate,success_perc,traffic_violations_avg,explored_states_avg,reward_cum_avg,actions_avg
50,0.0,0.5,0.0,0.0,80.0,1.5,25.6,7.55,22.3


So let's start looking for the row with the minimum value for the column *traffic_violations_avg*.

In [43]:
tuning_data.loc[tuning_data['traffic_violations_avg'].idxmin()]

q_init_value               0.00
alpha_rate                 0.25
epsilon_rate               0.00
gamma_rate                 1.00
success_perc              30.00
traffic_violations_avg     1.00
explored_states_avg       20.60
reward_cum_avg             5.95
actions_avg               29.30
Name: 29, dtype: float64

The success percentage is not good, but let's do some more fine tuning to see if we can find a combination of parameters that minimizes number of traffic violations and maximizes the success.

<h3>4.2. Second (fine tuned) Grid Search</h3>

I will do a second grid search with these range of values:
- *q_init_values*: 0.0, 1.0, 2.0 (3 values)
- *alpha_rate*: 0.15, 0.20, 0.25, 0.30, 0.35 (5 values)
- *epsilon_rate*: 0.00, 0.05, 0.10, 0.15, 0.20 (5 values)
- *gamma_rate*: 0.80, 0.85, 0.90, 0.95, 1.00 (5 values)

Let's analyze the data:

In [57]:
tuning_data_2 = pd.read_csv("smartcab/qlearn_agent_tuning_results2.csv", index_col=0)
tuning_data_2.describe()

Unnamed: 0,q_init_value,alpha_rate,epsilon_rate,gamma_rate,success_perc,traffic_violations_avg,explored_states_avg,reward_cum_avg,actions_avg
count,375.0,375.0,375.0,375.0,375.0,375.0,375.0,375.0,375.0
mean,1.0,0.25,0.1,0.9,28.666667,4.536,27.204533,2.255867,26.415467
std,0.817587,0.070805,0.070805,0.070805,22.981819,2.219941,3.479275,4.843244,4.18347
min,0.0,0.15,0.0,0.8,0.0,0.5,13.7,-6.55,12.9
25%,0.0,0.2,0.05,0.85,10.0,2.9,25.3,-1.2,23.85
50%,1.0,0.25,0.1,0.9,20.0,4.5,27.5,1.35,26.9
75%,2.0,0.3,0.15,0.95,40.0,5.95,29.5,4.975,29.1
max,2.0,0.35,0.2,1.0,100.0,11.3,39.0,25.15,36.0


In [55]:
tuning_data_2.head()

Unnamed: 0,q_init_value,alpha_rate,epsilon_rate,gamma_rate,success_perc,traffic_violations_avg,explored_states_avg,reward_cum_avg,actions_avg
0,0.0,0.15,0.0,0.8,50.0,9.7,28.2,4.15,23.1
1,0.0,0.15,0.0,0.85,50.0,3.5,35.4,9.25,24.9
2,0.0,0.15,0.0,0.9,20.0,1.0,20.0,12.45,26.5
3,0.0,0.15,0.0,0.95,10.0,1.1,25.7,7.3,32.4
4,0.0,0.15,0.0,1.0,70.0,7.3,22.8,0.7,17.9


In [56]:
tuning_data.loc[tuning_data_2['success_perc'].idxmax()]

q_init_value               0.00
alpha_rate                 0.75
epsilon_rate               0.75
gamma_rate                 1.00
success_perc              20.00
traffic_violations_avg     4.00
explored_states_avg       31.80
reward_cum_avg            -0.85
actions_avg               24.70
Name: 94, dtype: float64

<h4>And the winner is...</h4>

The **RewardAgent** has a poor success percentage so let's discard it.

The difference between **SuccessAgent** and **ActionsAgent** is minimal. But the **SuccessAgent** has a slightly better accumulated reward average. So I will choose **SuccessAgent** as the winner.

The parameter values for **SuccessAgent** are: **alpha_rate = 0.555556**, **epsilon_rate = 0.000000**, **gamma rate = 0.666667**.