<h1>Self-Driving Agent Report</h1>

<h2>1. Implementation of a Basic Driving Agent</h2>

As starting task, we will move the smartcab around the environment using a random approach. The set of possible actions will be: None, forward, left, right. The deadline will be set to false, but this doesn't mean that smartcab has an infinite number of moves as can see on code of the file **smartcab/environment.py** (but will increase a lot the number of moves available).

The code corresponding to this agent can be found on the class **RandomAgent** at the file **smartcab/agents.py**.

Observations from simulation:

1. Normally the smartcab action is not optimal, but normally reaches the destination because has a lot of moves available to reach the destination.
2. The environment  doesn't allow any agent to execute and action that violates traffic rules, but a strong negative reward is applied.

<h2>2. Inform the Driving Agent</h2>

The next task  is to identify a set of states that are appropriate for modeling the smartcab and environment. 

All the information we receive come from the environment and the planner.

Sensing the environment provide us with these inputs:

- light:
    - Possible values: Red / Green
- oncoming:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car oncoming and the action wants to execute.
- right:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the right oncoming and 
    the action wants to execute.
- left:
    - Possible values: None / Forward / Right / Left
    - Indicates if there is a car approaching from the left oncoming and 
    the action wants to execute.

Also from the environment we can obtain the deadline, that is the number of remaining moves to reach the destination.

The planner provides next_waypoint, with these possible values: Forward, Right and Left.

For representing the state we will use: **next_waypoint**, **light**, **oncoming**, **right** and **left**.

Having in mind we use *next_waypoint*, is not very useful to also use *deadline*. Also *deadline* will increase considerably the number of possible states, and would penalize the Q-Learning implementation.

The information from *light*, *oncoming*, *right* and *left* can help Q-Learning to avoid traffic violations. The information from *next_waypoint* can help Q-Learning to reach the destination as soon as possible.
 
Having in mind the properties used for the state, and possible values for each of these, the total number of different states are: 3 x 2 x 4 x 4 x 4. This means a total of 384 states at a given time.

<h2>3. Implement a Q-Learning Driving Agent</h2>

The third task is to implement the Q-Learning algorithm for the driving agent. The code corresponding to this agent can be found on the class **QLearningAgent** at the file **smartcab/agents.py**.

The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information (Source: [Wikipedia](https://en.wikipedia.org/wiki/Q-learning)):

![](images/qlearn.png)

Before proceeding to the simulation, some parameter values should be set.

In the formula shown above, two contants can be seen:
- **alpha_rate (α)** or **learning rate**: Determines to what extent the newly acquired information will override the old information.
- **gamma rate (γ)** or **discount factor**: Determines the importance of future rewards.

I will start with **alpha_rate = 0.9** and **gamma rate = 0.5**.

Another important value is the **epsilon_rate (ε)** or **exploration rate**, that determines when to explore or when to exploit the already learn information. I will start with **epsilon_rate = 0.1**.

Finally Q values should have an initial value. I will  use **0.0 as initial value**. Please notice that in the corresponding code I don't make a static initialization. Instead in the method **get_q_value** I return **self.q_init_value==0.0** that is equivalent:

~~~~
def get_q_value(self, state, action):
    key = (state, action)
    return self.q_matrix.get(key, self.q_init_value)
~~~~

The reason is that I've added code to build stats, and one of the values i want to track is the number of explored states. I get this number by simply using this code:

~~~~
len(self.q_matrix)
~~~~

The simulation will be executed 100 times, with enforce_deadline to True.

The stats data has been stored in a file that I will analyze below:

In [7]:
import pandas as pd

data = pd.read_csv("smartcab/stats_first_qlearn_agent.csv", index_col=0)

The **data** dataframe is a table containing 100 rows (one by simulation iteration), and 5 columns:
- **iteration**: The iteration number.
- **success**: True if the agent reached the destination.
- **cum_reward**: The accumulated reward in that iteration.
- **explored_states_cum**: The accumulated number of states explored.
- **traffic_violations_count**: The traffic violations that occurred in that iteration.

In [None]:
Let's explore the 10 first iterations:

In [8]:
data.head(10)

Unnamed: 0,iteration,success,cum_reward,explored_states_cum,traffic_violations_count
0,1,False,5.0,14,7
1,2,False,9.5,22,2
2,3,True,0.0,22,0
3,4,False,0.5,24,7
4,5,False,-7.5,27,7
5,6,True,4.0,27,3
6,7,False,7.0,27,4
7,8,True,-0.5,27,2
8,9,False,12.0,29,2
9,10,False,-7.0,29,11


We can see that normally don't reachs the destination. We can see that the number of states explored increases as iterations are done. And we can see that normally do a lot of traffic violations.

Let's see now the 10 last iterations:

In [9]:
data.tail(10)

Unnamed: 0,iteration,success,cum_reward,explored_states_cum,traffic_violations_count
90,91,False,-0.5,69,2
91,92,False,-3.5,69,1
92,93,False,1.0,69,0
93,94,False,-6.0,71,1
94,95,True,-1.0,71,2
95,96,False,-7.0,71,6
96,97,False,-10.5,73,10
97,98,False,3.0,74,11
98,99,True,-3.0,75,9
99,100,False,-6.5,77,21


We see more or less the same behaviour that in the first iterations, so we can conclude that *QLearningAgent* is 
not learning well.



Let's see the number of times that the *QLearningAgent* has been successful:

In [12]:
print len(data[(data.success)])

38


The *QLearningAgent* has a 38% of success.

I've done 100 simulations with the *RandomAgent* and *enforce_deadline=True*, and it has a 16% of success.

<h2>4. Improve the Q-Learning Driving Agent</h2>

Now let's tune the values for the **learning rate (alpha)**, **the discount factor (gamma)** and the **exploration rate (epsilon)**. 

We will perform many simulations with many combinations of these parameters, and the we will report the results to see what 
is the best combination. The correspoding code used to generate this data can be found on **smartcab/main_qlearn_agent_tuning.py**.

The results of the simulations will be stored on a csv file, that we will analyze:

In [8]:
import pandas as pd

tuning_data = pd.read_csv("smartcab/qlearn_agent_tuning_results.csv")
tuning_data.describe()

Unnamed: 0.1,Unnamed: 0,alpha_rate,epsilon_rate,gamma_rate,successPerc,actionsAvg,cumRewardAvg
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,0.5,0.5,0.5,21.585,26.762,-1.20177
std,288.819436,0.319302,0.319302,0.319302,11.008268,1.983252,3.540282
min,0.0,0.0,0.0,0.0,0.0,13.0,-15.305
25%,249.75,0.222222,0.222222,0.222222,16.0,26.0,-2.835
50%,499.5,0.5,0.5,0.5,20.0,27.0,-2.1425
75%,749.25,0.777778,0.777778,0.777778,24.0,28.0,-0.78875
max,999.0,1.0,1.0,1.0,97.0,31.0,22.35


To determine a suitable set of parameters, lets find: the q-agent with maximum success percentage, the q-agent with minimum actions taken average, and the q-agent with maximum accumulated reward average.

<h4>Q-Agent with maximum success percentage</h4>

In [9]:
tuning_data.loc[tuning_data['successPerc'].idxmax()]

Unnamed: 0      506.000000
alpha_rate        0.555556
epsilon_rate      0.000000
gamma_rate        0.666667
successPerc      97.000000
actionsAvg       14.000000
cumRewardAvg      9.820000
Name: 506, dtype: float64

This agent achieved a 97% success percentage. This is really a good percentage, so we can discard a bad implementation of the Q algorithm. It seems I selected bad values for the parameters in the previous section.

Let's name this agent **SuccessAgent**.

<h4>Q-Agent minimum actions taken average</h4>

In [10]:
tuning_data.loc[tuning_data['actionsAvg'].idxmin()]

Unnamed: 0      703.000000
alpha_rate        0.777778
epsilon_rate      0.000000
gamma_rate        0.333333
successPerc      95.000000
actionsAvg       13.000000
cumRewardAvg      8.870000
Name: 703, dtype: float64

This agent has an average of 13.000 actions taken.

Let's name this agent **ActionsAgent**.

<h4>Q-Agent maximum accumulated reward average</h4>

In [11]:
tuning_data.loc[tuning_data['cumRewardAvg'].idxmax()]

Unnamed: 0      103.000000
alpha_rate        0.111111
epsilon_rate      0.000000
gamma_rate        0.333333
successPerc      31.000000
actionsAvg       25.000000
cumRewardAvg     22.350000
Name: 103, dtype: float64

This agent has an average of 22.350 accumulated reward.

Let's name this agent **RewardAgent**.

<h4>And the winner is...</h4>

The **RewardAgent** has a poor success percentage so let's discard it.

The difference between **SuccessAgent** and **ActionsAgent** is minimal. But the **SuccessAgent** has a slightly better accumulated reward average. So I will choose **SuccessAgent** as the winner.

The parameter values for **SuccessAgent** are: **alpha_rate = 0.555556**, **epsilon_rate = 0.000000**, **gamma rate = 0.666667**.