# P4: Smartcab

## Implement a Basic Driving Agent

To begin, your only task is to get the **smartcab** to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:
- The next waypoint location relative to its current location and heading.
- The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
- The current time left from the allotted deadline.

To complete this task, simply have your driving agent choose a random action from the set of possible actions `(None, 'forward', 'left', 'right')` at each intersection, disregarding the input information above. Set the simulation deadline enforcement, `enforce_deadline` to `False` and observe how it performs.

***QUESTION:***

Observe what you see with the agent's behavior as it takes random actions. Does the **smartcab** eventually make it to the destination? Are there any other interesting observations to note?

***ANSWER:***

The instructions for this question explicitly state that these random actions disregarding _the state of the traffic light_. Since the already-coded `Environment.act()` flags `move_okay` depending on what action the agent wants to take and the state of the traffic light, I would have to change this method or write a new one (which does not make sense) to follow these instructions fully.

However, we can more easily disregard the other inputs assessed outside of `Environment.act()` and assign random actions to our agent. By doing this we can see that our agent does make it to the destination in most cases. In the other cases it runs up against the `hard_time_limit` set to `-100`.

## Inform the Driving Agent

***Instructions:***

Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the **smartcab** and environment. The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, process the inputs and update the agent's current state using the `self.state` variable. Continue with the simulation deadline enforcement `enforce_deadline` being set to `False`, and observe how your driving agent now reports the change in state as the simulation progresses.

***Notes:***

Rewards:
- `+=10.0` for each successfully completed trip, determined by:
    - ~~L1 distance to destination: int~~
    - rather, `Environment.agent_state[LearningAgent]['location']`
-  `+2.0` for each action it executes successfully that obeys traffic rules (a legal action), determined by:
    - inputs from `Environment.sense()`
-  `+0.0` for any null action, determined by:
    - inputs from `Environment.sense()`
-  `-0.5` for action not prescribed but legal, determined by:
    - inputs from `Environment.sense()`
    - `LearningAgent.next_waypoint`
-  `-1.0` for an action that violates traffic rules (illegal action), determined by:
    - inputs from `Environment.sense()`

State description options:
- L1 distance to destination: int
- light: 'green', 'red'
- oncoming: 'forward', 'left', 'right'
- left: 'forward', 'left', 'right'
- right: 'forward', 'left', 'right'
- next_waypoint:  'forward', 'left', 'right'


What info needs to be in a state?
- Characteristics that would determine an action's reward taken from that state.
- i.e. enough info in `s` to give correct output for $R(s)$, reward function.
- __BUT__ agent still must _learn_ traffic rules on its own, without that being hard-coded into state (like first attempts with concept of legality).

State definition:
```python

self.State = namedtuple('State', ['next_waypoint', 'light', \
                                  'oncoming', 'left', 'right'])
```



---
___First attempt(s):___

```python
self.ActionsLegality = \
    namedtuple('ActionsLegality',
               ['no_action', 'forward',
                'left', 'right'])
self.State = namedtuple('State', 
                ['next_waypoint', 'actions_legality'])
```
---
***More efficient state description is:***

```python
self.State = namedtuple('State', ['next_waypoint', 'okay_moves'])
self.OKAY_MOVES = ('all', 'all but left', 'right on red', 'none')
```
States in total: 12

---

i.e. 
1. Green light - no oncoming traffic (all actions)
- Green light - oncoming traffic (no left turn)
- Red light - no oncoming traffic (right turn only)
- Red light - oncoming traffic from left (no actions)

```
State(next_waypoint='left', actions_legality=ActionsLegality(no_action=True, forward=True, left=True, right=True))
State(next_waypoint='left', actions_legality=ActionsLegality(no_action=True, forward=True, left=False, right=True))
State(next_waypoint='left', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=True))
State(next_waypoint='left', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=False))

State(next_waypoint='right', actions_legality=ActionsLegality(no_action=True, forward=True, left=True, right=True))
State(next_waypoint='right', actions_legality=ActionsLegality(no_action=True, forward=True, left=False, right=True))
State(next_waypoint='right', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=True))
State(next_waypoint='right', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=False))

State(next_waypoint='forward', actions_legality=ActionsLegality(no_action=True, forward=True, left=True, right=True))
State(next_waypoint='forward', actions_legality=ActionsLegality(no_action=True, forward=True, left=False, right=True))
State(next_waypoint='forward', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=True))
State(next_waypoint='forward', actions_legality=ActionsLegality(no_action=True, forward=False, left=False, right=False))
```

***QUESTION:*** 

What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?


***ANSWER:***

The determining factors for our state description are what aspects of a state are tied to the rewards for an action from that state. 

Since rewards rely on whether the learning agent has followed (or not followed) traffic and safety laws and whether it has moved toward the destination as per the planner's `next_waypoint` returned value, we can combine inputs into the following description:
```python
State = namedtuple('State', ['next_waypoint', 'light',
                             'oncoming', 'left', 'right'])
```

OPTIONAL:

How many states in total exist for the smartcab in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?


ANSWER: 

Every possible combination of values for each state variable (listed below) is:

$2 * 3^4 = 162$

```python
light: 'green', 'red'
oncoming: 'forward', 'left', 'right'
left: 'forward', 'left', 'right'
right: 'forward', 'left', 'right'
next_waypoint: 'forward', 'left', 'right'
```


In [1]:
import sys
sys.path.append('/smartcab/')
import agent as ag
from IPython.display import display
import matplotlib.pyplot as plt

In [16]:
reload(ag)
result = ag.grid_search(trials=100,
                        epsilons=(.5,),
                        decay_divisors=(3000,),
                        learning_rates=(.2, .5, .8),
                        discounts=(.2, .5, .8),
                        only_Q4=False)

In [17]:
# display(result.groupby(list(result.columns[:4])).sum())
# result.describe()
display(result.sort_values(['avg_missed_Q4', 'avg_violations_Q4', 'avg_moves_Q4']).head())
# display(result.sort_values(['avg_violations_Q4', 'avg_missed_Q4', 'avg_moves_Q4']).head())
# display(result.sort_values(['avg_moves_Q4', 'avg_missed_Q4', 'avg_violations_Q4']).head())

Unnamed: 0,epsilon,decay_divisor,learning_rate,discount,avg_missed__Q1,avg_missed_Q2,avg_missed_Q3,avg_missed_Q4,avg_violations_Q1,avg_violations_Q2,avg_violations_Q3,avg_violations_Q4,avg_moves_Q1,avg_moves_Q2,avg_moves_Q3,avg_moves_Q4
3,0.5,3000,0.5,0.2,0.16,0.0,0.0,0.0,1.72,0.28,0.04,0.04,21.12,14.6,15.16,13.6
0,0.5,3000,0.2,0.2,0.28,0.0,0.0,0.0,2.2,0.28,0.04,0.08,18.08,13.24,12.36,14.56
7,0.5,3000,0.8,0.5,0.24,0.0,0.0,0.0,1.76,0.52,0.0,0.12,17.76,13.68,15.52,12.96
4,0.5,3000,0.5,0.5,0.12,0.08,0.08,0.0,2.08,0.52,0.12,0.12,15.92,15.56,15.52,15.08
2,0.5,3000,0.2,0.8,0.32,0.0,0.08,0.0,2.56,0.4,0.08,0.2,18.8,15.84,16.12,19.0


## Implement a Q-Learning Driving Agent

***Instructions:***
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the **smartcab** will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement `enforce_deadline` to `True`. Run the simulation and observe how the **smartcab** moves about the environment in each trial.

The formulas for updating Q-values can be found in [this] video.

[this]:https://classroom.udacity.com/nanodegrees/nd009/parts/0091345409/modules/e64f9a65-fdb5-4e60-81a9-72813beebb7e/lessons/5446820041/concepts/6348990570923

***Notes:***

Pseudocode of algorithm:

1. determine state value
- use probaility $\epsilon$ to decide whether to take random action or use best according to q-value
- decay $\epsilon$
- take action, get reward
- determine new state value
- use $<s, a, r, s'>$ to update Q value (of $s$ with $a$)

NOTE: Modified version of above to so an accurate determination of the next state from the next step. Essentially it looks back to previous state after determine $s'$

Good discussion:
https://discussions.udacity.com/t/qtable-content-example/178397/20

***QUESTION:***

What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?


***ANSWER:***

As the learning agent progresses through the trials, it begins to take actions toward the destination more frequently. These actions also begin to follow traffic rules more often.

This behavior changes are occur because:
1. The agent is making more moves toward the destination (positive rewards).
- The agent is driving more safely (negative rewards)


## Improve the Q-Learning Driving Agent

***Instructions:***

Your final task for this project is to enhance your driving agent so that, after sufficient training, the **smartcab** is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (`alpha`), the discount factor (`gamma`) and the exploration rate (`epsilon`) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your **smartcab**:

- Set the number of trials, `n_trials`, in the simulation to 100.
- Run the simulation with the deadline enforcement `enforce_deadline` set to `True` (you will need to reduce the update delay `update_delay` and set the `display` to `False`).
- Observe the driving agent’s learning and **smartcab’s** success rate, particularly during the later trials.
- Adjust one or several of the above parameters and iterate this process.

This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

***QUESTION:***

Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?


***ANSWER:***

_Tuned parameters:_
```python
self.epsilon = 0.5  # tested .1, .3, .5 and .8
self.decay_rate = self.epsilon / 400  # adjusted with eps

self.learning_rate = 0.2  # alpha, tested .1 and .3
self.discount = 0.2  # gamma
```

_Details of tuning for `self.epsilon`_:

- `self.epsilon = 0.1`
    - 64.7% of trials where destination is reached
- `self.epsilon = 0.3`
    - 92.8% of trials where destination is reached
- `self.epsilon = 0.5`
    - 96.5% of trials where destination is reached
- `self.epsilon = 0.8`
    - 95% of trials where destination is reached
    

With these parameters assigned above, our **smartcab** will reach its destination about 95 times out of our 100 trials with the deadline enforced. (Average determined from sample of 10 batches of each set of 100 trials.)

Negative rewards in the latter 3/4 of the 100 trials were rare, in the range of 0 to 3 instances.

***QUESTION:***

Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?


***ANSWER:***

An optimal policy would always take actions to move toward the destination unless this action would violate traffic laws. In this case, the optimal policy would then determine the next best move (with regard to the direction of the destination) that does _not_ violate traffic laws. I.e. first priority is safety, second is reaching the destination.

I would say my agent is getting close to the optimal policy. As mentioned in the previous question, with appropriate parameters, the learning agent almost always reaches the destination by the deadline after the first few trials. In addition, when measuring the frequency of negative rewards (i.e. traffic violations), I found very few occured after the agent learned over 25 trials. However, this 0 to 3 range for violations for the latter 75 trials is much too high according to our definition of an optimal policy which prioritizes safety over reaching the destination. 

Also, while the learning agent consistently reached the destination by the deadline, often it made non-optimal moves with respect to the destination. This caused it to travel longer than necessary, even while driving safely. Since there is no default penalty for every move (or non-move), the agent is not incentivized to minimize travel time. Perhaps, as well, the penalty for traffic violations should lower than a separate reward for one that causes a collision. These penalties could also be increased to ensure that no violation occur in later trials.

