## Implement a Basic Driving Agent
To begin, your only task is to get the **smartcab** to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:

- The next waypoint location relative to its current location and heading.
- The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
- The current time left from the allotted deadline.

To complete this task, simply have your driving agent choose a random action from the set of possible actions (`None, 'forward', 'left', 'right'`) at each intersection, disregarding the input information above. Set the simulation deadline enforcement, `enforce_deadline` to `False` and observe how it performs.

---

### Question
*Observe what you see with the agent's behavior as it takes random actions. Does the **smartcab** eventually make it to the destination? Are there any other interesting observations to note?*

### ANSWER
As we would expect, the agent showed completely erratic behavior. Still, even with random actions represented by $\alpha = 1$, $\gamma = 0$ and $\epsilon = 1$ averaged over 100 runs each with 100 training trials (`n_times=100`), the agent can reach its destination quite often. In its best run it fails in only 33 cases by hitting the hard time limit (-100):

    TODO: REPEAT
    RESULTS FOR 100 RUNS WITH 100 TRIALS EACH
    Highest success rate is 0.67 for alpha=1, gamma=0, and epsilon=1.
    with average traffic violations per cab ride: 21.59
    with total net reward over all 100 trials: -287.21
    with average moves above optimum per cab ride: 75.01

We should also note the high average number of traffic violations per cab ride: 21.59. Good for a Hollywood movie, but probably not for a smart cab. And the violations didn't even pay off, because our total reward after all the trials is negative.

Anyway, given $8 \times 6 = 48$ intersections and the option to do nothing, I expected the cab to fail more often.

Naturally, the results deteriorate if we set `enforce_deadline` to `True`. The cab would reach the destination in about 20% of the cases at best:

    TODO: REPEAT
    RESULTS FOR 100 RUNS WITH 100 TRIALS EACH
    Highest success rate is 0.20 for alpha=1, gamma=0, and epsilon=1.
    with average traffic violations per cab ride: 7.52
    with total net reward over all 100 trials: 7.29
    with average moves above optimum per cab ride: 21.78
  
Judging the other data that I checked, we can see that we're doing our cab a favor with pulling it off the street.

## Inform the Driving Agent
Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the **smartcab** and environment. The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, process the inputs and update the agent's current state using the `self.state` variable. Continue with the simulation deadline enforcement `enforce_deadline` being set to `False`, and observe how your driving agent now reports the change in state as the simulation progresses.

---

### QUESTION
*What states have you identified that are appropriate for modeling the **smartcab** and environment? Why do you believe each of these states to be appropriate for this problem?*

### ANSWER
Naively, we could use all the `inputs` containing information about the light and possible cars and their directions, the next `waypoint`, and also the `deadline` to constitute states. Since `deadline` can take up lots of different values, it would dramatically increase the number of states and thus training time. If we'd like to increase the size of the grid later, we'd even be in a worse situation.

Of course, we could also think of something more complicated like $\left \lfloor{log_{3}({deadline})}\right \rfloor$. This approach would not only reduce the number of states resulting by changes in `deadline`, but we could also tackle the importance of the remaining time: the state would change more often the smaller the remaining time becomes. Anyway, this is should also be possible by adjusting $\alpha$, $\gamma$, and $\epsilon$ according to the `deadline`.

The next best thing would be 5 variables with 2 or 4 possible values:

- $light \in \{red, green\}$
- $left \in \{None, forward, left, right\}$
- $oncoming \in \{None, forward, left, right\}$
- $right \in \{None, forward, left, right\}$
- $waypoint \in \{None, forward, left, right\}$

Having a closer look at the traffic laws given for this project, we can see that `right` doesn't give us important information:
- If there's a green light (given that other cars obey the traffic rules), we'd only have to check for oncoming traffic that might cross our path if we're turing left.
- If there's a red light and we go left or forward, then we will definitely be punished regardless of other cars.
- If there's a red light and we turn right, then only cars from our left would be relevant.

In conclusion, `right` is redundant and we can stick with `light`, `left`, `oncoming` and `waypoint`.

### OPTIONAL
*How many states in total exist for the **smartcab** in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?*

### ANSWER
The approach taken using `light`, `left`, `oncoming` and `waypoint` generates $2 \times 4 \times 4 \times 4 = 128$ different states possible. This will allow the agent to make good decisions, but it will also learn quite slowly. For Q-Learning, there are many states that have to be visited multiple times in order to learn the value of each action possible. Still, given the 100 trials set for this assignment, it should work quite well. The variable `light` has only to possible values but gives us a lot of indicators for driving well. Combining it with `waypoint` with 4 different values for moving quickly will also give us good directions. They're our "principal components" that will do most of the work within the algorithm. For our environment with only three other cars, we could probably even get away with only these using 8 states thus learning quickly, but with more cars the probability of crashes would increase and we'd probably want to avoid these at all costs (at least in Germany which seems to become technophobe judging by the reactions after the resent accient by a Tesla car).

The most extreme alternative (besides driving randomly) would probably be to ignore everything but `waypoint`. We'd only have 4 states, and given the small amount of traffic this might even work - but I'd probably not use such a smartcab in real life, and we better don't come across a cop...

## Implement a Q-Learning Driving Agent

With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the **smartcab** will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement `enforce_deadline` to `True`. Run the simulation and observe how the **smartcab** moves about the environment in each trial.

The formulas for updating Q-values can be [found in this video](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345409/modules/e64f9a65-fdb5-4e60-81a9-72813beebb7e/lessons/5446820041/concepts/6348990570923).

---

### QUESTION
*What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?*

### ANSWER
To me it is not completely clear how $\alpha$, $\gamma$ and $\epsilon$ should be set for this first run, so I simply used `random.random()` to generate random values for each of them. I also had a look at the results before setting `enforce_deadline` to `True`. In most cases, even with these random values the performance increased dramatically. In some cases, the agent wouldn't fail anymore. That was to be expected, because now the agent learns over time and uses the rewards as guidance for his actions instead of driving randomly. Furthermore, the agent violates traffic rules less frequently and moves to the destination quicker. Learning is also accountable for this, because the agent will receive large negative feedback for violating rules and also negative feedback for deviating from the next waypoint suggested by the planner. I also noticed that the agent tended to only fail in eary trials if at all. That's because it takes some time to fill in all the "Q values" for a state, but after that's done: boom. We could achieve the same with increasing the number of trials, of course, which would give the algorithm more time for exploring all possible states and computing the utilty for those depending on the action we'd take.

As an example, here's just one result for different random combinations of $\alpha$, $\gamma$ and $\epsilon$ also averaged over 100 runs:

    TODO: REPEAT
    RESULTS FOR 100 RUNS WITH 100 TRIALS EACH
    Highest success rate is 0.98 for alpha=0.689289044309, gamma=0.499773105599, and epsilon=0.0620423596347.
    with average traffic violations per cab ride: 0.38
    with total net reward over all 100 trials: 2220.99
    with average moves above optimum per cab ride: 13.40

After these trials, I set `enforce_deadline` to `True` and simply set $\alpha$ and $\gamma$ to 0.5 and $\epsilon$ to 0.25, so the agent would still randomly cruise the streets in about 25 % of all moves, moderately consider previous visits and moderately take into account the utility of the next state. In 100 runs with 100 iterations each, the average success rate was about 80 % - still better than the random approach even though now we have a deadline:

    RESULTS FOR 100 RUNS WITH 100 TRIALS EACH
    Highest success rate is 0.7988 for alpha=0.5, gamma=0.5, and epsilon=0.25.
    with average traffic violations per cab ride: 1.29
    with total net reward over all 100 trials: 2198.30
    with average moves above optimum per cab ride: 12.71


## Improve the Q-Learning Driving Agent
Your final task for this project is to enhance your driving agent so that, after sufficient training, the **smartcab** is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (`alpha`), the discount factor (`gamma`) and the exploration rate (`epsilon`) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your **smartcab**:

- Set the number of trials, `n_trials`, in the simulation to 100.
- Run the simulation with the deadline enforcement `enforce_deadline` set to `True` (you will need to reduce the update delay `update_delay` and set the `display` to `False`).
- Observe the driving agent’s learning and **smartcab’s** success rate, particularly during the later trials.
- Adjust one or several of the above parameters and iterate this process.
- This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

---

### QUESTION
*Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?*

### ANSWER
In order to find the best combination of $\alpha$, $\gamma$ and $\epsilon$, I quickly implemented a grid search. I lazily started it with a set of $\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}$ for each parameter and averaged the success rate of 100 runs for each combination, each with 100 trials for training (`n_times=100`). A feasible problem like ours can still be tackled fairly well this way ;-) After quite some time of waiting, the result was:

    TODO:

Afterwards, I explored the space around this solution a little more diligently by setting $\alpha \in \{0.03, 0.07, 0.1, 0.13, 0.17\}, \gamma \in \{0.13, 0.17, 0.2, 0.23, 0.27\}$ and $\epsilon \in \{0.13, 0.17, 0.2, 0.23, 0.27\}$. The final result was:

    TODO: 

Since 

### QUESTION
*Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?*

### ANSWER
An optimal policy could be "safety first", and I think the agent is pretty close to that. It could need some more training than "only" 100 trials, but the states chosen will guarantee that following a "safety first" mindset can be achieved. To make sure about that, $\epsilon$ should be set to 0 either after a certain number of trials or by using a function that converges to 0 instead of a fixed value. This way, there really wouldn't be any random driving after a certain amount of time, but the agent would solely rely on the Q tables / the learnings from the rewards it got.

One could argue that the average penalty per cab ride is really small but not zero. However, a closer look at the data tells us that those penalties primarily occur in early trials when the agent is still learning and sometimes later on when he chooses to explore state space by a random move. This could also be prevented by letting $\epsilon$ converge to zero after a certain amount of time.

Also, one might argue that it is not yet optimal because the number of moves the agent needs still exceed the distance between two intersections, but we must consider red lights that force us to wait or use a detour. In consequence, reaching the target in minimum time is really only possible if there's no red light, so this is hardly possible.