# Project 4: Reinforcement Learning

## Train a Smartcab to Drive

### The setting

In the project “Train a Smartcab to Drive”, a smartcab operating in an idealized grid-like city is given. There are traffic lights at each intersection and other cars present. The smartcab gets a reward for obeying traffic rules and a penalty for not obeying traffic rules or causing an accident. Goal of this project is to implement a learning agent for the smartcab that should learn an optimal policy for driving on city roads, obeying traffic rules correctly, and trying to reach the destination within a goal time based on the rewards and penalties it gets.

### Step 1 - Implement a basic driving agent

In the first step, I let the smart cab take a random action out of the possibilities of:

- doing nothing (state ‘None’)
- driving forward (state ‘forward’)
- turning left (state ‘left’)
- turning right (state ‘right’)

In this step, the smartcab does not learn from the results of it’s actions and has unlimited time to reach the goal.

I put the printed output of my test run into `'output_first_text.txt'`.

In [43]:
def read_in_my_text(file_name):
    with open (file_name, "r") as myfile:
        output = myfile.read()
    count_reached = output.count('Primary agent has reached destination!')
    count_aborted = output.count('Primary agent hit hard time limit (-100)! Trial aborted.')
    print "Rate of reaching the destination " + str(count_reached*1.0/(count_reached + count_aborted))

In [44]:
read_in_my_text("output_first_text.txt")

Rate of reaching the destination 0.613861386139


The counts of destination reached and aborted trials add up to 101, which makes sense looking at the output file. The trials start with a trial number 0 and end with a trial number 100.

This shows, that our smartcab is in about 60% of the trials able to reach the destination in the given time plus 100 extra steps. In 40% of the time even the 100 extra steps are not enough to let the smartcab reach it's destination.

### Step 2 - Identify and update state

I modeled states explicity for our given traffic rules. These states could easily be changed to work in an environment where we don't know which traffic light color is the one to go or to an environment where we might drive on the other side. In this case, the number of states would increase accordingly (for example I would look at more inputs when the traffic light is red).

Reducing the states I had in mind to keep the states as simple and understandable as possible. Each of my chosen states uses only the inputs needed to let us know if our smartcab should stop or go. With a smartcab heading forward, a red traffic light is enough to know it should stop, we don't have to know if there is another car oncoming, on the left or on the right.

I chose to look at the following states:
- forward, red light
- forward, green light
- turn right, green light
- turn right, red light, someone left
- turn right, red light, no one left
- turn left, red light
- turn left, green light, someone oncoming
- turn left, green light, no one oncoming

### Step 3 - Implement Q-Learning

In this step I implemented the q-learning algorithm with a learning rate of 0.5 and a discount factor of 0.5. The next action is always chosen based on the best estemate based on the current state.

I put the printed output of my test run into `'output_first_text.txt'`.

In [45]:
read_in_my_text("output_second_text.txt")

Rate of reaching the destination 0.0


Interestingly, the smartcab does not reach it's destination in a single trial, even though it has the 100 extra steps. It's obvious in the visualization that the smartcab sticks to actions it took before, rarely trying a new action. It gets stuck most of the time. Probably because the penalty for not moving at a red traffic light is not as high as the penalty for overstepping other traffic rules and the smartcab continually gets points for not moving at a traffic light as soon as it turns red.

Let's try to fix this by tweaking the values for the learning rate, the discount factor and the behavior on how to choose the best action.

### Step 4 - Enhance the driving agent

In this step I am playing around with the learning rate and the discount factor. But before, I will change the way the next action is chosen by changing the behavior of my learner when the highest estimated reward of an action is 0.

In this case, the smartcab has probably not faced the situation yet. In this situation I want the smartcab to make a random choice of what to do. I implemented a random choice of actions whenever the smartcab has no former knowledge about a situation like the one it is facing. I also implemented, that the smartcab should choose a random action, when the highest estimated reward is for doing nothing and it is below 0.4 to prevent that the smartcab gets stuck.

In [46]:
read_in_my_text("output_third_text.txt")

Rate of reaching the destination 0.49


With these changes the number of trails the smartcab reaches the destination goes up to 49%. A nice start compared to the result above with a lot of room for improvement.

In the next step I lowered the learning rate to 0.1 instead of 0.5.

In [89]:
read_in_my_text("output_fourth_text.txt")

Rate of reaching the destination 0.594059405941


This gets the trials the smartcab reaches the destination to 57%. Maybe we can top that by lowering the learning rate even further.

In [88]:
%run smartcab/agent.py

Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (1, 4), destination = (4, 5), deadline = 20
RoutePlanner.route_to(): destination = (4, 5)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (2, 5), destination = (7, 6), deadline = 30
RoutePlanner.route_to(): destination = (7, 6)
Environment.step(): Primary agent hit hard time limit (-100)! Trial aborted.
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (8, 5), destination = (3, 6), deadline = 30
RoutePlanner.route_to(): destination = (3, 6)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (3, 1), destination = (5, 6), deadline = 35
RoutePlanner.route_to(): destination = (5, 6)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 4
Environment.reset(): Trial set up with start = (8, 1), destination = (3, 2),