# Reinforcement Learning Python Project
### Building a Machine Learning Model

**Q Learning** Algorithm is used for the  model


#### Algorithm
- Initialise the **Q-table** to all zeros
- Iterate
    - Agent is in state **state**.
    - With probability **epsilon** choose to **explore**, else **exploit**.
        - If **explore**, then choose a *random* **action**.
        - If **exploit**, then choose the *best* **action** based on the current **Q-table**.
    - Update the **Q-table** from the new **reward** to the previous state.
    - Q[**state, action**] = (1 – **alpha**) * Q[**state, action**] + **alpha** * (**reward + gamma** * max(Q[**new_state**]) — Q[**state, action**])
    
#### Variables
As you can se, we have introduced the following variables.

- **epsilon**: the probability to take a random action, which is done to explore new territory.
- **alpha**: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
- **gamma**: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
- **reward**: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

### Description 
- To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picked up and moved to a drop-off point.
- At each position there are 6 different actions that can be taken.
    - **Action 0**: Go South if on field.
    - **Action 1**: Go North if on field.
    - **Action 2**: Go East if on field (Please notice, I mixed up East and West (East is Left here)).
    - **Action 3**: Go West if on field (Please notice, I mixed up East and West (West is right here)).

    - **Action 4**: Pickup item (it can try even if it is not there)
    - **Action 5**: Drop-off item (it can try even if it does not have it)
- Based on these actions we will make a reward system.
    - If the **agent** tries to go off the **field**, punish with **-10** in **reward**.
    - If the **agent** makes a (legal) move, punish with **-1** in **reward**, as we do not want to encourage endless walking around.
    - If the **agent** tries to pick up item, but it is not there or it has it already, punish with **-10** in **reward**.
    - If the **agent** picks up the item correct place, **reward** with **20**.
    - If **agent** tries to drop-off item in wrong place or does not have the item, punish with **-10** in **reward**.
    - If the **agent** drops-off item in correct place, **reward** with **20**.


In [1]:
class Field:
    def __init__(self, size, item_pickup, item_drop_off, start_pos):
        self.size = size
        self.item_pickup = item_pickup
        self.item_drop_off = item_drop_off
        self.position = start_pos
        self.item_in_car = False
        
    def get_number_of_states(self):
        return self.size*self.size*self.size*self.size*2
    
    def get_state(self):
        state = self.position[0]*self.size*self.size*self.size*2
        state = state + self.position[1]*self.size*self.size*2
        state = state + self.item_pickup[0]*self.size*2
        state = state + self.item_pickup[1]*2
        if self.item_in_car:
            state = state +1
        return state    
        
    def make_action(self, action): 
        (x, y) = self.position
        if action == 0: #Go south
            if y == self.size-1:
                return -10, False
            else:
                self.position = (x, y+1)
                return -1, False
        
        elif action == 1: #Go north
            if y == 0:
                return -10, False
            else:
                self.position = (x,y-1)
                return -1, False
            
        elif action == 2: #Go east
            if x == 0:
                return -10, False
            else:
                self.position = (x-1,y)
                return -1, False
            
        elif action == 3: #Go west
            if x == self.size -1:
                return -10, False
            else:
                self.position = (x+1,y)
                return -1, False
            
        elif action == 4: #pcikup item
            if self.item_in_car:
                return -10, False
            elif self.item_pickup != (x,y):
                return -10, False
            else:
                self.item_in_car = True
                return 20, False
            
        elif action == 5: #Drop off
            if not self.item_in_car:
                return -10, False
            elif self.item_drop_off != (x,y):
                self.item_pickup = (x, y)
                self.item_in_car = False
                return -10, False
            else:
                return 20, True
        

In [4]:
import random
import numpy as np

In [5]:
size = 10
item_start = (0,0)
item_drop_off = (9,9)
start_pos = (0,9)
    
field = Field(size, item_start, item_drop_off, start_pos)

number_of_states = field.get_number_of_states()
number_of_actions = 6

q_table = np.zeros((number_of_states, number_of_actions))

epsilon = 0.1
alpha = 0.1
gamma = 0.6

for _ in range(1000):
    field = Field(size, item_start, item_drop_off, start_pos)
    done = False
    
    while not done:
        state = field.get_state()
        if random.uniform(0,1) < epsilon:
            action = random.randint(0,5)
        else:
            action = np.argmax(q_table[state])
            
        reward, done = field.make_action(action) 
        #Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])
        
        new_state = field.get_state()
        new_state_max = np.max(q_table[new_state])
        q_table[state, action] = (1-alpha)*q_table[state, action] + alpha*(reward+gamma*new_state_max-q_table[state, action])

In [6]:
def reinforcement_learning():
    epsilon = 0.1
    alpha = 0.1
    gamma = 0.6
    
    field = Field(size, item_start, item_drop_off, start_pos)
    done = False
    steps = 0
    
    while not done:
        state = field.get_state()
        if random.uniform(0,1) < epsilon:
            action = random.randint(0,5)
        else:
            action = np.argmax(q_table[state])
            
        reward, done = field.make_action(action) 
        #Q[state, action] = (1 – alpha) * Q[state, action] + alpha * (reward + gamma * max(Q[new_state]) — Q[state, action])
        
        new_state = field.get_state()
        new_state_max = np.max(q_table[new_state])
        q_table[state, action] = (1-alpha)*q_table[state, action] + alpha*(reward+gamma*new_state_max-q_table[state, action])
        
        steps = steps + 1
        
    return steps 

Running program to test how many steps to solve problem

In [7]:
reinforcement_learning()

31

Taking the average numbers of the step of 1000 tryings 

In [8]:
runs_rl = [ reinforcement_learning() for _ in range(1000)]

sum(runs_rl)/len(runs_rl)

69.306