# Landfill mining with reinforcement learning and Mobile robotics 

#### Individual Study, Joseph Maliszewski 

### Introduction

A robot will need to navigate its way through enviroments at landfill. To do this it will need to avoid obstacles. We can do the following:
<br>
<br>
1) A human to add rules to the system in order to genrate desired behaviour 
<br>
<br>
2) Use self-learning such as RL, for the system to do this by iteself. 
<br>
<br>
3) A hybrid of the two
<br>
<br>
As the system will need to be a scalable system in this enviroment. It may be better to use a RL approach, (or a hybrid of the two). This will remove the need for any large edge cases to be accounted for, thus lowering need for human maintainance or intevention in development, are large likehood for optimal behaviour to be reached increasing the effciency of the system.
<br>
<br>
This notebook takes a very simple game from the openAI gym library, to present simple example of how RL could be implemented on for a single agent mining useful resources from landfill, using a markov descion process. In future, as described in the report, multiple nested, multi-agent reinforment learning would then be the next step for implementing a multi-robot colabrative system on a landfill. 
<br>
<br>
Reinforcement learning is a machine learning technique, that learns to act/behave in a partcilar way to maximze a given reward. Markov descion proccesses (MDP) is a discrete time stochastic control process allows for formalization of sequential descion making, such as this. In this example, MDP describes the problem, where RF can be used to find solutions to this problesm. in other words we are looking to discover an MDP model that provides a solution to the problem, with RL. There are 5 components to MDP, 
Enviroment, agent, states, Actions, Rewards. These work seqentially, where a reward is given depending on a action in a given state. This is called a trajectory.
 <br>   


### RL Landfill Game 

The robot in starting position is at a 2mx2m container on the edge of landfill in simulation.
<br>
<br>
The task for robot is to fill the container with clear plastic bottles collected from the landfill. 
<br>
<br>
The robot has been notified of a GPS location within an area at the landfill that is newly discovered and that it is there are clear plastic bottles in abundance. This was communicated by another robot that has emmited an "artifical pheromone" indicting the present of this resource to other robots in the area.
<br>
<br>
The robot must learn to avoid all obstacles on its way to this location of the landfill with no prior knowledge of its environment.
<br>
<br>
The landfill site can be represented as a grid:
<br>
<br>
SFFH<br>
FHFF<br>
HFFH<br>
FHFG<br>
<br>
S = robot (agent)starting position <br>
F = Free space in landfill/viable path<br>
H = Obstacle in landfill <br>
G = The goal destination of abundent clear plastics bottles<br>
<br>
The robot can navigate up, down, left, right, where the episode is terminated at the goal or if the robot hits the obstacle. 


In [2]:
from IPython.display import clear_output
import random
import gym
import numpy as np
import matplotlib.pyplot as plt


In [3]:
def build_empty_qtable(num_poss_actions, total_num_states):
    #qtable hold all pairs of states and possible action can can be taken in that state
    #actions: Left,Right,Up,Down : States: each unit of area in the enviroment
    # This table continously updates are referenced to and is the basis for how the agent learns
    # from its envirmoment.
    
    return np.zeros((num_poss_actions,total_num_states))

def update_qtable(qtable, learn_r, reward, discount_r, current_state,new_state, action):
    
    #The new state is a weighted sum of the new value and the old value. 
    
    old_value = qtable[current_state, action]
    new_value = reward + discount_r*(np.max(qtable[new_state, :]))
    qtable[current_state, action] =  ((1-learn_r) * old_value) + (learn_r*new_value)
    return qtable


def decay_exploration_rate(explore_r, min_explore_r, max_explore_r, explore_decay_r, episode):
    explore_r = min_explore_r + (max_explore_r - min_explore_r)*np.exp(-explore_decay_r*episode)
    return explore_r
    

def main():
    
    #Get the landfill enviroment (modelled of provided openAI frozen lake)
    environment = gym.make("FrozenLake-v0")
    
    #to build qtable need to num of poss actions in a state, and the number of total states
    #Qtable is a combination of every possible state and action pair. 
    num_poss_actions = environment.action_space.n
    total_num_states = environment.observation_space.n
    qtable = build_empty_qtable(total_num_states, num_poss_actions)
    
    #Set params for how long we want it to learn for 
    num_episodes = 10000
    num_steps = 100
    
    #set exploration/exploitation trade off params
    explore_r = 1
    explore_decay_r = 0.001
    max_explore_r = 1
    min_explore_r = 0.01
    
    discount_r = 0.1
    learn_r = 0.99
    
    # obtain all the rewards from all episodes to see learning progress 
    all_rewards_all_episodes = []
  
    # Start the learing process
    for episode in range(num_episodes):
      
        #one the start of each episode, the enviroment must be reset. 
        current_state = environment.reset()
        #print(current_state)
        
        #Will tell us when the episode is finished,so needs to be reset at the begging of each episode
        episode_complete = False
        
        #total rewards for current episode reset
        rewards_current_episode = 0
        
        #for all time steps in each episode (t in T)
        for steps in range(num_steps):
                                 
    
            #exploration/explotation trade off. If over limit then exploitation will occur
            #if under limit, exploration will occur. 
        
            explore_limit = random.uniform(0,1)
            
            
           # print("explore_limit ", explore_limit, "explore_r ", explore_r)
            
            
            if(explore_limit > explore_r):

                #exploit - does this by taking the maximum action value in the current state found to date.
                #If the learning rate starts at 1, it will always explore to begin with, but over time, it will be less
                #likely to explore, an would be more likely to exploit.
                action = np.argmax(qtable[current_state, : ])
        
            else:
                #explore - takes a random action in that current state.
                action = environment.action_space.sample() 
              
            
            
            #Now the action has been decided, the action must be excecuted. 
            #It will tell us :
                #- what the new state the action has taken the agent to
                #- the reward recieved from the action taken
                #- whether or not the agent has reached a terminal state (eg, collsion or goal)
                #- info diagnostics, regarding enviroment
            new_state, reward, episode_complete, info= environment.step(action)
            
            
            rewards_current_episode = rewards_current_episode + reward                    
            #print("rewards_current_episode  ",rewards_current_episode)
            #we can now update the qtable with the information of the action and state. It does not replace
            #the old value but combines wthis with the new learnt value. (relative to learning rate). Over time,
            #new learnt values have lesser effect on the qtable value, and the old value is less adjustable.
            qtable = update_qtable(qtable, learn_r, reward, discount_r, current_state, new_state, action)
                                     
           # print(qtable)
            
            current_state = new_state
                                
            #print(current_state)
            #checks to see if agent has reached a terminal state, if so, it breaks out of time steps                             
            if (episode_complete == True):
                break
                                 
        explore_r = decay_exploration_rate(explore_r, min_explore_r, max_explore_r, explore_decay_r, episode)                       
        
        all_rewards_all_episodes.append(rewards_current_episode)
    
    rewards_per_1000_episodes = np.split(np.array(all_rewards_all_episodes), num_episodes/1000)
    #print("rewards_per_1000_episodes ", len(rewards_per_1000_episodes))
    count = 1000
    
    rewards_1000_eps = []
    for r in rewards_per_1000_episodes:
        rewards_1000_eps.append(str(sum(r/1000)))
        print(count, " : ", str(sum(r/1000)))
        count += 1000
    print(qtable)
        
main()

(1000, ' : ', '0.01900000000000001')
(2000, ' : ', '0.025000000000000015')
(3000, ' : ', '0.06300000000000004')
(4000, ' : ', '0.057000000000000044')
(5000, ' : ', '0.11400000000000009')
(6000, ' : ', '0.09500000000000007')
(7000, ' : ', '0.10700000000000008')
(8000, ' : ', '0.11900000000000009')
(9000, ' : ', '0.10400000000000008')
(10000, ' : ', '0.09400000000000007')
[[2.03807538e-17 4.06286863e-18 2.74376063e-27 8.95868595e-28]
 [1.16374460e-41 4.42116666e-29 3.21175758e-37 3.72905871e-12]
 [7.53344398e-13 6.33982839e-26 2.41193366e-35 3.74954344e-26]
 [1.34440906e-22 3.02702778e-37 1.31847935e-35 2.70073483e-36]
 [2.20741735e-17 3.18003922e-35 2.81602050e-24 5.98833426e-31]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.30563943e-25 3.80476567e-14 1.09902438e-26 1.04004085e-34]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.15220215e-31 9.54536995e-13 1.90461159e-20 2.52003891e-31]
 [9.60608665e-13 1.93118980e-08 2.07655573e-13 6.91162946e-2

### References

1. date accessed 19 march 2020 https://towardsdatascience.com/introduction-to-reinforcement-learning-markov-decision-process-44c533ebf8da
2. date accessed 19 march 2020 https://link.springer.com/chapter/10.1007/978-3-642-27645-3_1
3. date accessed 19 march 2020 http://karpathy.github.io/2016/05/31/rl/
4. date accessed 20 march 2020 https://towardsdatascience.com/reinforcement-learning-rl-101-with-python-e1aa0d37d43b
5. date accessed 20 march 2020 https://deeplizard.com/learn/