## Open AI Gym Distribution Center Environment for Multi Agent Co-ordination to deliver packages using optimal routes

#### Environment
- This open AI gym environments split the city into 8 regions using directions, 
- Each region will have one agent to make local delivery or hand of packages to other remote regions 
- Each region will be assgined with packages either to deliver locally or deliver the products to other 7 regions Distribution centres.     
- During every reset this environment randomly generate payload for every region for each agent.   0 is no load, 1 is load
- Each agent can move left, stay in the same location or move right. 
- During every action package will be handed off to remote region or delivered locally depending on the actions.     
- Rewards will be assigned based on the trip
    - if agent makes successful hand-off or local delivery reward 1 will be assigned, 
    - if agent makes a trip but doesn’t have package to hand off or locally delivery reward -1 will be assigned).  
- Task will be considered done once all the packages are delivered. 

#### Regions and transition  (refer action_state_transition.csv) 
- 8 regions and corresponding idx North-0, North East-1, East-2, South East-3, South-4, South West-5, West-6, North West-7
- When agent takes action 0 he moves to the left (anti clock direction) e.g if agent is in location North-0 and action is 0 agent moves to North West-7
- When agent takes action 1 he stays in the same location e.g if agent is in location North-0 and action is 1 agents new location remains same 0
- When agent takes action 2 he moves to the right (clock direction) e.g if agent is in location North-0 and action is 2 agent moves to North East-2

In [83]:
from IPython.display import Image

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [84]:
from gym import spaces
import gym
import numpy as np
import time
import pandas as pd 
import numpy as np 

class MultiAgentEnv(gym.Env):

    def __init__(self):
        self.episode = 0
        
        ### State transition 
        self.state_transition = pd.read_csv('action_state_transition.csv')
        self.agents_n = 8
        self.start_time =  time.time()
        self.log_time = time.time()
        self.duration = 0
        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Discrete(3*2*2*2*2*2*2*2*2)
        self.state = self.reset()
        self.reset()
        
    # Action  No Action - 0, Right action - 1, Left action -2
    def step(self, action):
        print_interval = 1*60

        print_log = (time.time() - self.log_time) >= print_interval
        if(print_log):
            self.log_time = time.time()
        self.episode = self.episode+1
        # update inventory 
        new_loc = []
        inventory = []
        for i in range(len(action)):
            new_loc.append(self.getNewLocation(self.state[i][0],action[i]))
            inventory.append(list(self.state[i])[1:].copy())

        #i = 0
        rewards = []
        for i in range(len(inventory)):
            #inventory[i][new_loc[i]-1] = 0 # current agents inventory unload on remote/home location. 
            # if it is a home location it is final delivery so no need to increment the load
            # if it is a remote location load needs to be transfered to corresponding agent.  agent id is equal to the location id. 
            # we have used only two agents 1 = north , 2 north east
            reward = 0
            if(new_loc[i]-1 in [i for i in range(8)]):
                if (new_loc[i]-1 != i and inventory[i][new_loc[i]-1] > 0): # remote delivery
                    #print('inventory before remote inventory update: {}'.format(inventory))
                    inventory[i][new_loc[i]-1] = 0
                    #print('remote delivery for agent: {} by agent {} : inventory: {}'.format(new_loc[i]-1,i,inventory))      
                    
                    #remote delivery for agent: 6 by agent 0            
                    inventory[new_loc[i]-1][new_loc[i]-1] = 1
                    #print('inventory after remote inventory update: {}'.format(inventory))
                    reward = 1
                elif inventory[i][new_loc[i]-1] > 0:
                    #print('local delivery by agent: {}'.format(i))
                    inventory[i][new_loc[i]-1] = 0
                    reward = 1
                else:
                    #print('No Delivery')
                    reward = -1
            elif inventory[i][new_loc[i]-1] > 0:
                #print('local delivery by agent: {}'.format(i))
                inventory[i][new_loc[i]-1] = 0
                reward = 1
            else:
                #print('No Delivery')
                reward = -1
            if(new_loc[i]-1 == i and sum(inventory[i])==0):
                reward = 0
            rewards.append(reward)

        new_inventory = inventory
        #print('new inventory after update: {}'.format(new_inventory))
        #print(rewards)
        print_log = False
        if(print_log):
            for k in range(len(rewards)):
                print("Tries : {}, Current Location :{} , Current Inventory : {} ,action: {}, New Location : {}, Inventory drop for new Location: {}, Reward: {}"
                      .format(self.episode,self.state[k][0],self.state[k][1:],action[k],new_loc[k],self.state[k][1:][new_loc[k]-1],rewards[k]))
        
        new_state = list(self.state)
        idx_of_new_loc_inventories = []
        # update state with new position and inventory
        for k in range(len(action)):
            new_state_list = list(new_state[k])
            new_state_list[0] = new_loc[k]
            new_state_list[1:] = new_inventory[k]
            new_state_tuple = tuple(new_state_list)
            new_state[k] = new_state_tuple
        
        self.state  = [s for s in new_state]
        done = False
        remaning_inv = 0
        for k in range(len(action)):
            remaning_inv = remaning_inv + sum(self.state[k][1:])
            
        if(remaning_inv == 0):
            done = True
  
        return self.state, rewards, done, {}
    # North = 0, North East = 1, East = 2, South East = 3, South = 4, South West = 5, West = 6 , North West = 7
    def getNewLocation(self,currentLocation,action):
        return self.state_transition[self.state_transition['state']==currentLocation][self.state_transition['action']==action]['newstate'].iat[0]
       
    
    # reward -1 if there is no delivery
    # reward 1 if there is a delivery 
    # reward 0 for every step.  
    
    def get_rewards(self,newLocations,new_inventories,states):
        rewards = []
        for i in range(len(states)):
            reward = 0
            new_inventory=new_inventories[i]
            old_inventory=states[i][1:]
            old_location = states[i][0]
            #print('agent : {} old and new inventory : {}, {}'.format(i,new_inventory,old_inventory))
            #print('agent : {} sum of old and new inventory : {}, {}'.format(i,sum(new_inventory),sum(old_inventory)))
            if(sum(new_inventory)<sum(old_inventory)):
                reward = 1
            else:
                reward = -1
            if(newLocations[i]==old_location and sum(old_inventory) == 0):
                reward = 0
            rewards.append(reward)
        return rewards
    def get_random_action(self):
        return [np.random.randint(0,2) for i in range(self.agents_n)]

    def reset(self):
    
        self.state = self.get_random_agent_states()
        return self.state
    #returns random states for each agent
    def get_random_agent_states(self):
        states = []
        for i in range(self.agents_n):
            state = (i,
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2),
                     np.random.randint(0,2))
            states.append(state)
        return states
        
    
    def render(self, mode="human"):
        print("state :{} ".format(self.state))
        
    def close(self):
        print("close")

In [86]:
##Test Scenario: Following scenario provides sample state and sequence of steps required to delivery all the packages.
# State contains 8 rows to represent 8 agent stats. 
# Each agent states has 9 columns/values 
# 
#[currentlocaiton, 
#Load for region0 load for region1, 
#load for region2, load for region3, 
#load for region4, load for region5, 
#load for region6, load for region7]


state = [(0, 0, 0, 0, 1, 0, 0, 0, 1), 
         (1, 0, 1, 1, 1, 0, 0, 0, 1),
         (2, 1, 0, 0, 1, 1, 0, 0, 1),
         (3, 1, 0, 1, 0, 0, 0, 1, 0),
         (4, 0, 0, 0, 0, 0, 1, 0, 0), 
         (5, 1, 0, 0, 0, 0, 0, 0, 0), 
         (6, 1, 0, 0, 1, 0, 0, 1, 0), 
         (7, 0, 1, 1, 1, 1, 0, 1, 1)]

# Contains multiple steps to complete the task
# each step contains one action for each agent total 8 
# Action 0 represent move to left, 1 represent no action, 2 represents move to right. 
steps = [[0, 2, 0, 0, 2, 0, 2, 0], [2, 2, 0, 2, 2, 0, 2, 0],
         [2, 2, 2, 0, 0, 0, 2, 0], [1, 0, 2, 0, 2, 0, 2, 0], 
         [1, 0, 2, 0, 2, 0, 2, 0], [2, 0, 2, 0, 0, 0, 2, 0], 
         [2, 2, 2, 0, 2, 0, 2, 0], [2, 0, 2, 0, 0, 2, 2, 0],
         [0, 0, 2, 0, 0, 0, 2, 0]]

expected_reward = -1

env = MultiAgentEnv()
env.state = state
total_rewards = 0

for idx,step in enumerate(steps):
    print('Step {}'.format(idx))
    print('===========================================')
    print('Old State: {}'.format(env.state))
    print('Actions : {}'.format(step))
    state, rewards, done, _ = env.step(step)
    print('Average rewards : {} Individual Rewards: {}'.format(sum(rewards)/len(rewards),rewards))
    print('New State {} '.format(state))
    print('Task Complete: {} \n'.format(done))
    total_rewards = total_rewards + sum(rewards)/len(rewards)
print('Toal Rewards : {}'.format(total_rewards))

Step 0
Old State: [(0, 0, 0, 0, 1, 0, 0, 0, 1), (1, 0, 1, 1, 1, 0, 0, 0, 1), (2, 1, 0, 0, 1, 1, 0, 0, 1), (3, 1, 0, 1, 0, 0, 0, 1, 0), (4, 0, 0, 0, 0, 0, 1, 0, 0), (5, 1, 0, 0, 0, 0, 0, 0, 0), (6, 1, 0, 0, 1, 0, 0, 1, 0), (7, 0, 1, 1, 1, 1, 0, 1, 1)]
Actions : [0, 2, 0, 0, 2, 0, 2, 0]
Average rewards : -0.25 Individual Rewards: [-1, 1, 1, -1, -1, -1, 1, -1]
New State [(7, 1, 0, 0, 1, 0, 0, 0, 1), (2, 0, 0, 1, 1, 0, 0, 0, 1), (1, 0, 0, 0, 1, 1, 0, 0, 1), (2, 1, 0, 1, 0, 0, 0, 1, 0), (5, 0, 0, 0, 0, 0, 1, 0, 0), (4, 1, 0, 0, 0, 0, 0, 0, 0), (7, 1, 0, 0, 1, 0, 0, 0, 0), (6, 0, 1, 1, 1, 1, 0, 1, 1)] 
Task Complete: False 

Step 1
Old State: [(7, 1, 0, 0, 1, 0, 0, 0, 1), (2, 0, 0, 1, 1, 0, 0, 0, 1), (1, 0, 0, 0, 1, 1, 0, 0, 1), (2, 1, 0, 1, 0, 0, 0, 1, 0), (5, 0, 0, 0, 0, 0, 1, 0, 0), (4, 1, 0, 0, 0, 0, 0, 0, 0), (7, 1, 0, 0, 1, 0, 0, 0, 0), (6, 0, 1, 1, 1, 1, 0, 1, 1)]
Actions : [2, 2, 0, 2, 2, 0, 2, 0]
Average rewards : 0.5 Individual Rewards: [1, 1, 1, 1, 1, -1, -1, 1]
New State [(0, 1, 