In [8]:
import import_ipynb
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset
import pickle

### Base classes for an **AI Agent** having the following architecture:
1.1 An **AI Agent** operates in a '**World**' that calls it via '**action<-act(state)**' requests and '**reward(reward)**' intimations. (Worlds are typically wrappers around traditional RL-environments; Worlds can also wrap supervised learning tasks.) The goal of the AI Agent is to maximise *long-term steady-state average reward*. Periodically the World may also update the Agent regarding the completion of an *episode* (e.g. an RL episode or completion of epoch).

1.2 An Agent has **Controller**, **Perception**, **Memory** and **Actor** components (In line with Lecun's "Archicture of an Autonomous AI Agent". World Model to be added later.) The Memory contains a Perceptual Memory as well as a State-Action-Reward memory. The Actor includes a **Model**. A Model includes a **Network** and **Trainer** (class that handles publishes training procedure(s) for the Network). Overall orchestration of all components including the Agent's public interface is handled by the Controller. Further, the learning schedule, to train the Network, be it done only intitally, periodically, or continually online, is decided by the Controller. 

1.3 Each component of an Agent may be customised for a specific World by inheriting from the default base class for that component. Agent's *reset* method resets required components and the time counter; the *clear* method clears (e.g. removes all storage) from applicable components.

In [82]:
class AIAgent():
    def __init__(self,controller=None,perception=None,memory=None,actor=None):
        if controller is not None: self.controller=controller
        else: self.controller=Controller()
        if perception is not None: self.perception=perception
        else: self.perception=Perception()
        if memory is not None: self.memory=memory
        else: self.memory=Memory()
        if actor is not None: self.actor=actor
        else: self.actor=Actor()
        self.reset()
        self.controller.parent=self
        self.perception.parent=self
        self.memory.parent=self
        self.actor.parent=self
    def act(self,world_state):
        world_action = self.controller.act(world_state)
        # check to see if network needs training - TBDesigned
        self.time+=1
        return world_action
    def reward(self,world_reward):
        return self.controller.reward(world_reward)
    def episode(self):
        return self.controller.episode()
    def reset(self):
        ## TBD may need to do more
        self.time=0
    def clear(self):
        self.memory.clear()
        self.reset()

2.1 Control flow goes as follows: Agent tracks a *time* counter. World calls *act*(world-state) on Agent, which is routed to Controller's *act*. The incoming *world-state* is mapped to a percept using the *perceive_state* function published by the Perception module, and stored in the perceptual memory (via Memory's *add_percept*, against the current *time* and *key*, the latter extracted from *world-state*, e.g., a ticker, ticker-date, task-id, or; default *key*='default'). ~~Note: the incoming world-state is stored as the current time's current-percept as well as the next-percept for the previous time step, for (each) incoming key.~~ 

2.2 The Actor reads from the perceptual memory and creates an actor-state by processing the current percept (and possibly also using prior rewards and actions, e.g. for meta-RL). The Actor also updates the state-action-reward memory with the previous time's percept after mapping it to an actor-state.

2.2.1 Note: in case multiple actions are required at a given time the Actor is subclassed to override *act* so as to call the Controller's *act* multiple times and return the set of resulting set of actions together. (This will be needed in the case of tradeserver).

2.3 The Actor calls its Network to decide the action to return. Before returning the action, it is stored in the perceptual memory; a new entry is also created in the state-action-reward memory with the current actor-state and action. Also, the action is mapped to a world_action using the Perception component's *action_to_world* function.

2.4 Before completing the *act* (or *reward*) flow, Actor checks to see if any periodic or online training is needed to Network. It also updates the *time* counter.

2.5 On receiving a *world_reward* from the World, the Actor passes it to the Controller that extracts a *key* and *reward* using Perception's *perceive_reward* function. These are stored in the perceptual memory (for the prior time step, since by now the Actor's time step has been update as soon as its action was completed) as well as appended to the latest entry (prev time step) of the state-action-reward memory.

In [113]:
class Controller():
    def __init__(self):
        self.parent=None
    def act(self,world_state):
        percept,key=self.parent.perception.perceive_state(world_state)
        print(percept)
        self.parent.memory.add_percept(percept,key,self.parent.time)
        print(self.parent.memory.perceptual_memory)
        actor_state=self.parent.actor.create_actor_state(self.parent.time,key)
        print(actor_state)
        self.parent.memory.update_next_state(actor_state,self.parent.time-1)
        print(self.parent.memory.sar_memory)
        action=self.parent.actor.call_model(actor_state)
        print(action)
        self.parent.memory.add_state_action(actor_state,action,self.parent.time)
        print(self.parent.memory.sar_memory)
        world_action=self.parent.perception.action_to_world(action)
        return world_action
    def reward(self,world_reward):
        reward,key=self.parent.perception.perceive_reward(world_reward)
        self.parent.memory.update_reward_perceptual(reward,self.parent.time-1)
        self.parent.memory.update_reward_sar(reward,self.parent.time-1)

3.1 ***Memory*** stores are nested dictionaries indexed by *time* and *key*. Each entry is a dictionary with keys *'percept','action','reward'* and *'state','action','reward','next_state'* for perceptual memory / state-action-reward memory respectively.

In [114]:
class Memory():
    def __init__(self):
        self.parent=None
        self.clear()
    def clear(self):
        self.perceptual_memory={}
        self.sar_memory={}
    def add_percept(self,percept,key,time):
        if time in self.perceptual_memory: self.perceptual_memory[time][key]={'percept':percept}
        else: self.perceptual_memory[time]={key:{'percept':percept}}
    def update_next_state(self,actor_state,time):
        if time in self.sar_memory: self.sar_memory[time]['next_state']=actor_state
        else: self.sar_memory[time]={'next_state':actor_state}
    def add_state_action(self,actor_state,action,time):
        if time in self.sar_memory: 
            self.sar_memory[time]['state']=actor_state
            self.sar_memory[time]['action']=action
        else: self.sar_memory[time]={'state':actor_state,'action':action}
    def update_reward_perceptual(self,reward,time):
        self.perceptual_memory[time]['reward']=reward
    def update_reward_sar(self,reward,time):
        self.sar_memory[time]['reward']=reward

3.2 The default **Perception** class just copies world states/actions/rewards to actor states/actions/rewards. Should be subclassed for a given World.

In [115]:
class Perception():
    def __init__(self):
        self.parent=None
    def perceive_state(self,world_state):
        return world_state,'default'
    def action_to_world(self,action):
        return action
    def perceive_reward(self,reward):
        return reward,'default'

3.3 The default **Actor** has no Model and returns a fixed action (can be set). It copies the percept from perceptual memory directly into the actor_state. This should be subclassed and/or method *percept_to_state* or *create_actor_state* overridden for a given World.

In [116]:
class Actor():
    def __init__(self,model=None):
        self.parent=None
        if model is not None: self.model=model
        self.default_action='default_action'
    def create_actor_state(self,time,key):
        if time not in self.parent.memory.perceptual_memory: return None
        elif key not in self.parent.memory.perceptual_memory[time]: return None
        elif 'percept' not in self.parent.memory.perceptual_memory[time][key]: return None
        else: return self.percept_to_state(self.parent.memory.
                                           perceptual_memory[time][key]['percept'])
    def percept_to_state(self,percept):
        return percept
    def call_model(self,actor_state):
        return self.default_action

In [117]:
agent=AIAgent()

In [124]:
agent.act('world_state_2')
agent.reward(200)

world_state_2
{0: {'default': {'percept': 'world_state_1'}, 'reward': 100}, 1: {'default': {'percept': 'world_state_2'}}}
world_state_2
{-1: {'next_state': 'world_state_1'}, 0: {'state': 'world_state_1', 'action': 'default_action', 'reward': 100, 'next_state': 'world_state_2'}}
default_action
{-1: {'next_state': 'world_state_1'}, 0: {'state': 'world_state_1', 'action': 'default_action', 'reward': 100, 'next_state': 'world_state_2'}, 1: {'state': 'world_state_2', 'action': 'default_action'}}


In [129]:
agent.memory.perceptual_memory

{}

In [130]:
agent.memory.sar_memory

{}

In [121]:
agent.reward(100)

In [128]:
agent.clear()

In [131]:
agent.time

2