Reward Function and Environment Dynamic Class(es) for Single Agent Gym

A single agent gym (SingleAgent()), named mujoco_gym in the following, enables the training of a single agent with reinforcement learning. To instaniate a mujoco_gym, among other parameters 1. the reward function and 2. the environment dynamic class(es) need to be specified. How these may be implemented is explained in the following.

1. Reward Function 

A reward function determines the reward an agent receives after an action in a gym. In our test_reward function an agent receives a reward for both 1.1 moving towards a target and 1.2 moving at all in the mujoco_gym. The final reward returned by the function is the sum of these two. 


1.1 Reward for moving towards a target

The variable distance (line 1) stores the distance between the "torso" and the "current_target". For that it uses the mujoco_gym function calculate_distance, which can handle arrays with the coordinates of an object as well as strings with the name of an object to read out this objects coordinates. 
Next, the if-else query (lines 2-7) checks whether the dictionary data_store from the mujoco_gym already has the key "distance". 
If not (lines 2-4) the entry "distance" is created with the value of the variable distance. The (new_)reward is zero in this case, as the agent never attempted to move towards the "current_target" before. 
Otherwise (lines 5-7), if "distance" already exists in the data_store, it represents the distance to the "current_target" before the last step. The difference between this former distance and the current distance is the (new_)reward. If the "torso" moved closer to the "current_target" it is positive. If the agent did not move at all it is zero, and if the agent moved further away, it is negative. Then the data_store dictionary entry "distance" is updated with the current distance.
Lastly, the variable reward (line 8) stores the new_reward to allow an easy addition later on.

In [None]:
distance = mujoco_gym.calculate_distance("torso", mujoco_gym.data_store["current_target"])
if "distance" not in mujoco_gym.data_store.keys():
    mujoco_gym.data_store["distance"] = distance
    new_reward = 0
else:
    new_reward = mujoco_gym.data_store["distance"] - distance
    mujoco_gym.data_store["distance"] = distance
reward = new_reward

1.2 Reward for moving at all

Similarily to 1., the if-else-query (lines 1-9) checks whether the dictionary data_store from the mujoco_gym already has the key "last_position". 
If not (lines 1-3) the entry "last_position" is created with a deepcopy of the coordinates (np.ndarray) from the "torso". The (new_)reward is zero in this case, as the agent never attempted to move before. 
Otherwise (lines 4-9), if "last_position" already exists in the data_store, the agents position was checked before the last step. The (new_)reward is the distance between the agents last and its current position. It is calculated with the calculate_distance function from mujoco_gym, like the distance between the "torso" and the "current_target" in 1. Next, the data_store entry "last_position" is updated to the current position to be used in the next step. If the movement of the agent was too small, namely less than 0.08 (line 7), the agent receives a negative reward. Reward shaping revealed to multiply this (new_)reward by a factor of 6.

In [None]:
if "last_position" not in mujoco_gym.data_store.keys():
    mujoco_gym.data_store["last_position"] = copy.deepcopy(mujoco_gym.data.body("torso").xipos)
    new_reward = 0
else:
    new_reward = mujoco_gym.calculate_distance("torso", mujoco_gym.data_store["last_position"])
    mujoco_gym.data_store["last_position"] = copy.deepcopy(mujoco_gym.data.body("torso").xipos)
    if new_reward < 0.08:
        new_reward = new_reward * -1
    new_reward = new_reward * 6

Lastly, the return variable reward is the sum of the reward for moving towards the "current target" and the new_reward for moving at all.

In [None]:
reward = reward + new_reward

2. Environment Dynamic Class(es)