# Analysis of Actor Behaviour

Eysenbach et al. (2023) argue that their contrastive representation learning leads to representations that encode actionable distances. Further, they say the distance between a state-action representation and a goal representaiton corresponds to the probability of reaching that goal when performing the respective action in the respective state. If that is true, the trained encoders could be used for successful control without the need for an actor network. Instead, the representations for all current state-action pairs could be evaluated and the one closest to the goal representation could be chosen greedily.

In this experiment's first part, I want to investigate how the actor network behaves. Does it behave in accordance to this greedy selection or does it employ a more complex strategy?

## Loading Encoders & Actor
Again, we first need to load the encoders and the actor trained during the contrastive RL.

In [None]:
import experiment_utils as utils

env = "Spiral11x11"
sa_encoder, g_encoder, actor = utils.load_trained_networks(env)

## Loading Environment and States
Then, we collect all the states of the environment.

In [None]:
states = utils.get_all_env_states(env)

## Sampling Tasks (start states and goals)
To evaluate how the greedy selection strategy compares to the actor, we need to sample a few tasks.

Each task consists of a start state and a goal state.

In [None]:
tasks = utils.sample_tasks(states, n=1000)

## Recording Actor Behaviour on Sampled Tasks
Now we record the actor's performance on the sampled tasks. We record the following metrics as averages across all tasks:
- success rate
- number of steps needed
- time needed

In [None]:
actor_trajectories, actor_metrics = utils.evaluate_actor(actor, tasks)

## Computing Greedy Behaviour on Sampled Tasks
Now we don't use the actor but instead greedily select the next action based on which state-action representation is closest to the goal representation.

The recorded metrics are the same.

In [None]:
greedy_trajectories, greedy_metrics = utils.evaluate_encoder_selection("greedy", tasks)

## Comparing Actor & Greedy Behaviour
We now compare the obtained metrics and also compute the degree of agreement between actor selection and greedy selection, i.e. the percentage of identical decisions in identical situations (same state in same task).

In [None]:
utils.compare_metrics(("actor", "greedy"), (actor_metrics, greedy_metrics))
agreement = utils.compute_agreement(actor_trajectories, greedy_trajectories)


NameError: name 'utils' is not defined

# Going Beyond Greedy Selection

The previous experiment can have two possible outcomes:
1. The actor behaves greedily, always choosing the action for which the state-action representation is closest to the goal representation.
2. The actor does not behave greedily.

The first result would support the claim that the representations' distance does actually correspond to the probability of reaching the goal in the future. In this case, it could be interesting to see whether explicitly using selection strategies with a further decision horizon can improve beyond greedy search (which would challenge the probability assumption).

The second result would challenge the probability assumption and show that the actor learns a more complex strategy than just greedy selection. In this case, it would be interesting to see what other strategy better models the actor strategy.

In both cases, we need a number of more complex selection strategies that are better at taking into account effects in the future. Possible candidates are:
- Monte Carlo Tree Search
- Limited Breadth Firt Search
- Limited Depth First Search




## Loading Encoder & Actor

## Defining Algorithms

## Loading Environment & States

## Sampling Tasks (start states and goals)

## Recording Actor & Algorithm Behaviour on Sampled Tasks

## Comparing Performance of Actor & Algorithms