In [1]:
from openai import OpenAI
import yaml
import os
import numpy as np
import pandas as pd

In [2]:
def file_to_string(filename):
    with open(filename, 'r') as file:
        return file.read()

In [3]:
with open('./gpt/key.yaml', 'r') as stream:
    config = yaml.safe_load(stream)

client = OpenAI(api_key=config['OPENAI_API_KEY'])

In [4]:
def gpt_interaction(system_string, user_string):
    completion = client.chat.completions.create(
        model="gpt-4-1106-preview", # gpt-4-1106-preview, gpt-4-0613, gpt-4-32k, gpt-3.5-turbo-1106
        messages=[
        {"role": "system", "content": system_string},
        {"role": "user", "content": user_string}
        ]
    )

    print(completion.choices[0].message.content)

    return completion.choices[0].message.content

In [5]:
def save_string_to_file(save_path, string_file):
    with open(save_path, 'w') as file:
        file.write(string_file)

# Playing with curriculum and reflection

In [6]:
curriculum_answer = gpt_interaction(file_to_string('./gpt/prompt/test/curriculum_system.txt'), file_to_string('./gpt/prompt/test/curriculum_user.txt'))

save_string_to_file(save_path="./gpt/prompt/test/curriculum.md", string_file=curriculum_answer)

Task 1 Name
Learn to Stand

Task 1 Description
Before hopping, the agent must learn to balance and stand upright. Starting from a variety of positions within the initial state space, the goal is to achieve and maintain an upright torso with minimal movement. The agent should apply torques to all joints to prevent the hopper from falling over, prioritizing minimal energy expenditure. Success is measured by maintaining a z height of the torso above a threshold (such as 1.2 times the initial z height) and keeping the torso angle within a limited range for a duration without falling or exceeding energy thresholds.

Task 2 Name
Static Balance on One Foot

Task 2 Description
The agent needs to learn how to balance statically on its foot. This task involves controlling the torques on the three hinges without any hopping, such that the hopper maintains an upright position with zero velocity while withstanding perturbations (small external forces applied to the body parts). Success is measured 

In [7]:
reward_system = file_to_string('./gpt/prompt/test/reward_system.txt')
reward_user = file_to_string('./gpt/prompt/test/curriculum_user.txt') + file_to_string('./gpt/prompt/test/curriculum.md')

reward_functions = gpt_interaction(reward_system, reward_user)

Learn to Stand
Before hopping, the agent must learn to balance and stand upright. Starting from a variety of positions within the initial state space, the goal is to achieve and maintain an upright torso with minimal movement. The agent should apply torques to all joints to prevent the hopper from falling over, prioritizing minimal energy expenditure. Success is measured by maintaining a z height of the torso above a threshold (such as 1.2 times the initial z height) and keeping the torso angle within a limited range for a duration without falling or exceeding energy thresholds.

```python
def compute_reward(observation, action, next_observation) -> Tuple[np.float64, Dict[str, np.float64]]:
    height_threshold = 1.2 * 1.25  # 1.25 is the initial z height
    angle_tolerance = 0.05  # small angle tolerance around upright

    z_height = next_observation[1]
    torso_angle = next_observation[2]
    control_cost = self.control_cost(action)

    # Reward for maintaining height above thres

In [8]:
reflection_system = file_to_string('./gpt/prompt/test/reflection_system.txt')
reflection_env_code = file_to_string('./gpt/prompt/test/reflection_user.txt')
reflection_task_description = file_to_string('./gpt/prompt/test/reflection_task.txt')
reflection_learning_curve = file_to_string('./gpt/prompt/test/reflection_learning_curve.txt')
reflection_reason = file_to_string('./gpt/prompt/test/reflection_reason.txt')

reflection_user = reflection_env_code + reflection_task_description + reflection_learning_curve + reflection_reason

In [9]:
reflection_answer = gpt_interaction(reflection_system, reflection_user)

save_string_to_file(save_path="./gpt/prompt/test/reflection.md", string_file=reflection_answer)

Reason for fixing reward function: The learning curve demonstrates volatile progression with a steep drop in reward value, indicating potential instability in training or exploitation of the current reward function. The drop signifies that the reward function might not be robust against behaviors that are detrimental to the task goal. Additionally, the `hop_reward` and `healthy_reward` components max out easily, leaving no gradient for the agent to improve upon. The `forward_reward` plateaus quickly, while the `control_cost` has a relatively minor impact on behavior adjustment. To address these issues, we need a reward function that consistently incentivizes forward motion and hopping without saturation and provides a more balanced consideration of control costs.

```python
def compute_reward(observation, action, next_observation) -> Tuple[np.float64, Dict[str, np.float64]]:
    x_velocity = (next_observation[0] - observation[0]) / 0.05
    z_velocity = (next_observation[1] - observati

# Playing with Trajectory Feedback

In [10]:
trajectory_feedback_system = file_to_string('./gpt/prompt/test/trajectory_system.txt')
trajectory_feedback_env_code = file_to_string('./gpt/prompt/test/trajectory_user.txt')
trajectory_feedback_task = file_to_string('./gpt/prompt/test/trajectory_task.txt')
trajectory_1 = file_to_string('./gpt/prompt/test/simple_hopping_observation.txt')
trajectory_2 = file_to_string('./gpt/prompt/test/move_forward_observation.txt')

trajectory_feedback_user = trajectory_feedback_env_code + trajectory_feedback_task + "Trajectory of the agent 1: \n" + trajectory_1 + "\nTrajectory of the agent 2: \n" + trajectory_2

In [11]:
decision = gpt_interaction(trajectory_feedback_system, trajectory_feedback_user)

Decision: Agent 2
Reason: The task's goal is to maintain vertical jumping and landing without moving in the x-direction. Agent 1's trajectory shows substantial movement along the x-axis, which is evidenced by the consistently increasing x coordinate values in the trajectory data, indicating forward movement contrary to the task description. Conversely, Agent 2's trajectory data shows the x coordinate remaining around zero, which implies better adherence to remaining in the same horizontal place while hopping.


# Environment specific feedback

In [13]:
obs = np.loadtxt('./gpt/prompt/test/stand_still_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

stand_still_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [14]:
obs = np.loadtxt('./gpt/prompt/test/simple_hopping_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

simple_hopping_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [15]:
obs = np.loadtxt('./gpt/prompt/test/move_forward_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

move_forward_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [16]:
statistics_feedback_system = file_to_string('./gpt/prompt/test/trajectory_system.txt')
statistics_feedback_env_code = file_to_string('./gpt/prompt/test/trajectory_user.txt')
statistics_feedback_task = file_to_string('./gpt/prompt/test/trajectory_task.txt')

statistics_feedback_user = statistics_feedback_env_code + statistics_feedback_task + "Trajectory information of the agent 1: \n" + stand_still_info + "\nTrajectory information of the agent 2: \n" + simple_hopping_info

In [17]:
feedback_txt = gpt_interaction(statistics_feedback_system, statistics_feedback_user)

Decision: Agent 1
Reason: The task description specifies that the goal is to maintain vertical jumping and landing without progressing in the x-direction. Agent 1 has an average x position and x velocity of 0.0, with correspondingly zero standard deviations for x position and x velocity. This indicates that Agent 1 is successfully maintaining its position without moving horizontally. Additionally, Agent 1's average z position and small standard deviation for z position show consistent vertical movement within a limited range, which aligns with the goal of vertical jumping and landing in place.

In contrast, Agent 2 has a non-zero average x position and a significantly larger standard deviation for the x position, which indicates horizontal movement away from the starting x position. The standard deviation of x velocity is also substantial, illustrating variability in horizontal movement, further indicating that Agent 2 is not stationary in the x-direction.

Furthermore, the much larger

In [18]:
print(stand_still_info)
print(simple_hopping_info)
print(move_forward_info)


Average x position: 0.0
Standard deviation of x position: 0.0
Average z position: 1.2101798201798204
Standard deviation of z position: 0.0033100869613864286
Average x velocity: 0.0
Standard deviation of x velocity: 0.0
Average z velocity: 0.0
Standard deviation of z velocity: 0.18540496217739175

Average x position: 0.21748251748251748
Standard deviation of x position: 0.05654607400669276
Average z position: 1.4222977022977024
Standard deviation of z position: 0.16082117967756113
Average x velocity: -7.105427357601002e-18
Standard deviation of x velocity: 0.9826622003516774
Average z velocity: 0.0
Standard deviation of z velocity: 1.4437364717980912

Average x position: 1.0750218340611355
Standard deviation of x position: 0.9117063154529592
Average z position: 1.3292576419213973
Standard deviation of z position: 0.16504033910927382
Average x velocity: 1.6995614035087723
Standard deviation of x velocity: 1.124478351067322
Average z velocity: -0.2905701754385964
Standard deviation of z 

# Environment Specific Curriculum

In [23]:
curriculum_system = file_to_string('./gpt/prompt/AntMaze_UMaze/env_specific_curriculum_system.txt')
curriculum_user = file_to_string('./gpt/prompt/AntMaze_UMaze/env_specific_curriculum_user.txt')

curriculum_answer = gpt_interaction(curriculum_system, curriculum_user)

Task 1 
Basic Locomotion
Task 1 
Maximize torso_velocity

Task 2 
Orientation Control
Task 2 
Maintain torso_orientation as a value of [1.0, 0.0, 0.0, 0.0]

Task 3 
Goal Orientation
Task 3 
Minimize goal_distance

Task 4 
Navigation and Turning
Task 4 
Maximize torso_coordinate while changing torso_orientation to face multiple random goal_pos without actual reaching

Task 5
Original Task
Task 5
Maintain goal_distance as a value of [0.45]


In [24]:
save_string_to_file(save_path="./gpt/prompt/AntMaze_UMaze/env_specific_curriculum.md", string_file=curriculum_answer)

In [25]:
import re

# Adjusting the regular expression pattern to match the new format
pattern = r"(Task \d+[\s\S]+?)(?=Task \d+|$)"

# Using regex to find all matches and split the string
tasks = re.findall(pattern, curriculum_answer)

tasks

['Task 1 \nBasic Locomotion\n',
 'Task 1 \nMaximize torso_velocity\n\n',
 'Task 2 \nOrientation Control\n',
 'Task 2 \nMaintain torso_orientation as a value of [1.0, 0.0, 0.0, 0.0]\n\n',
 'Task 3 \nGoal Orientation\n',
 'Task 3 \nMinimize goal_distance\n\n',
 'Task 4 \nNavigation and Turning\n',
 'Task 4 \nMaximize torso_coordinate while changing torso_orientation to face multiple random goal_pos without actual reaching\n\n',
 'Task 5\nOriginal Task\n',
 'Task 5\nMaintain goal_distance as a value of [0.45]']

In [26]:
task_length = len(tasks)/2

In [27]:
reward_system = file_to_string('./gpt/prompt/AntMaze_UMaze/env_specific_reward_system.txt')
reward_user = file_to_string('./gpt/prompt/AntMaze_UMaze/env_specific_reward_user.txt')

for task_idx in range(int(task_length)):
    curriculum_answer = tasks[2*task_idx] + tasks[2*task_idx+1]
    reward_user_itr = reward_user + "\n" + curriculum_answer
    for i in range(5):
        reward_functions = gpt_interaction(reward_system, reward_user_itr) 

        save_string_to_file(save_path=f"./gpt/prompt/AntMaze_UMaze/env_specific_reward_task_{task_idx}_sample_{i}.md", string_file=reward_functions)

Task 1 Name
Basic Locomotion

Task 1 Description
Encourage the ant to maximize its forward velocity to achieve locomotion. We aim to increase the speed of the ant's torso in the direction of the maze's target goal. 

```python
def basic_locomotion_reward(self, ant_obs) -> Tuple[np.float64, Dict[str, np.float64]]:
    # Get the torso velocity in the direction of movement (x-axis)
    torso_vel = self.torso_velocity(ant_obs)
    forward_velocity = torso_vel[0]  # Assuming that x-axis is the forward direction

    # Reward component based on the forward velocity of the torso (to encourage forward movement)
    velocity_reward_weight = 1.0
    velocity_reward = np.tanh(forward_velocity)
    
    reward = velocity_reward_weight * velocity_reward
    
    # Dictionary of individual components with their corresponding rewards
    reward_components = {
        'velocity_reward': velocity_reward
    }
    
    return reward, reward_components
```
Task 1 Name:
Basic Locomotion

Task 1 Descriptio

# Find statistics that is useful

In [35]:
system = file_to_string('./gpt/prompt/test/statistics_system.txt')
user = file_to_string('./gpt/prompt/test/statistics_user.txt')

statistics_answer = gpt_interaction(system, user)

Required variables: [x_velocity_of_the_torso, healthy_reward, control_cost, z_coordinate_of_the_torso, angle_of_the_torso]
[x_velocity_of_the_torso]: [Maximized]
[healthy_reward]: [Maximized]
[control_cost]: [Minimized]
[z_coordinate_of_the_torso]: [Close to [0.7]]
[angle_of_the_torso]: [Close to [0.0]]

Reason:
The x velocity of the torso should be maximized in order to achieve the highest forward speed.

The healthy reward should be maximized because it indicates that the hopper is maintaining its height above 0.7 and keeping the torso angle within the desired range, which corresponds to being upright.

Control cost should be minimized to ensure that the actions taken by the hopper, in terms of torques applied to hinges, are not excessively large, leading to more efficient and controlled hops.

The z coordinate of the torso should be close to 0.7, as it directly correlates with the constraint mentioned in the healthy reward description, implying the hopper is airborne or maintaining 

In [19]:
def is_healthy(observation):
    z, angle = observation[1:3]
    state = observation[2:]

    min_state, max_state = (-100, 100)
    min_z, max_z = (0.7, np.inf)
    min_angle, max_angle = (-0.2, 0.2)

    healthy_state = np.all(np.logical_and(min_state < state, state < max_state))
    healthy_z = min_z < z < max_z
    healthy_angle = min_angle < angle < max_angle

    is_healthy = all((healthy_state, healthy_z, healthy_angle))

    return int(is_healthy)

In [20]:
obs = np.loadtxt('./gpt/prompt/test/stand_still_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

# Get average of the is_healthy
healthy = np.array([is_healthy(observation) for observation in obs])
healthy_avg = np.mean(healthy)

stand_still_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std) + "\nAverage is healthy: " + str(healthy_avg)

In [21]:
obs = np.loadtxt('./gpt/prompt/test/simple_hopping_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

# Get average of the is_healthy
healthy = np.array([is_healthy(observation) for observation in obs])
healthy_avg = np.mean(healthy)

simple_hopping_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std) + "\nAverage is healthy: " + str(healthy_avg)

In [22]:
assistant = statistics_answer

user_2 = file_to_string('./gpt/prompt/test/statistics_user_2.txt')
user_2 = user_2 + "\nAgent 1" + stand_still_info + "\nAgent 2" + simple_hopping_info

completion = client.chat.completions.create(
    model="gpt-4-1106-preview", # gpt-4-1106-preview, gpt-4-0613, gpt-4-32k, gpt-3.5-turbo-1106
    messages=[
    {"role": "system", "content": system},
    {"role": "user", "content": user},
    {"role": "assistant", "content": assistant},
    {"role": "user", "content": user_2},
    ]
)

print(completion.choices[0].message.content)

Decision: Agent 1
Reason: 

Agent 1 exhibits no horizontal movement as the average and standard deviation of x position and velocity are both zero, which perfectly adheres to the task requirement of not moving forward or backward. Furthermore, Agent 1 has a high average z position, more consistent hopping height (lower standard deviation of z position), and maintains the healthy status throughout the trajectory. While the standard deviation of z velocity is non-zero, which indicates variability in the hopping speed, there's no displacement in the x-direction, so the trajectory meets the task requirements effectively. In contrast, Agent 2 has shown drift in the x-direction as indicated by the non-zero average x position and significant standard deviation of x position, suggesting it's not strictly hopping in place. The high standard deviation in x velocity for Agent 2 also demonstrates inconsistency in maintaining a stationary hop. Even though Agent 2 has a higher average z position, th