In [1]:
from openai import OpenAI
import yaml
import os
import numpy as np
import pandas as pd

In [2]:
def file_to_string(filename):
    with open(filename, 'r') as file:
        return file.read()

In [3]:
with open('./gpt/key.yaml', 'r') as stream:
    config = yaml.safe_load(stream)

client = OpenAI(api_key=config['OPENAI_API_KEY'])

In [4]:
def gpt_interaction(system_string, user_string):
    completion = client.chat.completions.create(
        model="gpt-4-1106-preview", # gpt-4-1106-preview, gpt-4-0613, gpt-4-32k, gpt-3.5-turbo-1106
        messages=[
        {"role": "system", "content": system_string},
        {"role": "user", "content": user_string}
        ]
    )

    print(completion.choices[0].message.content)

    return completion.choices[0].message.content

In [5]:
def save_string_to_file(save_path, string_file):
    with open(save_path, 'w') as file:
        file.write(string_file)

# Playing with curriculum and reflection

In [6]:
curriculum_answer = gpt_interaction(file_to_string('./gpt/prompt/test/curriculum_system.txt'), file_to_string('./gpt/prompt/test/curriculum_user.txt'))

save_string_to_file(save_path="./gpt/prompt/test/curriculum.md", string_file=curriculum_answer)

Task 1 Name
Learn to Stand

Task 1 Description
Before hopping, the agent must learn to balance and stand upright. Starting from a variety of positions within the initial state space, the goal is to achieve and maintain an upright torso with minimal movement. The agent should apply torques to all joints to prevent the hopper from falling over, prioritizing minimal energy expenditure. Success is measured by maintaining a z height of the torso above a threshold (such as 1.2 times the initial z height) and keeping the torso angle within a limited range for a duration without falling or exceeding energy thresholds.

Task 2 Name
Static Balance on One Foot

Task 2 Description
The agent needs to learn how to balance statically on its foot. This task involves controlling the torques on the three hinges without any hopping, such that the hopper maintains an upright position with zero velocity while withstanding perturbations (small external forces applied to the body parts). Success is measured 

In [7]:
reward_system = file_to_string('./gpt/prompt/test/reward_system.txt')
reward_user = file_to_string('./gpt/prompt/test/curriculum_user.txt') + file_to_string('./gpt/prompt/test/curriculum.md')

reward_functions = gpt_interaction(reward_system, reward_user)

Learn to Stand
Before hopping, the agent must learn to balance and stand upright. Starting from a variety of positions within the initial state space, the goal is to achieve and maintain an upright torso with minimal movement. The agent should apply torques to all joints to prevent the hopper from falling over, prioritizing minimal energy expenditure. Success is measured by maintaining a z height of the torso above a threshold (such as 1.2 times the initial z height) and keeping the torso angle within a limited range for a duration without falling or exceeding energy thresholds.

```python
def compute_reward(observation, action, next_observation) -> Tuple[np.float64, Dict[str, np.float64]]:
    height_threshold = 1.2 * 1.25  # 1.25 is the initial z height
    angle_tolerance = 0.05  # small angle tolerance around upright

    z_height = next_observation[1]
    torso_angle = next_observation[2]
    control_cost = self.control_cost(action)

    # Reward for maintaining height above thres

In [8]:
reflection_system = file_to_string('./gpt/prompt/test/reflection_system.txt')
reflection_env_code = file_to_string('./gpt/prompt/test/reflection_user.txt')
reflection_task_description = file_to_string('./gpt/prompt/test/reflection_task.txt')
reflection_learning_curve = file_to_string('./gpt/prompt/test/reflection_learning_curve.txt')
reflection_reason = file_to_string('./gpt/prompt/test/reflection_reason.txt')

reflection_user = reflection_env_code + reflection_task_description + reflection_learning_curve + reflection_reason

In [9]:
reflection_answer = gpt_interaction(reflection_system, reflection_user)

save_string_to_file(save_path="./gpt/prompt/test/reflection.md", string_file=reflection_answer)

Reason for fixing reward function: The learning curve demonstrates volatile progression with a steep drop in reward value, indicating potential instability in training or exploitation of the current reward function. The drop signifies that the reward function might not be robust against behaviors that are detrimental to the task goal. Additionally, the `hop_reward` and `healthy_reward` components max out easily, leaving no gradient for the agent to improve upon. The `forward_reward` plateaus quickly, while the `control_cost` has a relatively minor impact on behavior adjustment. To address these issues, we need a reward function that consistently incentivizes forward motion and hopping without saturation and provides a more balanced consideration of control costs.

```python
def compute_reward(observation, action, next_observation) -> Tuple[np.float64, Dict[str, np.float64]]:
    x_velocity = (next_observation[0] - observation[0]) / 0.05
    z_velocity = (next_observation[1] - observati

# Playing with Trajectory Feedback

In [10]:
trajectory_feedback_system = file_to_string('./gpt/prompt/test/trajectory_system.txt')
trajectory_feedback_env_code = file_to_string('./gpt/prompt/test/trajectory_user.txt')
trajectory_feedback_task = file_to_string('./gpt/prompt/test/trajectory_task.txt')
trajectory_1 = file_to_string('./gpt/prompt/test/simple_hopping_observation.txt')
trajectory_2 = file_to_string('./gpt/prompt/test/move_forward_observation.txt')

trajectory_feedback_user = trajectory_feedback_env_code + trajectory_feedback_task + "Trajectory of the agent 1: \n" + trajectory_1 + "\nTrajectory of the agent 2: \n" + trajectory_2

In [11]:
decision = gpt_interaction(trajectory_feedback_system, trajectory_feedback_user)

Decision: Agent 2
Reason: The task's goal is to maintain vertical jumping and landing without moving in the x-direction. Agent 1's trajectory shows substantial movement along the x-axis, which is evidenced by the consistently increasing x coordinate values in the trajectory data, indicating forward movement contrary to the task description. Conversely, Agent 2's trajectory data shows the x coordinate remaining around zero, which implies better adherence to remaining in the same horizontal place while hopping.


# Environment specific feedback

In [13]:
obs = np.loadtxt('./gpt/prompt/test/stand_still_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

stand_still_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [14]:
obs = np.loadtxt('./gpt/prompt/test/simple_hopping_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

simple_hopping_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [15]:
obs = np.loadtxt('./gpt/prompt/test/move_forward_observation.txt', delimiter=',')

# Get average and std of the x position
x_pos = obs[:, 0]
x_pos_avg = np.mean(x_pos)
x_pos_std = np.std(x_pos)

# Get average and std of the z position
z_pos = obs[:, 1]
z_pos_avg = np.mean(z_pos)
z_pos_std = np.std(z_pos)

# Get average and std of x velocity
x_pos_next = obs[1:, 0]
x_vel = (x_pos_next - x_pos[:-1]) / 0.008
x_vel_avg = np.mean(x_vel)
x_vel_std = np.std(x_vel)

# Get average and std of z velocity
z_pos_next = obs[1:, 1]
z_vel = (z_pos_next - z_pos[:-1]) / 0.008
z_vel_avg = np.mean(z_vel)
z_vel_std = np.std(z_vel)

move_forward_info = "\nAverage x position: " + str(x_pos_avg) + "\nStandard deviation of x position: " + str(x_pos_std) + "\nAverage z position: " + str(z_pos_avg) + "\nStandard deviation of z position: " + str(z_pos_std) + "\nAverage x velocity: " + str(x_vel_avg) + "\nStandard deviation of x velocity: " + str(x_vel_std) + "\nAverage z velocity: " + str(z_vel_avg) + "\nStandard deviation of z velocity: " + str(z_vel_std)

In [16]:
statistics_feedback_system = file_to_string('./gpt/prompt/test/trajectory_system.txt')
statistics_feedback_env_code = file_to_string('./gpt/prompt/test/trajectory_user.txt')
statistics_feedback_task = file_to_string('./gpt/prompt/test/trajectory_task.txt')

statistics_feedback_user = statistics_feedback_env_code + statistics_feedback_task + "Trajectory information of the agent 1: \n" + stand_still_info + "\nTrajectory information of the agent 2: \n" + simple_hopping_info

In [17]:
feedback_txt = gpt_interaction(statistics_feedback_system, statistics_feedback_user)

Decision: Agent 1
Reason: The task description specifies that the goal is to maintain vertical jumping and landing without progressing in the x-direction. Agent 1 has an average x position and x velocity of 0.0, with correspondingly zero standard deviations for x position and x velocity. This indicates that Agent 1 is successfully maintaining its position without moving horizontally. Additionally, Agent 1's average z position and small standard deviation for z position show consistent vertical movement within a limited range, which aligns with the goal of vertical jumping and landing in place.

In contrast, Agent 2 has a non-zero average x position and a significantly larger standard deviation for the x position, which indicates horizontal movement away from the starting x position. The standard deviation of x velocity is also substantial, illustrating variability in horizontal movement, further indicating that Agent 2 is not stationary in the x-direction.

Furthermore, the much larger

In [18]:
print(stand_still_info)
print(simple_hopping_info)
print(move_forward_info)


Average x position: 0.0
Standard deviation of x position: 0.0
Average z position: 1.2101798201798204
Standard deviation of z position: 0.0033100869613864286
Average x velocity: 0.0
Standard deviation of x velocity: 0.0
Average z velocity: 0.0
Standard deviation of z velocity: 0.18540496217739175

Average x position: 0.21748251748251748
Standard deviation of x position: 0.05654607400669276
Average z position: 1.4222977022977024
Standard deviation of z position: 0.16082117967756113
Average x velocity: -7.105427357601002e-18
Standard deviation of x velocity: 0.9826622003516774
Average z velocity: 0.0
Standard deviation of z velocity: 1.4437364717980912

Average x position: 1.0750218340611355
Standard deviation of x position: 0.9117063154529592
Average z position: 1.3292576419213973
Standard deviation of z position: 0.16504033910927382
Average x velocity: 1.6995614035087723
Standard deviation of x velocity: 1.124478351067322
Average z velocity: -0.2905701754385964
Standard deviation of z 