<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_092.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform imitation learning with behavior cloning. Use the CartPole environment from `gymnasium` as a base. The process should involve:
1.  Collecting expert trajectories.
2.  Training a policy via supervised learning using these trajectories.
3.  Evaluating the trained policy.

## Install necessary libraries

### Subtask:
Install necessary libraries. We will need to install `gymnasium` for the CartPole environment and `tensorflow` for the neural network.


**Reasoning**:
The first step is to install the necessary libraries using pip.



In [2]:
%pip install gymnasium tensorflow



## Collect expert trajectories

### Subtask:
Collect expert trajectories. We'll create an "expert" policy and use it to collect trajectories from the CartPole environment.


**Reasoning**:
The first step is to import the necessary libraries and create the environment.



In [1]:
import gymnasium as gym
import numpy as np

env = gym.make("CartPole-v1")

**Reasoning**:
Now I will define the expert policy, collect trajectories, and then convert them into a format suitable for training. This combines steps 3, 4, 5, 6, and 7 from the instructions into a single, efficient code block.



In [3]:
def expert_policy(obs):
    """A simple expert policy for CartPole."""
    pole_angle = obs[2]
    return 0 if pole_angle < 0 else 1

expert_trajectories = []
for _ in range(100):
    trajectory = []
    obs, info = env.reset()
    done = False
    while not done:
        action = expert_policy(obs)
        trajectory.append((obs, action))
        obs, reward, done, truncated, info = env.step(action)
    expert_trajectories.append(trajectory)

env.close()

observations = np.concatenate([np.array([item[0] for item in traj]) for traj in expert_trajectories])
actions = np.concatenate([np.array([item[1] for item in traj]) for traj in expert_trajectories])

print("Shape of observations:", observations.shape)
print("Shape of actions:", actions.shape)

Shape of observations: (4256, 4)
Shape of actions: (4256,)


## Train a behavior cloning agent

### Subtask:
Train a neural network to mimic the expert's behavior using the collected trajectories.


**Reasoning**:
I will now define, compile, and train the neural network model as per the instructions. This involves creating a sequential model with the specified architecture, compiling it with the Adam optimizer and binary cross-entropy loss, training it on the collected `observations` and `actions` data, and finally printing the model summary.



In [4]:
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(observations.shape[1],)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(observations, actions, epochs=15, validation_split=0.2)

model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.6500 - loss: 0.6598 - val_accuracy: 0.9331 - val_loss: 0.5306
Epoch 2/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.9103 - loss: 0.4496 - val_accuracy: 0.9566 - val_loss: 0.2355
Epoch 3/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9420 - loss: 0.2206 - val_accuracy: 0.9577 - val_loss: 0.1495
Epoch 4/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9575 - loss: 0.1463 - val_accuracy: 0.9577 - val_loss: 0.1181
Epoch 5/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9534 - loss: 0.1256 - val_accuracy: 0.9660 - val_loss: 0.0968
Epoch 6/15
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9646 - loss: 0.1056 - val_accuracy: 0.9707 - val_loss: 0.0822
Epoch 7/15
[1m107/107[0m 

## Evaluate the trained agent

### Subtask:
Evaluate the performance of our trained agent in the CartPole environment.


**Reasoning**:
Create a new CartPole environment for evaluation and define a function to get the action from the trained model.



In [5]:
eval_env = gym.make("CartPole-v1")

def agent_policy(obs):
    """Returns an action based on the trained model's prediction."""
    # The model expects a batch of observations, so we add a dimension
    obs_batch = np.expand_dims(obs, axis=0)
    # Get the probability of action 1
    prob_action_1 = model.predict(obs_batch, verbose=0)[0][0]
    # Convert probability to binary action (0 or 1)
    return 1 if prob_action_1 > 0.5 else 0


**Reasoning**:
Run 100 episodes using the trained agent, record the total reward for each episode, and then calculate and print the average reward.



In [6]:
total_rewards = []
for _ in range(100):
    obs, info = eval_env.reset()
    done = False
    episode_reward = 0
    while not done:
        action = agent_policy(obs)
        obs, reward, done, truncated, info = eval_env.step(action)
        episode_reward += reward
    total_rewards.append(episode_reward)

average_reward = np.mean(total_rewards)
print(f"Average reward over 100 episodes: {average_reward}")

eval_env.close()

Average reward over 100 episodes: 42.08


## Summary:

### Q&A
**What is the average reward of the agent trained via imitation learning?**
The trained agent achieved an average reward of 42.08 over 100 evaluation episodes in the CartPole environment.

### Data Analysis Key Findings
*   An expert policy was used to generate 100 trajectories, resulting in 4,256 observation-action pairs for training.
*   A neural network was trained for 15 epochs on this data, achieving a final training accuracy of 98.63% and a validation accuracy of 99.18%.
*   The trained model, when evaluated in the CartPole environment for 100 episodes, obtained an average reward of 42.08.

### Insights or Next Steps
*   The performance of the agent is heavily dependent on the quality and quantity of the expert data. To improve the agent's performance, it would be beneficial to collect more diverse and extensive expert trajectories.
*   Experimenting with different model architectures, such as adding more layers or neurons, could potentially enhance the agent's ability to learn the expert's policy and improve its performance.
