# Basic Drone Training with PPO

This notebook demonstrates how to train a drone using PPO algorithm.

## Quick Start
1. Import training functions
2. Configure training parameters
3. Train the model
4. Test the trained model

## 1. Setup and Imports

In [1]:
import sys
import os

# Add parent directory to path if needed
parent_dir = os.path.abspath('..')
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)
    
from crazy_flie_env import CrazyFlieEnv
import numpy as np

env = CrazyFlieEnv()

In [2]:
import sys
import os

# Add parent directory to path if needed
parent_dir = os.path.abspath('..')
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

# Import training utilities
from train.simple_train import train_ppo, train_sac, load_model
from train.config import TrainingConfig
from train.test_utils import test_model, quick_test, visualize_episode

print("✅ Imports successful!")

✅ Imports successful!


## 2. Quick Test (Optional)

Test the environment before training to make sure everything works.

In [3]:
# Quick environment test
from crazy_flie_env import CrazyFlieEnv

env = CrazyFlieEnv()
obs, info = env.reset()

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
print(f"State shape: {obs['state'].shape}")
print(f"Image shape: {obs['image'].shape}")

# Test one step
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
print(f"Step completed! Reward: {reward:.3f}")

env.close()
print("\n✅ Environment test passed!")

Observation space: Dict('image': Box(0, 255, (720, 1280, 3), uint8), 'state': Box([-10.        -10.          0.         -5.         -5.         -5.
  -3.1415927  -3.1415927  -3.1415927 -10.        -10.        -10.       ], [10.        10.        10.         5.         5.         5.
  3.1415927  3.1415927  3.1415927 10.        10.        10.       ], (12,), float32))
Action space: Box([-1. -1. -1.  0.], 1.0, (4,), float32)
State shape: (12,)
Image shape: (720, 1280, 3)
Step completed! Reward: 0.121

✅ Environment test passed!


## 3. Configure Training

Create a training configuration. You can modify any parameters here.

### Live Visualization Options:
- Set `render_during_training=True` to see the MuJoCo viewer during training
- Set `render_freq` to control how often to render (lower = more frequent, but slower training)
- Set `show_live_metrics=True` to see training metrics in real-time

In [None]:
# Quick training (for testing) - WITHOUT live visualization
# config = TrainingConfig(
#     algorithm="PPO",
#     total_timesteps=100_000,  
#     num_envs=1,
#     learning_rate=3e-4,
#     n_steps=2048,
#     batch_size=64,
#     eval_freq=10_000,
#     save_freq=20_000,
#     render_during_training=True,  # Set to True to see live MuJoCo viewer
#     show_live_metrics=False
# )

# Alternative: Training WITH live visualization (slower but you can watch!)
config = TrainingConfig(
    algorithm="PPO",
    total_timesteps=100_000,  
    num_envs=1,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    eval_freq=10_000,
    save_freq=20_000,
    render_during_training=True,   # Enable live viewer
    render_freq=10,                # Render every 500 steps
    show_live_metrics=True,         # Show metrics during training
    metrics_freq=2000               # Print metrics every 2000 steps
)

print("Training Configuration:")
print(f"  Algorithm: {config.algorithm}")
print(f"  Total timesteps: {config.total_timesteps:,}")
print(f"  Parallel envs: {config.num_envs}")
print(f"  Learning rate: {config.learning_rate}")
print(f"  Device: {config.device}")
print(f"  Live rendering: {config.render_during_training}")
if config.render_during_training:
    print(f"  Render frequency: every {config.render_freq} steps")

Training Configuration:
  Algorithm: PPO
  Total timesteps: 100,000
  Parallel envs: 1
  Learning rate: 0.0003
  Device: cpu
  Live rendering: True
  Render frequency: every 500 steps


## 4. Train PPO Agent

This will train the agent. Progress bar will show training progress.

### Live Visualization
If you enabled `render_during_training=True`, the MuJoCo viewer window will pop up during training, allowing you to watch the drone learn in real-time!

⚠️ **Note**: 
- Training can take a while depending on `total_timesteps`
- Live rendering will slow down training but lets you watch the learning process
- 100K steps: ~10-30 minutes (without rendering), ~20-45 minutes (with rendering)
- 500K steps: ~1-2 hours (without rendering)
- 1M steps: ~2-4 hours (without rendering)

In [5]:
# Train the model
model, results = train_ppo(config, verbose=True)

print("\n✅ Training completed!")
print(f"Model saved to: {results['final_model_path']}")
print(f"Logs available at: {results['log_dir']}")

🚀 Starting PPO Training
   Timesteps: 100,000
   Environments: 1
   Device: cpu
📁 Model directory: models/ppo_drone_20251001_151324
📁 Log directory: logs/ppo_drone_20251001_151324
✅ Model loaded: C:\Users\Ratan.Bunkar\Learning\general\rl-agent\Drone-UAV\bitcraze_crazyflie_2\scene.xml
📊 Model info: 2 bodies, 7 DOF
✅ Found drone body ID: 1
✅ Drone controller initialized
🎛️ Controller gains:
   Altitude: Kp=0.275, Ki=0.022, Kd=0.108
   Roll: Kp=0.325367
   Yaw: Kp=0.219712
✅ Observation space defined:
   State vector: 12 dimensions
   Camera image: (720, 1280, 3)
✅ Action space defined:
   Roll command: [-1.0, 1.0]
   Pitch command: [-1.0, 1.0]
   Yaw rate command: [-1.0, 1.0]
   Thrust command: [0.0, 1.0]
✅ Found drone FPV camera
📷 Available cameras:
   Camera 0: track
✅ Created 3 renderers
✅ Camera system initialized
📷 Camera system status:
   drone_fpv: ID=-1, pos=[-1.00, 0.00, 0.50]
✅ Rendering system initialized (OpenCV: ✅)
✅ Reward calculator initialized
🎯 Reward components:
   heig

Output()



: 

## 5. Test the Trained Model

Test the trained model on several episodes.

In [None]:
# Test the model (will render first 3 episodes)
avg_reward, metrics = test_model(
    model_path=results['final_model_path'],
    algorithm="PPO",
    num_episodes=10,
    render=True
)

print("\n📊 Test Results:")
print(f"  Average Reward: {metrics['avg_reward']:.2f}")
print(f"  Std Dev: {metrics['std_reward']:.2f}")
print(f"  Success Rate: {metrics['success_rate']:.1%}")
print(f"  Average Episode Length: {metrics['avg_length']:.1f} steps")

## 6. View Training Logs with TensorBoard

Run this cell to view training metrics in TensorBoard.

In [None]:
# Load TensorBoard extension
%load_ext tensorboard

# Launch TensorBoard
%tensorboard --logdir logs/

# Alternatively, run this in terminal:
# tensorboard --logdir logs/

## 7. Visualize Episode Trajectory (Optional)

Visualize the drone's trajectory during an episode.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Get trajectory data
trajectory = visualize_episode(
    model_path=results['final_model_path'],
    algorithm="PPO",
    max_steps=500
)

# Extract positions
positions = [t['position'] for t in trajectory]
positions = np.array(positions)

# Plot 3D trajectory
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

ax.plot(positions[:, 0], positions[:, 1], positions[:, 2], 'b-', linewidth=2, alpha=0.7)
ax.scatter(positions[0, 0], positions[0, 1], positions[0, 2], c='g', s=100, label='Start')
ax.scatter(positions[-1, 0], positions[-1, 1], positions[-1, 2], c='r', s=100, label='End')

ax.set_xlabel('X Position (m)')
ax.set_ylabel('Y Position (m)')
ax.set_zlabel('Z Position (m)')
ax.set_title('Drone Flight Trajectory')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Episode completed in {len(trajectory)} steps")

## 8. Train SAC Agent (Alternative)

If you want to try SAC instead of PPO:

In [None]:
# Configure for SAC
sac_config = TrainingConfig(
    algorithm="SAC",
    total_timesteps=100_000,
    num_envs=4,
    learning_rate=3e-4,
    batch_size=256,
    buffer_size=100_000
)

# Train SAC
sac_model, sac_results = train_sac(sac_config, verbose=True)

# Test SAC
test_model(
    model_path=sac_results['final_model_path'],
    algorithm="SAC",
    num_episodes=10
)

## Next Steps

- **Longer training**: Increase `total_timesteps` to 500K or 1M for better performance
- **Hyperparameter tuning**: Experiment with learning rate, batch size, etc.
- **Custom rewards**: Modify the environment reward function
- **Different algorithms**: Try SAC, A2C, or custom algorithms
- **Advanced features**: Add curriculum learning, domain randomization