# Guide to Load Dataset for Inference

## Prerequisites and Key Concepts

This notebook teaches you how to work with robot data. Robot data consists of recordings of what a robot sees (camera images) and does (arm movements, hand gestures) while completing tasks.

**Key Terms:**
- **Dataset**: A collection of robot demonstrations (like video recordings + robot movements)
- **Modality**: Different types of data (video, robot joint positions, actions, language instructions)
- **Embodiment**: The physical form/type of robot (humanoid, arm, mobile robot, etc.)
- **Inference**: Using a trained AI model to predict what the robot should do next

## Understanding Robot Data

Robot datasets contain multiple types of information:
1. **Video data**: What the robot "sees" through its cameras
2. **State data**: Current positions of robot joints (arms, hands, etc.)
3. **Action data**: What the robot should do next (move arm, close gripper, etc.)
4. **Language data**: Human instructions ("pick up the apple")

## LeRobot Format

* This tutorial will show how to load data in LeRobot Format by using our dataloader. 
* We will use the `robot_sim.PickNPlace` dataset as an example which is already converted to LeRobot Format.

In [None]:
# any_describe: A utility function that prints detailed information about data structures
# This helps you understand what's inside your data
from gr00t.utils.misc import any_describe

# LeRobotSingleDataset: The main class for loading robot datasets
# Think of this as your "data loader" for robot demonstrations
from gr00t.data.dataset import LeRobotSingleDataset

# ModalityConfig: Tells the system which types of data to use (video, robot states, etc.)
# This is like a "menu" where you select what data you want
from gr00t.data.dataset import ModalityConfig

# EmbodimentTag: Specifies what type of robot this data is from
# Different robots need different handling (humanoid vs arm vs mobile robot)
from gr00t.data.schema import EmbodimentTag

## Loading Your First Dataset

Loading robot data is like opening a folder of robot demonstrations. We need to tell the system:
1. **Where** to find the data (file path)
2. **What** types of data to use (video, robot positions, etc.)
3. **What** type of robot this data is from


**Delta Indices Explained:**
- `[0]` = Use only the current frame/timestep
- `[-1, 0]` = Use the previous frame AND current frame
- `[0, 1, 2]` = Use current frame and next 2 frames

### Understanding Embodiment Tags

**What is an Embodiment?** The physical form of the robot:
- **Humanoid robot**: Has arms, legs, torso (like GR1)
- **Robot arm**: Just an arm with a gripper
- **Mobile robot**: Can move around

**Why does this matter?** Different robots need different AI "brains". GR00T has specialized components for each robot type to get the best performance.


In [None]:
import os
import gr00t


REPO_PATH = os.path.dirname(os.path.dirname(gr00t.__file__))
DATA_PATH = os.path.join(REPO_PATH, "demo_data/robot_sim.PickNPlace")

# STEP 3: Show where we're loading data from
print("Loading dataset... from", DATA_PATH)

In [None]:
## Configuring What Data to Use (Modality Configs)

# WHAT: Tell the system exactly which types of robot data we want to use
# WHY: Not all datasets have the same information, so we need to specify what's available
# HOW: We create a "configuration menu" for each data type

modality_configs = {
    # === VIDEO CONFIGURATION ===
    # WHAT: Configure camera/video data
    # This tells the system to use the robot's main camera view
    "video": ModalityConfig(
        delta_indices=[0],  # [0] = only current frame (not previous/future frames)
        modality_keys=["video.ego_view"],  # "ego_view" = robot's main camera perspective
    ),
    
    # === ROBOT STATE CONFIGURATION ===
    # WHAT: Configure robot body position data (where all robot parts are positioned)
    # This includes all the robot's body parts - arms, hands, legs, etc.
    "state": ModalityConfig(
        delta_indices=[0],  # [0] = current positions only
        modality_keys=[
            "state.left_arm",      # Left arm joint positions (7 joints: shoulder, elbow, wrist)
            "state.left_hand",     # Left hand/gripper positions (6 finger joints)
            "state.left_leg",      # Left leg joint positions
            "state.neck",          # Neck/head joint positions
            "state.right_arm",     # Right arm joint positions
            "state.right_hand",    # Right hand/gripper positions
            "state.right_leg",     # Right leg joint positions
            "state.waist",         # Waist/torso joint positions
        ],
    ),
    
    # === ACTION CONFIGURATION ===
    # WHAT: Configure what actions the robot should perform
    # In this dataset, we only have hand actions (grasping, releasing)
    "action": ModalityConfig(
        delta_indices=[0],  # [0] = immediate next action
        modality_keys=[
            "action.left_hand",   # What the left hand should do (grip strength, finger positions)
            "action.right_hand",  # What the right hand should do
        ],
    ),
    
    # === LANGUAGE CONFIGURATION ===
    # WHAT: Configure human language instructions and validation
    # This includes task descriptions ("pick up the apple") and quality labels
    "language": ModalityConfig(
        delta_indices=[0],  # [0] = current instruction
        modality_keys=[
            "annotation.human.action.task_description",  # Human task instruction
            "annotation.human.validity"  # Whether this demonstration is good/bad
        ],
    ),
}

print("✅ Modality configuration created!")
print(f"📊 We're using {len(modality_configs)} types of data: {list(modality_configs.keys())}")

In [None]:
## Loading the Dataset - Putting It All Together

# STEP 1: Choose the robot embodiment (what type of robot this data is from)
# WHAT: EmbodimentTag.GR1 means this data is from a "GR1" humanoid robot
# WHY: GR00T has different "brains" optimized for different robot types
# Think of this as choosing the right "driver" for your robot hardware
embodiment_tag = EmbodimentTag.GR1
print(f"🤖 Using embodiment: {embodiment_tag}")
print("   This tells GR00T to use its humanoid robot AI components")

# STEP 2: Load the dataset with all our configurations
# WHAT: Create the dataset object that will handle loading robot data
# HOW: We pass in the path, modality configs, and embodiment tag
print("\n🔄 Loading dataset...")
dataset = LeRobotSingleDataset(
    DATA_PATH,           # Where to find the data
    modality_configs,    # What types of data to use
    embodiment_tag=embodiment_tag  # What type of robot this is
)

print('\n'*2)
print("="*100)
print(f"{' ✅ HUMANOID DATASET LOADED SUCCESSFULLY! ':=^100}")
print("="*100)
print(f"📊 Dataset contains {len(dataset)} data points (timesteps)")
print(f"🎯 Each data point includes: video + robot states + actions + language")

# STEP 3: Examine a single data point to understand the structure
# WHAT: Look at the 7th data point in the dataset
# WHY: This helps us understand what information is available
print("\n🔍 Let's examine data point #7:")
resp = dataset[7]  # Get the 7th data point
any_describe(resp)  # Print detailed information about this data point
print("\n📋 Available data keys:", list(resp.keys()))
print("\n💡 Key explanations:")
print("   - Each 'key' represents a different type of information")
print("   - 'video.ego_view' = camera image from robot's perspective")
print("   - 'state.right_arm' = current positions of right arm joints")
print("   - 'action.right_hand' = what the right hand should do next")

## Visualizing Robot Camera Data

Let's look at what the robot actually "sees" through its camera during the demonstrations. This helps us understand the visual context of the robot's actions.


In [None]:
# WHAT: Extract and display robot camera images from different time points
# WHY: Visual inspection helps us understand what the robot sees during tasks
# HOW: We'll sample 10 images from the first 100 data points and display them in a grid

import matplotlib.pyplot as plt  # For displaying images

print("🎥 Extracting robot camera images...")
images_list = []  # Store the collected images

# STEP 1: Collect images from every 10th data point (to see progression over time)
# We're sampling images to see how the scene changes as the robot performs its task
for i in range(100):  # Look at first 100 data points
    if i % 10 == 0:  # Every 10th data point (0, 10, 20, 30, ...)
        resp = dataset[i]  # Get data point i
        img = resp["video.ego_view"][0]  # Extract the camera image
        # Note: [0] gets the current frame (remember delta_indices=[0])
        images_list.append(img)
        
print(f"✅ Collected {len(images_list)} images from the robot's camera")

# STEP 2: Display the images in a nice grid layout
# Create a 2x5 grid to show all 10 images at once
print("🖼️ Creating image grid...")
fig, axs = plt.subplots(2, 5, figsize=(20, 10))  # 2 rows, 5 columns

# STEP 3: Place each image in the grid
for i, ax in enumerate(axs.flat):  # axs.flat makes it easy to iterate through all subplots
    ax.imshow(images_list[i])  # Display the image
    ax.axis("off")  # Hide axis numbers and ticks for cleaner look
    ax.set_title(f"Timestep {i*10}", fontsize=12)  # Label with actual timestep
    
plt.tight_layout()  # Adjust spacing between images automatically
plt.suptitle("🤖 Robot's Camera View During Pick & Place Task", fontsize=16, y=1.02)
plt.show()

print("\n💡 What you're seeing:")
print("   - These are sequential frames from the robot's camera")
print("   - You can see how the scene changes as the robot performs its task")
print("   - Notice objects being picked up, moved, or manipulated")


## Data Transformations - Preparing Data for AI Models

Raw robot data needs to be "processed" before AI models can use it effectively. Think of this like preparing ingredients before cooking - we need to:
- Resize images to standard sizes
- Normalize numbers to consistent ranges  
- Convert data to formats the AI model expects

**Why do we need transformations?**
- **AI models are picky**: They expect data in specific formats and ranges
- **Consistency**: All images should be the same size, all numbers in similar ranges
- **Performance**: Properly processed data helps models learn better and faster

In [None]:
# WHAT: Import all the transformation functions we'll need
# WHY: Each transform does a specific job to prepare the data for AI models
from gr00t.data.transform.base import ComposedModalityTransform  # Combines multiple transforms
from gr00t.data.transform import VideoToTensor, VideoCrop, VideoResize, VideoColorJitter, VideoToNumpy
from gr00t.data.transform.state_action import StateActionToTensor, StateActionTransform
from gr00t.data.transform.concat import ConcatTransform

# STEP 1: Get our modality configurations (we defined these earlier)
# We need to know which data streams to apply transforms to
video_modality = modality_configs["video"]    # Camera/video configuration
state_modality = modality_configs["state"]    # Robot body positions configuration  
action_modality = modality_configs["action"]  # Robot actions configuration

print("🔧 Setting up data transformations...")
print("📝 Each transform will be applied in sequence:")

# STEP 2: Create a pipeline of transformations that will be applied to our data
# Think of this as a factory assembly line - data goes through each step
to_apply_transforms = ComposedModalityTransform(
    transforms=[
        # === VIDEO TRANSFORMATIONS ===
        # Process camera images to make them suitable for AI models
        
        # 1. Convert images to tensors (AI-friendly format)
        # Like converting a photo into a grid of numbers
        VideoToTensor(apply_to=video_modality.modality_keys),
        
        # 2. Crop images slightly (remove 5% from edges)
        # WHY: Focuses on center content, removes potential edge artifacts
        VideoCrop(apply_to=video_modality.modality_keys, scale=0.95),
        
        # 3. Resize all images to 224x224 pixels (standard AI input size)
        # WHY: AI models expect consistent input sizes
        VideoResize(apply_to=video_modality.modality_keys, height=224, width=224, interpolation="linear"),
        
        # 4. Add random color variations (data augmentation)
        # WHY: Helps AI model handle different lighting conditions
        # Randomly adjusts brightness, contrast, saturation, hue
        VideoColorJitter(apply_to=video_modality.modality_keys, 
                        brightness=0.3,   # ±30% brightness change
                        contrast=0.4,     # ±40% contrast change  
                        saturation=0.5,   # ±50% saturation change
                        hue=0.08),        # ±8% hue change
        
        # 5. Convert back to numpy arrays (for final processing)
        VideoToNumpy(apply_to=video_modality.modality_keys),

        # === ROBOT STATE TRANSFORMATIONS ===
        # Process robot joint positions and body state information
        
        # 1. Convert state data to tensors
        StateActionToTensor(apply_to=state_modality.modality_keys),
        
        # 2. Normalize state values to range [-1, 1]
        # WHY: AI models work better when all numbers are in similar ranges
        # "min_max" normalization scales values proportionally
        StateActionTransform(apply_to=state_modality.modality_keys, normalization_modes={
            key: "min_max" for key in state_modality.modality_keys  # Apply to all state keys
        }),

        # === ACTION TRANSFORMATIONS ===
        # Process robot action commands (what robot should do next)
        
        # 1. Convert action data to tensors
        StateActionToTensor(apply_to=action_modality.modality_keys),
        
        # 2. Normalize action values to range [-1, 1]
        # WHY: Consistent scaling helps AI model learn action patterns
        StateActionTransform(apply_to=action_modality.modality_keys, normalization_modes={
            key: "min_max" for key in action_modality.modality_keys  # Apply to all action keys
        }),

        # === FINAL STEP: CONCATENATION ===
        # Combine all the processed data into unified arrays
        # Like putting all ingredients into organized containers
        ConcatTransform(
            video_concat_order=video_modality.modality_keys,    # How to order video data
            state_concat_order=state_modality.modality_keys,    # How to order state data
            action_concat_order=action_modality.modality_keys,  # How to order action data
        ),
    ]
)

print("✅ Transform pipeline created!")
print(f"🔄 Will apply {len(to_apply_transforms.transforms)} transformations to the data")
print("\n💡 What these transforms do:")
print("   📹 Video: Resize→Crop→ColorAdjust→Normalize")
print("   🤖 State: Convert→Normalize (robot joint positions)")
print("   🎯 Action: Convert→Normalize (robot commands)")
print("   📦 Final: Combine everything into organized arrays")


Now see how the data is different after applying the transformations.

e.g. states and actions are being normalized and concatenated, video images are being cropped, resized, and color-jittered.

In [None]:
dataset = LeRobotSingleDataset(
    DATA_PATH,
    modality_configs,
    transforms=to_apply_transforms,
    embodiment_tag=embodiment_tag
)

# print the 7th data point
resp = dataset[7]
any_describe(resp)
print(resp.keys())
