# Dataset Exploration
This notebook can be used to explore (a subset) of the RGB images and LiDAR data used for these experiments. 

It builds a FiftyOne dataset with grouping for the images and LiDAR points and displays them in the FO instance in the last cell. The lidar points are converted in `convert_lidar_to_pcd()`, to be properly displayed as a 3D point cloud. Additionaly, we compute some basic dataset statistics (class distribution, split sizes) and print and plot them.

For computation time reasons, I limited the number of samples per class which will be converted into PCD and displayed in FiftyOne. The limitation can be changed with the `MAX_SAMPLES_PER_CLASS` config var.

The split created and used in the following notebooks is a little questionable. It is not a classical 80/20 or 70/15/15 split, but it is the same as in the NVIDIA Lab. It is a train/val only split with 10 batches in the val set. I decided to keep this, for proper alignment with the target numbers set for evaluation of Task 5. 

In [None]:
%%capture
!uv pip install fiftyone==1.10.0

In [None]:
import fiftyone as fo
import numpy as np
from pathlib import Path
from tqdm import tqdm
import os
import matplotlib.pyplot as plt
import cv2
import pandas as pd
import sys
import os
sys.path.append(os.path.abspath('../src'))
from visualization import convert_lidar_to_pcd

In [None]:
try:
    from google.colab import drive
    drive.mount('/gdrive')
    print("Mounted Google Drive")
    DATA_ROOT = Path('/gdrive/MyDrive/extended_assessments/Multimodal_Learning/data')
except:
    print("Running locally")
    DATA_ROOT = Path('../data')

In [None]:
CLASSES = ["cubes", "spheres"]
PCD_CACHE = DATA_ROOT / "pcd_cache"
MAX_SAMPLES_PER_CLASS = 200
BATCH_SIZE = 32
VALID_BATCHES = 10

PCD_CACHE.mkdir(exist_ok=True)

dataset_name = "cilp_assessment"
if dataset_name in fo.list_datasets():
    fo.delete_dataset(dataset_name)

dataset = fo.Dataset(dataset_name)
dataset.add_group_field("group", default="rgb")

az = np.load(DATA_ROOT / "cubes/azimuth.npy")
ze = np.load(DATA_ROOT / "cubes/zenith.npy")

samples = []

stats = {
    "total_samples": 0,
    "train_samples": 0,
    "val_samples": 0,
    "class_counts": {}
}

for class_name in CLASSES:
    rgb_dir = DATA_ROOT / class_name / "rgb"
    lidar_dir = DATA_ROOT / class_name / "lidar"
    
    # Get all files
    rgb_files = sorted(list(rgb_dir.glob("*.png")))
    num_files = len(rgb_files)
    stats["class_counts"][class_name] = num_files
    stats["total_samples"] += num_files
    
    # Calculate split based on the NVIDIA Lab logic
    # The last VALID_BATCHES * BATCH_SIZE are validation
    num_val = VALID_BATCHES * BATCH_SIZE
    split_idx = num_files - num_val
    
    stats["train_samples"] += split_idx
    stats["val_samples"] += num_val

    # Limit samples for FiftyOne visualization
    files_to_process = rgb_files
    if MAX_SAMPLES_PER_CLASS:
        files_to_process = rgb_files[:MAX_SAMPLES_PER_CLASS]
    
    print(f"Processing {class_name} (Visualizing {len(files_to_process)}/{num_files} samples)...")
    for rgb_file in tqdm(files_to_process):
        stem = rgb_file.stem
        lidar_file = lidar_dir / f"{stem}.npy"
        
        if not lidar_file.exists():
            continue
            
        file_idx = int(stem)
        tag = "train" if file_idx < split_idx else "val"
            
        # Convert Lidar
        pcd_file = PCD_CACHE / f"{class_name}_{stem}.pcd"
        # Always convert to ensure latest logic is used
        convert_lidar_to_pcd(lidar_file, az, ze, pcd_file)

        # Create Group
        group = fo.Group()
        
        # RGB Sample
        rgb_sample = fo.Sample(
            filepath=str(rgb_file.absolute()),
            group=group.element("rgb"),
            ground_truth=fo.Classification(label=class_name),
            tags=[tag]
        )
        
        # Lidar Sample
        lidar_sample = fo.Sample(
            filepath=str(pcd_file.absolute()),
            group=group.element("lidar"),
            ground_truth=fo.Classification(label=class_name),
            tags=[tag]
        )
        
        samples.extend([rgb_sample, lidar_sample])

dataset.add_samples(samples)
dataset.persistent = True
print(f"Created dataset '{dataset.name}' with {len(dataset)} samples.")

## Dataset Statistics
Those stats are NOT based on the data used in FO, but the whole dataset. The split, as mentioned above, is similar to the one in the NVIDIA Lab to be properly aligned with the expected control numbers in the end.
1. Samples per class
2. Split sizes and class distribution
3. Image sizes and data types
4. Visualizations


In [None]:
# 1. Samples per class (Full Dataset)
print("Full Dataset Statistics:")
print("-" * 30)
print("Samples per class:", stats["class_counts"])

# 2. Train/validation split sizes (Full Dataset)
print(f"Total Samples: {stats['total_samples']}")
print(f"Train Samples: {stats['train_samples']}")
print(f"Val Samples:   {stats['val_samples']}")
print("-" * 30)

# 3. Image and LiDAR dimensions and data types
view = dataset.select_group_slices("rgb")
if len(view) > 0:
    first_rgb = view.first()
    img = cv2.imread(first_rgb.filepath)
    print(f"Image Dimensions: {img.shape}")
    print(f"Data Type: {img.dtype}")
else:
    print("No samples in dataset to check dimensions.")
lidar_sample = np.load(DATA_ROOT / "cubes/lidar/0000.npy")
print(f"LiDAR Sample Dimensions: {lidar_sample.shape}")
print(f"Data Type: {lidar_sample.dtype}")
print(f"Min LiDAR Value: {lidar_sample.min()}, Max LiDAR Value: {lidar_sample.max()}")


# 4. Class distribution visualization
plt.figure(figsize=(8, 6))
plt.bar(stats["class_counts"].keys(), stats["class_counts"].values())
plt.title("Class Distribution (Full Dataset)")
plt.xlabel("Class")
plt.ylabel("Number of Samples")
plt.show()

In [None]:
session = fo.launch_app(dataset)