# NTU-RGB+D 60 Data Preparation for Google Colab

This notebook prepares NTU-RGB+D 60 dataset for training in Google Colab.

## Your Setup
- **Zip file location**: `drive/MyDrive/"Colab Notebooks"/thesis_outline_colabs/CTR-GCN/data`
- **Expected zip file**: Contains `nturgbd_skeletons_s001_to_s017.zip` (NTU60)

## What This Notebook Does

1. Mounts Google Drive
2. Locates and extracts the NTU60 zip file
3. Runs three processing scripts:
   - `get_raw_skes_data.py` - Extract skeleton data from .skeleton files
   - `get_raw_denoised_data.py` - Remove bad/noisy skeletons
   - `seq_transformation.py` - Transform and create train/test splits
4. Creates final `.npz` files: `NTU60_CS.npz` and `NTU60_CV.npz`

## Estimated Time
- Total processing: ~30-60 minutes
- Each step: ~10-20 minutes

## Step 1: Mount Google Drive and Set Up Environment

In [None]:
from google.colab import drive
import os
import zipfile
import glob
import numpy as np

# Mount Google Drive
drive.mount('/content/drive')

# Set up your specific paths
DRIVE_ROOT = '/content/drive/MyDrive'
NTU_RAW_DIR = os.path.join(DRIVE_ROOT, 'Colab Notebooks', 'thesis_outline_colabs', 'CTR-GCN', 'data')
PROJECT_ROOT = os.path.join(DRIVE_ROOT, 'Colab Notebooks', 'thesis_outline_colabs', 'CTR-GCN')

# Create directories
os.makedirs(NTU_RAW_DIR, exist_ok=True)

print(f"‚úÖ Google Drive mounted")
print(f"üìÅ NTU Raw Data Directory: {NTU_RAW_DIR}")
print(f"üìÅ Project Root: {PROJECT_ROOT}")

# Change to project root
if os.path.exists(PROJECT_ROOT):
    os.chdir(PROJECT_ROOT)
    print(f"‚úÖ Changed to project root: {os.getcwd()}")
else:
    print(f"‚ö†Ô∏è  Project root not found: {PROJECT_ROOT}")
    print("   Please ensure CTR-GCN repository is in Google Drive")

## Step 2: Install Dependencies

In [None]:
# Install required packages
!pip install numpy scipy scikit-learn pyyaml tqdm

# Verify installation
import numpy as np
import scipy
import sklearn
print(f"‚úÖ NumPy: {np.__version__}")
print(f"‚úÖ SciPy: {scipy.__version__}")
print(f"‚úÖ scikit-learn: {sklearn.__version__}")

## Step 3: Locate and Extract Zip File

In [None]:
# Find zip files in your data directory
zip_files = glob.glob(os.path.join(NTU_RAW_DIR, '*.zip'))
print(f"Found {len(zip_files)} zip file(s) in {NTU_RAW_DIR}:")
for zf in zip_files:
    print(f"  - {os.path.basename(zf)}")

# Find NTU60 zip file
zip_file = None
for zf in zip_files:
    filename = os.path.basename(zf).lower()
    if 's001_to_s017' in filename or ('nturgbd' in filename and 'skeleton' in filename):
        zip_file = zf
        break

if zip_file:
    print(f"\n‚úÖ Found NTU60 zip file: {os.path.basename(zip_file)}")
    
    # Extract to nturgbd_raw directory
    extract_dir = os.path.join(NTU_RAW_DIR, 'nturgbd_raw')
    os.makedirs(extract_dir, exist_ok=True)
    
    print(f"Extracting to: {extract_dir}...")
    print("This may take a few minutes...")
    
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)
    
    print("‚úÖ Extraction complete!")
    
    # Verify extraction
    skeleton_dir = os.path.join(extract_dir, 'nturgb+d_skeletons')
    if os.path.exists(skeleton_dir):
        skeleton_files = glob.glob(os.path.join(skeleton_dir, '*.skeleton'))
        print(f"‚úÖ Found {len(skeleton_files)} skeleton files")
    else:
        print("\n‚ö†Ô∏è  Expected directory 'nturgb+d_skeletons' not found")
        print("   Listing extracted directories:")
        for item in os.listdir(extract_dir):
            item_path = os.path.join(extract_dir, item)
            if os.path.isdir(item_path):
                print(f"     üìÅ {item}/")
            else:
                print(f"     üìÑ {item}")
else:
    print("\n‚ùå NTU60 zip file not found!")
    print(f"   Please ensure the zip file is in: {NTU_RAW_DIR}")
    print("   Expected filename contains: 'nturgbd_skeletons_s001_to_s017' or 's001_to_s017'")

## Step 4: Verify CTR-GCN Repository Structure

In [None]:
# Check if CTR-GCN repository structure exists
required_dirs = [
    'data/ntu',
    'data/ntu/statistics',
]

required_files = [
    'data/ntu/get_raw_skes_data.py',
    'data/ntu/get_raw_denoised_data.py',
    'data/ntu/seq_transformation.py',
]

print("Checking repository structure...")
all_good = True

for dir_path in required_dirs:
    full_path = os.path.join(PROJECT_ROOT, dir_path)
    if os.path.exists(full_path):
        print(f"‚úÖ {dir_path}")
    else:
        print(f"‚ùå Missing: {dir_path}")
        os.makedirs(full_path, exist_ok=True)
        print(f"   Created: {dir_path}")
        all_good = False

for file_path in required_files:
    full_path = os.path.join(PROJECT_ROOT, file_path)
    if os.path.exists(full_path):
        print(f"‚úÖ {file_path}")
    else:
        print(f"‚ùå Missing: {file_path}")
        all_good = False

if not all_good:
    print("\n‚ö†Ô∏è  Some files/directories are missing.")
    print("   Please ensure CTR-GCN repository is complete in Google Drive.")
    print("   You may need to clone it:")
    print(f"   !git clone https://github.com/Uason-Chen/CTR-GCN.git {PROJECT_ROOT}")
else:
    print("\n‚úÖ All required files and directories found!")

## Step 5: Update Paths in Processing Scripts

In [None]:
# Navigate to processing directory
processing_dir = os.path.join(PROJECT_ROOT, 'data', 'ntu')
os.chdir(processing_dir)
print(f"Current directory: {os.getcwd()}")

# Update paths in get_raw_skes_data.py
script_path = 'get_raw_skes_data.py'
if os.path.exists(script_path):
    with open(script_path, 'r') as f:
        content = f.read()
    
    # Calculate relative path from data/ntu to raw skeleton directory
    raw_skeleton_dir = os.path.join(NTU_RAW_DIR, 'nturgbd_raw', 'nturgb+d_skeletons')
    
    # Check if directory exists
    if os.path.exists(raw_skeleton_dir):
        relative_path = os.path.relpath(raw_skeleton_dir, processing_dir) + '/'
        
        # Update the path in the script
        import re
        # Find and replace the skes_path line
        pattern = r"skes_path\s*=\s*['\"]([^'\"]+)['\"]"
        
        if re.search(pattern, content):
            content = re.sub(
                pattern,
                f"skes_path = '{relative_path}'",
                content
            )
            
            with open(script_path, 'w') as f:
                f.write(content)
            
            print(f"‚úÖ Updated skes_path in get_raw_skes_data.py")
            print(f"   New path: {relative_path}")
            print(f"   Absolute path: {raw_skeleton_dir}")
        else:
            print("‚ö†Ô∏è  Could not find skes_path definition in script")
    else:
        print(f"‚ö†Ô∏è  Raw skeleton directory not found: {raw_skeleton_dir}")
        print("   Please check Step 3 (extraction)")
else:
    print(f"‚ùå Script not found: {script_path}")

## Step 6: Run Processing Scripts

**Note**: Each step may take 10-20 minutes. Be patient!

### Step 6.1: Get Raw Skeleton Data

In [None]:
print("=" * 70)
print("STEP 1/3: Getting raw skeleton data from .skeleton files")
print("=" * 70)
print("This step reads all .skeleton files and extracts joint positions.")
print("Estimated time: 10-20 minutes\n")

!python get_raw_skes_data.py

print("\n‚úÖ Step 1 complete!")
print("   Output: raw_data/raw_skes_data.pkl")

### Step 6.2: Remove Bad Skeletons (Denoising)

In [None]:
print("=" * 70)
print("STEP 2/3: Removing bad skeletons (denoising)")
print("=" * 70)
print("This step filters out noisy or invalid skeleton sequences.")
print("Estimated time: 10-15 minutes\n")

!python get_raw_denoised_data.py

print("\n‚úÖ Step 2 complete!")
print("   Output: denoised_data/raw_denoised_joints.pkl")

### Step 6.3: Transform Sequences and Create Splits

In [None]:
print("=" * 70)
print("STEP 3/3: Transforming sequences")
print("=" * 70)
print("This step:")
print("  - Centers skeletons to first frame")
print("  - Aligns all sequences to same length")
print("  - Splits into train/test sets (CS and CV)")
print("  - Creates final .npz files")
print("Estimated time: 10-15 minutes\n")

!python seq_transformation.py

print("\n‚úÖ Step 3 complete!")
print("   Output: NTU60_CS.npz and NTU60_CV.npz")

## Step 7: Verify Processed Files

In [None]:
# Check for processed .npz files
npz_files = glob.glob('*.npz')
print(f"Found {len(npz_files)} .npz file(s):\n")

for npz_file in sorted(npz_files):
    size_mb = os.path.getsize(npz_file) / (1024 * 1024)
    print(f"üìÅ {npz_file}")
    print(f"   Size: {size_mb:.2f} MB")
    
    # Load and inspect
    try:
        data = np.load(npz_file, allow_pickle=True)
        print(f"   Contents:")
        for key in sorted(data.keys()):
            arr = data[key]
            if isinstance(arr, np.ndarray):
                print(f"     - {key}: shape={arr.shape}, dtype={arr.dtype}")
        print()
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error loading file: {e}\n")

# Expected files
expected = ['NTU60_CS.npz', 'NTU60_CV.npz']
missing = [f for f in expected if f not in npz_files]

if missing:
    print(f"‚ö†Ô∏è  Missing expected files: {missing}")
else:
    print("‚úÖ All expected files created successfully!")
    print(f"\nüìÇ Files are saved at: {processing_dir}")

## Step 8: Test Data Loading

In [None]:
# Add project root to path
import sys
sys.path.insert(0, PROJECT_ROOT)

from feeders.feeder_ntu import Feeder

# Test train feeder
print("Testing train feeder...")
train_data_path = os.path.join(processing_dir, 'NTU60_CS.npz')

if os.path.exists(train_data_path):
    train_feeder = Feeder(
        data_path=train_data_path,
        split='train',
        window_size=64,
        p_interval=[0.95],
        random_rot=False,
        bone=False,
        vel=False
    )
    
    print(f"‚úÖ Train feeder created!")
    print(f"   Dataset size: {len(train_feeder)} samples")
    
    # Test loading a sample
    data, label, index = train_feeder[0]
    print(f"\nSample 0:")
    print(f"   Data shape: {data.shape}")
    print(f"   Label: {label}")
    print(f"   Data dtype: {data.dtype}")
    print(f"   Data range: [{data.min():.2f}, {data.max():.2f}]")
    
    # Test test feeder
    print("\nTesting test feeder...")
    test_feeder = Feeder(
        data_path=train_data_path,
        split='test',
        window_size=64,
        p_interval=[0.95],
        random_rot=False,
        bone=False,
        vel=False
    )
    
    print(f"‚úÖ Test feeder created!")
    print(f"   Dataset size: {len(test_feeder)} samples")
    
    print("\n‚úÖ Data loading test successful!")
    print("\nüéâ Data preparation complete! You can now start training.")
else:
    print(f"‚ö†Ô∏è  Processed data file not found: {train_data_path}")
    print("   Please check Step 6 (processing scripts)")

## Step 9: Summary

### ‚úÖ Setup Complete!

Your NTU60 dataset is now prepared in Google Drive:

**Processed Data Location**: `{PROJECT_ROOT}/data/ntu/`

### Files Created:
- `NTU60_CS.npz` - Cross-Subject split (~500 MB - 2 GB)
- `NTU60_CV.npz` - Cross-View split (~500 MB - 2 GB)

### Next Steps:

1. **Start Training**:
   ```python
   # In Colab or local environment
   python main.py --config config/nturgbd-cross-subject/default.yaml --device 0
   ```

2. **Verify Config**: Make sure `data_path` in config points to:
   - `data/ntu/NTU60_CS.npz` (relative path from project root)

3. **Optional**: Delete intermediate files to save space:
   - `raw_data/` directory
   - `denoised_data/` directory
   - Keep only the `.npz` files for training

### File Locations:
- **Processed data**: `{PROJECT_ROOT}/data/ntu/NTU60_CS.npz`
- **Config file**: `{PROJECT_ROOT}/config/nturgbd-cross-subject/default.yaml`

### Troubleshooting:

If you encounter issues:

1. **"Skeleton file not found"**: Check that zip file was extracted correctly
2. **"statistics directory not found"**: Ensure CTR-GCN repository is complete
3. **Memory errors**: Use Colab Pro or process locally
4. **Path errors**: Verify all paths are correct in the cells above

For detailed troubleshooting, see: `Docs/exploration/feeders/ntu_data_preparation_colab.md`