## 📓 Notebook Manager

This cell initializes the widgets required for managing your research notebook. Please run the cell below to enable functionality for:
- Exporting cells tagged with `export` into a `clean` notebook
- Generating a dynamic Table of Contents (TOC)
- Exporting the notebook to GitHub-compatible Markdown

➡️ **Be sure to execute the next cell before continuing with any editing or exporting.**

In [1]:
# Cell 1 - Workflow Tools
import sys
sys.path.insert(0, '../../lib')

from notebook_tools import TOCWidget, ExportWidget
import ipywidgets as widgets


# Create widget instances
toc = TOCWidget()
export = ExportWidget()

# Create horizontal layout
left_side = widgets.VBox([toc.button, export.button, toc.status])
right_side = widgets.VBox([toc.output, export.output])

# Display side by side
display(widgets.HBox([left_side, right_side]))

HBox(children=(VBox(children=(Button(button_style='primary', description='Generate TOC', icon='list', style=Bu…

# 🚦 Traffic Video Preprocessing - Methodology [VERSION]

## 🎯 Purpose
This notebook implements the data preprocessing workflow for GDOT traffic camera videos. We process raw video files into frame sequences suitable for computer vision tasks.

## 📋 Context
- **Data Source**: 30 GDOT traffic camera feeds (recorded locally)
- **Video Specs**: 480 resolution, 15 fps
- **Methodology Goal**: Establish reproducible preprocessing workflow with clear documentation

## 🔄 Workflow Overview
1. Video ingestion and cataloging
2. Frame extraction
3. Quality control
4. Spatial transformations
5. Color normalization
6. Temporal downsampling
7. Data organization
8. Export and storage

## ⚡ Key Improvements (Methodology [VERSION])
- Added reproducibility checkpoints
- Streamlined workflow touchpoints
- Enhanced error handling and logging

## 📚 Notebook Structure
- **Setup**: Environment and dependencies
- **Processing**: Step-by-step video preprocessing
- **Validation**: Quality checks and verification
- **Summary**: Results and analysis (see end of notebook)

*Processing completed: [DATE] | Methodology version: [VERSION]*

**Last Updated**: [DATE]  
**Author**: [NAME]  
**Version**: [VERSION]

## 📑 Table of Contents (Auto-Generated)

This section will automatically generate a table of contents for your research notebook once you run the **Generate TOC** function. The table of contents will help you navigate through your data collection, analysis, and findings as your citizen science project develops.

➡️ **Do not edit this cell manually. It will be overwritten automatically.**


<!-- TOC -->
# Table of Contents

- [📓 Notebook Manager](#📓-notebook-manager)
- [🎯 Purpose](#🎯-purpose)
- [📋 Context](#📋-context)
- [🔄 Workflow Overview](#🔄-workflow-overview)
- [⚡ Key Improvements (Methodology [VERSION])](#⚡-key-improvements-(methodology-[version]))
- [📚 Notebook Structure](#📚-notebook-structure)
- [📑 Table of Contents (Auto-Generated)](#📑-table-of-contents-(auto-generated))
- [🔧 Environment Setup](#🔧-environment-setup)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [🔄 Progress Tracking & Checkpoint System](#🔄-progress-tracking-&-checkpoint-system)
- [💾 Initialize Checkpoint and Progress Tracking Functions](#💾-initialize-checkpoint-and-progress-tracking-functions)
  - [📊 Analysis & ObservationS](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [📹 Video Ingestion & Cataloging](#📹-video-ingestion-&-cataloging)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [🎞️ Frame Extraction](#🎞️-frame-extraction)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [🔍 Image Quality Control](#🔍-image-quality-control)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [📐 Spatial Transformations](#📐-spatial-transformations)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [🎨 Color Space Normalization](#🎨-color-space-normalization)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
- [⏱️ Temporal Downsampling](#⏱️-temporal-downsampling)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)
  - [📊 Analysis & Observations](#📊-analysis-&-observations)
    - [Results](#results)
    - [Observations](#observations)
    - [Notes](#notes)

<!-- /TOC -->



## 🔧 Environment Setup

This cell establishes the preprocessing environment by:

1. **Import Required Libraries**
   - OpenCV (cv2) for video processing and frame extraction
   - NumPy for array operations and numerical computations
   - Pandas for data organization and metadata management
   - Logging for process tracking and error reporting
   - System utilities for path handling and file operations

2. **Library Verification**
   - Check OpenCV installation and version
   - Verify NumPy and Pandas availability
   - Display version information for debugging

3. **Initialize Helper Functions**
   - **calculate_brightness()**: Compute average pixel intensity (0-255)
   - **calculate_blur_score()**: Measure sharpness using Laplacian variance
   - **get_video_metadata()**: Extract video properties (fps, resolution, duration)

4. **Directory Setup**
   - Create output directory structure
   - Ensure path exists before processing begins

5. **Codec Validation**
   - Test preferred video codec availability
   - Confirm fourcc code generation

**Note**: This cell must run successfully before proceeding with video processing. Any import errors or missing dependencies will be reported here.

### 📐 Preprocessing Configuration Parameters

This cell defines all parameters for individual video preprocessing. Parameters are organized into categories with emoji indicators:

#### Target Parameters
- 🎯 **VIDEO_ID**: Specific camera to process (e.g., ATL-1005)
- 🎯 **BATCH_DATE**: Date from batch analysis (YYYYMMDD format)

#### Path Configuration  
- 📁 **INPUT_BASE**: Root directory for video recordings
- 📁 **OUTPUT_BASE**: Root directory for processed output
- 📁 **VIDEO_DIR**: Derived path to specific camera/date videos
- 📁 **OUTPUT_DIR**: Derived path for this preprocessing run

#### Processing Settings
- 📊 **FRAMES_TO_EXTRACT**: Total number of frames to extract
- 📊 **SAMPLE_RATE**: Extract every Nth frame from video

#### Quality Thresholds
Values from batch analysis quality metrics:
- 🔍 **brightness_min**: Minimum acceptable brightness (0-255)
- 🔍 **brightness_max**: Maximum acceptable brightness (0-255)  
- 🔍 **blur_min**: Minimum blur score (Laplacian variance)

#### Video Settings
- 🎥 **PREFERRED_CODEC**: Primary video codec for processing
- 🎥 **FALLBACK_CODECS**: Alternative codecs if primary fails
- 🎥 **MAX_FRAME_WIDTH**: Maximum frame width for resizing
- 🎥 **MAX_FRAME_HEIGHT**: Maximum frame height for resizing
- 🎥 **JPEG_QUALITY**: Output quality for saved frames (0-100)

**Note**: These values are hardcoded for initial testing. Future versions will read from `preprocessing_config.json`.

In [7]:
# preprocessing configuration parameters
from pathlib import Path

CONFIG = {
    # target parameters
    'VIDEO_ID': 'ATL-1005',  # 🎯 camera to process
    'BATCH_DATE': '20250620',  # 🎯 date from batch analysis
    'TARGET_HOUR': 12,  # 🎯 target hour (noon)
    
    # path configuration  
    'INPUT_BASE': Path.home() / 'traffic-recordings',  # 📁 video source
    'OUTPUT_BASE': Path('../../data/preprocessing/individual_analysis'),  # 📁 output base
    
    # processing settings
    'FRAMES_TO_EXTRACT': 300,  # 📊 total frames to extract
    'SAMPLE_RATE': 15,  # 📊 extract every Nth frame
    
    # quality thresholds (from batch analysis)
    'QUALITY_THRESHOLD': {
        'brightness_min': 104.47,  # 🔍 minimum brightness
        'brightness_max': 114.75,  # 🔍 maximum brightness  
        'blur_min': 3494.35  # 🔍 minimum blur score
    },
    
    # video settings
    'PREFERRED_CODEC': 'mp4v',  # 🎥 primary codec
    'FALLBACK_CODECS': ['h264', 'xvid'],  # 🎥 alternatives
    'MAX_FRAME_WIDTH': 1920,  # 🎥 max width
    'MAX_FRAME_HEIGHT': 1080,  # 🎥 max height
    'JPEG_QUALITY': 95  # 🎥 output quality (0-100)
}

# derived paths
date_formatted = f"{CONFIG['BATCH_DATE'][:4]}-{CONFIG['BATCH_DATE'][4:6]}-{CONFIG['BATCH_DATE'][6:8]}"
CONFIG['OUTPUT_DIR'] = CONFIG['OUTPUT_BASE'] / date_formatted / CONFIG['VIDEO_ID']
CONFIG['VIDEO_DIR'] = CONFIG['INPUT_BASE'] / CONFIG['VIDEO_ID'] / date_formatted

print("Configuration loaded:")
print(f"  Processing: {CONFIG['VIDEO_ID']} from {date_formatted}")
print(f"  Output to: {CONFIG['OUTPUT_DIR']}")

Configuration loaded:
  Processing: ATL-1005 from 2025-06-20
  Output to: ../../data/preprocessing/individual_analysis/2025-06-20/ATL-1005


### Environment Initialization

The preprocessing configuration parameters defined above will now be used to initialize the environment, import required libraries, and set up helper functions for video processing.

In [8]:
# environment setup
import cv2
import numpy as np
import pandas as pd
import os
import sys
import json
import logging
from datetime import datetime, timedelta

# setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# verify opencv
try:
    print(f"✓ OpenCV version: {cv2.__version__}")
except ImportError:
    print("⚠️ OpenCV not installed. Install with: pip install opencv-python")
    
print(f"✓ Python version: {sys.version.split()[0]}")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Pandas version: {pd.__version__}")

# helper functions
def calculate_brightness(frame):
    """Calculate average brightness of frame"""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    return np.mean(gray)

def calculate_blur_score(frame):
    """Calculate Laplacian variance (higher = sharper)"""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    return cv2.Laplacian(gray, cv2.CV_64F).var()

def get_video_metadata(video_path):
    """Extract video metadata"""
    metadata = {}
    cap = cv2.VideoCapture(str(video_path))
    if cap.isOpened():
        metadata['fps'] = cap.get(cv2.CAP_PROP_FPS)
        metadata['frame_count'] = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        metadata['width'] = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        metadata['height'] = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        metadata['duration_seconds'] = metadata['frame_count'] / metadata['fps'] if metadata['fps'] > 0 else 0
        metadata['codec'] = int(cap.get(cv2.CAP_PROP_FOURCC))
        cap.release()
    return metadata

# create output directory
CONFIG['OUTPUT_DIR'].mkdir(parents=True, exist_ok=True)

# verify codec support
fourcc_test = cv2.VideoWriter_fourcc(*CONFIG['PREFERRED_CODEC'])
print(f"✓ Preferred codec '{CONFIG['PREFERRED_CODEC']}' fourcc: {fourcc_test}")

print(f"\n✓ Environment setup complete")
print(f"  Output directory created: {CONFIG['OUTPUT_DIR']}")

✓ OpenCV version: 4.11.0
✓ Python version: 3.12.9
✓ NumPy version: 2.2.4
✓ Pandas version: 2.2.3
✓ Preferred codec 'mp4v' fourcc: 1983148141

✓ Environment setup complete
  Output directory created: ../../data/preprocessing/individual_analysis/2025-06-20/ATL-1005


### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Environment Setup*

---

## 🔄 Progress Tracking & Checkpoint System

The following cells implement simple progress tracking and checkpoint functionality to:

1. **Track Processing Progress**
   - Monitor which video is currently being processed
   - Count successful vs failed videos
   - Display elapsed time

2. **Enable Restart Capability**
   - Save progress after each video completes
   - Automatically skip already-processed videos on rerun
   - Maintain list of failed videos for retry

This ensures we don't lose work if the kernel crashes and provides visibility into long-running processes.

## 💾 Initialize Checkpoint and Progress Tracking Functions

This module establishes checkpoint and progress tracking capabilities for the preprocessing workflow. The system creates functions for saving and loading processing state, initializes timing and counting variables, recovers from any existing checkpoints, and provides real-time progress monitoring with completion status.

**Implemented below.**

In [9]:
import json
import os
import time
from datetime import datetime

# Initialize tracking variables
CHECKPOINT_FILE = "preprocessing_checkpoint.json"
start_time = time.time()

def load_checkpoint():
    """Load previous progress if it exists"""
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE, 'r') as f:
            checkpoint = json.load(f)
            print(f"✓ Loaded checkpoint: {len(checkpoint['processed'])} videos already processed")
            return checkpoint
    return {
        "processed": [], 
        "failed": [], 
        "last_completed": None, 
        "start_time": datetime.now().isoformat()
    }

def save_checkpoint(checkpoint):
    """Save current progress"""
    checkpoint['last_updated'] = datetime.now().isoformat()
    with open(CHECKPOINT_FILE, 'w') as f:
        json.dump(checkpoint, f, indent=2)

def log_progress(video_name, status, checkpoint, total_videos):
    """Log progress and update checkpoint"""
    if status == "success":
        checkpoint['processed'].append(video_name)
    else:
        checkpoint['failed'].append(video_name)
    
    checkpoint['last_completed'] = video_name
    save_checkpoint(checkpoint)
    
    # Display progress
    elapsed = time.time() - start_time
    processed_count = len(checkpoint['processed'])
    failed_count = len(checkpoint['failed'])
    
    print(f"\n[{datetime.now().strftime('%H:%M:%S')}] {video_name}: {status}")
    print(f"Progress: {processed_count}/{total_videos} | Failed: {failed_count} | Elapsed: {elapsed/60:.1f}min")

# Load any existing checkpoint
checkpoint = load_checkpoint()
print(f"Ready to process videos. Checkpoint system initialized.")

Ready to process videos. Checkpoint system initialized.


### 📊 Analysis & ObservationS

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*Initialize Checkpoint and Progress Tracking Functions*

---

## 📹 Video Ingestion & Cataloging

This module loads video files from the source directory and extracts technical metadata including resolution, frame rate, duration, and codec specifications. The cataloging process builds a comprehensive data inventory and identifies format variations that may impact downstream processing stages.

*The following code cell implements the video ingestion module using FFmpeg and OpenCV for metadata extraction.*

In [11]:
# video ingestion and cataloging
def parse_timestamp(filename):
    """extract timestamp from filename"""
    parts = filename.stem.split('_')
    if len(parts) >= 3:
        time_str = parts[2]
        hours = int(time_str[:2])
        minutes = int(time_str[2:4])
        return hours * 60 + minutes  # minutes from midnight
    return None

# find videos
video_files = list(CONFIG['VIDEO_DIR'].glob(f"{CONFIG['VIDEO_ID']}_*.mp4"))

if not video_files:
    raise FileNotFoundError(f"No videos found for {CONFIG['VIDEO_ID']} on {CONFIG['BATCH_DATE']}")

# find closest to noon
target_minutes = CONFIG['TARGET_HOUR'] * 60  # 720 minutes
closest_video = None
min_diff = float('inf')

for video in video_files:
    minutes = parse_timestamp(video)
    if minutes is not None:
        diff = abs(minutes - target_minutes)
        if diff < min_diff:
            min_diff = diff
            closest_video = video

CONFIG['selected_video'] = closest_video
time_str = closest_video.stem.split('_')[2]
print(f"Selected: {closest_video.name}")
print(f"  Starts at: {time_str[:2]}:{time_str[2:4]}:{time_str[4:6]}")

Selected: ATL-1005_20250620_120641.mp4
  Starts at: 12:06:41


### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Video Ingestion & Cataloging*

---

## 🎞️ Frame Extraction

This module samples frames from video sequences at specified temporal intervals. The extraction process converts temporal video data into spatial image representations suitable for computer vision processing and analysis.

*The following code cell implements frame extraction using OpenCV with configurable sampling rates and output formats.*



In [12]:
# frame extraction
import cv2

print(f"Frame Extraction")
print(f"Extracting {CONFIG['FRAMES_TO_EXTRACT']} frames (every {CONFIG['SAMPLE_RATE']} frames)")

video_path = CONFIG['selected_video']
cap = cv2.VideoCapture(str(video_path))

if not cap.isOpened():
    raise ValueError(f"Cannot open video: {video_path}")

# create frames directory
frames_dir = CONFIG['OUTPUT_DIR'] / 'frames'
frames_dir.mkdir(exist_ok=True)

# extract frames
frames_extracted = 0
frame_index = 0

while frames_extracted < CONFIG['FRAMES_TO_EXTRACT'] and cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    
    # extract every Nth frame
    if frame_index % CONFIG['SAMPLE_RATE'] == 0:
        frame_filename = f"frame_{frames_extracted:04d}.jpg"
        frame_path = frames_dir / frame_filename
        
        # save frame
        cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, CONFIG['JPEG_QUALITY']])
        
        frames_extracted += 1
        if frames_extracted % 50 == 0:
            print(f"  Extracted {frames_extracted}/{CONFIG['FRAMES_TO_EXTRACT']} frames")
    
    frame_index += 1

cap.release()

CONFIG['frames_dir'] = frames_dir
CONFIG['frames_extracted'] = frames_extracted

print(f"\n✓ Extracted {frames_extracted} frames to {frames_dir}")

Frame Extraction
Extracting 300 frames (every 15 frames)
  Extracted 50/300 frames
  Extracted 100/300 frames
  Extracted 150/300 frames
  Extracted 200/300 frames
  Extracted 250/300 frames
  Extracted 300/300 frames

✓ Extracted 300 frames to ../../data/preprocessing/individual_analysis/2025-06-20/ATL-1005/frames


### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Frame Extraction*

---

## 🔍 Image Quality Control

This module filters out blurry, dark, or corrupted frames using automated quality metrics. The quality control process ensures only processable frames continue through the workflow, optimizing compute resources and improving downstream analysis reliability.


*The following code cell implements quality filtering using Laplacian variance for blur detection, histogram analysis for exposure assessment, and file integrity checks.*

### 📊 Analysis & Observations
**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Image Quality Control*

---

## 📐 Spatial Transformations

This module applies geometric transformations including resize, crop, and padding operations to achieve consistent frame dimensions. The standardization process ensures uniform input sizes for batch processing and meets model requirements for downstream analysis.

**🚧 IMPLEMENTATION REQUIRED 🚧**

*The following code cell implements spatial transformations using OpenCV and PIL with configurable target dimensions and padding strategies.*

### 📊 Analysis & Observations
**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Spatial Transformations*

---

## 🎨 Color Space Normalization

This module converts frames to a consistent color space (RGB/BGR) and normalizes pixel values to standardized ranges. The normalization process ensures uniform data representation across different camera sensors and lighting conditions.

**🚧 IMPLEMENTATION REQUIRED 🚧**

*The following code cell implements color space conversion and pixel normalization using OpenCV with configurable target color spaces and normalization ranges.*

### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Color Space Normalization*

---

## ⏱️ Temporal Downsampling

This module selects keyframes or applies temporal windowing techniques to reduce data redundancy. The downsampling process manages data volume while preserving important temporal events and motion patterns for analysis.

**🚧 IMPLEMENTATION REQUIRED 🚧**

*The following code cell implements temporal downsampling using keyframe detection algorithms and configurable windowing strategies with OpenCV and custom temporal analysis functions.*

### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Temporal Downsampling*

---

# 📁 Data Organization

This module structures processed frames with comprehensive metadata linking back to source videos. The organization system maintains full traceability throughout the processing workflow and enables efficient data loading for downstream analysis.

**🚧 IMPLEMENTATION REQUIRED 🚧**

*The following code cell implements data structuring using JSON metadata files and hierarchical directory organization with pandas for efficient data indexing and retrieval.*

### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Data Organization*

---

# 💾 Export & Storage

This module saves processed frames in optimized formats for efficient storage and retrieval. The export process optimizes I/O performance for training workflows and ensures data accessibility for downstream analysis.

**🚧 IMPLEMENTATION REQUIRED 🚧**

*The following code cell implements data export with compression and batch writing optimizations.*

### 📊 Analysis & Observations

**Record your findings from the code execution above:**

#### Results
*What outputs or data were generated?*

#### Observations
*What patterns or behaviors did you notice?*

#### Notes
*Any issues, performance observations, or follow-up needed?*

---

*End of Export & Storage*

---