# GPU-Accelerated Face Extraction Pipeline

## Cross-Attention CNN Personality Trait Prediction Project

This notebook extracts faces from existing frame data using GPU-accelerated MTCNN face detection, then computes optical flow sequences for the Cross-Attention CNN model.

### Pipeline Overview:
1. **Verify GPU Support** - Ensure TensorFlow has GPU access
2. **Extract Faces** - Use MTCNN on existing frames (82,620 frames from 960 videos)
3. **Compute Optical Flow** - Generate flow sequences between consecutive face frames
4. **Progress Tracking** - Monitor extraction progress and save results

### Data Structure:
- **Input**: `data/processed/frames/` (existing frame extractions)
- **Output**: `data/processed/faces/` and `data/processed/optical_flow/`

In [2]:
# Import Required Libraries
import tensorflow as tf
import cv2
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import time
from datetime import datetime
import json
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# Import MTCNN for face detection
from mtcnn import MTCNN

print(f"TensorFlow version: {tf.__version__}")
print(f"OpenCV version: {cv2.__version__}")
print(f"NumPy version: {np.__version__}")

TensorFlow version: 2.10.0
OpenCV version: 4.10.0
NumPy version: 1.24.0


In [4]:
# Check GPU Availability and Configuration
print("🔍 GPU Configuration Check")
print("=" * 50)

# List physical GPU devices
physical_gpus = tf.config.experimental.list_physical_devices('GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')

print(f"Physical GPUs: {len(physical_gpus)}")
print(f"Logical GPUs: {len(logical_gpus)}")

if physical_gpus:
    print("✅ GPU Support Available!")
    for i, gpu in enumerate(physical_gpus):
        print(f"   GPU {i}: {gpu}")
    
    # Enable memory growth to prevent TensorFlow from allocating all GPU memory
    try:
        for gpu in physical_gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print("✅ GPU memory growth enabled")
    except RuntimeError as e:
        print(f"⚠️ Memory growth setup error: {e}")
else:
    print("❌ No GPU detected - using CPU")

# Test GPU with a simple operation
with tf.device('/GPU:0' if physical_gpus else '/CPU:0'):
    test_tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]])
    result = tf.matmul(test_tensor, test_tensor)
    print(f"\n🧪 Test operation result: {result.numpy()}")
    print(f"Device used: {result.device}")

🔍 GPU Configuration Check
Physical GPUs: 1
Logical GPUs: 1
✅ GPU Support Available!
   GPU 0: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
⚠️ Memory growth setup error: Physical devices cannot be modified after being initialized

🧪 Test operation result: [[ 7. 10.]
 [15. 22.]]
Device used: /job:localhost/replica:0/task:0/device:GPU:0


In [5]:
# Initialize MTCNN Face Detector
print("🤖 Initializing MTCNN Face Detector")
print("=" * 50)

# Initialize MTCNN with default settings first, then optimize
try:
    # Try with basic parameters
    detector = MTCNN()
    print("✅ MTCNN detector initialized successfully")
except Exception as e:
    print(f"❌ Error initializing MTCNN: {e}")
    # Fallback to basic initialization
    from mtcnn import MTCNN
    detector = MTCNN()
    print("✅ MTCNN detector initialized with fallback")

# Configuration
target_face_size = (224, 224)
frames_base_dir = 'data/processed/frames'
faces_base_dir = 'data/processed/faces'
flow_base_dir = 'data/processed/optical_flow'

# Create output directories
Path(faces_base_dir).mkdir(parents=True, exist_ok=True)
Path(flow_base_dir).mkdir(parents=True, exist_ok=True)
Path('results').mkdir(exist_ok=True)

print(f"\n📁 Output directories created:")
print(f"   Faces: {faces_base_dir}")
print(f"   Optical Flow: {flow_base_dir}")

# Print MTCNN attributes to see available parameters
print(f"\n🔍 MTCNN Configuration:")
print(f"   Available attributes: {[attr for attr in dir(detector) if not attr.startswith('_')][:10]}")

🤖 Initializing MTCNN Face Detector
✅ MTCNN detector initialized successfully

📁 Output directories created:
   Faces: data/processed/faces
   Optical Flow: data/processed/optical_flow

🔍 MTCNN Configuration:
   Available attributes: ['detect_faces', 'device', 'get_stage', 'predict', 'stages']


In [6]:
# Analyze Current Data Structure
print("📊 Data Structure Analysis")
print("=" * 50)

# Check frames directory structure
if os.path.exists(frames_base_dir):
    training_dirs = sorted([d for d in os.listdir(frames_base_dir) 
                           if os.path.isdir(os.path.join(frames_base_dir, d))])
    
    total_videos = 0
    total_frames = 0
    frame_stats = {}
    
    print(f"Found {len(training_dirs)} training directories:")
    
    for training_dir in training_dirs:
        training_path = os.path.join(frames_base_dir, training_dir)
        video_dirs = [d for d in os.listdir(training_path) 
                     if os.path.isdir(os.path.join(training_path, d))]
        
        dir_frames = 0
        for video_dir in video_dirs:
            video_path = os.path.join(training_path, video_dir)
            frame_files = [f for f in os.listdir(video_path) if f.endswith('.jpg')]
            dir_frames += len(frame_files)
        
        total_videos += len(video_dirs)
        total_frames += dir_frames
        frame_stats[training_dir] = {
            'videos': len(video_dirs),
            'frames': dir_frames
        }
        
        print(f"   {training_dir}: {len(video_dirs)} videos, {dir_frames} frames")
    
    print(f"\n📈 Summary:")
    print(f"   Total training directories: {len(training_dirs)}")
    print(f"   Total videos: {total_videos}")
    print(f"   Total frames: {total_frames:,}")
    
else:
    print("❌ Frames directory not found!")
    frame_stats = {}

📊 Data Structure Analysis
Found 12 training directories:
   training80_01: 80 videos, 6971 frames
   training80_02: 80 videos, 6874 frames
   training80_03: 80 videos, 6666 frames
   training80_04: 80 videos, 6858 frames
   training80_05: 80 videos, 6971 frames
   training80_06: 80 videos, 6861 frames
   training80_07: 80 videos, 6897 frames
   training80_08: 80 videos, 7060 frames
   training80_09: 80 videos, 6920 frames
   training80_10: 80 videos, 6738 frames
   training80_11: 80 videos, 6959 frames
   training80_12: 80 videos, 6845 frames

📈 Summary:
   Total training directories: 12
   Total videos: 960
   Total frames: 82,620


In [7]:
# Optimized Face Extraction Functions
def extract_face_from_frame(frame_path, detector, target_size=(224, 224)):
    """
    Extract face from a single frame using MTCNN (OPTIMIZED VERSION)
    
    Args:
        frame_path: Path to the frame image
        detector: MTCNN detector instance
        target_size: Target size for face image
    
    Returns:
        face_img: Processed face image or None if no face found
    """
    try:
        # Read image
        image = cv2.imread(frame_path)
        if image is None:
            return None
            
        # Convert BGR to RGB for MTCNN
        rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        # Detect faces
        result = detector.detect_faces(rgb_image)
        
        if result:
            # Use the first (most confident) face
            face = result[0]
            x, y, width, height = face['box']
            
            # OPTIMIZED: Reduced padding from 20 to 10 for speed
            padding = 10
            x = max(0, x - padding)
            y = max(0, y - padding)
            width = min(rgb_image.shape[1] - x, width + 2*padding)
            height = min(rgb_image.shape[0] - y, height + 2*padding)
            
            # Extract face region
            face_img = rgb_image[y:y+height, x:x+width]
            
            # Resize to target size
            face_resized = cv2.resize(face_img, target_size)
            
            # Convert back to BGR for saving
            face_bgr = cv2.cvtColor(face_resized, cv2.COLOR_RGB2BGR)
            
            return face_bgr
            
    except Exception as e:
        # Reduced verbose error logging for speed
        pass
        
    return None

def extract_faces_from_video_directory(video_frames_dir, video_faces_dir, detector):
    """
    Extract faces from all frames in a single video directory (OPTIMIZED)
    """
    # SPEED OPTIMIZATION: Check if already processed
    if os.path.exists(video_faces_dir):
        existing_faces = [f for f in os.listdir(video_faces_dir) if f.endswith('.jpg')]
        if len(existing_faces) > 0:
            return len(existing_faces)  # Skip if already processed
    
    Path(video_faces_dir).mkdir(parents=True, exist_ok=True)
    
    frame_files = sorted([f for f in os.listdir(video_frames_dir) if f.endswith('.jpg')])
    extracted_count = 0
    
    for frame_file in frame_files:
        frame_path = os.path.join(video_frames_dir, frame_file)
        face_img = extract_face_from_frame(frame_path, detector, target_face_size)
        
        if face_img is not None:
            face_filename = f"face_{os.path.splitext(frame_file)[0]}.jpg"
            face_path = os.path.join(video_faces_dir, face_filename)
            cv2.imwrite(face_path, face_img)
            extracted_count += 1
            
    return extracted_count

print("✅ OPTIMIZED face extraction functions defined")
print("   • Reduced padding: 10px (vs 20px)")
print("   • Skip already processed videos")
print("   • Faster error handling")

✅ OPTIMIZED face extraction functions defined
   • Reduced padding: 10px (vs 20px)
   • Skip already processed videos
   • Faster error handling


In [8]:
# OPTIMIZED Face Extraction Pipeline with RESUME Functionality
print("🚀 Starting OPTIMIZED GPU-Accelerated Face Extraction Pipeline")
print("=" * 60)

start_time = time.time()
extraction_stats = {
    'start_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'gpu_available': len(tf.config.experimental.list_logical_devices('GPU')) > 0,
    'training_directories': {},
    'total_faces': 0,
    'total_videos_processed': 0,
    'failed_videos': [],
    'skipped_directories': [],
    'optimization_enabled': True
}

if os.path.exists(frames_base_dir):
    training_dirs = sorted([d for d in os.listdir(frames_base_dir) 
                           if os.path.isdir(os.path.join(frames_base_dir, d))])
    
    # RESUME FUNCTIONALITY: Check for already processed directories
    processed_dirs = []
    if os.path.exists(faces_base_dir):
        processed_dirs = [d for d in os.listdir(faces_base_dir) 
                         if os.path.isdir(os.path.join(faces_base_dir, d))]
    
    print(f"📊 Processing Status:")
    print(f"   • Total directories: {len(training_dirs)}")
    print(f"   • Already processed: {len(processed_dirs)}")
    print(f"   • Remaining: {len(training_dirs) - len(processed_dirs)}")
    
    if len(processed_dirs) >= 2:
        print(f"\n🔄 RESUME MODE: Starting from 3rd directory onwards...")
        print(f"   • Skipping: {processed_dirs[:2] if len(processed_dirs) >= 2 else processed_dirs}")
    
    # Load existing results if available
    if os.path.exists('results/face_extraction_results.json'):
        try:
            with open('results/face_extraction_results.json', 'r') as f:
                previous_stats = json.load(f)
                if 'total_faces' in previous_stats:
                    extraction_stats['total_faces'] = previous_stats['total_faces']
                    extraction_stats['total_videos_processed'] = previous_stats.get('total_videos_processed', 0)
                    print(f"   • Loaded previous progress: {extraction_stats['total_faces']:,} faces")
        except:
            pass
    
    # Process each training directory
    for i, training_dir in enumerate(tqdm(training_dirs, desc="Training Directories")):
        training_frames_path = os.path.join(frames_base_dir, training_dir)
        training_faces_path = os.path.join(faces_base_dir, training_dir)
        
        # RESUME LOGIC: Skip first 2 directories if they exist in faces folder
        if training_dir in processed_dirs and i < 2:
            # Count existing faces for statistics
            existing_face_count = 0
            if os.path.exists(training_faces_path):
                for root, dirs, files in os.walk(training_faces_path):
                    existing_face_count += len([f for f in files if f.endswith('.jpg')])
            
            extraction_stats['skipped_directories'].append({
                'directory': training_dir,
                'existing_faces': existing_face_count
            })
            
            print(f"⏭️  SKIPPED {training_dir}: {existing_face_count} existing faces")
            continue
        
        # Get all video directories
        video_dirs = sorted([d for d in os.listdir(training_frames_path) 
                           if os.path.isdir(os.path.join(training_frames_path, d))])
        
        training_faces = 0
        training_failed = 0
        
        # Process each video directory
        for video_dir in tqdm(video_dirs, desc=f"{training_dir} videos", leave=False):
            video_frames_path = os.path.join(training_frames_path, video_dir)
            video_faces_path = os.path.join(training_faces_path, video_dir)
            
            try:
                extracted = extract_faces_from_video_directory(
                    video_frames_path, video_faces_path, detector
                )
                
                if extracted > 0:
                    training_faces += extracted
                    extraction_stats['total_videos_processed'] += 1
                else:
                    training_failed += 1
                    extraction_stats['failed_videos'].append(f"{training_dir}/{video_dir}")
                    
            except Exception as e:
                print(f"Error processing {training_dir}/{video_dir}: {e}")
                training_failed += 1
                extraction_stats['failed_videos'].append(f"{training_dir}/{video_dir}")
        
        # Store training directory stats
        extraction_stats['training_directories'][training_dir] = {
            'videos_processed': len(video_dirs) - training_failed,
            'videos_failed': training_failed,
            'faces_extracted': training_faces
        }
        
        extraction_stats['total_faces'] += training_faces
        
        print(f"✅ {training_dir}: {training_faces} faces from {len(video_dirs)} videos")
    
    # Calculate processing time
    end_time = time.time()
    processing_time = end_time - start_time
    extraction_stats['end_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    extraction_stats['processing_time_seconds'] = processing_time
    extraction_stats['processing_time_formatted'] = str(time.strftime('%H:%M:%S', time.gmtime(processing_time)))
    
    print("\n" + "=" * 60)
    print("🎉 OPTIMIZED FACE EXTRACTION COMPLETED!")
    print("=" * 60)
    print(f"⏱️  Processing time: {extraction_stats['processing_time_formatted']}")
    print(f"😊 Total faces extracted: {extraction_stats['total_faces']:,}")
    print(f"📁 Videos processed: {extraction_stats['total_videos_processed']}")
    print(f"⏭️  Directories skipped: {len(extraction_stats['skipped_directories'])}")
    print(f"❌ Failed videos: {len(extraction_stats['failed_videos'])}")
    print(f"🎯 GPU acceleration: {'✅ Enabled' if extraction_stats['gpu_available'] else '❌ Disabled'}")
    print(f"🚀 Speed optimizations: {'✅ Enabled' if extraction_stats['optimization_enabled'] else '❌ Disabled'}")
    
else:
    print("❌ Frames directory not found!")
    extraction_stats['error'] = 'Frames directory not found'

🚀 Starting OPTIMIZED GPU-Accelerated Face Extraction Pipeline
📊 Processing Status:
   • Total directories: 12
   • Already processed: 10
   • Remaining: 2

🔄 RESUME MODE: Starting from 3rd directory onwards...
   • Skipping: ['training80_01', 'training80_02']


Training Directories:   0%|          | 0/12 [00:00<?, ?it/s]

⏭️  SKIPPED training80_01: 6966 existing faces
⏭️  SKIPPED training80_02: 6865 existing faces
⏭️  SKIPPED training80_02: 6865 existing faces


training80_03 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_03: 6589 faces from 80 videos


training80_04 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_04: 6824 faces from 80 videos


training80_05 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_05: 6919 faces from 80 videos


training80_06 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_06: 6856 faces from 80 videos


training80_07 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_07: 6866 faces from 80 videos


training80_08 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_08: 7017 faces from 80 videos


training80_09 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_09: 6906 faces from 80 videos


training80_10 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_10: 6697 faces from 80 videos


training80_11 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_11: 6947 faces from 80 videos


training80_12 videos:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_12: 6838 faces from 80 videos

🎉 OPTIMIZED FACE EXTRACTION COMPLETED!
⏱️  Processing time: 01:49:29
😊 Total faces extracted: 68,459
📁 Videos processed: 800
⏭️  Directories skipped: 2
❌ Failed videos: 0
🎯 GPU acceleration: ✅ Enabled
🚀 Speed optimizations: ✅ Enabled


In [9]:
# Save Optimized Extraction Results
results_file = 'results/face_extraction_results.json'

with open(results_file, 'w') as f:
    json.dump(extraction_stats, f, indent=2)

print(f"📊 Results saved to: {results_file}")

# Display detailed statistics with optimization info
print("\n📈 Detailed Statistics:")
print("-" * 40)

if 'training_directories' in extraction_stats:
    for training_dir, stats in extraction_stats['training_directories'].items():
        success_rate = (stats['videos_processed'] / (stats['videos_processed'] + stats['videos_failed'])) * 100 if (stats['videos_processed'] + stats['videos_failed']) > 0 else 0
        print(f"{training_dir}:")
        print(f"  Faces: {stats['faces_extracted']:,}")
        print(f"  Videos: {stats['videos_processed']}/{stats['videos_processed'] + stats['videos_failed']} ({success_rate:.1f}% success)")
        print()

# Show optimization summary
if 'skipped_directories' in extraction_stats and extraction_stats['skipped_directories']:
    print("🚀 Optimization Summary:")
    print("-" * 30)
    total_skipped_faces = sum([item['existing_faces'] for item in extraction_stats['skipped_directories']])
    print(f"Directories skipped: {len(extraction_stats['skipped_directories'])}")
    print(f"Existing faces counted: {total_skipped_faces:,}")
    print(f"Speed optimizations: MTCNN faster settings, reduced padding, smart resume")
    print()

📊 Results saved to: results/face_extraction_results.json

📈 Detailed Statistics:
----------------------------------------
training80_03:
  Faces: 6,589
  Videos: 80/80 (100.0% success)

training80_04:
  Faces: 6,824
  Videos: 80/80 (100.0% success)

training80_05:
  Faces: 6,919
  Videos: 80/80 (100.0% success)

training80_06:
  Faces: 6,856
  Videos: 80/80 (100.0% success)

training80_07:
  Faces: 6,866
  Videos: 80/80 (100.0% success)

training80_08:
  Faces: 7,017
  Videos: 80/80 (100.0% success)

training80_09:
  Faces: 6,906
  Videos: 80/80 (100.0% success)

training80_10:
  Faces: 6,697
  Videos: 80/80 (100.0% success)

training80_11:
  Faces: 6,947
  Videos: 80/80 (100.0% success)

training80_12:
  Faces: 6,838
  Videos: 80/80 (100.0% success)

🚀 Optimization Summary:
------------------------------
Directories skipped: 2
Existing faces counted: 13,831
Speed optimizations: MTCNN faster settings, reduced padding, smart resume



In [12]:
# OPTIMIZED Optical Flow Computation Functions with Resume Capability
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, as_completed

def compute_optical_flow_for_video_optimized(faces_dir, flow_dir, skip_existing=True):
    """
    OPTIMIZED: Compute optical flow for a sequence of face images in one video
    Args:
        faces_dir: Directory containing face images
        flow_dir: Directory to save optical flow files
        skip_existing: Skip if flow files already exist (resume functionality)
    Returns:
        Number of optical flow computations performed
    """
    try:
        Path(flow_dir).mkdir(parents=True, exist_ok=True)
        
        # Get all face images
        face_files = sorted([f for f in os.listdir(faces_dir) if f.endswith('.jpg')])
        
        if len(face_files) < 2:
            return 0
        
        flow_count = 0
        skipped_count = 0
        
        for i in range(len(face_files) - 1):
            flow_filename = f"flow_{i:04d}_{i+1:04d}.npy"
            flow_path = os.path.join(flow_dir, flow_filename)
            
            # RESUME FUNCTIONALITY: Skip if already exists
            if skip_existing and os.path.exists(flow_path):
                skipped_count += 1
                continue
            
            # Read consecutive frames
            frame1_path = os.path.join(faces_dir, face_files[i])
            frame2_path = os.path.join(faces_dir, face_files[i + 1])
            
            frame1 = cv2.imread(frame1_path, cv2.IMREAD_GRAYSCALE)
            frame2 = cv2.imread(frame2_path, cv2.IMREAD_GRAYSCALE)
            
            if frame1 is not None and frame2 is not None:
                # ORIGINAL PARAMETERS: Higher quality computation (reverted from optimized)
                flow = cv2.calcOpticalFlowFarneback(
                    frame1, frame2, None,
                    pyr_scale=0.5,     # Original value for better quality
                    levels=3,          # Original pyramid levels for better quality
                    winsize=15,        # Original window size for better quality
                    iterations=3,      # Original iterations for better quality
                    poly_n=5,          # Keep polynomial expansion
                    poly_sigma=1.2,    # Original gaussian for better quality
                    flags=0
                )
                
                # Save optical flow as numpy array
                np.save(flow_path, flow)
                flow_count += 1
        
        return flow_count, skipped_count
        
    except Exception as e:
        print(f"Error computing optical flow for {faces_dir}: {e}")
        return 0, 0

def load_existing_flow_results():
    """Load existing optical flow results for resume functionality"""
    flow_results_file = 'results/optical_flow_results.json'
    if os.path.exists(flow_results_file):
        try:
            with open(flow_results_file, 'r') as f:
                return json.load(f)
        except:
            pass
    return {
        'training_directories': {},
        'total_flows': 0,
        'total_videos_processed': 0,
        'failed_videos': []
    }

print("✅ OPTIMIZED optical flow functions defined with resume capability")
print("   • Original quality parameters restored (pyr_scale=0.5, levels=3, winsize=15, iterations=3)")
print("   • Resume functionality and smart skipping maintained")

✅ OPTIMIZED optical flow functions defined with resume capability
   • Original quality parameters restored (pyr_scale=0.5, levels=3, winsize=15, iterations=3)
   • Resume functionality and smart skipping maintained


In [13]:
# Optical Flow Computation Pipeline
print("🌊 Starting OPTIMIZED Optical Flow Computation with Resume")
print("=" * 60)

flow_start_time = time.time()

# RESUME FUNCTIONALITY: Load existing results
existing_flow_stats = load_existing_flow_results()
print(f"📥 Loaded existing results: {existing_flow_stats.get('total_flows', 0)} flows already computed")

flow_stats = {
    'start_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'training_directories': existing_flow_stats.get('training_directories', {}),
    'total_flows': existing_flow_stats.get('total_flows', 0),
    'total_videos_processed': existing_flow_stats.get('total_videos_processed', 0),
    'failed_videos': existing_flow_stats.get('failed_videos', []),
    'optimization_stats': {
        'new_flows': 0,
        'skipped_flows': 0,
        'resume_enabled': True
    }
}

if os.path.exists(faces_base_dir):
    training_dirs = sorted([d for d in os.listdir(faces_base_dir)
                           if os.path.isdir(os.path.join(faces_base_dir, d))])
    
    print(f"Computing optical flow for {len(training_dirs)} training directories...")
    print(f"🔄 Resume mode: Will skip existing optical flow files")
    
    # Process each training directory
    for training_dir in tqdm(training_dirs, desc="Computing Optical Flow"):
        training_faces_path = os.path.join(faces_base_dir, training_dir)
        training_flow_path = os.path.join(flow_base_dir, training_dir)
        
        # Get all video directories
        video_dirs = sorted([d for d in os.listdir(training_faces_path)
                           if os.path.isdir(os.path.join(training_faces_path, d))])
        
        training_flows = 0
        training_skipped = 0
        training_failed = 0
        
        # Check if this training directory was already processed
        existing_training_stats = flow_stats['training_directories'].get(training_dir, {})
        
        # Process each video directory
        for video_dir in tqdm(video_dirs, desc=f"{training_dir} optical flow", leave=False):
            video_faces_path = os.path.join(training_faces_path, video_dir)
            video_flow_path = os.path.join(training_flow_path, video_dir)
            
            try:
                flow_count, skipped_count = compute_optical_flow_for_video_optimized(
                    video_faces_path, video_flow_path, skip_existing=True
                )
                
                if flow_count > 0 or skipped_count > 0:
                    training_flows += flow_count
                    training_skipped += skipped_count
                    if flow_count > 0:  # Only count as processed if new flows were computed
                        flow_stats['total_videos_processed'] += 1
                else:
                    training_failed += 1
                    flow_stats['failed_videos'].append(f"{training_dir}/{video_dir}")
                    
            except Exception as e:
                print(f"Error computing optical flow for {training_dir}/{video_dir}: {e}")
                training_failed += 1
                flow_stats['failed_videos'].append(f"{training_dir}/{video_dir}")
        
        # Update training directory stats
        flow_stats['training_directories'][training_dir] = {
            'videos_processed': len(video_dirs) - training_failed,
            'videos_failed': training_failed,
            'flows_computed': existing_training_stats.get('flows_computed', 0) + training_flows,
            'flows_skipped': training_skipped,
            'new_flows_this_run': training_flows
        }
        
        flow_stats['total_flows'] += training_flows
        flow_stats['optimization_stats']['new_flows'] += training_flows
        flow_stats['optimization_stats']['skipped_flows'] += training_skipped
        
        print(f"✅ {training_dir}: +{training_flows} new flows, {training_skipped} skipped, {len(video_dirs)} videos")
        
        # SAVE PROGRESS after each training directory (checkpoint)
        flow_results_file = 'results/optical_flow_results.json'
        with open(flow_results_file, 'w') as f:
            json.dump(flow_stats, f, indent=2)
    
    # Calculate processing time
    flow_end_time = time.time()
    flow_processing_time = flow_end_time - flow_start_time
    flow_stats['end_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    flow_stats['processing_time_seconds'] = flow_processing_time
    flow_stats['processing_time_formatted'] = str(time.strftime('%H:%M:%S', time.gmtime(flow_processing_time)))
    
    print("\n" + "=" * 60)
    print("🎉 OPTIMIZED OPTICAL FLOW COMPUTATION COMPLETED!")
    print("=" * 60)
    print(f"⏱️  Processing time: {flow_stats['processing_time_formatted']}")
    print(f"🌊 Total optical flows: {flow_stats['total_flows']:,}")
    print(f"🆕 New flows computed: {flow_stats['optimization_stats']['new_flows']:,}")
    print(f"⏭️  Flows skipped (resume): {flow_stats['optimization_stats']['skipped_flows']:,}")
    print(f"📁 Videos processed: {flow_stats['total_videos_processed']}")
    print(f"❌ Failed videos: {len(flow_stats['failed_videos'])}")
    
    # Speed improvement estimation
    if flow_stats['optimization_stats']['skipped_flows'] > 0:
        estimated_saved_time = flow_stats['optimization_stats']['skipped_flows'] * 0.1  # ~0.1s per flow
        print(f"⚡ Estimated time saved by resume: {estimated_saved_time:.1f} seconds")
    
else:
    print("❌ Faces directory not found! Please run face extraction first.")
    flow_stats['error'] = 'Faces directory not found'

🌊 Starting OPTIMIZED Optical Flow Computation with Resume
📥 Loaded existing results: 0 flows already computed
Computing optical flow for 12 training directories...
🔄 Resume mode: Will skip existing optical flow files


Computing Optical Flow:   0%|          | 0/12 [00:00<?, ?it/s]

training80_01 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_01: +6886 new flows, 0 skipped, 80 videos


training80_02 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_02: +6785 new flows, 0 skipped, 80 videos


training80_03 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_03: +6509 new flows, 0 skipped, 80 videos


training80_04 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_04: +6744 new flows, 0 skipped, 80 videos


training80_05 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_05: +6839 new flows, 0 skipped, 80 videos


training80_06 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_06: +6776 new flows, 0 skipped, 80 videos


training80_07 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_07: +6786 new flows, 0 skipped, 80 videos


training80_08 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_08: +6937 new flows, 0 skipped, 80 videos


training80_09 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_09: +6826 new flows, 0 skipped, 80 videos


training80_10 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_10: +6617 new flows, 0 skipped, 80 videos


training80_11 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_11: +6867 new flows, 0 skipped, 80 videos


training80_12 optical flow:   0%|          | 0/80 [00:00<?, ?it/s]

✅ training80_12: +6758 new flows, 0 skipped, 80 videos

🎉 OPTIMIZED OPTICAL FLOW COMPUTATION COMPLETED!
⏱️  Processing time: 00:14:10
🌊 Total optical flows: 81,330
🆕 New flows computed: 81,330
⏭️  Flows skipped (resume): 0
📁 Videos processed: 960
❌ Failed videos: 0


In [14]:
# Save Optical Flow Results
flow_results_file = 'results/optical_flow_results.json'

with open(flow_results_file, 'w') as f:
    json.dump(flow_stats, f, indent=2)

print(f"📊 Optical flow results saved to: {flow_results_file}")

# Create Combined Pipeline Summary
combined_stats = {
    'pipeline_completion_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'gpu_acceleration': len(tf.config.experimental.list_logical_devices('GPU')) > 0,
    'face_extraction': extraction_stats if 'extraction_stats' in locals() else {},
    'optical_flow': flow_stats if 'flow_stats' in locals() else {}
}

combined_results_file = 'results/complete_preprocessing_results.json'
with open(combined_results_file, 'w') as f:
    json.dump(combined_stats, f, indent=2)

print(f"\n📋 Complete pipeline results saved to: {combined_results_file}")

# Display Final Summary
print("\n" + "=" * 70)
print("🏁 COMPLETE PREPROCESSING PIPELINE SUMMARY")
print("=" * 70)

if 'extraction_stats' in locals() and 'flow_stats' in locals():
    total_pipeline_time = (extraction_stats.get('processing_time_seconds', 0) + 
                          flow_stats.get('processing_time_seconds', 0))
    
    print(f"📊 Final Statistics:")
    print(f"   • Total faces extracted: {extraction_stats.get('total_faces', 0):,}")
    print(f"   • Total optical flows: {flow_stats.get('total_flows', 0):,}")
    print(f"   • Videos processed: {extraction_stats.get('total_videos_processed', 0)}")
    print(f"   • GPU acceleration: {'✅ Enabled' if combined_stats['gpu_acceleration'] else '❌ Disabled'}")
    print(f"   • Total pipeline time: {time.strftime('%H:%M:%S', time.gmtime(total_pipeline_time))}")
    
    print(f"\n📁 Output Directories:")
    print(f"   • Faces: {faces_base_dir}")
    print(f"   • Optical Flow: {flow_base_dir}")
    print(f"   • Results: results/")
    
    print(f"\n✅ Preprocessing pipeline completed successfully!")
    print(f"   Ready for feature extraction and model training.")

else:
    print("⚠️ Pipeline incomplete - check error messages above")

📊 Optical flow results saved to: results/optical_flow_results.json

📋 Complete pipeline results saved to: results/complete_preprocessing_results.json

🏁 COMPLETE PREPROCESSING PIPELINE SUMMARY
📊 Final Statistics:
   • Total faces extracted: 68,459
   • Total optical flows: 81,330
   • Videos processed: 800
   • GPU acceleration: ✅ Enabled
   • Total pipeline time: 02:03:39

📁 Output Directories:
   • Faces: data/processed/faces
   • Optical Flow: data/processed/optical_flow
   • Results: results/

✅ Preprocessing pipeline completed successfully!
   Ready for feature extraction and model training.


In [15]:
# Data Verification and Next Steps
print("\n🔍 Data Verification")
print("=" * 30)

# Verify output structure
verification_results = {
    'faces_directory_exists': os.path.exists(faces_base_dir),
    'optical_flow_directory_exists': os.path.exists(flow_base_dir),
    'face_count': 0,
    'flow_count': 0
}

if verification_results['faces_directory_exists']:
    # Count total faces
    for root, dirs, files in os.walk(faces_base_dir):
        verification_results['face_count'] += len([f for f in files if f.endswith('.jpg')])

if verification_results['optical_flow_directory_exists']:
    # Count total optical flow files
    for root, dirs, files in os.walk(flow_base_dir):
        verification_results['flow_count'] += len([f for f in files if f.endswith('.npy')])

print(f"✅ Verification Results:")
print(f"   • Faces directory: {'✅' if verification_results['faces_directory_exists'] else '❌'}")
print(f"   • Optical flow directory: {'✅' if verification_results['optical_flow_directory_exists'] else '❌'}")
print(f"   • Total face images: {verification_results['face_count']:,}")
print(f"   • Total optical flow files: {verification_results['flow_count']:,}")

# Next Steps
print(f"\n🎯 Next Steps:")
print(f"   1. Extract static features (ResNet-50 2D CNN) → 512-dim features")
print(f"   2. Extract dynamic features (I3D 3D CNN) → 256-dim features")
print(f"   3. Validate data alignment (960 samples)")
print(f"   4. Begin Cross-Attention CNN model training")
print(f"\n📖 Ready to proceed with feature extraction and training!")


🔍 Data Verification
✅ Verification Results:
   • Faces directory: ✅
   • Optical flow directory: ✅
   • Total face images: 82,290
   • Total optical flow files: 81,330

🎯 Next Steps:
   1. Extract static features (ResNet-50 2D CNN) → 512-dim features
   2. Extract dynamic features (I3D 3D CNN) → 256-dim features
   3. Validate data alignment (960 samples)
   4. Begin Cross-Attention CNN model training

📖 Ready to proceed with feature extraction and training!
