# Data Formatting – Rwanda (Manual Clipping with moviepy)

This notebook loads the **TEACH Final Scores** CSV for *Rwanda*, discovers classroom videos stored on SharePoint, manually splits each full-length video into first and last 15-minute clips using `moviepy`, and saves only these clips to Google Drive. It also attaches clip links to the dataset and exports a final formatted CSV.

**Enhanced Features:**
- ✅ Progress bar with tqdm
- ✅ Progressive saving every 5 videos processed
- ✅ Resume capability from last saved progress
- ✅ Robust error handling and retry logic


### Workflow
1. Detect runtime, install missing packages.
2. Mount / locate Google Drive container.
3. Authenticate to SharePoint using browser cookies.
4. Discover every video from SharePoint Rwanda 2020 folder.
5. Load TEACH CSV dataset and prepare output columns.
6. **NEW:** Check for existing progress and resume from last checkpoint.
7. **NEW:** Process videos with progress bar and save every 5 videos.
8. Manually split each video into first and last 15-minute clips in parallel using `ThreadPoolExecutor`.
9. Attach clip links to the dataset and save final CSV.
10. Log and retry errors, skip videos shorter than 30 minutes.


## Dependencies & Environment Setup

Install required Python packages and configure the environment.

In [None]:
!pip install -q python-dotenv requests pandas moviepy tqdm
!pip install -q google-auth google-auth-oauthlib google-auth-httplib2

In [None]:
# Environment detection & dependency install
import importlib.util
import subprocess, sys, os
from pathlib import Path

IN_COLAB = importlib.util.find_spec("google.colab") is not None

def _ensure(pkgs):
    missing = [p for p in pkgs if importlib.util.find_spec(p.replace('-', '_')) is None]
    if missing:
        print("Installing:", missing)
        subprocess.check_call([sys.executable, "-m", "pip", "install", *missing])

_ensure(["python-dotenv", "requests", "pandas", "moviepy", "tqdm"])
if IN_COLAB:
    _ensure(["google-auth", "google-auth-oauthlib", "google-auth-httplib2"])

Installing: ['python-dotenv']
Installing: ['google-auth']


## Paths & Google Drive Mount

Configure paths for raw data, outputs, and mount Google Drive if running in Colab.

In [None]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    RAW_DIR = Path('/content/drive/My Drive/world bank/data/Rwanda')
else:
    RAW_DIR = Path.cwd()

RAW_CSV = RAW_DIR / 'evals/Teach_Final_Scores_v1(ALL_Scores).csv'
OUTPUT_DIR = RAW_DIR / 'evals/formattedData'
VIDEO_OUTPUT_DIR = RAW_DIR / 'videos'
PROGRESS_FILE = OUTPUT_DIR / 'progress.json'
CHECKPOINT_CSV = OUTPUT_DIR / 'rwanda_manual_clips_checkpoint.csv'

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
VIDEO_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"RAW_CSV: {RAW_CSV}\nOUTPUT_DIR: {OUTPUT_DIR}\nVIDEO_OUTPUT_DIR: {VIDEO_OUTPUT_DIR}")
print(f"PROGRESS_FILE: {PROGRESS_FILE}\nCHECKPOINT_CSV: {CHECKPOINT_CSV}")

Mounted at /content/drive
RAW_CSV: /content/drive/My Drive/world bank/data/Rwanda/evals/Teach_Final_Scores_v1(ALL_Scores).csv
OUTPUT_DIR: /content/drive/My Drive/world bank/data/Rwanda/evals/formattedData
VIDEO_OUTPUT_DIR: /content/drive/My Drive/world bank/data/Rwanda/videos
PROGRESS_FILE: /content/drive/My Drive/world bank/data/Rwanda/evals/formattedData/progress.json
CHECKPOINT_CSV: /content/drive/My Drive/world bank/data/Rwanda/evals/formattedData/rwanda_manual_clips_checkpoint.csv


## SharePoint Authentication

Authenticate to SharePoint using browser cookies stored in the `cookie` environment variable.

In [None]:
from dotenv import load_dotenv
import requests, os
from google.colab import userdata

load_dotenv()

# Method 1: Use the exact cookies from your browser
SHAREPOINT_COOKIES = {
    'rtFa': userdata.get('rtFa'),
    'SIMI': userdata.get('SIMI'),
    'FedAuth': userdata.get('FedAuth'),
}

SP_BASE_URL = 'https://worldbankgroup.sharepoint.com/teams/TeachDashboardVideoLibrary-WBGroup'
SP_FOLDER_PATH = '/teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Rwanda 2020'

# Complete headers matching your successful browser request
SP_HEADERS = {
    'Accept': 'application/json;odata=verbose',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
    'Referer': 'https://worldbankgroup.sharepoint.com/',
    'X-RequestDigest': '0xF565A91A7BD37466546E94C89866BE8BC7ED8EE11DECF380BB7DB547ED52D9F749B0E629C90EC7955287641B5169B358AF12D615AF1FE996DCCDD0DD65F31311,16 Jun 2025 20:44:18 -0000',
    'Content-Type': 'application/json;odata=verbose'
}

# Test the connection
r = requests.get(f"{SP_BASE_URL}/_api/web", cookies=SHAREPOINT_COOKIES, headers=SP_HEADERS)
print(f"SharePoint connection status: {r.status_code}")

if r.status_code == 200:
    print("✅ Authentication successful!")
    print(f"Response: {r.text[:200]}...")
else:
    print(f"❌ Authentication failed: {r.status_code}")
    print(f"Response: {r.text}")

SharePoint connection status: 200
✅ Authentication successful!
Response: {"d":{"__metadata":{"id":"https://worldbankgroup.sharepoint.com/teams/TeachDashboardVideoLibrary-WBGroup/_api/Web","uri":"https://worldbankgroup.sharepoint.com/teams/TeachDashboardVideoLibrary-WBGroup...


## SharePoint Video Discovery

Discover all video files in the specified SharePoint folder.

In [None]:
# -----------------------------------------------------------
# 3  ENHANCED SharePoint Functions with Timestamp Metadata
# -----------------------------------------------------------

VIDEO_EXTS = {'.mp4', '.MP4', '.mov', '.MOV', '.mts', '.MTS', '.avi', '.AVI'}

def get_file_metadata_with_timestamps(server_relative_url):
    """Get detailed file metadata including timestamps from SharePoint."""
    try:
        metadata_url = f"{SP_BASE_URL}/_api/web/GetFileByServerRelativeUrl('{server_relative_url}')"
        response = requests.get(metadata_url, cookies=SHAREPOINT_COOKIES, headers=SP_HEADERS)

        if response.status_code == 200:
            data = response.json()['d']
            return {
                'TimeCreated': data.get('TimeCreated'),
                'TimeLastModified': data.get('TimeLastModified'),
                'Length': data.get('Length', 0),
                'UIVersionString': data.get('UIVersionString', '1.0'),
                'success': True
            }
        else:
            return {'success': False, 'error': f"HTTP {response.status_code}"}
    except Exception as e:
        return {'success': False, 'error': str(e)}

def get_folder_contents(folder_path):
    """Get files and subfolders in a SharePoint folder."""
    print(f"📂 Scanning folder: {folder_path}")

    files = []
    folders = []

    try:
        # Get files in folder
        files_url = f"{SP_BASE_URL}/_api/web/GetFolderByServerRelativeUrl('{folder_path}')/Files"
        response = requests.get(files_url, cookies=SHAREPOINT_COOKIES, headers=SP_HEADERS)

        if response.status_code == 200:
            data = response.json()
            files = data['d']['results'] if 'results' in data['d'] else []
            print(f"   📄 Found {len(files)} files")
        else:
            print(f"   ❌ Failed to get files: {response.status_code}")

        # Get subfolders
        folders_url = f"{SP_BASE_URL}/_api/web/GetFolderByServerRelativeUrl('{folder_path}')/Folders"
        response = requests.get(folders_url, cookies=SHAREPOINT_COOKIES, headers=SP_HEADERS)

        if response.status_code == 200:
            data = response.json()
            all_folders = data['d']['results'] if 'results' in data['d'] else []
            # Filter out system folders
            folders = [f for f in all_folders if not f['Name'].startswith('_') and f['Name'] not in ['Forms']]
            print(f"   📁 Found {len(folders)} subfolders")
        else:
            print(f"   ❌ Failed to get folders: {response.status_code}")

    except Exception as e:
        print(f"   💥 Error scanning {folder_path}: {e}")

    return files, folders

def create_sharepoint_url(server_relative_url):
    """Create a standardized SharePoint URL from server relative URL."""
    return f"https://worldbankgroup.sharepoint.com{server_relative_url}"

def discover_videos_recursive(folder_path, discovered_videos, progress_info):
    """Recursively discover all videos in SharePoint folder with enhanced metadata."""
    files, folders = get_folder_contents(folder_path)

    # Process files in current folder
    for file_info in files:
        file_name = file_info['Name']
        file_ext = Path(file_name).suffix

        # Only catalog video files
        if file_ext not in VIDEO_EXTS:
            continue

        progress_info['total_found'] += 1

        # Create standardized SharePoint URL
        sharepoint_url = create_sharepoint_url(file_info['ServerRelativeUrl'])

        # Get enhanced metadata with timestamps
        metadata = get_file_metadata_with_timestamps(file_info['ServerRelativeUrl'])

        # Store video info with metadata
        video_info = {
            'name': file_name,
            'url': sharepoint_url,
            'server_path': file_info['ServerRelativeUrl'],
            'folder_path': folder_path,
            'metadata': metadata
        }

        discovered_videos.append(video_info)

        if progress_info['total_found'] % 10 == 0:
            print(f"   📊 Discovered {progress_info['total_found']} videos so far...")

    # Process subfolders recursively
    for folder_info in folders:
        folder_name = folder_info['Name']
        if folder_name == 'Videos':
          folder_server_path = folder_info['ServerRelativeUrl']
          # Recurse into subfolder
          discover_videos_recursive(folder_server_path, discovered_videos, progress_info)

print("✅ ENHANCED SharePoint video discovery functions ready")

✅ ENHANCED SharePoint video discovery functions ready


In [None]:
# -----------------------------------------------------------
# 4  Execute SharePoint Video Discovery with Enhanced Metadata
# -----------------------------------------------------------
import time as time

def discover_rwanda_videos():
    """Main function to discover all videos in Rwanda 2020 folder with timestamps."""
    print("="*80)
    print("🎬 STARTING ENHANCED SHAREPOINT VIDEO DISCOVERY")
    print("="*80)
    print(f"📍 Source: {SP_FOLDER_PATH}")
    print(f"🎯 Target Extensions: {', '.join(VIDEO_EXTS)}")
    print(f"🕒 Enhanced with timestamp metadata for ordering verification")
    print()

    # Initialize progress tracking
    progress_info = {
        'total_found': 0
    }

    discovered_videos = []
    start_time = time.time()

    try:
        # Start recursive discovery from Rwanda 2020 folder
        discover_videos_recursive(SP_FOLDER_PATH, discovered_videos, progress_info)

        # Calculate final stats
        elapsed = time.time() - start_time

        # Count metadata success rate
        successful_metadata = sum(1 for v in discovered_videos if v['metadata']['success'])

        print()
        print("="*80)
        print("📊 ENHANCED DISCOVERY COMPLETE - SUMMARY")
        print("="*80)
        print(f"🔍 Videos discovered: {len(discovered_videos)}")
        print(f"🕒 Metadata retrieved: {successful_metadata}/{len(discovered_videos)} ({successful_metadata/len(discovered_videos)*100:.1f}%)")
        print(f"⏱️  Time elapsed: {elapsed:.1f} seconds")

        if len(discovered_videos) > 0:
            print("\n📋 Sample discovered videos with metadata:")
            for i, video in enumerate(discovered_videos[:3]):
                print(f"   {i+1}. {video['name']}")
                print(f"      URL: {video['url']}")
                if video['metadata']['success']:
                    print(f"      Created: {video['metadata'].get('TimeCreated', 'N/A')}")
                    print(f"      Size: {video['metadata'].get('Length', 'N/A')} bytes")
                else:
                    print(f"      Metadata: {video['metadata']['error']}")
            if len(discovered_videos) > 3:
                print(f"   ... and {len(discovered_videos) - 3} more")

        print("="*80)
        return discovered_videos

    except Exception as e:
        print(f"💥 Discovery failed: {e}")
        return []

# Execute the discovery
discovered_videos = discover_rwanda_videos()

if not discovered_videos:
    print("❌ No videos were discovered. Check your SharePoint access and folder path.")
else:
    print(f"✅ Ready to process {len(discovered_videos)} discovered videos with enhanced metadata!")

🎬 STARTING ENHANCED SHAREPOINT VIDEO DISCOVERY
📍 Source: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Rwanda 2020
🎯 Target Extensions: .avi, .MTS, .mp4, .MOV, .mts, .AVI, .mov, .MP4
🕒 Enhanced with timestamp metadata for ordering verification

📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Rwanda 2020
   📄 Found 9 files
   📁 Found 1 subfolders
📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Rwanda 2020/Videos
   📄 Found 199 files
   📁 Found 2 subfolders
   📊 Discovered 10 videos so far...
   📊 Discovered 20 videos so far...
   📊 Discovered 30 videos so far...
   📊 Discovered 40 videos so far...
   📊 Discovered 50 videos so far...
   📊 Discovered 60 videos so far...
   📊 Discovered 70 videos so far...
   📊 Discovered 80 videos so far...
   📊 Discovered 90 videos so far...
   📊 Discovered 100 videos so far...
   📊 Discovered 110 videos so far...
   📊 Discovered 120 videos so far...
   📊 Dis

## Load TEACH Dataset & Prepare Clip Columns

Load the CSV and ensure `First Video Clip` and `Last Video Clip` columns exist.

In [None]:
import pandas as pd

def load_dataset(path):
    lines=path.read_text(encoding='latin-1').splitlines()
    h1=[h.strip() for h in lines[0].split(',')]
    h2=[h.strip() for h in lines[1].split(',')]
    base=h1[:3]+h2[3:]
    cols,seen=[],{}
    for c in base:
        if not c: c='Unnamed'
        seen[c]=seen.get(c,0)
        cols.append(c if seen[c]==0 else f"{c}_{seen[c]}")
        seen[c]+=1
    return pd.read_csv(path,header=None,skiprows=[0,2],names=cols,encoding='latin-1')

print(f"Loading dataset from {RAW_CSV}")
df=load_dataset(RAW_CSV)
for col in ['First Video Clip','Last Video Clip']:
    if col not in df.columns: df[col]=''
print('Dataset prepared.')

Loading dataset from /content/drive/My Drive/world bank/data/Rwanda/evals/Teach_Final_Scores_v1(ALL_Scores).csv
Dataset prepared.


## Progress Management & Resume Capability

Define functions to save and load progress, allowing the notebook to resume from where it left off.

In [None]:
import json
import pandas as pd
from datetime import datetime

def save_progress(processed_ids, results, checkpoint_number):
    """Save current progress to JSON file and checkpoint CSV."""
    progress_data = {
        'timestamp': datetime.now().isoformat(),
        'checkpoint_number': checkpoint_number,
        'processed_ids': list(processed_ids),
        'results': results,
        'total_processed': len(processed_ids)
    }
    
    # Save progress JSON
    with open(PROGRESS_FILE, 'w') as f:
        json.dump(progress_data, f, indent=2)
    
    # Save checkpoint CSV
    df_copy = df.copy()
    
    # Update dataframe with current results
    for idx, row in df_copy.iterrows():
        import re
        id_re = re.compile(r"(\d{6,7})")
        m = id_re.search(str(row.get('School_Clip', '')))
        if m:
            ident = m.group(1)
            r = results.get(ident)
            if r:
                if r['first']: df_copy.at[idx, 'First Video Clip'] = r['first']
                if r['last']: df_copy.at[idx, 'Last Video Clip'] = r['last']
    
    df_copy.to_csv(CHECKPOINT_CSV, index=False)
    print(f"💾 Progress saved: {len(processed_ids)} videos processed (Checkpoint #{checkpoint_number})")

def load_progress():
    """Load existing progress if available."""
    if PROGRESS_FILE.exists():
        try:
            with open(PROGRESS_FILE, 'r') as f:
                progress_data = json.load(f)
            
            processed_ids = set(progress_data.get('processed_ids', []))
            results = progress_data.get('results', {})
            checkpoint_number = progress_data.get('checkpoint_number', 0)
            
            print(f"📂 Resuming from checkpoint #{checkpoint_number}")
            print(f"🔄 Already processed: {len(processed_ids)} videos")
            
            return processed_ids, results, checkpoint_number
            
        except Exception as e:
            print(f"⚠️  Could not load progress file: {e}")
            print("🆕 Starting fresh...")
    else:
        print("🆕 No previous progress found. Starting fresh...")
    
    return set(), {}, 0

def get_pending_videos(discovered_videos, processed_ids):
    """Get list of videos that haven't been processed yet."""
    import re
    id_re = re.compile(r"(\d{6,7})")
    
    pending_videos = []
    for video in discovered_videos:
        # Extract ID from video name or URL
        match = id_re.search(video.get('name', '') or video.get('url', ''))
        if match:
            ident = match.group(1)
            if ident not in processed_ids:
                pending_videos.append((ident, video))
    
    return pending_videos

print("✅ Progress management functions ready")

## Enhanced Video Processing with Progress Tracking

Process videos with progress bar, incremental saving every 5 videos, and resume capability.

In [None]:
import re
import tempfile
import os
import logging
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from urllib.parse import urljoin
import time
import threading

# ─── Logging Setup ─────────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)

# ─── Compile ID Regex ──────────────────────────────────────────────────────────
# Match exactly six or seven digits for the video identifier
id_re = re.compile(r"(\d{6,7})")

# ─── Progress tracking variables ────────────────────────────────────────────────
progress_lock = threading.Lock()

def download_video(srv_rel_url, target):
    """
    Download a video from SharePoint (or any HTTP URL) into `target`.
    Retries up to 3 times on failure, logging each attempt.
    """
    # 1) Build the correct URL
    if isinstance(srv_rel_url, str) and srv_rel_url.lower().startswith("http"):
        url = srv_rel_url
    else:
        url = urljoin("https://worldbankgroup.sharepoint.com", srv_rel_url)

    logger.debug("Downloading from URL: %s → %s", srv_rel_url, url)

    # 2) Attempt download up to 3 times
    for attempt in range(1, 4):
        try:
            with requests.get(url,
                              cookies=SHAREPOINT_COOKIES,
                              headers=SP_HEADERS,
                              stream=True) as response:
                response.raise_for_status()
                with open(target, "wb") as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
            logger.info("✅ Download succeeded (%s) → %s", attempt, target)
            return True

        except Exception as e:
            logger.error("❌ Download error on attempt %d for URL %s: %s", attempt, url, str(e))
            if attempt < 3:
                time.sleep(2)
            else:
                return False

def process_single_video(video_id, video_metadata, pbar, errors, skipped, results, processed_ids, checkpoint_number):
    """
    Process a single video: download, split into clips, and update progress.
    Returns True if successful, False otherwise.
    """
    srv_rel_url = video_metadata.get('url')
    name = video_metadata.get('name', '<unknown>')

    logger.info("⏳ Starting processing of %s (ID %s)", name, video_id)

    # Prepare temp file and output paths
    tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4").name
    out1 = OUTPUT_DIR / f"{video_id}_first.mp4"
    out2 = OUTPUT_DIR / f"{video_id}_last.mp4"

    # Process with up to 3 retries
    for attempt in range(1, 4):
        try:
            # Download video
            if not download_video(srv_rel_url, tmp):
                raise Exception("Download failed")
            
            logger.debug("Downloaded %s to %s", name, tmp)

            # Process video
            clip = VideoFileClip(tmp)
            
            # Check if video is long enough (at least 30 minutes)
            if clip.duration < 1800:  # 30 minutes in seconds
                logger.warning("⏭️  Skipping %s - too short (%.1f minutes)", name, clip.duration/60)
                with progress_lock:
                    skipped.append((video_id, f"Too short: {clip.duration/60:.1f} min"))
                clip.close()
                os.remove(tmp)
                return False

            # Split video in half
            mid = clip.duration / 2
            c1 = clip.subclip(0, mid)
            c2 = clip.subclip(mid, clip.duration)

            # Write clips
            c1.write_videofile(str(out1), audio_codec="aac", verbose=False, logger=None)
            c2.write_videofile(str(out2), audio_codec="aac", verbose=False, logger=None)

            logger.info("✅ Finished clipping %s into %s and %s", name, out1, out2)

            # Update results
            with progress_lock:
                results[video_id] = {"first": str(out1), "last": str(out2)}
                processed_ids.add(video_id)

            # Cleanup
            clip.close(); c1.close(); c2.close()
            os.remove(tmp)

            return True

        except Exception as e:
            logger.error(
                "❌ Error on attempt %d processing %s (ID %s): %s",
                attempt, name, video_id, str(e)
            )
            if attempt == 3:
                with progress_lock:
                    errors.append((video_id, f"Failed after 3 attempts: {str(e)}"))
                return False
            time.sleep(2)  # Wait before retry
        finally:
            # Clean up temp file if it exists
            if os.path.exists(tmp):
                try:
                    os.remove(tmp)
                except:
                    pass

    return False

# ─── Main Processing Function with Progress Bar ──────────────────────────────────

def process_videos_with_progress():
    """
    Main function to process videos with progress bar and incremental saving.
    """
    print("\n" + "="*80)
    print("🎬 STARTING VIDEO PROCESSING WITH PROGRESS TRACKING")
    print("="*80)
    
    # Load existing progress
    processed_ids, results, checkpoint_number = load_progress()
    
    # Get pending videos
    pending_videos = get_pending_videos(discovered_videos, processed_ids)
    
    if not pending_videos:
        print("✅ All videos have already been processed!")
        return results, [], []
    
    print(f"📊 Total videos discovered: {len(discovered_videos)}")
    print(f"✅ Already processed: {len(processed_ids)}")
    print(f"⏳ Remaining to process: {len(pending_videos)}")
    print(f"💾 Save frequency: Every 5 videos")
    print()
    
    # Initialize tracking
    errors = []
    skipped = []
    videos_since_last_save = 0
    
    # Create progress bar
    pbar = tqdm(
        total=len(pending_videos),
        desc="Processing Videos",
        unit="video",
        position=0,
        leave=True
    )
    
    # Process videos with ThreadPoolExecutor
    max_workers = min(4, os.cpu_count())  # Limit concurrent downloads
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all jobs
        future_to_video = {
            executor.submit(
                process_single_video,
                video_id, video_metadata, pbar, errors, skipped, 
                results, processed_ids, checkpoint_number
            ): (video_id, video_metadata)
            for video_id, video_metadata in pending_videos
        }
        
        # Process completed jobs
        for future in as_completed(future_to_video):
            video_id, video_metadata = future_to_video[future]
            
            try:
                success = future.result()
                
                # Update progress bar
                pbar.update(1)
                
                if success:
                    videos_since_last_save += 1
                    pbar.set_postfix_str(f"✅ Processed: {len(results)}, Errors: {len(errors)}, Skipped: {len(skipped)}")
                    
                    # Save progress every 5 videos
                    if videos_since_last_save >= 5:
                        checkpoint_number += 1
                        save_progress(processed_ids, results, checkpoint_number)
                        videos_since_last_save = 0
                else:
                    pbar.set_postfix_str(f"⚠️  Processed: {len(results)}, Errors: {len(errors)}, Skipped: {len(skipped)}")
                
            except Exception as e:
                logger.error(f"💥 Unexpected error processing {video_id}: {e}")
                with progress_lock:
                    errors.append((video_id, f"Unexpected error: {str(e)}"))
                pbar.update(1)
    
    pbar.close()
    
    # Final save
    if videos_since_last_save > 0:
        checkpoint_number += 1
        save_progress(processed_ids, results, checkpoint_number)
    
    # Print summary
    print("\n" + "="*80)
    print("📊 PROCESSING COMPLETE - FINAL SUMMARY")
    print("="*80)
    print(f"✅ Successfully processed: {len(results)} videos")
    print(f"⏭️  Skipped (too short): {len(skipped)} videos")
    print(f"❌ Errors: {len(errors)} videos")
    print(f"📁 Total files created: {len(results) * 2} clips")
    
    if errors:
        print("\n❌ Error Summary:")
        for video_id, error_msg in errors[:5]:  # Show first 5 errors
            print(f"   - {video_id}: {error_msg}")
        if len(errors) > 5:
            print(f"   ... and {len(errors) - 5} more errors")
    
    if skipped:
        print("\n⏭️  Skipped Summary:")
        for video_id, skip_reason in skipped[:5]:  # Show first 5 skipped
            print(f"   - {video_id}: {skip_reason}")
        if len(skipped) > 5:
            print(f"   ... and {len(skipped) - 5} more skipped")
    
    print("="*80)
    
    return results, errors, skipped

# ─── Execute Processing ─────────────────────────────────────────────────────────

if discovered_videos:
    results, errors, skipped = process_videos_with_progress()
else:
    print("❌ No videos discovered. Cannot proceed with processing.")
    results, errors, skipped = {}, [], []





## Attach Clip Links & Save Final CSV

Populate the DataFrame with clip paths and export the final CSV.

In [None]:
print("\n" + "="*60)
print("📝 FINALIZING DATASET AND SAVING RESULTS")
print("="*60)

# Update DataFrame with clip links
updated_rows = 0
for idx, row in df.iterrows():
    m = id_re.search(str(row.get('School_Clip', '')))
    if m:
        ident = m.group(1)
        r = results.get(ident)
        if r:
            if r['first']: 
                df.at[idx, 'First Video Clip'] = r['first']
            if r['last']: 
                df.at[idx, 'Last Video Clip'] = r['last']
            updated_rows += 1

print(f"📊 Updated {updated_rows} rows with clip links")

# Save final CSV
out_csv = OUTPUT_DIR / 'rwanda_manual_clips_final.csv'
df.to_csv(out_csv, index=False)
print(f"💾 Saved final CSV: {out_csv}")

# Save summary report
summary_file = OUTPUT_DIR / 'processing_summary.txt'
with open(summary_file, 'w') as f:
    f.write("Rwanda Video Processing Summary\n")
    f.write("=" * 40 + "\n\n")
    f.write(f"Total videos discovered: {len(discovered_videos)}\n")
    f.write(f"Successfully processed: {len(results)}\n")
    f.write(f"Skipped (too short): {len(skipped)}\n")
    f.write(f"Errors: {len(errors)}\n")
    f.write(f"Dataset rows updated: {updated_rows}\n")
    f.write(f"Total clips created: {len(results) * 2}\n\n")
    
    if errors:
        f.write("Errors:\n")
        for video_id, error_msg in errors:
            f.write(f"  - {video_id}: {error_msg}\n")
        f.write("\n")
    
    if skipped:
        f.write("Skipped videos:\n")
        for video_id, skip_reason in skipped:
            f.write(f"  - {video_id}: {skip_reason}\n")

print(f"📋 Saved processing summary: {summary_file}")

print("\n✅ All processing complete!")
print(f"📁 Final CSV: {out_csv}")
print(f"📁 Checkpoint CSV: {CHECKPOINT_CSV}")
print(f"📁 Progress file: {PROGRESS_FILE}")
print(f"📁 Summary: {summary_file}")

## Cleanup & Verification

Optional cleanup and verification of the processed clips.

In [None]:
# Optional: Verify created clips
def verify_clips():
    """Verify that all expected clip files were created."""
    print("\n🔍 Verifying created clips...")
    
    missing_clips = []
    total_expected = len(results) * 2
    found_clips = 0
    
    for video_id, clip_paths in results.items():
        for clip_type, clip_path in clip_paths.items():
            if os.path.exists(clip_path):
                found_clips += 1
            else:
                missing_clips.append(f"{video_id}_{clip_type}: {clip_path}")
    
    print(f"📊 Clip verification results:")
    print(f"   Expected clips: {total_expected}")
    print(f"   Found clips: {found_clips}")
    print(f"   Missing clips: {len(missing_clips)}")
    
    if missing_clips:
        print("\n❌ Missing clips:")
        for missing in missing_clips[:10]:  # Show first 10
            print(f"   - {missing}")
        if len(missing_clips) > 10:
            print(f"   ... and {len(missing_clips) - 10} more")
    else:
        print("✅ All expected clips found!")

# Run verification
if results:
    verify_clips()

print("\n🎉 Rwanda video processing notebook completed successfully!")
print("\n📖 To resume processing later:")
print("   1. Run all cells up to 'Enhanced Video Processing'")
print("   2. The notebook will automatically resume from the last checkpoint")
print("   3. Progress is saved every 5 videos and can be found in:")
print(f"      - {PROGRESS_FILE}")
print(f"      - {CHECKPOINT_CSV}")