# Data Formatting – Peru (FIXED WITH TIMESTAMP ORDERING)

This notebook loads the **TEACH Final Scores** CSV for *Peru*, crawls every classroom **video** stored under **SharePoint ▸ General ▸ Peru 2019** (World Bank tenant) using cookie-based authentication, and matches videos to evaluations by adding standardized SharePoint URLs to the CSV.

**KEY IMPROVEMENTS:**
- Uses SharePoint file timestamps to verify clip ordering
- Only saves FIRST and LAST clips (no intermediate clips)
- Resolves ordering conflicts using timestamp priority
- Provides detailed ordering confidence tracking

It auto-detects whether you're running in **Google Colab** or **locally (VS Code/Jupyter)**:

* **Colab**  → CSVs and outputs live on *Google Drive* (mounted).
* **Local**  → CSV is beside this notebook (`new/rawData/Peru`) and outputs go to `new/formattedData`.


### Workflow
1. Detect runtime, install missing packages.
2. Mount / locate Google Drive **or** local paths.
3. Authenticate to SharePoint using browser cookies (env var `cookie`).
4. Recursively discover every video from SharePoint Peru 2019 folder with **enhanced timestamp metadata**.
5. **NEW**: Verify clip ordering using timestamp chronology vs filename patterns.
6. **NEW**: Select only FIRST and LAST clips for each evaluation.
7. Write final CSV with only first/last SharePoint video URLs to the appropriate output folder.

In [1]:
!pip install -q python-dotenv requests pandas
!pip install -q google-auth google-auth-oauthlib google-auth-httplib2

In [3]:
# -----------------------------------------------------------
# Environment detection & dependency install
# -----------------------------------------------------------
import importlib.util
import subprocess
import sys
import os
import re
import json
import textwrap
import time
import requests
from pathlib import Path
from urllib.parse import unquote
from datetime import datetime
import statistics

IN_COLAB = importlib.util.find_spec("google.colab") is not None

def _ensure(pkgs):
    """Install missing packages"""
    missing = [p for p in pkgs if importlib.util.find_spec(p.replace('-', '_')) is None]
    if missing:
        print("Installing:", ", ".join(missing))
        subprocess.check_call([sys.executable, "-m", "pip", "install", *missing])

# Install required packages
_ensure(["python-dotenv", "requests", "pandas"])

if IN_COLAB:
    # Colab sometimes needs these after a fresh kernel
    _ensure(["google-auth", "google-auth-oauthlib", "google-auth-httplib2"])

Installing: python-dotenv


In [3]:
# -----------------------------------------------------------
# 1  Paths: Google Drive (Colab) ▶ or local folders
# -----------------------------------------------------------
if IN_COLAB:
    from google.colab import drive as _gdrive
    _gdrive.mount('/content/drive')
    RAW_DIR      = Path('/content/drive/My Drive/world bank/data/Peru')
    RAW_CSV      = RAW_DIR / 'evals/TEACH_Final_Scores_4 - Peru.csv'
    OUTPUT_DIR   = RAW_DIR / 'evals/formattedData'
else:
    NB_DIR       = Path.cwd()                         # new/rawData/Peru
    RAW_DIR      = NB_DIR                             # csv sits here
    RAW_CSV      = RAW_DIR / 'TEACH_Final_Scores_4 - Peru.csv'
    OUTPUT_DIR   = NB_DIR.parent.parent / 'formattedData'

OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

print("Running in:", "Colab" if IN_COLAB else "Local Jupyter/VS Code")
print("RAW_CSV  →", RAW_CSV)
print("OUTPUT   →", OUTPUT_DIR)

Running in: Local Jupyter/VS Code
RAW_CSV  → /Users/mkrasnow/Desktop/montesa/new/rawData/Peru/TEACH_Final_Scores_4 - Peru.csv
OUTPUT   → /Users/mkrasnow/Desktop/montesa/new/formattedData


In [4]:
# -----------------------------------------------------------
# 2  SharePoint Authentication Setup using Browser Cookies
# -----------------------------------------------------------
from dotenv import load_dotenv
import os

load_dotenv()

# Get cookies from environment variable
cookie_string = os.getenv('cookie')

if not cookie_string:
    raise RuntimeError("""
    ❗ Set 'cookie' environment variable with your browser cookies.
    
    To get cookies:
    1. Go to SharePoint site in browser
    2. Press F12 → Network tab → Clear → Refresh page
    3. Click any request to worldbankgroup.sharepoint.com
    4. Copy the complete 'Cookie:' line from Request Headers
    5. Set as environment variable: export cookie="your_cookie_string"
    """)

# Parse cookies into dictionary
cookies = {}
for item in cookie_string.split(';'):
    if '=' in item:
        key, value = item.strip().split('=', 1)
        cookies[key] = value

print(f"✅ Loaded {len(cookies)} cookies for SharePoint authentication")

# SharePoint configuration
SP_BASE_URL = 'https://worldbankgroup.sharepoint.com/teams/TeachDashboardVideoLibrary-WBGroup'
SP_FOLDER_PATH = '/teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019'

# Standard headers for SharePoint requests
SP_HEADERS = {
    'Accept': 'application/json;odata=verbose',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
    'Referer': 'https://worldbankgroup.sharepoint.com/'
}

# Test connection
test_url = f"{SP_BASE_URL}/_api/web"
response = requests.get(test_url, cookies=cookies, headers=SP_HEADERS)
print(f"🔗 SharePoint connection test: {response.status_code}")

if response.status_code == 200:
    site_data = response.json()
    print(f"📍 Connected to: {site_data['d']['Title']}")
else:
    raise RuntimeError(f"Failed to connect to SharePoint: {response.status_code} - {response.text[:200]}")

✅ Loaded 16 cookies for SharePoint authentication
🔗 SharePoint connection test: 200
📍 Connected to: Teach Dashboard Video Library - WB Group


In [5]:
# -----------------------------------------------------------
# 3  ENHANCED SharePoint Functions with Timestamp Metadata
# -----------------------------------------------------------

VIDEO_EXTS = {'.mp4', '.MP4', '.mov', '.MOV', '.mts', '.MTS', '.avi', '.AVI'}

def get_file_metadata_with_timestamps(server_relative_url):
    """Get detailed file metadata including timestamps from SharePoint."""
    try:
        metadata_url = f"{SP_BASE_URL}/_api/web/GetFileByServerRelativeUrl('{server_relative_url}')"
        response = requests.get(metadata_url, cookies=cookies, headers=SP_HEADERS)
        
        if response.status_code == 200:
            data = response.json()['d']
            return {
                'TimeCreated': data.get('TimeCreated'),
                'TimeLastModified': data.get('TimeLastModified'),
                'Length': data.get('Length', 0),
                'UIVersionString': data.get('UIVersionString', '1.0'),
                'success': True
            }
        else:
            return {'success': False, 'error': f"HTTP {response.status_code}"}
    except Exception as e:
        return {'success': False, 'error': str(e)}

def get_folder_contents(folder_path):
    """Get files and subfolders in a SharePoint folder."""
    print(f"📂 Scanning folder: {folder_path}")
    
    files = []
    folders = []
    
    try:
        # Get files in folder
        files_url = f"{SP_BASE_URL}/_api/web/GetFolderByServerRelativeUrl('{folder_path}')/Files"
        response = requests.get(files_url, cookies=cookies, headers=SP_HEADERS)
        
        if response.status_code == 200:
            data = response.json()
            files = data['d']['results'] if 'results' in data['d'] else []
            print(f"   📄 Found {len(files)} files")
        else:
            print(f"   ❌ Failed to get files: {response.status_code}")
        
        # Get subfolders
        folders_url = f"{SP_BASE_URL}/_api/web/GetFolderByServerRelativeUrl('{folder_path}')/Folders"
        response = requests.get(folders_url, cookies=cookies, headers=SP_HEADERS)
        
        if response.status_code == 200:
            data = response.json()
            all_folders = data['d']['results'] if 'results' in data['d'] else []
            # Filter out system folders
            folders = [f for f in all_folders if not f['Name'].startswith('_') and f['Name'] not in ['Forms']]
            print(f"   📁 Found {len(folders)} subfolders")
        else:
            print(f"   ❌ Failed to get folders: {response.status_code}")
            
    except Exception as e:
        print(f"   💥 Error scanning {folder_path}: {e}")
    
    return files, folders

def create_sharepoint_url(server_relative_url):
    """Create a standardized SharePoint URL from server relative URL."""
    return f"https://worldbankgroup.sharepoint.com{server_relative_url}"

def discover_videos_recursive(folder_path, discovered_videos, progress_info):
    """Recursively discover all videos in SharePoint folder with enhanced metadata."""
    files, folders = get_folder_contents(folder_path)
    
    # Process files in current folder
    for file_info in files:
        file_name = file_info['Name']
        file_ext = Path(file_name).suffix
        
        # Only catalog video files
        if file_ext not in VIDEO_EXTS:
            continue
        
        progress_info['total_found'] += 1
        
        # Create standardized SharePoint URL
        sharepoint_url = create_sharepoint_url(file_info['ServerRelativeUrl'])
        
        # Get enhanced metadata with timestamps
        metadata = get_file_metadata_with_timestamps(file_info['ServerRelativeUrl'])
        
        # Store video info with metadata
        video_info = {
            'name': file_name,
            'url': sharepoint_url,
            'server_path': file_info['ServerRelativeUrl'],
            'folder_path': folder_path,
            'metadata': metadata
        }
        
        discovered_videos.append(video_info)
        
        if progress_info['total_found'] % 10 == 0:
            print(f"   📊 Discovered {progress_info['total_found']} videos so far...")
    
    # Process subfolders recursively
    for folder_info in folders:
        folder_name = folder_info['Name']
        folder_server_path = folder_info['ServerRelativeUrl']
        
        # Recurse into subfolder
        discover_videos_recursive(folder_server_path, discovered_videos, progress_info)

print("✅ ENHANCED SharePoint video discovery functions ready")

✅ ENHANCED SharePoint video discovery functions ready


In [6]:
# -----------------------------------------------------------
# 4  Execute SharePoint Video Discovery with Enhanced Metadata
# -----------------------------------------------------------

def discover_peru_videos():
    """Main function to discover all videos in Peru 2019 folder with timestamps."""
    print("="*80)
    print("🎬 STARTING ENHANCED SHAREPOINT VIDEO DISCOVERY")
    print("="*80)
    print(f"📍 Source: {SP_FOLDER_PATH}")
    print(f"🎯 Target Extensions: {', '.join(VIDEO_EXTS)}")
    print(f"🕒 Enhanced with timestamp metadata for ordering verification")
    print()
    
    # Initialize progress tracking
    progress_info = {
        'total_found': 0
    }
    
    discovered_videos = []
    start_time = time.time()
    
    try:
        # Start recursive discovery from Peru 2019 folder
        discover_videos_recursive(SP_FOLDER_PATH, discovered_videos, progress_info)
        
        # Calculate final stats
        elapsed = time.time() - start_time
        
        # Count metadata success rate
        successful_metadata = sum(1 for v in discovered_videos if v['metadata']['success'])
        
        print()
        print("="*80)
        print("📊 ENHANCED DISCOVERY COMPLETE - SUMMARY")
        print("="*80)
        print(f"🔍 Videos discovered: {len(discovered_videos)}")
        print(f"🕒 Metadata retrieved: {successful_metadata}/{len(discovered_videos)} ({successful_metadata/len(discovered_videos)*100:.1f}%)")
        print(f"⏱️  Time elapsed: {elapsed:.1f} seconds")
        
        if len(discovered_videos) > 0:
            print("\n📋 Sample discovered videos with metadata:")
            for i, video in enumerate(discovered_videos[:3]):
                print(f"   {i+1}. {video['name']}")
                print(f"      URL: {video['url']}")
                if video['metadata']['success']:
                    print(f"      Created: {video['metadata'].get('TimeCreated', 'N/A')}")
                    print(f"      Size: {video['metadata'].get('Length', 'N/A')} bytes")
                else:
                    print(f"      Metadata: {video['metadata']['error']}")
            if len(discovered_videos) > 3:
                print(f"   ... and {len(discovered_videos) - 3} more")

        print("="*80)
        return discovered_videos
        
    except Exception as e:
        print(f"💥 Discovery failed: {e}")
        return []

# Execute the discovery
discovered_videos = discover_peru_videos()

if not discovered_videos:
    print("❌ No videos were discovered. Check your SharePoint access and folder path.")
else:
    print(f"✅ Ready to process {len(discovered_videos)} discovered videos with enhanced metadata!")

🎬 STARTING ENHANCED SHAREPOINT VIDEO DISCOVERY
📍 Source: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019
🎯 Target Extensions: .mts, .MTS, .MOV, .avi, .MP4, .mp4, .AVI, .mov
🕒 Enhanced with timestamp metadata for ordering verification

📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019
   📄 Found 5 files
   📁 Found 20 subfolders
📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019/Ruta 10
   📄 Found 0 files
   📁 Found 2 subfolders
📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019/Ruta 10/Video
   📄 Found 9 files
   📁 Found 1 subfolders
📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019/Ruta 10/Video/Clip 1
   📄 Found 9 files
   📁 Found 0 subfolders
   📊 Discovered 10 videos so far...
📂 Scanning folder: /teams/TeachDashboardVideoLibrary-WBGroup/Shared Documents/General/Peru 2019/Ruta

In [7]:
# -----------------------------------------------------------
# 5  Helper: load_dataset (unchanged)
# -----------------------------------------------------------
import pandas as pd

def load_dataset(path: Path) -> pd.DataFrame:
    """Load TEACH csv exported with two header rows and latin-1 encoding."""
    with path.open('r', encoding='latin-1') as f:
        lines = f.readlines()
    
    h1 = [h.strip() for h in lines[0].split(',')]
    h2 = [h.strip() for h in lines[1].split(',')]
    base = h1[:3] + h2[3:]
    
    cols, seen = [], {}
    for c in base:
        c = c or 'Unnamed'
        if c not in seen:
            cols.append(c)
            seen[c] = 0
        else:
            seen[c] += 1
            cols.append(f"{c}_{seen[c]}")
    
    return pd.read_csv(path, header=None, skiprows=[0,2], names=cols, encoding='latin-1')

print("✅ Dataset loading function ready")

✅ Dataset loading function ready


In [8]:
# -----------------------------------------------------------
# 6  Load CSV & ensure FIRST/LAST clip columns only
# -----------------------------------------------------------
print(f"📊 Loading CSV from: {RAW_CSV}")

if not RAW_CSV.exists():
    raise FileNotFoundError(f"CSV file not found: {RAW_CSV}")

df = load_dataset(RAW_CSV)
df['Source Table'] = 'Peru'

print(f"✅ Loaded {len(df)} rows from CSV")
print(f"📋 Columns: {list(df.columns)[:10]}{'...' if len(df.columns) > 10 else ''}")

# Ensure required columns exist
required_cols = ['Identifier','Audio File 1','Audio File 2','Transcription 1','Transcription 2','Language','Context']
for c in required_cols:
    if c not in df.columns:
        df[c] = ''
        
# Add ONLY first and last video clip columns (NEW APPROACH)
clip_columns = ['First Video Clip', 'Last Video Clip']
for col in clip_columns:
    if col not in df.columns:
        df[col] = ''

print(f"✅ Configured for FIRST/LAST clips only: {clip_columns}")

📊 Loading CSV from: /Users/mkrasnow/Desktop/montesa/new/rawData/Peru/TEACH_Final_Scores_4 - Peru.csv
✅ Loaded 363 rows from CSV
📋 Columns: ['Route', 'School_Clip', 'Person to Score', 'Teachear provides learning activity to most students - 1st Snapshot', 'Students are on task - 1st snapshot', 'Teachear provides learning activity to most students - 2nd Snapshot', 'Students are on task - 2nd snapshot', 'Teachear provides learning activity to most students - 3rd Snapshot', 'Students are on task - 3rd snapshot', '1. Supportive Learning Environment']...
✅ Configured for FIRST/LAST clips only: ['First Video Clip', 'Last Video Clip']


In [9]:
# -----------------------------------------------------------
# 7  TIMESTAMP-BASED ORDERING: Core Logic for First/Last Selection
# -----------------------------------------------------------
print("🕒 Building timestamp-based video ordering system...")

# Enhanced regex patterns for clip number detection
clip_patterns = [
    (re.compile(r'clip\s*(\d+)', re.I), 'clip_keyword', 10),
    (re.compile(r'_(\d+)'), 'underscore', 8),
    (re.compile(r'\((\d+)\)'), 'parentheses', 6),
    (re.compile(r'test.*clip\s*(\d+)', re.I), 'test_clip', 4),
    (re.compile(r'(\d+)\.mp4$', re.I), 'trailing_number', 2),
]

id_re = re.compile(r'^(\d{6,7})\D')

def extract_clip_info(filename):
    """Extract clip number and confidence from filename."""
    best_match = None
    best_score = 0
    best_pattern = None
    
    for pattern, pattern_name, score in clip_patterns:
        match = pattern.search(filename)
        if match and score > best_score:
            try:
                clip_num = int(match.group(1))
                if 1 <= clip_num <= 4:
                    best_match = clip_num
                    best_score = score
                    best_pattern = pattern_name
            except ValueError:
                continue
    
    return best_match, best_score, best_pattern

def parse_sharepoint_timestamp(timestamp_str):
    """Parse SharePoint timestamp string to datetime object."""
    try:
        if not timestamp_str:
            return None
        # SharePoint timestamp format: /Date(1234567890000)/
        if timestamp_str.startswith('/Date(') and timestamp_str.endswith(')/'):
            timestamp_ms = int(timestamp_str[6:-2])
            return datetime.fromtimestamp(timestamp_ms / 1000)
        # Try standard ISO format
        return datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
    except (ValueError, TypeError):
        return None

def verify_clip_ordering(videos_for_identifier):
    """CORE LOGIC: Verify clip ordering using timestamps vs filename patterns."""
    if len(videos_for_identifier) < 2:
        return {
            'verified_order': videos_for_identifier,
            'method': 'single_video',
            'confidence': 'high',
            'notes': f'Only {len(videos_for_identifier)} video(s) available'
        }
    
    # Extract clip numbers and timestamps for each video
    for video in videos_for_identifier:
        clip_num, confidence, pattern = extract_clip_info(video['name'])
        video['clip_num'] = clip_num
        video['filename_confidence'] = confidence
        video['pattern_type'] = pattern
        
        # Parse timestamp
        if video['metadata']['success']:
            video['timestamp'] = parse_sharepoint_timestamp(video['metadata'].get('TimeCreated'))
        else:
            video['timestamp'] = None
    
    # Check data availability
    has_clip_numbers = any(v['clip_num'] is not None for v in videos_for_identifier)
    has_timestamps = any(v['timestamp'] is not None for v in videos_for_identifier)
    
    if not has_clip_numbers and not has_timestamps:
        # Both missing - use filename alphabetical order
        verified_order = sorted(videos_for_identifier, key=lambda x: x['name'])
        return {
            'verified_order': verified_order,
            'method': 'filename_alphabetical',
            'confidence': 'low',
            'notes': 'No clip numbers or timestamps - used alphabetical filename order'
        }
    
    if not has_timestamps:
        # Only filename patterns available
        verified_order = sorted(videos_for_identifier, key=lambda x: (x['clip_num'] or 999, x['name']))
        return {
            'verified_order': verified_order,
            'method': 'filename_only',
            'confidence': 'medium',
            'notes': 'No timestamps - used filename patterns only'
        }
    
    if not has_clip_numbers:
        # Only timestamps available
        verified_order = sorted([v for v in videos_for_identifier if v['timestamp']], key=lambda x: x['timestamp'])
        return {
            'verified_order': verified_order,
            'method': 'timestamp_only',
            'confidence': 'high',
            'notes': 'No clip numbers - used timestamp chronology'
        }
    
    # Both available - check for alignment
    filename_order = sorted([v for v in videos_for_identifier if v['clip_num']], key=lambda x: x['clip_num'])
    timestamp_order = sorted([v for v in videos_for_identifier if v['timestamp']], key=lambda x: x['timestamp'])
    
    # Compare ordering alignment
    filename_names = [v['name'] for v in filename_order]
    timestamp_names = [v['name'] for v in timestamp_order]
    
    if filename_names == timestamp_names:
        # Perfect alignment
        return {
            'verified_order': timestamp_order,
            'method': 'aligned_timestamp_filename',
            'confidence': 'high',
            'notes': 'Timestamp and filename ordering perfectly aligned'
        }
    else:
        # Conflict - prioritize timestamps
        return {
            'verified_order': timestamp_order,
            'method': 'timestamp_priority_conflict',
            'confidence': 'medium',
            'notes': f'Timestamp/filename conflict - used timestamp priority. Filename order: {filename_names[:3]}, Timestamp order: {timestamp_names[:3]}'
        }

def select_first_last_clips(identifier_videos):
    """Select ONLY first and last clips from verified ordering."""
    if len(identifier_videos) == 0:
        return None
    
    if len(identifier_videos) == 1:
        return {
            'first_clip': identifier_videos[0]['url'],
            'last_clip': None,
            'method': 'single_video',
            'confidence': 'high',
            'total_available': 1,
            'notes': 'Only one video available - assigned as first clip'
        }
    
    # Get verified ordering
    ordering_result = verify_clip_ordering(identifier_videos)
    verified_videos = ordering_result['verified_order']
    
    if len(verified_videos) < 2:
        return {
            'first_clip': verified_videos[0]['url'] if verified_videos else None,
            'last_clip': None,
            'method': ordering_result['method'],
            'confidence': ordering_result['confidence'],
            'total_available': len(verified_videos),
            'notes': f"Insufficient videos after ordering: {ordering_result['notes']}"
        }
    
    # Select first and last
    first_clip = verified_videos[0]
    last_clip = verified_videos[-1]
    
    return {
        'first_clip': first_clip['url'],
        'last_clip': last_clip['url'],
        'method': ordering_result['method'],
        'confidence': ordering_result['confidence'],
        'total_available': len(verified_videos),
        'notes': f"Selected first/last from {len(verified_videos)} videos. {ordering_result['notes']}"
    }

print("✅ Timestamp-based ordering system ready")

🕒 Building timestamp-based video ordering system...
✅ Timestamp-based ordering system ready


In [10]:
# -----------------------------------------------------------
# 8  MAIN PROCESSING: Build First/Last Clip Mapping with Timestamp Verification
# -----------------------------------------------------------
print("🔍 Building timestamp-verified first/last clip mapping...")

if len(discovered_videos) == 0:
    print("⚠️  No videos were discovered. Make sure the discovery completed successfully.")
    video_map = {}
else:
    print(f"📹 Processing {len(discovered_videos)} discovered videos")
    
    # Group videos by identifier
    videos_by_identifier = {}
    unmatched_videos = []
    
    for video_info in discovered_videos:
        name = video_info['name']
        m_id = id_re.search(name)
        
        if not m_id:
            unmatched_videos.append(video_info)
            continue
        
        ident = m_id.group(1)
        if ident not in videos_by_identifier:
            videos_by_identifier[ident] = []
        
        videos_by_identifier[ident].append(video_info)
    
    print(f"✅ Grouped videos: {len(videos_by_identifier)} identifiers, {len(unmatched_videos)} unmatched")
    
    # Build first/last clip mapping using timestamp verification
    video_map = {}
    stats = {
        'total_identifiers': len(videos_by_identifier),
        'successful_mappings': 0,
        'single_video_cases': 0,
        'multiple_video_cases': 0,
        'method_counts': {},
        'confidence_counts': {'high': 0, 'medium': 0, 'low': 0}
    }
    
    for ident, videos in videos_by_identifier.items():
        result = select_first_last_clips(videos)
        
        if result:
            video_map[ident] = result
            stats['successful_mappings'] += 1
            
            # Track statistics
            if result['total_available'] == 1:
                stats['single_video_cases'] += 1
            else:
                stats['multiple_video_cases'] += 1
            
            method = result['method']
            stats['method_counts'][method] = stats['method_counts'].get(method, 0) + 1
            
            confidence = result['confidence']
            stats['confidence_counts'][confidence] += 1
    
    print(f"\n📊 Timestamp-verified mapping complete:")
    print(f"   Successful mappings: {stats['successful_mappings']}/{stats['total_identifiers']}")
    print(f"   Single video cases: {stats['single_video_cases']}")
    print(f"   Multiple video cases: {stats['multiple_video_cases']}")
    
    print(f"\n🔧 Ordering methods used:")
    for method, count in stats['method_counts'].items():
        print(f"   {method}: {count} cases")
    
    print(f"\n🎯 Confidence distribution:")
    for conf, count in stats['confidence_counts'].items():
        print(f"   {conf}: {count} cases ({count/stats['successful_mappings']*100:.1f}%)")
    
    # Show sample mappings
    if video_map:
        print("\n📋 Sample first/last clip mappings:")
        for i, (ident, result) in enumerate(list(video_map.items())[:3]):
            has_last = result['last_clip'] is not None
            print(f"   {ident}: First + {'Last' if has_last else 'No Last'} ({result['method']}, {result['confidence']} confidence)")
            print(f"      Available videos: {result['total_available']}")
            print(f"      Notes: {result['notes'][:80]}{'...' if len(result['notes']) > 80 else ''}")

🔍 Building timestamp-verified first/last clip mapping...
📹 Processing 402 discovered videos
✅ Grouped videos: 192 identifiers, 6 unmatched

📊 Timestamp-verified mapping complete:
   Successful mappings: 192/192
   Single video cases: 60
   Multiple video cases: 132

🔧 Ordering methods used:
   timestamp_priority_conflict: 32 cases
   aligned_timestamp_filename: 94 cases
   single_video: 60 cases
   timestamp_only: 6 cases

🎯 Confidence distribution:
   high: 160 cases (83.3%)
   medium: 32 cases (16.7%)
   low: 0 cases (0.0%)

📋 Sample first/last clip mappings:
   379859: First + Last (timestamp_priority_conflict, medium confidence)
      Available videos: 2
      Notes: Selected first/last from 2 videos. Timestamp/filename conflict - used timestamp ...
   382242: First + Last (aligned_timestamp_filename, high confidence)
      Available videos: 2
      Notes: Selected first/last from 2 videos. Timestamp and filename ordering perfectly ali...
   420703: First + Last (aligned_timestamp_

In [11]:
# -----------------------------------------------------------
# 9  FINAL: Attach First/Last Clips to DataFrame and Save
# -----------------------------------------------------------
print("🔗 Attaching first/last video clips to dataframe...")

attachment_stats = {
    'rows_with_first_clip': 0,
    'rows_with_both_clips': 0,
    'rows_with_first_only': 0,
    'no_match': 0,
    'skipped_existing': 0
}

for idx, row in df.iterrows():
    # Look for identifier in the School_Clip column
    school_clip = str(row.get('School_Clip', ''))
    m = re.search(r'(\d{6,7})', school_clip)
    
    if not m:
        attachment_stats['no_match'] += 1
        continue
    
    ident = m.group(1)
    
    if ident not in video_map:
        attachment_stats['no_match'] += 1
        continue
    
    # Check if this row already has video clips assigned
    has_existing_videos = any(row.get(col, '') for col in ['First Video Clip', 'Last Video Clip'])
    
    if has_existing_videos:
        attachment_stats['skipped_existing'] += 1
        continue
    
    # Attach first/last clips
    result = video_map[ident]
    
    if result['first_clip']:
        df.at[idx, 'First Video Clip'] = result['first_clip']
        attachment_stats['rows_with_first_clip'] += 1
        
        if result['last_clip']:
            df.at[idx, 'Last Video Clip'] = result['last_clip']
            attachment_stats['rows_with_both_clips'] += 1
        else:
            attachment_stats['rows_with_first_only'] += 1
    
    # Add detailed context information
    context_info = f"CLIP_ORDERING: {result['method']} ({result['confidence']} confidence) | {result['notes']}"
    existing_context = str(row.get('Context', ''))
    df.at[idx, 'Context'] = (existing_context + ' | ' if existing_context else '') + context_info

print(f"\n📊 First/Last clip attachment summary:")
print(f"✅ Rows with first clip: {attachment_stats['rows_with_first_clip']}")
print(f"🎬 Rows with both clips: {attachment_stats['rows_with_both_clips']}")
print(f"🎥 Rows with first only: {attachment_stats['rows_with_first_only']}")
print(f"⏭️  Skipped (existing): {attachment_stats['skipped_existing']}")
print(f"⚠️  No match: {attachment_stats['no_match']}")

# Calculate final coverage
total_rows = len(df)
coverage_rate = attachment_stats['rows_with_first_clip'] / total_rows * 100
both_clips_rate = attachment_stats['rows_with_both_clips'] / total_rows * 100

print(f"\n📈 Coverage rates:")
print(f"   First clip coverage: {coverage_rate:.1f}%")
print(f"   Both clips coverage: {both_clips_rate:.1f}%")

# Save final CSV with ONLY first/last clips
final_csv = OUTPUT_DIR / 'peru_formatted_first_last_clips_only.csv'
df.to_csv(final_csv, index=False)
print(f"\n💾 Saved FIRST/LAST clips CSV → {final_csv}")

# Show sample of final assignments
video_rows = df[(df['First Video Clip'] != '') | (df['Last Video Clip'] != '')]
if len(video_rows) > 0:
    print(f"\n📋 Sample rows with first/last clips ({len(video_rows)} total):")
    for i, (idx, row) in enumerate(video_rows.head(3).iterrows()):
        first_clip = row['First Video Clip']
        last_clip = row['Last Video Clip']
        context = row.get('Context', '')
        
        print(f"   Row {idx}:")
        print(f"      First clip: {'✅' if first_clip else '❌'} {first_clip[:60] + '...' if first_clip else 'None'}")
        print(f"      Last clip:  {'✅' if last_clip else '❌'} {last_clip[:60] + '...' if last_clip else 'None'}")
        
        # Extract ordering method from context
        method_match = re.search(r'CLIP_ORDERING: ([^\(]+)', context)
        if method_match:
            print(f"      Method: {method_match.group(1).strip()}")

print("\n" + "="*80)
print("🎉 TIMESTAMP-VERIFIED FIRST/LAST PROCESSING COMPLETE!")
print("="*80)
print(f"📊 Final dataset: {len(df)} rows")
print(f"🎬 Videos discovered: {len(discovered_videos)}")
print(f"🔗 Rows with video clips: {len(video_rows)}")
print(f"⚡ Key improvements:")
print(f"   - Timestamp-based ordering verification")
print(f"   - Only first and last clips preserved")
print(f"   - Intelligent conflict resolution")
print(f"   - Detailed method tracking")
print(f"📁 Output: {final_csv.name}")
print("="*80)

🔗 Attaching first/last video clips to dataframe...

📊 First/Last clip attachment summary:
✅ Rows with first clip: 346
🎬 Rows with both clips: 246
🎥 Rows with first only: 100
⏭️  Skipped (existing): 0
⚠️  No match: 17

📈 Coverage rates:
   First clip coverage: 95.3%
   Both clips coverage: 67.8%

💾 Saved FIRST/LAST clips CSV → /Users/mkrasnow/Desktop/montesa/new/formattedData/peru_formatted_first_last_clips_only.csv

📋 Sample rows with first/last clips (346 total):
   Row 1:
      First clip: ✅ https://worldbankgroup.sharepoint.com/teams/TeachDashboardVi...
      Last clip:  ✅ https://worldbankgroup.sharepoint.com/teams/TeachDashboardVi...
      Method: timestamp_priority_conflict
   Row 2:
      First clip: ✅ https://worldbankgroup.sharepoint.com/teams/TeachDashboardVi...
      Last clip:  ✅ https://worldbankgroup.sharepoint.com/teams/TeachDashboardVi...
      Method: timestamp_priority_conflict
   Row 3:
      First clip: ✅ https://worldbankgroup.sharepoint.com/teams/TeachDashboardVi.