# Multi-Institution MARC Data Collection Setup

## Overview
This notebook collects MARC data from all Ivy Plus libraries. The Ivy Plus consortium includes:

- Brown University
- University of Chicago  
- Columbia University
- Cornell University
- Dartmouth College
- Duke University
- Harvard University
- Johns Hopkins University
- Massachusetts Institute of Technology
- University of Pennsylvania
- Princeton University
- Stanford University
- Yale University

## Data Source
POD
```
https://pod.stanford.edu/organizations/{org_code}/streams/{stream_date}/normalized_resourcelist/marc21
```

## Usage Examples

```python
# Sync data for a single institution
sync_institution_data('harvard')  # Uses stream date from config

# Sync data for all active institutions (incremental sync)
sync_all_active_institutions()

# Force re-download all files for an institution
sync_institution_data('yale', force_refresh=True)

# Use a different stream date
sync_institution_data('yale', stream_date='2023-01-01')

# Check current data status
for institution_key in ivy_plus_config:
    if not ivy_plus_config[institution_key]['active']:
        continue
    pod_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
    if os.path.exists(pod_dir):
        marc_files = glob.glob(os.path.join(pod_dir, "**/*.mrc.gz"), recursive=True)
        print(f"{institution_key}: {len(marc_files)} MARC files")
```

## Data Organization
- **Raw Downloads**: Each institution's data is stored in `pod_{institution}` directories
- **Work Directory**: Files are copied to `pod-processing-outputs/work/marc/` with institution prefix
- **Processed Files**: Institution-specific MARC files in `pod-processing-outputs/processed/`
- **Export Directory**: Final filtered files in timestamped subdirectories

## Processing Workflow
1. **Sync Data**: Download/update MARC and delete files from POD
2. **Organize Files**: Copy MARC files to work directory with institution prefixes
3. **Process Records**: Extract and consolidate MARC records by institution
4. **Export with Filtering**: Apply delete filtering and create final export package

## Key Features
- **Incremental Sync**: Only downloads new/updated files (unless `force_refresh=True`)
- **Delete Filtering**: Automatically processes `.del.txt` files to filter out deleted records
- **Progress Tracking**: Saves progress to allow resuming interrupted processing
- **Institution Prefixing**: Prevents filename conflicts between institutions

## Notes
- Harvard's collection is particularly large (~43GB, 962 files)
- Processing uses PySpark for efficient handling of large datasets
- A single POD access token provides authentication for all institutions
- Stream dates vary by institution (configured in `ivy_plus_config`)

In [None]:
!pip install pyspark

In [None]:
!pip install resync

In [None]:

import os

# Configuration for paths - adjust this when moving between systems
VOLUME_BASE = '/Volumes/Samsung_T5/pod-pyspark-notbook'
os.chdir(VOLUME_BASE)  # Set working directory to the new volume

# Define all paths relative to the new volume
base_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs')
work_dir = os.path.join(base_dir, 'work', 'marc')
processed_dir = os.path.join(base_dir, 'processed')
final_dir = os.path.join(base_dir, 'final')
export_dir = os.path.join(base_dir, 'export')
archive_dir = os.path.join(base_dir, 'archive')

# Create directories if they don't exist
for directory in [base_dir, work_dir, processed_dir, final_dir, export_dir, archive_dir]:
    os.makedirs(directory, exist_ok=True)

print(f"Volume base: {VOLUME_BASE}")
print(f"Working directory: {os.getcwd()}")
print("All directories created/verified")

In [None]:
# Add this verification cell after your path setup cell (after cell id="d2a40ed6")
# This ensures we're starting with a clean slate

print("=== VERIFYING CLEAN STATE ===")

# Check processed directory is empty
processed_files = os.listdir(processed_dir) if os.path.exists(processed_dir) else []
if processed_files:
    print(f"⚠️ WARNING: Processed directory is not empty! Contains {len(processed_files)} files")
    print("Files found:", processed_files[:5], "..." if len(processed_files) > 5 else "")
    print("\nTo ensure clean processing, run:")
    print(f"  shutil.rmtree('{processed_dir}')")
    print(f"  os.makedirs('{processed_dir}')")
else:
    print("✓ Processed directory is clean")

# Check for any old export directories
if os.path.exists(export_dir):
    export_subdirs = [d for d in os.listdir(export_dir) if os.path.isdir(os.path.join(export_dir, d))]
    if export_subdirs:
        print(f"\n✓ Found {len(export_subdirs)} previous export(s)")
        print("  These won't interfere with new processing")
else:
    print("\n✓ No previous exports found")

print("\n✓ Ready for fresh processing run!")

In [None]:

import json
import glob

print("=== VERIFYING SETUP ===")
print(f"Volume base: {VOLUME_BASE}")
print(f"Current directory: {os.getcwd()}")

# Check directories
for name, path in [
    ('Work dir', work_dir),
    ('Processed dir', processed_dir),
    ('Final dir', final_dir)
]:
    exists = "✓" if os.path.exists(path) else "✗"
    print(f"{exists} {name}: {path}")

# Check for progress file
progress_file = os.path.join(processed_dir, 'processing_progress.json')
if os.path.exists(progress_file):
    with open(progress_file, 'r') as f:
        progress = json.load(f)
    print(f"\n✓ Progress file found:")
    print(f"  - Files processed: {len(progress.get('processed_files', []))}")
    print(f"  - Total records: {progress.get('total_records', 0):,}")
else:
    print(f"\n✗ No progress file - will start fresh")

# Check source files
marc_files = glob.glob(os.path.join(work_dir, '*.mrc.gz'))
print(f"\nSource files: {len(marc_files)} .mrc.gz files in work directory")

In [None]:
# Configuration for Ivy Plus Libraries
# This dictionary contains the organization codes and basic info for each institution
# The access token appears to be system-wide for POD, not institution-specific

# Common POD access token (appears to work for all institutions)
POD_ACCESS_TOKEN = 'YOUR_POD_ACCESS_TOKEN_HERE'  # Replace with your actual token

ivy_plus_config = {
    'brown': {
        'name': 'Brown University',
        'org_code': 'brown',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'brown_2022-05-05',
    },
    'chicago': {
        'name': 'University of Chicago',
        'org_code': 'chicago',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2022-06-18',
    },
    'columbia': {
        'name': 'Columbia University',
        'org_code': 'columbia',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'prod',
    },
    'cornell': {
        'name': 'Cornell University',
        'org_code': 'cornell',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2023-04',
    },
    'dartmouth': {
        'name': 'Dartmouth College',
        'org_code': 'dartmouth',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '202208',
    },
    'duke': {
        'name': 'Duke University',
        'org_code': 'duke',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2025-01-15',
    },
    'harvard': {
        'name': 'Harvard University',
        'org_code': 'harvard',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '20220204-with-items',
    },
    'johns_hopkins': {
        'name': 'Johns Hopkins University',
        'org_code': 'jhu',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'jhu',
    },
    'mit': {
        'name': 'Massachusetts Institute of Technology',
        'org_code': 'mit',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'prod-2022-06-08',
    },
    'penn': {
        'name': 'University of Pennsylvania',
        'org_code': 'penn',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,  # Currently working and verified
        'stream_date': '2022-07-18',
    },
    'princeton': {
        'name': 'Princeton University',
        'org_code': 'princeton',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'princeton-prod-0223',
    },
    'stanford': {
        'name': 'Stanford University',
        'org_code': 'stanford',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2024-08-27',
    },
    'yale': {
        'name': 'Yale University',
        'org_code': 'yale',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2025-07-alma',
    }
}

# Base URLs for POD system
POD_BASE_URL = 'https://pod.stanford.edu/'
SITEMAP_BASE_URL = 'https://pod.stanford.edu/organizations/{org}/streams/{stream}/normalized_resourcelist/marc21'

# Default stream (you may need to adjust this for each institution)
DEFAULT_STREAM = '2022-07-18'

print("Ivy Plus Libraries Configuration (with shared POD token):")
for key, config in ivy_plus_config.items():
    status = "✓ Active" if config['active'] else "✗ Inactive (need to verify org_code)"
    print(f"  {config['name']}: {status}")

print(f"\nNOTE: All institutions now use the same POD access token.")
print(f"To activate an institution, verify its org_code and set active=True")
print(f"Use test_all_organization_codes() to verify org_codes")

In [None]:
# Add this cell after configuration (id="eca2352d") and before organize files (id="05b4fe0f")

# Import shutil for the organize function later
import shutil

def sync_institution_data(institution_key, stream_date=None, output_base_dir=None, force_refresh=False):
    """
    Sync MARC data for a specific institution using resync-sync.
    By default, only downloads missing or updated files.
    
    Args:
        institution_key: Institution identifier from ivy_plus_config
        stream_date: Override stream date (optional)
        output_base_dir: Base directory for output (optional)
        force_refresh: If True, forces re-download of all files
    """
    if institution_key not in ivy_plus_config:
        print(f"Error: Institution '{institution_key}' not found in configuration")
        return False

    config = ivy_plus_config[institution_key]

    if not config['active']:
        print(f"Warning: Institution '{config['name']}' is not active")
        return False

    # Use institution-specific stream_date if present, else argument, else default
    effective_stream_date = stream_date or config.get('stream_date') or DEFAULT_STREAM

    # Set up output directory
    if output_base_dir is None:
        output_base_dir = VOLUME_BASE

    output_dir = os.path.join(output_base_dir, f"pod_{institution_key}")

    # Build sitemap URL
    sitemap_url = SITEMAP_BASE_URL.format(
        org=config['org_code'],
        stream=effective_stream_date
    )

    # Check existing data before sync
    existing_marc = 0
    existing_del = 0
    if os.path.exists(output_dir):
        existing_marc = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
        existing_del = len(glob.glob(os.path.join(output_dir, "**/*.del.txt"), recursive=True))
        if existing_marc > 0 or existing_del > 0:
            print(f"Found existing data: {existing_marc} MARC files, {existing_del} delete files")

    # Build command - resync-sync will handle incremental updates
    command = [
        'resync-sync',
        '-v',
        '--sitemap', sitemap_url,
        '--access-token', config['access_token'],
        '-b', POD_BASE_URL,
    ]
    
    # Add force refresh flag if requested
    if force_refresh:
        command.append('--force')
        print("Force refresh enabled - will re-download all files")
    
    command.append(output_dir)

    print(f"Syncing data for {config['name']}...")
    print(f"Stream: {effective_stream_date}")
    print(f"Output: {output_dir}")
    if not force_refresh:
        print("Mode: Incremental (only missing/updated files)")

    try:
        # Run resync-sync
        result = subprocess.run(command)
        
        if result.returncode == 0:
            # Check what was downloaded
            new_marc = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
            new_del = len(glob.glob(os.path.join(output_dir, "**/*.del.txt"), recursive=True))
            
            print(f"✓ Successfully synced {config['name']}")
            print(f"  Total files: {new_marc} MARC, {new_del} delete files")
            
            if existing_marc > 0 or existing_del > 0:
                downloaded_marc = new_marc - existing_marc
                downloaded_del = new_del - existing_del
                if downloaded_marc > 0 or downloaded_del > 0:
                    print(f"  New files: {downloaded_marc} MARC, {downloaded_del} delete files")
                else:
                    print(f"  No new files needed - already up to date")
            
            return True
        else:
            print(f"✗ Error syncing data for {config['name']}")
            return False

    except Exception as e:
        print(f"✗ Exception occurred while syncing {config['name']}: {e}")
        return False

def sync_all_active_institutions(force_refresh=False):
    """
    Sync data for all active institutions, leveraging rsync's incremental capabilities
    """
    print("=== SYNCING ALL ACTIVE INSTITUTIONS ===")
    print("Using rsync incremental mode - only missing/updated files will be downloaded\n")
    
    successful = []
    failed = []
    already_complete = []
    
    for institution_key, config in ivy_plus_config.items():
        if config['active']:
            print(f"\n--- {config['name']} ---")
            
            # Quick check if data already exists
            output_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
            if os.path.exists(output_dir) and not force_refresh:
                marc_count = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
                if marc_count > 0:
                    print(f"Checking for updates ({marc_count} existing files)...")
            
            if sync_institution_data(institution_key, force_refresh=force_refresh):
                successful.append(institution_key)
            else:
                failed.append(institution_key)
        else:
            print(f"\nSkipping {config['name']} (inactive)")
    
    print("\n=== SYNC SUMMARY ===")
    print(f"Successfully synced: {len(successful)} institutions")
    
    if failed:
        print(f"\nFailed: {len(failed)} institutions")
        for inst in failed:
            print(f"  ✗ {ivy_plus_config[inst]['name']}")
    
    return successful

# === MAIN EXECUTION ===

# First, show current data status
print("=== CURRENT DATA STATUS ===")
total_marc_files = 0
total_del_files = 0
institutions_with_data = []

for institution_key in ivy_plus_config:
    if not ivy_plus_config[institution_key]['active']:
        continue
        
    pod_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
    if os.path.exists(pod_dir):
        marc_files = glob.glob(os.path.join(pod_dir, "**/*.mrc.gz"), recursive=True)
        del_files = glob.glob(os.path.join(pod_dir, "**/*.del.txt"), recursive=True)
        if marc_files or del_files:
            institutions_with_data.append(institution_key)
            total_marc_files += len(marc_files)
            total_del_files += len(del_files)
            print(f"✓ {ivy_plus_config[institution_key]['name']}: {len(marc_files)} MARC, {len(del_files)} delete files")

if institutions_with_data:
    print(f"\nTotal: {len(institutions_with_data)} institutions with data")
    print(f"Files: {total_marc_files} MARC, {total_del_files} delete files")
    print("\n✓ Ready to run incremental sync to check for updates")
else:
    print("\nNo existing data found - will download all files on first sync")


sync_all_active_institutions()

In [None]:
def organize_institution_files():
    """
    Move/copy relevant files from institution directories to work directory
    """
    print("=== ORGANIZING INSTITUTION FILES ===")
    total_copied = 0
    
    for institution_key, config in ivy_plus_config.items():
        if not config['active']:
            continue
            
        # Use absolute path with new volume
        source_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
        if not os.path.exists(source_dir):
            print(f"Directory not found: {source_dir}")
            continue
            
        # Find all MARC files (adjust pattern as needed)
        marc_files = glob.glob(f"{source_dir}/**/*.mrc.gz", recursive=True)
        
        # Copy to work directory with institution prefix
        copied_count = 0
        for file in marc_files:
            filename = os.path.basename(file)
            dest_file = os.path.join(work_dir, f"{institution_key}_{filename}")
            shutil.copy2(file, dest_file)
            copied_count += 1
            
        print(f"  {institution_key}: Copied {copied_count} files")
        total_copied += copied_count
    
    print(f"\nTotal: Copied {total_copied} files to {work_dir}")
    return total_copied

In [None]:
# use pyspark to import files in multiple folders
# and save them in a single folder
# usage: spark-submit --master local[*] import_files.py

# Complete Spark setup for MARC processing
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

# Initialize Spark if not already done
try:
    sc
    print("Using existing SparkContext")
    # If SparkContext exists, get SparkSession from it
    spark = SparkSession.builder.getOrCreate()
except NameError:
    # Create SparkSession with configuration
    spark = SparkSession.builder \
        .appName("marcUpdateDeleteApp") \
        .config("spark.executor.memory", "10g") \
        .config("spark.driver.memory", "6g") \
        .config("spark.memory.fraction", "0.7") \
        .config("spark.memory.storageFraction", "0.3") \
        .config("spark.sql.shuffle.partitions", "10") \
        .config("spark.default.parallelism", "2") \
        .getOrCreate()
    
    # Get SparkContext from SparkSession
    sc = spark.sparkContext
    print("Created new SparkSession and SparkContext with memory-optimized settings")

# Now these will work
print(f"Spark version: {spark.version}")
print(f"Spark context web UI: {sc.uiWebUrl}")
print("Spark setup complete")

# MARC record consolidation and organization

In [None]:
import os
import glob
import gzip
from pymarc import MARCReader, Record, MARCWriter
import json

# Check if Spark is already initialized
try:
    sc
    print("Using existing SparkContext")
    print(f"SparkContext ID: {sc.applicationId}")
    print(f"Spark UI: {sc.uiWebUrl}")
except NameError:
    print("ERROR: SparkContext not found!")
    print("Please run the Spark initialization cell first")
    raise

# Define directories using the volume base
source_dir = os.path.join(base_dir, 'work', 'marc')
output_dir = os.path.join(base_dir, 'processed')
os.makedirs(output_dir, exist_ok=True)

# Progress tracking file
progress_file = os.path.join(output_dir, 'processing_progress.json')

# Load previous progress if exists
if os.path.exists(progress_file):
    with open(progress_file, 'r') as f:
        progress = json.load(f)
    print(f"Resuming from previous run. Already processed {len(progress['processed_files'])} files.")
    processed_files = set(progress['processed_files'])
    total_records = progress['total_records']
    total_updates = progress['total_updates']
    errors = progress['errors']
else:
    print("Starting fresh processing run.")
    processed_files = set()
    total_records = 0
    total_updates = 0
    errors = []

def save_progress():
    """Save current progress to file"""
    progress = {
        'processed_files': list(processed_files),
        'total_records': total_records,
        'total_updates': total_updates,
        'errors': errors
    }
    with open(progress_file, 'w') as f:
        json.dump(progress, f)

def get_all_marc_gz_files():
    """Get all .mrc.gz files from the source directory"""
    pattern = os.path.join(source_dir, '*.mrc.gz')
    files = glob.glob(pattern)
    print(f"Found {len(files)} .mrc.gz files in {source_dir}")
    return files

def extract_institution_from_filename(filename):
    """Extract institution code from filename"""
    if '_' in filename:
        return filename.split('_')[0]
    elif '-' in filename:
        return filename.split('-')[0]
    return 'unknown'

def process_pod_delete_files():
    """
    Process all POD delete files (.del.txt) separately
    """
    print("\n=== PROCESSING POD DELETE FILES ===")
    
    delete_patterns = [
        "pod_*/**/*.del.txt",
        os.path.join(work_dir, "*.del.txt")
    ]
    
    all_delete_ids = set()
    delete_file_count = 0
    
    for pattern in delete_patterns:
        delete_files = glob.glob(pattern, recursive=True)
        for del_file in delete_files:
            delete_file_count += 1
            try:
                with open(del_file, 'r') as f:
                    file_deletes = [line.strip() for line in f if line.strip()]
                    all_delete_ids.update(file_deletes)
                    print(f"  {os.path.basename(del_file)}: {len(file_deletes)} IDs")
            except Exception as e:
                print(f"  Error reading {del_file}: {e}")
    
    # Save delete IDs to file
    if all_delete_ids:
        delete_ids_file = os.path.join(output_dir, 'delete_ids.txt')
        with open(delete_ids_file, 'w') as f:
            for delete_id in sorted(all_delete_ids):
                f.write(f"{delete_id}\n")
        print(f"\nProcessed {delete_file_count} delete files")
        print(f"Total unique delete IDs: {len(all_delete_ids)}")
        print(f"Delete IDs saved to: {delete_ids_file}")
    else:
        print("No POD delete files found")
    
    return all_delete_ids

# First, process delete files separately
delete_ids = process_pod_delete_files()

# Get all .mrc.gz files
marc_gz_files = get_all_marc_gz_files()

# Filter out already processed files
files_to_process = [f for f in marc_gz_files if os.path.basename(f) not in processed_files]
print(f"\nFiles to process: {len(files_to_process)} (skipping {len(processed_files)} already processed)")

# Check if processing is complete
if len(files_to_process) == 0 and len(processed_files) > 0:
    print("\n✓ All MARC files have been processed!")
    print(f"Total processed: {len(processed_files)} files")
    print("Use the next cells to split and export the data.")
else:
    # Open existing institution writers in append mode
    institution_writers = {}
    
    # Process files
    print("\n=== PROCESSING MARC UPDATE FILES ===")
    print("Note: POD provides all records in MARC files as updates")
    print("Delete records are provided separately in .del.txt files\n")
    
    for i, file_path in enumerate(files_to_process):
        filename = os.path.basename(file_path)
        print(f"Processing {i+1}/{len(files_to_process)}: {filename}")
        
        try:
            # Read the gzipped MARC file
            with gzip.open(file_path, 'rb') as gz_file:
                marc_data = gz_file.read()
            
            # Parse MARC records
            reader = MARCReader(marc_data, to_unicode=True, force_utf8=True, utf8_handling='replace')
            
            # Extract institution from filename
            institution = extract_institution_from_filename(filename)
            
            # Create/open writer for this institution
            if institution not in institution_writers:
                output_file = os.path.join(output_dir, f"{institution}_updates.mrc")
                # Open in append mode if file exists
                mode = 'ab' if os.path.exists(output_file) else 'wb'
                output_file_handle = open(output_file, mode)
                institution_writers[institution] = {
                    'file': output_file_handle,
                    'writer': MARCWriter(output_file_handle),
                    'count': 0,
                    'filename': output_file
                }
            
            # Process records - ALL records in POD MARC files are updates
            file_records = 0
            file_updates = 0
            
            for record in reader:
                if record is None:
                    continue
                
                file_records += 1
                file_updates += 1
                
                # Extract record ID from 001 field for logging
                record_id = record['001'].data if '001' in record else None
                
                try:
                    institution_writers[institution]['writer'].write(record)
                    institution_writers[institution]['count'] += 1
                except Exception as e:
                    print(f"  Warning: Could not write record {record_id}: {e}")
            
            # Update totals
            total_records += file_records
            total_updates += file_updates
            
            print(f"  Processed: {file_records} update records")
            
            # Mark file as processed and save progress
            processed_files.add(filename)
            save_progress()
            
        except Exception as e:
            print(f"  Error processing {filename}: {e}")
            errors.append({'file': filename, 'error': str(e)})
            save_progress()
    
    # Close all writers
    print("\n=== CLOSING FILES ===")
    for institution, writer_info in institution_writers.items():
        writer_info['writer'].close()
        writer_info['file'].close()
        print(f"  {institution}: {writer_info['count']} records written in this session")
    
    # Final summary
    print(f"\n=== PROCESSING SUMMARY ===")
    print(f"Total MARC files processed: {len(processed_files)}")
    print(f"Total update records found: {total_records:,}")
    print(f"Files with errors: {len(errors)}")
    if delete_ids:
        print(f"Delete IDs from .del.txt files: {len(delete_ids):,}")
    
    if len(files_to_process) == 0:
        print("\nNo more files to process in this run.")
        print(f"Total processed so far: {len(processed_files)}/{len(marc_gz_files)}")
        
        if len(processed_files) == len(marc_gz_files):
            print("\n=== ALL MARC FILES COMPLETELY PROCESSED! ===")
            print("Progress file has been kept for safety.")
            print("To manually remove progress file and allow fresh processing, run:")
            print(f"  os.remove('{progress_file}')")
        else:
            print("\nProcessing appears incomplete. Check for errors above.")
    else:
        print(f"\nProcessed {len(processed_files)} files so far.")
        print(f"Progress saved. You can safely restart if needed.")
    
    print(f"\nInstitution-specific update records saved to {output_dir}")

# MARC Processing Validation

## Purpose
This cell performs a comprehensive validation check of the MARC processing pipeline output. It ensures that the institution consolidation step completed successfully and provides detailed statistics about the processed data.

## What It Does

### 1. **Directory Verification**
- Checks if the processed directory (`pod-processing-outputs/processed`) exists
- Verifies that MARC consolidation has been run

### 2. **Institution File Analysis**
- Locates all `*_updates.mrc` files (one per institution)
- Counts MARC records in each file using pymarc
- Calculates file sizes
- Aggregates statistics by institution

### 3. **Auxiliary File Inspection**
- Identifies non-MARC files in the processed directory:
  - `delete_ids.txt` - Lists record IDs to be deleted
  - `processing_progress.json` - Tracks processing state
  - Any other generated files

### 4. **Source Directory Check**
- Verifies the work directory still contains source `.mrc.gz` files
- Confirms data integrity throughout the pipeline

### 5. **Spark Status Verification**
- Checks if SparkContext is still active
- Reports Spark application details
- Ensures resources are available for next steps

## Output Information

### Success Indicators (✓)
- Processed directory exists with institution files
- Source files are still available
- Spark is active and ready

### Warning Indicators (✗)
- Missing directories or files
- Processing errors
- Spark unavailable

## Key Metrics Reported
- **Per Institution**: Record count and file size
- **Total**: Aggregate records across all institutions
- **Delete IDs**: Number of records marked for deletion
- **Processing State**: Progress tracking information

## Next Steps
Based on validation results:
- ✅ If successful → Proceed to export with delete filtering
- ❌ If failed → Return to processing cell and investigate errors

## Example Output
✓ Processed directory contains 13 institution update files harvard_updates.mrc: 1,234,567 records (543.21 MB) yale_updates.mrc: 987,654 records (432.10 MB) ... Total records across all files: 15,678,901

This validation ensures data integrity before the final export step where delete filtering is applied.

In [None]:
# Validate processed output files 
import os
import glob
from pymarc import MARCReader

# Check processed directory
processed_dir = 'pod-processing-outputs/processed'

print("=== VALIDATION SUMMARY ===")
print("This validation checks the processed institution files")
print("(No splitting step - working with full institution files)\n")

# Check if directories exist
if not os.path.exists(processed_dir):
    print(f"✗ Processed directory does not exist: {processed_dir}")
    print("  Run the processing cell first!")
else:
    # Look for institution update files
    update_files = glob.glob(os.path.join(processed_dir, '*_updates.mrc'))
    print(f"✓ Processed directory contains {len(update_files)} institution update files")
    
    # Group by institution for better overview
    institution_counts = {}
    total_records = 0
    
    for file_path in update_files:
        file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
        institution = os.path.basename(file_path).replace('_updates.mrc', '')
        
        # Count records in file
        try:
            with open(file_path, 'rb') as marc_file:
                reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                record_count = sum(1 for record in reader if record is not None)
                total_records += record_count
                
                institution_counts[institution] = {
                    'files': 1, 
                    'records': record_count,
                    'size_mb': file_size
                }
                
                print(f"  {os.path.basename(file_path)}: {record_count:,} records ({file_size:.2f} MB)")
        except Exception as e:
            print(f"  {os.path.basename(file_path)}: Error reading file - {e}")
    
    print(f"\nTotal records across all files: {total_records:,}")
    print("\nBy institution:")
    for inst, counts in sorted(institution_counts.items()):
        print(f"  {inst}: {counts['records']:,} records ({counts['size_mb']:.2f} MB)")

# Check for other important files
print(f"\n=== OTHER FILES IN PROCESSED DIRECTORY ===")
if os.path.exists(processed_dir):
    all_files = os.listdir(processed_dir)
    other_files = [f for f in all_files if not f.endswith('_updates.mrc')]
    
    for file in sorted(other_files):
        file_path = os.path.join(processed_dir, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
            if file.endswith('.txt'):
                # Count lines in text files (like delete_ids.txt)
                with open(file_path, 'r') as f:
                    line_count = sum(1 for line in f if line.strip())
                print(f"  {file} ({file_size:.2f} MB, {line_count:,} lines)")
            elif file.endswith('.json'):
                print(f"  {file} ({file_size:.2f} MB) - Progress tracking")
            else:
                print(f"  {file} ({file_size:.2f} MB)")

# Check source directory
print(f"\n=== SOURCE DIRECTORY ===")
source_dir = 'pod-processing-outputs/work/marc'
if os.path.exists(source_dir):
    source_files = glob.glob(os.path.join(source_dir, '*.mrc.gz'))
    print(f"✓ Source directory contains {len(source_files)} .mrc.gz files")
else:
    print(f"✗ Source directory does not exist: {source_dir}")

print(f"\n=== SPARK STATUS ===")
try:
    executor_info = sc.statusTracker().getExecutorInfos()
    print(f"✓ SparkContext is active")
    print(f"  Application ID: {sc.applicationId}")
    print(f"  Executors: {len(executor_info)}")
except:
    print("✗ SparkContext not available")

print(f"\n=== NEXT STEPS ===")
if update_files:
    print("✓ Data is ready for export with delete filtering")
    print("  Run the export cell to create filtered output")
    print("  Delete filtering will be applied during export")
else:
    print("✗ No processed data ready for export")
    print("  Make sure to run the processing cell first")

print(f"\nValidation complete.")

# Export with delete filtering 


In [None]:
# Export with delete filtering 
import os
import glob
import shutil
from pymarc import MARCReader, MARCWriter
from datetime import datetime
import json

# Define base directory and subdirectories
base_dir = 'pod-processing-outputs'
processed_dir = os.path.join(base_dir, 'processed')
export_dir = os.path.join(base_dir, 'export')
archive_dir = os.path.join(base_dir, 'archive')
work_dir = os.path.join(base_dir, 'work', 'marc')

# Create directories if they don't exist
for directory in [export_dir, processed_dir, archive_dir]:
    os.makedirs(directory, exist_ok=True)

def find_and_process_pod_delete_files():
    """
    Find and process all POD delete files (.del.txt) to ensure we have all deletes
    """
    print("=== SEARCHING FOR POD DELETE FILES (.del.txt) ===")
    
    delete_patterns = [
        "pod_*/**/*.del.txt",
        "pod-processing-outputs/work/**/*.del.txt"
    ]
    
    all_pod_delete_ids = set()
    delete_file_count = 0
    
    for pattern in delete_patterns:
        delete_files = glob.glob(pattern, recursive=True)
        for del_file in delete_files:
            delete_file_count += 1
            try:
                with open(del_file, 'r') as f:
                    file_deletes = [line.strip() for line in f if line.strip()]
                    all_pod_delete_ids.update(file_deletes)
                    print(f"  {os.path.basename(del_file)}: {len(file_deletes)} IDs")
            except Exception as e:
                print(f"  Error reading {del_file}: {e}")
    
    print(f"\nFound {delete_file_count} POD delete files")
    print(f"Total unique delete IDs from POD files: {len(all_pod_delete_ids)}")
    
    return all_pod_delete_ids

def reconcile_delete_ids():
    """
    Reconcile delete IDs from different sources
    """
    # Get POD delete IDs
    pod_delete_ids = find_and_process_pod_delete_files()
    
    # Check existing delete_ids.txt
    existing_delete_ids = set()
    delete_ids_file = os.path.join(processed_dir, 'delete_ids.txt')
    
    if os.path.exists(delete_ids_file):
        with open(delete_ids_file, 'r') as f:
            existing_delete_ids = {line.strip() for line in f if line.strip()}
        print(f"\nExisting delete_ids.txt contains: {len(existing_delete_ids)} IDs")
    
    # Combine all delete IDs
    all_delete_ids = pod_delete_ids.union(existing_delete_ids)
    
    print(f"\n=== DELETE ID RECONCILIATION ===")
    print(f"From POD .del.txt files: {len(pod_delete_ids)}")
    print(f"From existing delete_ids.txt: {len(existing_delete_ids)}")
    print(f"Total unique delete IDs: {len(all_delete_ids)}")
    
    # Update delete_ids.txt with all IDs
    if len(all_delete_ids) > len(existing_delete_ids):
        print(f"\nUpdating delete_ids.txt with {len(all_delete_ids) - len(existing_delete_ids)} additional IDs")
        with open(delete_ids_file, 'w') as f:
            for delete_id in sorted(all_delete_ids):
                f.write(f"{delete_id}\n")
    
    return all_delete_ids

def create_export_package(filter_deletes=True):
    """
    Create a final export package with all processed files (NO SPLITTING VERSION)
    
    Args:
        filter_deletes: If True, filter out records matching delete IDs
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    export_subdir = os.path.join(export_dir, f"marc_export_{timestamp}")
    os.makedirs(export_subdir, exist_ok=True)
    
    # Reconcile and load all delete IDs
    delete_ids = set()
    if filter_deletes:
        delete_ids = reconcile_delete_ids()
        if delete_ids:
            print(f"\nUsing {len(delete_ids)} delete IDs for filtering")
        else:
            print("No delete IDs found - proceeding without filtering")
            filter_deletes = False
    
    # Process institution update files from processed directory
    total_filtered = 0
    update_files = glob.glob(os.path.join(processed_dir, '*_updates.mrc'))
    
    if update_files:
        print(f"\n=== PROCESSING {len(update_files)} INSTITUTION FILES ===")
        
        for input_file in update_files:
            institution = os.path.basename(input_file).replace('_updates.mrc', '')
            output_file = os.path.join(export_subdir, f"{institution}_filtered.mrc")
            
            if filter_deletes and delete_ids:
                # Filter while copying
                print(f"Processing {institution}...")
                with open(input_file, 'rb') as marc_file:
                    reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                    writer = MARCWriter(open(output_file, 'wb'))
                    
                    file_records = 0
                    file_filtered = 0
                    
                    for record in reader:
                        if record is None:
                            continue
                        
                        file_records += 1
                        
                        # Check if record should be filtered
                        if '001' in record and record['001'].data in delete_ids:
                            file_filtered += 1
                            total_filtered += 1
                        else:
                            writer.write(record)
                    
                    writer.close()
                    
                    print(f"  {institution}: {file_records} records, filtered {file_filtered}")
            else:
                # Just copy without filtering
                shutil.copy2(input_file, export_subdir)
                # Rename to indicate it's unfiltered
                os.rename(
                    os.path.join(export_subdir, os.path.basename(input_file)),
                    os.path.join(export_subdir, f"{institution}_unfiltered.mrc")
                )
        
        if filter_deletes and total_filtered > 0:
            print(f"\nTotal records filtered across all files: {total_filtered}")
        print(f"Processed {len(update_files)} institution files to export package")
    else:
        print("\n✗ No institution update files found in processed directory!")
        return None
    
    # Copy delete IDs file for reference
    delete_ids_file = os.path.join(processed_dir, 'delete_ids.txt')
    if os.path.exists(delete_ids_file):
        shutil.copy2(delete_ids_file, export_subdir)
        print("Copied delete IDs file to export package for reference")
    
    # Create detailed summary file
    summary_file = os.path.join(export_subdir, 'processing_summary.txt')
    with open(summary_file, 'w') as f:
        f.write(f"MARC Processing Summary (No Splitting Version)\n")
        f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"=" * 50 + "\n\n")
        
        f.write(f"Delete ID Sources:\n")
        f.write(f"  - POD .del.txt files processed\n")
        f.write(f"  - Total delete IDs: {len(delete_ids)}\n\n")
        
        if filter_deletes and delete_ids:
            f.write(f"Delete filtering: ENABLED\n")
            f.write(f"Delete IDs used: {len(delete_ids)}\n")
            f.write(f"Records filtered: {total_filtered}\n\n")
        else:
            f.write(f"Delete filtering: DISABLED\n\n")
        
        # Count records and files by institution
        f.write(f"Institution Breakdown:\n")
        f.write(f"-" * 30 + "\n")
        
        institution_stats = {}
        total_records = 0
        total_files = 0
        
        for file in glob.glob(os.path.join(export_subdir, '*.mrc')):
            total_files += 1
            institution = os.path.basename(file).split('_')[0]
            
            if institution not in institution_stats:
                institution_stats[institution] = {'files': 0, 'records': 0}
            
            try:
                with open(file, 'rb') as marc_file:
                    reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                    record_count = sum(1 for record in reader if record is not None)
                    total_records += record_count
                    institution_stats[institution]['files'] += 1
                    institution_stats[institution]['records'] += record_count
                    f.write(f"{os.path.basename(file)}: {record_count:,} records\n")
            except Exception as e:
                f.write(f"{os.path.basename(file)}: Error reading file - {e}\n")
        
        f.write(f"\n{'-' * 30}\n")
        f.write(f"Total: {total_records:,} records in {total_files} files\n\n")
        
        # Institution summary
        f.write(f"By Institution:\n")
        for inst, stats in sorted(institution_stats.items()):
            f.write(f"  {inst.upper()}: {stats['records']:,} records in {stats['files']} file(s)\n")
        
        # Delete IDs info
        if os.path.exists(os.path.join(export_subdir, 'delete_ids.txt')):
            with open(os.path.join(export_subdir, 'delete_ids.txt'), 'r') as del_file:
                delete_count = sum(1 for line in del_file if line.strip())
            f.write(f"\nDelete IDs in reference file: {delete_count:,}\n")
    
    print(f"\nExport package created: {export_subdir}")
    return export_subdir

# Main execution
print("=== EXPORT WITH DELETE FILTERING (NO SPLITTING) ===")

# Create export package with delete filtering enabled
export_path = create_export_package(filter_deletes=True)

if export_path:
    # List export contents
    export_files = os.listdir(export_path)
    print(f"\n=== EXPORT PACKAGE CONTENTS ===")
    for file in sorted(export_files):
        file_path = os.path.join(export_path, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
            print(f"  {file} ({file_size:.2f} MB)")

    print(f"\n=== PROCESSING COMPLETE ===")
    print(f"Final export package: {export_path}")
    print(f"Ready for conversion to Parquet format")

    # Display final statistics
    summary_file = os.path.join(export_path, 'processing_summary.txt')
    if os.path.exists(summary_file):
        print(f"\nKey Statistics from Processing Summary:")
        with open(summary_file, 'r') as f:
            for line in f:
                if line.strip() and any(x in line for x in ["Total:", "Delete IDs used:", "Records filtered:", "By Institution:"]):
                    print(f"  {line.strip()}")
                    if "By Institution:" in line:
                        # Print the next few institution lines
                        for _ in range(13):  # For all Ivy Plus institutions
                            inst_line = next(f, None)
                            if inst_line and inst_line.strip():
                                print(f"  {inst_line.strip()}")

# Keep SparkContext active for any additional processing
print(f"\nSparkContext remains active for additional processing if needed")
print(f"To stop Spark when completely done, run: sc.stop()")

# Final Export Step: Copy Filtered Files to Final Directory

## Purpose
This cell copies all filtered MARC files from the timestamped export directory to a standardized `final` directory for streamlined processing in the next pipeline stage.

## What It Does
- Copies all `.mrc` files (filtered by institution) from the export package
- Places them in `pod-processing-outputs/final/` directory
- Skips non-MARC files (like `.txt` summaries)
- Provides a clean, consistent location for `pod-processing.ipynb` to read from

## Why This Step
- **Consistency**: The `final` directory always contains the latest processed files
- **Simplicity**: Next notebook doesn't need to navigate timestamped directories
- **Clean Interface**: Only MARC files are copied, excluding metadata files

## Output
Files are copied as:
- `harvard_filtered.mrc`
- `yale_filtered.mrc`
- `princeton_filtered.mrc`
- etc.

Ready for conversion to Parquet format in `pod-processing.ipynb`

In [None]:
# Copy filtered files to final directory for streamlined processing
import shutil
import glob

if export_path:
    print("\n=== COPYING TO FINAL DIRECTORY ===")
    
    # Ensure final directory exists
    os.makedirs(final_dir, exist_ok=True)
    
    # Copy all filtered MARC files to final directory
    marc_files = glob.glob(os.path.join(export_path, '*.mrc'))
    
    for marc_file in marc_files:
        if not marc_file.endswith('.txt'):  # Skip text files
            dest = os.path.join(final_dir, os.path.basename(marc_file))
            shutil.copy2(marc_file, dest)
            print(f"  Copied: {os.path.basename(marc_file)}")
    
    print(f"\n✓ {len(marc_files)} files copied to {final_dir}")
    print("Files are now ready for pod-processing.ipynb")