# Multi-Institution MARC Data Collection Setup

## Overview
This notebook collects MARC data from all Ivy Plus libraries. The Ivy Plus consortium includes:

- Brown University
- University of Chicago  
- Columbia University
- Cornell University
- Dartmouth College
- Duke University
- Harvard University
- Johns Hopkins University
- Massachusetts Institute of Technology
- University of Pennsylvania
- Princeton University
- Stanford University
- Yale University

## Data Source
POD
```
https://pod.stanford.edu/organizations/{org_code}/streams/{stream_date}/normalized_resourcelist/marc21
```

## Usage Examples

```python
# Sync data for a single institution
sync_institution_data('harvard')  # Uses stream date from config

# Sync data for all active institutions (incremental sync)
sync_all_active_institutions()

# Force re-download all files for an institution
sync_institution_data('yale', force_refresh=True)

# Use a different stream date
sync_institution_data('yale', stream_date='2023-01-01')

# Check current data status
for institution_key in ivy_plus_config:
    if not ivy_plus_config[institution_key]['active']:
        continue
    pod_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
    if os.path.exists(pod_dir):
        marc_files = glob.glob(os.path.join(pod_dir, "**/*.mrc.gz"), recursive=True)
        print(f"{institution_key}: {len(marc_files)} MARC files")
```

## Data Organization
- **Raw Downloads**: Each institution's data is stored in `pod_{institution}` directories
- **Work Directory**: Files are copied to `pod-processing-outputs/work/marc/` with institution prefix
- **Processed Files**: Institution-specific MARC files in `pod-processing-outputs/processed/`
- **Export Directory**: Final filtered files in timestamped subdirectories

## Processing Workflow
1. **Sync Data**: Download/update MARC and delete files from POD
2. **Organize Files**: Copy MARC files to work directory with institution prefixes
3. **Process Records**: Extract and consolidate MARC records by institution
4. **Export with Filtering**: Apply delete filtering and create final export package

## Key Features
- **Incremental Sync**: Only downloads new/updated files (unless `force_refresh=True`)
- **Delete Filtering**: Automatically processes `.del.txt` files to filter out deleted records
- **Progress Tracking**: Saves progress to allow resuming interrupted processing
- **Institution Prefixing**: Prevents filename conflicts between institutions

## Notes
- Harvard's collection is particularly large (~43GB, 962 files)
- Processing uses PySpark for efficient handling of large datasets
- A single POD access token provides authentication for all institutions
- Stream dates vary by institution (configured in `ivy_plus_config`)

In [1]:
!pip install setuptools



In [2]:
!pip install pyspark



In [3]:
!pip install resync



In [4]:

import os

# Configuration for paths - adjust this when moving between systems
VOLUME_BASE = '/Volumes/Samsung_T5/pod-pyspark-notbook'
os.chdir(VOLUME_BASE)  # Set working directory to the new volume

# Define all paths relative to the new volume
base_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs')
work_dir = os.path.join(base_dir, 'work', 'marc')
processed_dir = os.path.join(base_dir, 'processed')
final_dir = os.path.join(base_dir, 'final')
export_dir = os.path.join(base_dir, 'export')
archive_dir = os.path.join(base_dir, 'archive')

# Create directories if they don't exist
for directory in [base_dir, work_dir, processed_dir, final_dir, export_dir, archive_dir]:
    os.makedirs(directory, exist_ok=True)

print(f"Volume base: {VOLUME_BASE}")
print(f"Working directory: {os.getcwd()}")
print("All directories created/verified")

Volume base: /Volumes/Samsung_T5/pod-pyspark-notbook
Working directory: /Volumes/Samsung_T5/pod-pyspark-notbook
All directories created/verified


In [5]:
# Add this verification cell after your path setup cell (after cell id="d2a40ed6")
# This ensures we're starting with a clean slate

print("=== VERIFYING CLEAN STATE ===")

# Check processed directory is empty
processed_files = os.listdir(processed_dir) if os.path.exists(processed_dir) else []
if processed_files:
    print(f"⚠️ WARNING: Processed directory is not empty! Contains {len(processed_files)} files")
    print("Files found:", processed_files[:5], "..." if len(processed_files) > 5 else "")
    print("\nTo ensure clean processing, run:")
    print(f"  shutil.rmtree('{processed_dir}')")
    print(f"  os.makedirs('{processed_dir}')")
else:
    print("✓ Processed directory is clean")

# Check for any old export directories
if os.path.exists(export_dir):
    export_subdirs = [d for d in os.listdir(export_dir) if os.path.isdir(os.path.join(export_dir, d))]
    if export_subdirs:
        print(f"\n✓ Found {len(export_subdirs)} previous export(s)")
        print("  These won't interfere with new processing")
else:
    print("\n✓ No previous exports found")

print("\n✓ Ready for fresh processing run!")

=== VERIFYING CLEAN STATE ===
Files found: ['columbia_updates.mrc', 'duke_updates.mrc', 'delete_ids.txt', 'dartmouth_updates.mrc', 'yale_updates.mrc'] ...

To ensure clean processing, run:
  shutil.rmtree('/Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/processed')
  os.makedirs('/Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/processed')

✓ Found 2 previous export(s)
  These won't interfere with new processing

✓ Ready for fresh processing run!


In [6]:

import json
import glob

print("=== VERIFYING SETUP ===")
print(f"Volume base: {VOLUME_BASE}")
print(f"Current directory: {os.getcwd()}")

# Check directories
for name, path in [
    ('Work dir', work_dir),
    ('Processed dir', processed_dir),
    ('Final dir', final_dir)
]:
    exists = "✓" if os.path.exists(path) else "✗"
    print(f"{exists} {name}: {path}")

# Check for progress file
progress_file = os.path.join(processed_dir, 'processing_progress.json')
if os.path.exists(progress_file):
    with open(progress_file, 'r') as f:
        progress = json.load(f)
    print(f"\n✓ Progress file found:")
    print(f"  - Files processed: {len(progress.get('processed_files', []))}")
    print(f"  - Total records: {progress.get('total_records', 0):,}")
else:
    print(f"\n✗ No progress file - will start fresh")

# Check source files
marc_files = glob.glob(os.path.join(work_dir, '*.mrc.gz'))
print(f"\nSource files: {len(marc_files)} .mrc.gz files in work directory")

=== VERIFYING SETUP ===
Volume base: /Volumes/Samsung_T5/pod-pyspark-notbook
Current directory: /Volumes/Samsung_T5/pod-pyspark-notbook
✓ Work dir: /Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/work/marc
✓ Processed dir: /Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/processed
✓ Final dir: /Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/final

✓ Progress file found:
  - Files processed: 8120
  - Total records: 160,234,737

Source files: 8128 .mrc.gz files in work directory


In [7]:
# Configuration for Ivy Plus Libraries
# This dictionary contains the organization codes and basic info for each institution
# The access token appears to be system-wide for POD, not institution-specific

# Common POD access token (appears to work for all institutions)
POD_ACCESS_TOKEN = "eyJhbGciOiJIUzI1NiJ9.eyJqdGkiOiIzMTBlMDg1OTg1NWJmNjFjY2RhZTA1NzhjNTM3M2Q4NyIsInNjb3BlIjoiZG93bmxvYWQiLCJpc3MiOiJQT0QiLCJuYW1lIjoiZG93bmxvYWRpbmcifQ.iHKGTrdKjs4mtBBQwcl6lvU7psyciFLLT5b2nTM-lh0"

ivy_plus_config = {
    'brown': {
        'name': 'Brown University',
        'org_code': 'brown',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'brown_2022-05-05',
    },
    'chicago': {
        'name': 'University of Chicago',
        'org_code': 'chicago',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2022-06-18',
    },
    'columbia': {
        'name': 'Columbia University',
        'org_code': 'columbia',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'prod',
    },
    'cornell': {
        'name': 'Cornell University',
        'org_code': 'cornell',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2023-04',
    },
    'dartmouth': {
        'name': 'Dartmouth College',
        'org_code': 'dartmouth',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '202208',
    },
    'duke': {
        'name': 'Duke University',
        'org_code': 'duke',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2025-01-15',
    },
    'harvard': {
        'name': 'Harvard University',
        'org_code': 'harvard',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '20220204-with-items',
    },
    'johns_hopkins': {
        'name': 'Johns Hopkins University',
        'org_code': 'jhu',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'jhu',
    },
    'mit': {
        'name': 'Massachusetts Institute of Technology',
        'org_code': 'mit',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'prod-2022-06-08',
    },
    'penn': {
        'name': 'University of Pennsylvania',
        'org_code': 'penn',
        'access_token': POD_ACCESS_TOKEN,
        'active': True,  # Currently working and verified
        'stream_date': '2022-07-18',
    },
    'princeton': {
        'name': 'Princeton University',
        'org_code': 'princeton',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': 'princeton-prod-0223',
    },
    'stanford': {
        'name': 'Stanford University',
        'org_code': 'stanford',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2024-08-27',
    },
    'yale': {
        'name': 'Yale University',
        'org_code': 'yale',  
        'access_token': POD_ACCESS_TOKEN,
        'active': True,
        'stream_date': '2025-07-alma',
    }
}

# Base URLs for POD system
POD_BASE_URL = 'https://pod.stanford.edu/'
SITEMAP_BASE_URL = 'https://pod.stanford.edu/organizations/{org}/streams/{stream}/normalized_resourcelist/marc21'

# Default stream (you may need to adjust this for each institution)
DEFAULT_STREAM = '2022-07-18'

print("Ivy Plus Libraries Configuration (with shared POD token):")
for key, config in ivy_plus_config.items():
    status = "✓ Active" if config['active'] else "✗ Inactive (need to verify org_code)"
    print(f"  {config['name']}: {status}")

print(f"\nNOTE: All institutions now use the same POD access token.")
print(f"To activate an institution, verify its org_code and set active=True")
print(f"Use test_all_organization_codes() to verify org_codes")

Ivy Plus Libraries Configuration (with shared POD token):
  Brown University: ✓ Active
  University of Chicago: ✓ Active
  Columbia University: ✓ Active
  Cornell University: ✓ Active
  Dartmouth College: ✓ Active
  Duke University: ✓ Active
  Harvard University: ✓ Active
  Johns Hopkins University: ✓ Active
  Massachusetts Institute of Technology: ✓ Active
  University of Pennsylvania: ✓ Active
  Princeton University: ✓ Active
  Stanford University: ✓ Active
  Yale University: ✓ Active

NOTE: All institutions now use the same POD access token.
To activate an institution, verify its org_code and set active=True
Use test_all_organization_codes() to verify org_codes


In [8]:
# Import shutil for the organize function later
import shutil
import subprocess

def sync_institution_data(institution_key, stream_date=None, output_base_dir=None, force_refresh=False):
    """
    Sync MARC data for a specific institution using resync-sync.
    By default, only downloads missing or updated files.
    
    Args:
        institution_key: Institution identifier from ivy_plus_config
        stream_date: Override stream date (optional)
        output_base_dir: Base directory for output (optional)
        force_refresh: If True, forces re-download of all files
    """
    if institution_key not in ivy_plus_config:
        print(f"Error: Institution '{institution_key}' not found in configuration")
        return False

    config = ivy_plus_config[institution_key]

    if not config['active']:
        print(f"Warning: Institution '{config['name']}' is not active")
        return False

    # Use institution-specific stream_date if present, else argument, else default
    effective_stream_date = stream_date or config.get('stream_date') or DEFAULT_STREAM

    # Set up output directory
    if output_base_dir is None:
        output_base_dir = VOLUME_BASE

    output_dir = os.path.join(output_base_dir, f"pod_{institution_key}")

    # Build sitemap URL
    sitemap_url = SITEMAP_BASE_URL.format(
        org=config['org_code'],
        stream=effective_stream_date
    )

    # Check existing data before sync
    existing_marc = 0
    existing_del = 0
    if os.path.exists(output_dir):
        existing_marc = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
        existing_del = len(glob.glob(os.path.join(output_dir, "**/*.del.txt"), recursive=True))
        if existing_marc > 0 or existing_del > 0:
            print(f"Found existing data: {existing_marc} MARC files, {existing_del} delete files")

    # Build command - resync-sync will handle incremental updates
    command = [
        'resync-sync',
        '-v',
        '--sitemap', sitemap_url,
        '--access-token', config['access_token'],
        '-b', POD_BASE_URL,
    ]
    
    # Add force refresh flag if requested
    if force_refresh:
        command.append('--force')
        print("Force refresh enabled - will re-download all files")
    
    command.append(output_dir)

    print(f"Syncing data for {config['name']}...")
    print(f"Stream: {effective_stream_date}")
    print(f"Output: {output_dir}")
    if not force_refresh:
        print("Mode: Incremental (only missing/updated files)")

    try:
        # Run resync-sync
        result = subprocess.run(command)
        
        if result.returncode == 0:
            # Check what was downloaded
            new_marc = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
            new_del = len(glob.glob(os.path.join(output_dir, "**/*.del.txt"), recursive=True))
            
            print(f"✓ Successfully synced {config['name']}")
            print(f"  Total files: {new_marc} MARC, {new_del} delete files")
            
            if existing_marc > 0 or existing_del > 0:
                downloaded_marc = new_marc - existing_marc
                downloaded_del = new_del - existing_del
                if downloaded_marc > 0 or downloaded_del > 0:
                    print(f"  New files: {downloaded_marc} MARC, {downloaded_del} delete files")
                else:
                    print(f"  No new files needed - already up to date")
            
            return True
        else:
            print(f"✗ Error syncing data for {config['name']}")
            return False

    except Exception as e:
        print(f"✗ Exception occurred while syncing {config['name']}: {e}")
        return False

def sync_all_active_institutions(force_refresh=False):
    """
    Sync data for all active institutions, leveraging rsync's incremental capabilities
    """
    print("=== SYNCING ALL ACTIVE INSTITUTIONS ===")
    print("Using rsync incremental mode - only missing/updated files will be downloaded\n")
    
    successful = []
    failed = []
    already_complete = []
    
    for institution_key, config in ivy_plus_config.items():
        if config['active']:
            print(f"\n--- {config['name']} ---")
            
            # Quick check if data already exists
            output_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
            if os.path.exists(output_dir) and not force_refresh:
                marc_count = len(glob.glob(os.path.join(output_dir, "**/*.mrc.gz"), recursive=True))
                if marc_count > 0:
                    print(f"Checking for updates ({marc_count} existing files)...")
            
            if sync_institution_data(institution_key, force_refresh=force_refresh):
                successful.append(institution_key)
            else:
                failed.append(institution_key)
        else:
            print(f"\nSkipping {config['name']} (inactive)")
    
    print("\n=== SYNC SUMMARY ===")
    print(f"Successfully synced: {len(successful)} institutions")
    
    if failed:
        print(f"\nFailed: {len(failed)} institutions")
        for inst in failed:
            print(f"  ✗ {ivy_plus_config[inst]['name']}")
    
    return successful

# === MAIN EXECUTION ===

# First, show current data status
print("=== CURRENT DATA STATUS ===")
total_marc_files = 0
total_del_files = 0
institutions_with_data = []

for institution_key in ivy_plus_config:
    if not ivy_plus_config[institution_key]['active']:
        continue
        
    pod_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
    if os.path.exists(pod_dir):
        marc_files = glob.glob(os.path.join(pod_dir, "**/*.mrc.gz"), recursive=True)
        del_files = glob.glob(os.path.join(pod_dir, "**/*.del.txt"), recursive=True)
        if marc_files or del_files:
            institutions_with_data.append(institution_key)
            total_marc_files += len(marc_files)
            total_del_files += len(del_files)
            print(f"✓ {ivy_plus_config[institution_key]['name']}: {len(marc_files)} MARC, {len(del_files)} delete files")

if institutions_with_data:
    print(f"\nTotal: {len(institutions_with_data)} institutions with data")
    print(f"Files: {total_marc_files} MARC, {total_del_files} delete files")
    print("\n✓ Ready to run incremental sync to check for updates")
else:
    print("\nNo existing data found - will download all files on first sync")


sync_all_active_institutions()

=== CURRENT DATA STATUS ===
✓ Brown University: 1067 MARC, 553 delete files
✓ University of Chicago: 940 MARC, 889 delete files
✓ Columbia University: 1052 MARC, 1040 delete files
✓ Cornell University: 696 MARC, 437 delete files
✓ Dartmouth College: 600 MARC, 372 delete files
✓ Duke University: 117 MARC, 134 delete files
✓ Harvard University: 963 MARC, 706 delete files
✓ Johns Hopkins University: 643 MARC, 505 delete files
✓ Massachusetts Institute of Technology: 825 MARC, 385 delete files
✓ University of Pennsylvania: 147 MARC, 146 delete files
✓ Princeton University: 794 MARC, 609 delete files
✓ Stanford University: 297 MARC, 62 delete files
✓ Yale University: 2 MARC, 1 delete files

Total: 13 institutions with data
Files: 8143 MARC, 5839 delete files

✓ Ready to run incremental sync to check for updates
=== SYNCING ALL ACTIVE INSTITUTIONS ===
Using rsync incremental mode - only missing/updated files will be downloaded


--- Brown University ---
Checking for updates (1067 existing fi

Reading resource list https://pod.stanford.edu/organizations/brown/streams/brown_2022-05-05/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/brown/streams/brown_2022-05-05/normalized_resourcelist/marc21
Parsed as sitemap, 1660 resources
Read source resource list, 1660 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_brown
Status:     NOT IN SYNC (same=1620, to create=40, to update=0, to delete=3)
Will GET 40 resources
created: https://pod.stanford.edu/file/489139/brown-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_brown/file/489139/brown-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489356/brown-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_brown/file/489356/brown-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489358/brown-2025-07-15-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbo

✓ Successfully synced Brown University
  Total files: 1091 MARC, 569 delete files
  New files: 24 MARC, 16 delete files

--- University of Chicago ---
Checking for updates (940 existing files)...
Found existing data: 940 MARC files, 889 delete files
Syncing data for University of Chicago...
Stream: 2022-06-18
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_chicago
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/chicago/streams/2022-06-18/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/chicago/streams/2022-06-18/normalized_resourcelist/marc21
Parsed as sitemap, 1887 resources
Read source resource list, 1887 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_chicago
Status:     NOT IN SYNC (same=1829, to create=58, to update=0, to delete=2)
Will GET 58 resources
created: https://pod.stanford.edu/file/489165/chicago-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_chicago/file/489165/chicago-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489167/chicago-2025-07-14-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_chicago/file/489167/chicago-2025-07-14-delta-deletes.del.txt
created: https://pod.stanford.edu/file/489360/chicago-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pys

✓ Successfully synced University of Chicago
  Total files: 969 MARC, 918 delete files
  New files: 29 MARC, 29 delete files

--- Columbia University ---
Checking for updates (1052 existing files)...
Found existing data: 1052 MARC files, 1040 delete files
Syncing data for Columbia University...
Stream: prod
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_columbia
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/columbia/streams/prod/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/columbia/streams/prod/normalized_resourcelist/marc21
Parsed as sitemap, 2092 resources
Read source resource list, 2092 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_columbia
Status:     NOT IN SYNC (same=2092, to create=0, to update=0, to delete=2)
Will GET 0 resources
nodelete: would delete https://pod.stanford.edu/.DS_Store (--delete to enable)
nodelete: would delete https://pod.stanford.edu/file/.DS_Store (--delete to enable)
Written last timestamp 2025-07-12T17:42:22.124667Z for incremental sync
Status:     PART SYNCED (same=2092, created=0, updated=0, to delete (--delete)=2)


✓ Successfully synced Columbia University
  Total files: 1052 MARC, 1040 delete files
  No new files needed - already up to date

--- Cornell University ---
Checking for updates (696 existing files)...
Found existing data: 696 MARC files, 437 delete files
Syncing data for Cornell University...
Stream: 2023-04
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_cornell
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/cornell/streams/2023-04/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/cornell/streams/2023-04/normalized_resourcelist/marc21
Parsed as sitemap, 1204 resources
Read source resource list, 1204 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_cornell
Status:     NOT IN SYNC (same=1133, to create=71, to update=0, to delete=2)
Will GET 71 resources
created: https://pod.stanford.edu/file/489145/cornell-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_cornell/file/489145/cornell-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489353/cornell-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_cornell/file/489353/cornell-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489628/cornell-2025-07-16-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbo

✓ Successfully synced Cornell University
  Total files: 742 MARC, 462 delete files
  New files: 46 MARC, 25 delete files

--- Dartmouth College ---
Checking for updates (600 existing files)...
Found existing data: 600 MARC files, 372 delete files
Syncing data for Dartmouth College...
Stream: 202208
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_dartmouth
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/dartmouth/streams/202208/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/dartmouth/streams/202208/normalized_resourcelist/marc21
Parsed as sitemap, 972 resources
Read source resource list, 972 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_dartmouth
Status:     NOT IN SYNC (same=972, to create=0, to update=0, to delete=2)
Will GET 0 resources
nodelete: would delete https://pod.stanford.edu/.DS_Store (--delete to enable)
nodelete: would delete https://pod.stanford.edu/file/.DS_Store (--delete to enable)
Written last timestamp 2025-07-12T17:44:52.649427Z for incremental sync
Status:     PART SYNCED (same=972, created=0, updated=0, to delete (--delete)=2)


✓ Successfully synced Dartmouth College
  Total files: 600 MARC, 372 delete files
  No new files needed - already up to date

--- Duke University ---
Checking for updates (117 existing files)...
Found existing data: 117 MARC files, 134 delete files
Syncing data for Duke University...
Stream: 2025-01-15
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_duke
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/duke/streams/2025-01-15/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/duke/streams/2025-01-15/normalized_resourcelist/marc21
Parsed as sitemap, 317 resources
Read source resource list, 317 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_duke
Status:     NOT IN SYNC (same=251, to create=66, to update=0, to delete=2)
Will GET 66 resources
created: https://pod.stanford.edu/file/489614/duke-2025-07-16-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_duke/file/489614/duke-2025-07-16-delta-deletes.del.txt
created: https://pod.stanford.edu/file/489875/duke-2025-07-17-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_duke/file/489875/duke-2025-07-17-delta-deletes.del.txt
created: https://pod.stanford.edu/file/489876/duke-2025-07-17-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_duke/file/48

✓ Successfully synced Duke University
  Total files: 150 MARC, 167 delete files
  New files: 33 MARC, 33 delete files

--- Harvard University ---
Checking for updates (963 existing files)...
Found existing data: 963 MARC files, 706 delete files
Syncing data for Harvard University...
Stream: 20220204-with-items
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_harvard
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/harvard/streams/20220204-with-items/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/harvard/streams/20220204-with-items/normalized_resourcelist/marc21
Parsed as sitemap, 1754 resources
Read source resource list, 1754 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_harvard
Status:     NOT IN SYNC (same=1669, to create=85, to update=0, to delete=2)
Will GET 85 resources
created: https://pod.stanford.edu/file/489173/harvard-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_harvard/file/489173/harvard-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489380/harvard-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_harvard/file/489380/harvard-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489648/harvard-2025-07-16-delta-marc21.mrc.gz -> /Volumes/Sams

✓ Successfully synced Harvard University
  Total files: 1013 MARC, 741 delete files
  New files: 50 MARC, 35 delete files

--- Johns Hopkins University ---
Checking for updates (643 existing files)...
Found existing data: 643 MARC files, 505 delete files
Syncing data for Johns Hopkins University...
Stream: jhu
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_johns_hopkins
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/jhu/streams/jhu/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/jhu/streams/jhu/normalized_resourcelist/marc21
Parsed as sitemap, 1231 resources
Read source resource list, 1231 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_johns_hopkins
Status:     NOT IN SYNC (same=1148, to create=83, to update=0, to delete=3)
Will GET 83 resources
created: https://pod.stanford.edu/file/489157/jhu-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_johns_hopkins/file/489157/jhu-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489364/jhu-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_johns_hopkins/file/489364/jhu-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489608/jhu-2025-07-16-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_johns_hop

✓ Successfully synced Johns Hopkins University
  Total files: 693 MARC, 538 delete files
  New files: 50 MARC, 33 delete files

--- Massachusetts Institute of Technology ---
Checking for updates (825 existing files)...
Found existing data: 825 MARC files, 385 delete files
Syncing data for Massachusetts Institute of Technology...
Stream: prod-2022-06-08
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_mit
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/mit/streams/prod-2022-06-08/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/mit/streams/prod-2022-06-08/normalized_resourcelist/marc21
Parsed as sitemap, 1268 resources
Read source resource list, 1268 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_mit
Status:     NOT IN SYNC (same=1210, to create=58, to update=0, to delete=2)
Will GET 58 resources
created: https://pod.stanford.edu/file/489152/mit-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_mit/file/489152/mit-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489605/mit-2025-07-16-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_mit/file/489605/mit-2025-07-16-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489854/mit-2025-07-17-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_mit/file/489854

✓ Successfully synced Massachusetts Institute of Technology
  Total files: 867 MARC, 401 delete files
  New files: 42 MARC, 16 delete files

--- University of Pennsylvania ---
Checking for updates (147 existing files)...
Found existing data: 147 MARC files, 146 delete files
Syncing data for University of Pennsylvania...
Stream: 2022-07-18
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_penn
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/penn/streams/2022-07-18/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/penn/streams/2022-07-18/normalized_resourcelist/marc21
Parsed as sitemap, 307 resources
Read source resource list, 307 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_penn
Status:     NOT IN SYNC (same=293, to create=14, to update=0, to delete=3)
Will GET 14 resources
created: https://pod.stanford.edu/file/489652/penn-2025-07-16-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_penn/file/489652/penn-2025-07-16-delta-deletes.del.txt
created: https://pod.stanford.edu/file/489653/penn-2025-07-16-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_penn/file/489653/penn-2025-07-16-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/491297/penn-2025-07-23-delta-deletes.del.txt -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_penn/file/4912

✓ Successfully synced University of Pennsylvania
  Total files: 154 MARC, 153 delete files
  New files: 7 MARC, 7 delete files

--- Princeton University ---
Checking for updates (794 existing files)...
Found existing data: 794 MARC files, 609 delete files
Syncing data for Princeton University...
Stream: princeton-prod-0223
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_princeton
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/princeton/streams/princeton-prod-0223/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/princeton/streams/princeton-prod-0223/normalized_resourcelist/marc21
Parsed as sitemap, 1489 resources
Read source resource list, 1489 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_princeton
Status:     NOT IN SYNC (same=1403, to create=86, to update=0, to delete=2)
Will GET 86 resources
created: https://pod.stanford.edu/file/489169/princeton-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_princeton/file/489169/princeton-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489376/princeton-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_princeton/file/489376/princeton-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489633/princeton-2025-07-16-delta-marc21.mrc

✓ Successfully synced Princeton University
  Total files: 844 MARC, 645 delete files
  New files: 50 MARC, 36 delete files

--- Stanford University ---
Checking for updates (297 existing files)...
Found existing data: 297 MARC files, 62 delete files
Syncing data for Stanford University...
Stream: 2024-08-27
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_stanford
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/stanford/streams/2024-08-27/normalized_resourcelist/marc21
Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/stanford/streams/2024-08-27/normalized_resourcelist/marc21
Parsed as sitemap, 419 resources
Read source resource list, 419 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_stanford
Status:     NOT IN SYNC (same=359, to create=60, to update=0, to delete=2)
Will GET 60 resources
created: https://pod.stanford.edu/file/489142/stanford-2025-07-14-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_stanford/file/489142/stanford-2025-07-14-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489347/stanford-2025-07-15-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-pyspark-notbook/pod_stanford/file/489347/stanford-2025-07-15-delta-marc21.mrc.gz
created: https://pod.stanford.edu/file/489638/stanford-2025-07-16-delta-marc21.mrc.gz -> /Volumes/Samsung_T5/pod-

✓ Successfully synced Stanford University
  Total files: 343 MARC, 76 delete files
  New files: 46 MARC, 14 delete files

--- Yale University ---
Checking for updates (2 existing files)...
Found existing data: 2 MARC files, 1 delete files
Syncing data for Yale University...
Stream: 2025-07-alma
Output: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_yale
Mode: Incremental (only missing/updated files)


Reading resource list https://pod.stanford.edu/organizations/yale/streams/2025-07-alma/normalized_resourcelist/marc21


✓ Successfully synced Yale University
  Total files: 2 MARC, 1 delete files
  No new files needed - already up to date

=== SYNC SUMMARY ===
Successfully synced: 13 institutions


Read sitemap/sitemapindex from https://pod.stanford.edu/organizations/yale/streams/2025-07-alma/normalized_resourcelist/marc21
Parsed as sitemap, 3 resources
Read source resource list, 3 resources listed
Scanning disk from /Volumes/Samsung_T5/pod-pyspark-notbook/pod_yale
Status:     NOT IN SYNC (same=3, to create=0, to update=0, to delete=2)
Will GET 0 resources
nodelete: would delete https://pod.stanford.edu/.DS_Store (--delete to enable)
nodelete: would delete https://pod.stanford.edu/file/.DS_Store (--delete to enable)
Written last timestamp 2025-07-14T15:18:57.990209Z for incremental sync
Status:     PART SYNCED (same=3, created=0, updated=0, to delete (--delete)=2)


['brown',
 'chicago',
 'columbia',
 'cornell',
 'dartmouth',
 'duke',
 'harvard',
 'johns_hopkins',
 'mit',
 'penn',
 'princeton',
 'stanford',
 'yale']

In [9]:
def organize_institution_files():
    """
    Move/copy relevant files from institution directories to work directory
    """
    print("=== ORGANIZING INSTITUTION FILES ===")
    total_copied = 0
    
    for institution_key, config in ivy_plus_config.items():
        if not config['active']:
            continue
            
        # Use absolute path with new volume
        source_dir = os.path.join(VOLUME_BASE, f"pod_{institution_key}")
        if not os.path.exists(source_dir):
            print(f"Directory not found: {source_dir}")
            continue
            
        # Find all MARC files (adjust pattern as needed)
        marc_files = glob.glob(f"{source_dir}/**/*.mrc.gz", recursive=True)
        
        # Copy to work directory with institution prefix
        copied_count = 0
        for file in marc_files:
            filename = os.path.basename(file)
            dest_file = os.path.join(work_dir, f"{institution_key}_{filename}")
            shutil.copy2(file, dest_file)
            copied_count += 1
            
        print(f"  {institution_key}: Copied {copied_count} files")
        total_copied += copied_count
    
    print(f"\nTotal: Copied {total_copied} files to {work_dir}")
    return total_copied

# Execute the function to copy files from pod_* directories to work directory
organize_institution_files()

=== ORGANIZING INSTITUTION FILES ===
  brown: Copied 1091 files
  chicago: Copied 969 files
  columbia: Copied 1052 files
  cornell: Copied 742 files
  dartmouth: Copied 600 files
  duke: Copied 150 files
  harvard: Copied 1013 files
  johns_hopkins: Copied 693 files
  mit: Copied 867 files
  penn: Copied 154 files
  princeton: Copied 844 files
  stanford: Copied 343 files
  yale: Copied 2 files

Total: Copied 8520 files to /Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/work/marc


8520

In [10]:
# use pyspark to import files in multiple folders
# and save them in a single folder
# usage: spark-submit --master local[*] import_files.py

# Complete Spark setup for MARC processing
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

# Initialize Spark if not already done
try:
    sc
    print("Using existing SparkContext")
    # If SparkContext exists, get SparkSession from it
    spark = SparkSession.builder.getOrCreate()
except NameError:
    # Create SparkSession with configuration
    spark = SparkSession.builder \
        .appName("marcUpdateDeleteApp") \
        .config("spark.executor.memory", "10g") \
        .config("spark.driver.memory", "6g") \
        .config("spark.memory.fraction", "0.7") \
        .config("spark.memory.storageFraction", "0.3") \
        .config("spark.sql.shuffle.partitions", "10") \
        .config("spark.default.parallelism", "2") \
        .getOrCreate()
    
    # Get SparkContext from SparkSession
    sc = spark.sparkContext
    print("Created new SparkSession and SparkContext with memory-optimized settings")

# Now these will work
print(f"Spark version: {spark.version}")
print(f"Spark context web UI: {sc.uiWebUrl}")
print("Spark setup complete")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/02 16:50:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Created new SparkSession and SparkContext with memory-optimized settings
Spark version: 3.5.0
Spark context web UI: http://192.168.1.37:4040
Spark setup complete


# MARC record consolidation and organization

In [11]:
import os
import glob
import gzip
from pymarc import MARCReader, Record, MARCWriter
import json
import subprocess

# Check if Spark is already initialized
try:
    sc
    print("Using existing SparkContext")
    print(f"SparkContext ID: {sc.applicationId}")
    print(f"Spark UI: {sc.uiWebUrl}")
except NameError:
    print("ERROR: SparkContext not found!")
    print("Please run the Spark initialization cell first")
    raise

# Define directories using the volume base (VOLUME_BASE is defined in earlier cell)
source_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs', 'work', 'marc')
output_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs', 'processed')
os.makedirs(output_dir, exist_ok=True)

# Progress tracking file
progress_file = os.path.join(output_dir, 'processing_progress.json')

# IMPORTANT: Remove progress file to force reprocessing all files
# This ensures we don't miss any files due to incomplete previous runs
if os.path.exists(progress_file):
    print("Found existing progress file. Removing to force complete reprocessing...")
    os.remove(progress_file)
    print(f"✓ Removed: {progress_file}")
    processed_files = set()
    total_records = 0
    total_updates = 0
    errors = []
    print("Starting fresh processing run.")
else:
    print("Starting fresh processing run.")
    processed_files = set()
    total_records = 0
    total_updates = 0
    errors = []

def save_progress():
    """Save current progress to file"""
    progress = {
        'processed_files': list(processed_files),
        'total_records': total_records,
        'total_updates': total_updates,
        'errors': errors
    }
    with open(progress_file, 'w') as f:
        json.dump(progress, f)

def get_all_marc_gz_files():
    """Get all .mrc.gz files from the source directory"""
    # FIXED: Use recursive=True to find files in subdirectories
    pattern = os.path.join(source_dir, '**', '*.mrc.gz')
    files = glob.glob(pattern, recursive=True)
    print(f"Found {len(files)} .mrc.gz files in {source_dir} (including subdirectories)")
    
    # Check specifically for Penn files
    penn_files = [f for f in files if 'penn_' in os.path.basename(f)]
    print(f"Penn files found: {len(penn_files)}")
    
    return files

def extract_institution_from_filename(filename):
    """Extract institution code from filename"""
    if '_' in filename:
        return filename.split('_')[0]
    elif '-' in filename:
        return filename.split('-')[0]
    return 'unknown'

def process_pod_delete_files():
    """
    Process all POD delete files (.del.txt) separately
    """
    print("\n=== PROCESSING POD DELETE FILES ===")
    
    # FIXED: Use absolute paths with VOLUME_BASE and recursive=True
    delete_patterns = [
        os.path.join(VOLUME_BASE, "pod_*/**/*.del.txt"),
        os.path.join(VOLUME_BASE, "pod_*/**/*.del"),  # Some institutions might use .del
        os.path.join(source_dir, "**/*.del.txt")
    ]
    
    all_delete_ids = set()
    delete_file_count = 0
    
    for pattern in delete_patterns:
        print(f"Searching pattern: {pattern}")
        delete_files = glob.glob(pattern, recursive=True)
        print(f"  Found {len(delete_files)} files")
        
        for del_file in delete_files:
            delete_file_count += 1
            try:
                with open(del_file, 'r') as f:
                    file_deletes = [line.strip() for line in f if line.strip()]
                    all_delete_ids.update(file_deletes)
                    print(f"  {os.path.basename(del_file)}: {len(file_deletes)} IDs")
            except Exception as e:
                print(f"  Error reading {del_file}: {e}")
    
    # Save delete IDs to file
    if all_delete_ids:
        delete_ids_file = os.path.join(output_dir, 'delete_ids.txt')
        with open(delete_ids_file, 'w') as f:
            for delete_id in sorted(all_delete_ids):
                f.write(f"{delete_id}\n")
        print(f"\nProcessed {delete_file_count} delete files")
        print(f"Total unique delete IDs: {len(all_delete_ids)}")
        print(f"Delete IDs saved to: {delete_ids_file}")
    else:
        print("No POD delete files found")
    
    return all_delete_ids

# First, process delete files separately
delete_ids = process_pod_delete_files()

# Get all .mrc.gz files
marc_gz_files = get_all_marc_gz_files()

# Double-check Penn source files specifically
print("\n=== CHECKING PENN SOURCE FILES ===")
penn_pod_dir = os.path.join(VOLUME_BASE, "pod_penn")
if os.path.exists(penn_pod_dir):
    penn_source_files = glob.glob(os.path.join(penn_pod_dir, "**/*.mrc.gz"), recursive=True)
    print(f"Penn files in original pod_penn directory: {len(penn_source_files)}")
    
    # Check for penn files in work directory
    penn_work_files = [f for f in marc_gz_files if 'penn_' in os.path.basename(f)]
    print(f"Penn files in work directory: {len(penn_work_files)}")
    
    if len(penn_source_files) > len(penn_work_files):
        print(f"⚠️ WARNING: Only {len(penn_work_files)} of {len(penn_source_files)} Penn files were copied to work directory")
        print("This may explain missing records - some source files were not processed")
else:
    print("⚠️ Penn source directory not found")

# Filter out already processed files
files_to_process = marc_gz_files  # Process ALL files, ignoring previous progress
print(f"\nFiles to process: {len(files_to_process)} (processing all files)")

# Open existing institution writers in append mode
institution_writers = {}

# Process files
print("\n=== PROCESSING MARC UPDATE FILES ===")
print("Note: POD provides all records in MARC files as updates")
print("Delete records are provided separately in .del.txt files\n")

for i, file_path in enumerate(files_to_process):
    filename = os.path.basename(file_path)
    print(f"Processing {i+1}/{len(files_to_process)}: {filename}")
    
    try:
        # Read the gzipped MARC file
        with gzip.open(file_path, 'rb') as gz_file:
            marc_data = gz_file.read()
        
        # Parse MARC records
        reader = MARCReader(marc_data, to_unicode=True, force_utf8=True, utf8_handling='replace')
        
        # Extract institution from filename
        institution = extract_institution_from_filename(filename)
        
        # Create/open writer for this institution
        if institution not in institution_writers:
            output_file = os.path.join(output_dir, f"{institution}_updates.mrc")
            # Always start with a fresh file
            output_file_handle = open(output_file, 'wb')
            institution_writers[institution] = {
                'file': output_file_handle,
                'writer': MARCWriter(output_file_handle),
                'count': 0,
                'filename': output_file
            }
        
        # Process records - ALL records in POD MARC files are updates
        file_records = 0
        file_updates = 0
        
        for record in reader:
            if record is None:
                continue
            
            file_records += 1
            file_updates += 1
            
            # Extract record ID from 001 field for logging
            record_id = record['001'].data if '001' in record else None
            
            try:
                institution_writers[institution]['writer'].write(record)
                institution_writers[institution]['count'] += 1
            except Exception as e:
                print(f"  Warning: Could not write record {record_id}: {e}")
        
        # Update totals
        total_records += file_records
        total_updates += file_updates
        
        print(f"  Processed: {file_records} update records")
        
        # Mark file as processed and save progress
        processed_files.add(filename)
        save_progress()
        
    except Exception as e:
        print(f"  Error processing {filename}: {e}")
        errors.append({'file': filename, 'error': str(e)})
        save_progress()

# Close all writers
print("\n=== CLOSING FILES ===")
for institution, writer_info in institution_writers.items():
    writer_info['writer'].close()
    writer_info['file'].close()
    print(f"  {institution}: {writer_info['count']} records written in this session")

# Final summary
print(f"\n=== PROCESSING SUMMARY ===")
print(f"Total MARC files processed: {len(processed_files)}")
print(f"Total update records found: {total_records:,}")
print(f"Files with errors: {len(errors)}")
if delete_ids:
    print(f"Delete IDs from .del.txt files: {len(delete_ids):,}")

print("\n=== ALL MARC FILES COMPLETELY PROCESSED! ===")
print(f"Institution-specific update records saved to {output_dir}")

Using existing SparkContext
SparkContext ID: local-1756849812374
Spark UI: http://192.168.1.37:4040
Found existing progress file. Removing to force complete reprocessing...
✓ Removed: /Volumes/Samsung_T5/pod-pyspark-notbook/pod-processing-outputs/processed/processing_progress.json
Starting fresh processing run.

=== PROCESSING POD DELETE FILES ===
Searching pattern: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_*/**/*.del.txt
  Found 6083 files
  harvard-2024-10-03-delta-deletes.del.txt: 9 IDs
  harvard-2024-01-13-delta-deletes.del.txt: 7 IDs
  harvard-2024-04-22-delta-deletes.del.txt: 1 IDs
  harvard-2024-01-18-delta-deletes.del.txt: 16 IDs
  harvard-2025-07-17-delta-deletes.del.txt: 39 IDs
  harvard-2025-07-25-delta-deletes.del.txt: 21 IDs
  harvard-2025-04-23-delta-deletes.del.txt: 8 IDs
  harvard-2024-05-16-delta-deletes.del.txt: 2 IDs
  harvard-2025-08-24-delta-deletes.del.txt: 17 IDs
  harvard-2023-01-09-delta-deletes.del.txt: 3 IDs
  harvard-2025-06-13-delta-deletes.del.txt: 58 ID



  Processed: 560144 update records
Processing 86/8519: chicago_chicago-2024-06-20-delta-marc21.mrc.gz
  Processed: 7476 update records
Processing 87/8519: brown_brown-2023-04-23-delta-marc21.mrc.gz
  Processed: 211 update records
Processing 88/8519: mit_mit-2024-06-09-delta-marc21.mrc.gz
  Processed: 1006 update records
Processing 89/8519: columbia_columbia-2025-01-08-delta-marc21.mrc.gz
  Processed: 1220 update records
Processing 90/8519: princeton_princeton-2024-12-20-delta-marc21.mrc.gz
  Processed: 7372 update records
Processing 91/8519: chicago_chicago-2025-03-18-delta-marc21.mrc.gz
  Processed: 746 update records
Processing 92/8519: dartmouth_dartmouth-2024-02-03-delta-marc21.mrc.gz
  Processed: 83 update records
Processing 93/8519: chicago_chicago-2025-06-10-delta-marc21.mrc.gz
  Processed: 1204 update records
Processing 94/8519: chicago_chicago-2024-03-28-delta-marc21.mrc.gz
  Processed: 23420 update records
Processing 95/8519: johns_hopkins_jhu-2023-12-30-delta-marc21.mrc.gz
 

# MARC Processing Validation

## Purpose
This cell performs a comprehensive validation check of the MARC processing pipeline output. It ensures that the institution consolidation step completed successfully and provides detailed statistics about the processed data.

## What It Does

### 1. **Directory Verification**
- Checks if the processed directory (`pod-processing-outputs/processed`) exists
- Verifies that MARC consolidation has been run

### 2. **Institution File Analysis**
- Locates all `*_updates.mrc` files (one per institution)
- Counts MARC records in each file using pymarc
- Calculates file sizes
- Aggregates statistics by institution

### 3. **Auxiliary File Inspection**
- Identifies non-MARC files in the processed directory:
  - `delete_ids.txt` - Lists record IDs to be deleted
  - `processing_progress.json` - Tracks processing state
  - Any other generated files

### 4. **Source Directory Check**
- Verifies the work directory still contains source `.mrc.gz` files
- Confirms data integrity throughout the pipeline

### 5. **Spark Status Verification**
- Checks if SparkContext is still active
- Reports Spark application details
- Ensures resources are available for next steps

## Output Information

### Success Indicators (✓)
- Processed directory exists with institution files
- Source files are still available
- Spark is active and ready

### Warning Indicators (✗)
- Missing directories or files
- Processing errors
- Spark unavailable

## Key Metrics Reported
- **Per Institution**: Record count and file size
- **Total**: Aggregate records across all institutions
- **Delete IDs**: Number of records marked for deletion
- **Processing State**: Progress tracking information

## Next Steps
Based on validation results:
- ✅ If successful → Proceed to export with delete filtering
- ❌ If failed → Return to processing cell and investigate errors

## Example Output
✓ Processed directory contains 13 institution update files harvard_updates.mrc: 1,234,567 records (543.21 MB) yale_updates.mrc: 987,654 records (432.10 MB) ... Total records across all files: 15,678,901

This validation ensures data integrity before the final export step where delete filtering is applied.

In [12]:
# Validate processed output files 
import os
import glob
from pymarc import MARCReader

# Check processed directory
processed_dir = 'pod-processing-outputs/processed'

print("=== VALIDATION SUMMARY ===")
print("This validation checks the processed institution files")
print("(No splitting step - working with full institution files)\n")

# Check if directories exist
if not os.path.exists(processed_dir):
    print(f"✗ Processed directory does not exist: {processed_dir}")
    print("  Run the processing cell first!")
else:
    # Look for institution update files
    update_files = glob.glob(os.path.join(processed_dir, '*_updates.mrc'))
    print(f"✓ Processed directory contains {len(update_files)} institution update files")
    
    # Group by institution for better overview
    institution_counts = {}
    total_records = 0
    
    for file_path in update_files:
        file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
        institution = os.path.basename(file_path).replace('_updates.mrc', '')
        
        # Count records in file
        try:
            with open(file_path, 'rb') as marc_file:
                reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                record_count = sum(1 for record in reader if record is not None)
                total_records += record_count
                
                institution_counts[institution] = {
                    'files': 1, 
                    'records': record_count,
                    'size_mb': file_size
                }
                
                print(f"  {os.path.basename(file_path)}: {record_count:,} records ({file_size:.2f} MB)")
        except Exception as e:
            print(f"  {os.path.basename(file_path)}: Error reading file - {e}")
    
    print(f"\nTotal records across all files: {total_records:,}")
    print("\nBy institution:")
    for inst, counts in sorted(institution_counts.items()):
        print(f"  {inst}: {counts['records']:,} records ({counts['size_mb']:.2f} MB)")

# Check for other important files
print(f"\n=== OTHER FILES IN PROCESSED DIRECTORY ===")
if os.path.exists(processed_dir):
    all_files = os.listdir(processed_dir)
    other_files = [f for f in all_files if not f.endswith('_updates.mrc')]
    
    for file in sorted(other_files):
        file_path = os.path.join(processed_dir, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
            if file.endswith('.txt'):
                # Count lines in text files (like delete_ids.txt)
                with open(file_path, 'r') as f:
                    line_count = sum(1 for line in f if line.strip())
                print(f"  {file} ({file_size:.2f} MB, {line_count:,} lines)")
            elif file.endswith('.json'):
                print(f"  {file} ({file_size:.2f} MB) - Progress tracking")
            else:
                print(f"  {file} ({file_size:.2f} MB)")

# Check source directory
print(f"\n=== SOURCE DIRECTORY ===")
source_dir = 'pod-processing-outputs/work/marc'
if os.path.exists(source_dir):
    source_files = glob.glob(os.path.join(source_dir, '*.mrc.gz'))
    print(f"✓ Source directory contains {len(source_files)} .mrc.gz files")
else:
    print(f"✗ Source directory does not exist: {source_dir}")

print(f"\n=== SPARK STATUS ===")
try:
    executor_info = sc.statusTracker().getExecutorInfos()
    print(f"✓ SparkContext is active")
    print(f"  Application ID: {sc.applicationId}")
    print(f"  Executors: {len(executor_info)}")
except:
    print("✗ SparkContext not available")

print(f"\n=== NEXT STEPS ===")
if update_files:
    print("✓ Data is ready for export with delete filtering")
    print("  Run the export cell to create filtered output")
    print("  Delete filtering will be applied during export")
else:
    print("✗ No processed data ready for export")
    print("  Make sure to run the processing cell first")

print(f"\nValidation complete.")

=== VALIDATION SUMMARY ===
This validation checks the processed institution files
(No splitting step - working with full institution files)

✓ Processed directory contains 13 institution update files
  columbia_updates.mrc: 17,696,269 records (38838.52 MB)
  duke_updates.mrc: 10,986,245 records (24143.63 MB)
  dartmouth_updates.mrc: 3,862,868 records (7770.00 MB)
  yale_updates.mrc: 7,431,259 records (12078.07 MB)
  penn_updates.mrc: 11,982,338 records (29329.07 MB)
  mit_updates.mrc: 6,779,976 records (13110.32 MB)
  harvard_updates.mrc: 68,269,575 records (142589.34 MB)
  chicago_updates.mrc: 13,123,280 records (32217.56 MB)
  princeton_updates.mrc: 31,489,341 records (60088.73 MB)
  johns_updates.mrc: 5,254,862 records (9607.98 MB)
  stanford_updates.mrc: 9,814,866 records (20672.88 MB)
  brown_updates.mrc: 3,440,806 records (6843.53 MB)
  cornell_updates.mrc: 7,306,191 records (14674.76 MB)

Total records across all files: 197,437,876

By institution:
  brown: 3,440,806 records (68

# Export with delete filtering 


In [19]:
# Export with delete filtering 
import os
import glob
import shutil
from pymarc import MARCReader, MARCWriter
from datetime import datetime
import json

# Define base directory and subdirectories
base_dir = 'pod-processing-outputs'
processed_dir = os.path.join(base_dir, 'processed')
export_dir = os.path.join(base_dir, 'export')
archive_dir = os.path.join(base_dir, 'archive')
work_dir = os.path.join(base_dir, 'work', 'marc')

# Create directories if they don't exist
for directory in [export_dir, processed_dir, archive_dir]:
    os.makedirs(directory, exist_ok=True)

def find_and_process_pod_delete_files():
    """
    Find and process all POD delete files (.del.txt) to ensure we have all deletes
    """
    print("=== SEARCHING FOR POD DELETE FILES (.del.txt) ===")
    
    # Use absolute paths with VOLUME_BASE
    delete_patterns = [
        os.path.join(VOLUME_BASE, "pod_*/**/*.del.txt"),
        os.path.join(VOLUME_BASE, "pod_*/**/*.del"),  # Some institutions might use .del
        os.path.join(VOLUME_BASE, "pod-processing-outputs/work/**/*.del.txt")
    ]
    
    all_pod_delete_ids = set()
    delete_file_count = 0
    files_by_institution = {}
    
    for pattern in delete_patterns:
        print(f"Searching pattern: {pattern}")
        delete_files = glob.glob(pattern, recursive=True)
        print(f"  Found {len(delete_files)} files")
        
        for del_file in delete_files:
            delete_file_count += 1
            try:
                # Extract institution from path
                if "pod_" in del_file:
                    parts = del_file.split("pod_")[1].split(os.sep)
                    institution = parts[0]
                else:
                    institution = "unknown"
                
                with open(del_file, 'r') as f:
                    file_deletes = [line.strip() for line in f if line.strip()]
                    all_pod_delete_ids.update(file_deletes)
                    
                    if institution not in files_by_institution:
                        files_by_institution[institution] = 0
                    files_by_institution[institution] += len(file_deletes)
                    
                    print(f"  {os.path.basename(del_file)}: {len(file_deletes)} IDs (from {institution})")
            except Exception as e:
                print(f"  Error reading {del_file}: {e}")
    
    print(f"\nFound {delete_file_count} POD delete files")
    print(f"Total unique delete IDs from POD files: {len(all_pod_delete_ids)}")
    
    if files_by_institution:
        print("\nDelete IDs by institution:")
        for inst, count in sorted(files_by_institution.items()):
            print(f"  {inst}: {count} IDs")
    
    return all_pod_delete_ids

def reconcile_delete_ids():
    """
    Reconcile delete IDs from different sources
    """
    # Get POD delete IDs
    pod_delete_ids = find_and_process_pod_delete_files()
    
    # Check existing delete_ids.txt
    existing_delete_ids = set()
    delete_ids_file = os.path.join(processed_dir, 'delete_ids.txt')
    
    if os.path.exists(delete_ids_file):
        with open(delete_ids_file, 'r') as f:
            existing_delete_ids = {line.strip() for line in f if line.strip()}
        print(f"\nExisting delete_ids.txt contains: {len(existing_delete_ids)} IDs")
    
    # Combine all delete IDs
    all_delete_ids = pod_delete_ids.union(existing_delete_ids)
    
    print(f"\n=== DELETE ID RECONCILIATION ===")
    print(f"From POD .del.txt files: {len(pod_delete_ids)}")
    print(f"From existing delete_ids.txt: {len(existing_delete_ids)}")
    print(f"Total unique delete IDs: {len(all_delete_ids)}")
    
    # Update delete_ids.txt with all IDs
    if len(all_delete_ids) > len(existing_delete_ids):
        print(f"\nUpdating delete_ids.txt with {len(all_delete_ids) - len(existing_delete_ids)} additional IDs")
        with open(delete_ids_file, 'w') as f:
            for delete_id in sorted(all_delete_ids):
                f.write(f"{delete_id}\n")
    
    return all_delete_ids

def create_export_package(filter_deletes=True, skip_delete_institutions=None):
    """
    Create a final export package with all processed files (NO SPLITTING VERSION)
    
    Args:
        filter_deletes: If True, filter out records matching delete IDs
        skip_delete_institutions: List of institution codes to skip delete filtering for
    """
    if skip_delete_institutions is None:
        skip_delete_institutions = []
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    export_subdir = os.path.join(export_dir, f"marc_export_{timestamp}")
    os.makedirs(export_subdir, exist_ok=True)
    
    # Reconcile and load all delete IDs
    delete_ids = set()
    if filter_deletes:
        delete_ids = reconcile_delete_ids()
        if delete_ids:
            print(f"\nUsing {len(delete_ids)} delete IDs for filtering")
            if skip_delete_institutions:
                print(f"⚠️ Skipping delete filtering for: {', '.join(skip_delete_institutions)}")
        else:
            print("No delete IDs found - proceeding without filtering")
            filter_deletes = False
    
    # Process institution update files from processed directory
    total_filtered = 0
    update_files = glob.glob(os.path.join(processed_dir, '*_updates.mrc'))
    
    if update_files:
        print(f"\n=== PROCESSING {len(update_files)} INSTITUTION FILES ===")
        
        for input_file in update_files:
            institution = os.path.basename(input_file).replace('_updates.mrc', '')
            output_file = os.path.join(export_subdir, f"{institution}_filtered.mrc")
            
            # Check if we should skip delete filtering for this institution
            skip_deletes_for_this = institution in skip_delete_institutions
            
            if filter_deletes and delete_ids and not skip_deletes_for_this:
                # Filter while copying
                print(f"Processing {institution}...")
                with open(input_file, 'rb') as marc_file:
                    reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                    writer = MARCWriter(open(output_file, 'wb'))
                    
                    file_records = 0
                    file_filtered = 0
                    
                    for record in reader:
                        if record is None:
                            continue
                        
                        file_records += 1
                        
                        # Check if record should be filtered
                        if '001' in record and record['001'].data in delete_ids:
                            file_filtered += 1
                            total_filtered += 1
                        else:
                            writer.write(record)
                    
                    writer.close()
                    
                    print(f"  {institution}: {file_records} records, filtered {file_filtered}")
            else:
                # Just copy without filtering
                shutil.copy2(input_file, export_subdir)
                if skip_deletes_for_this:
                    print(f"  {institution}: Copied WITHOUT delete filtering (in skip list)")
                    # Keep the _filtered.mrc naming for consistency
                    os.rename(
                        os.path.join(export_subdir, os.path.basename(input_file)),
                        output_file
                    )
                else:
                    # Rename to indicate it's unfiltered
                    os.rename(
                        os.path.join(export_subdir, os.path.basename(input_file)),
                        os.path.join(export_subdir, f"{institution}_unfiltered.mrc")
                    )
        
        if filter_deletes and total_filtered > 0:
            print(f"\nTotal records filtered across all files: {total_filtered}")
        print(f"Processed {len(update_files)} institution files to export package")
    else:
        print("\n✗ No institution update files found in processed directory!")
        return None
    
    # Copy delete IDs file for reference
    delete_ids_file = os.path.join(processed_dir, 'delete_ids.txt')
    if os.path.exists(delete_ids_file):
        shutil.copy2(delete_ids_file, export_subdir)
        print("Copied delete IDs file to export package for reference")
    
    # Create detailed summary file
    summary_file = os.path.join(export_subdir, 'processing_summary.txt')
    with open(summary_file, 'w') as f:
        f.write(f"MARC Processing Summary (No Splitting Version)\n")
        f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"=" * 50 + "\n\n")
        
        f.write(f"Delete ID Sources:\n")
        f.write(f"  - POD .del.txt files processed\n")
        f.write(f"  - Total delete IDs: {len(delete_ids)}\n\n")
        
        if filter_deletes and delete_ids:
            f.write(f"Delete filtering: ENABLED\n")
            f.write(f"Delete IDs used: {len(delete_ids)}\n")
            f.write(f"Records filtered: {total_filtered}\n")
            if skip_delete_institutions:
                f.write(f"Institutions skipped: {', '.join(skip_delete_institutions)}\n")
            f.write("\n")
        else:
            f.write(f"Delete filtering: DISABLED\n\n")
        
        # Count records and files by institution
        f.write(f"Institution Breakdown:\n")
        f.write(f"-" * 30 + "\n")
        
        institution_stats = {}
        total_records = 0
        total_files = 0
        
        for file in glob.glob(os.path.join(export_subdir, '*.mrc')):
            total_files += 1
            institution = os.path.basename(file).split('_')[0]
            
            if institution not in institution_stats:
                institution_stats[institution] = {'files': 0, 'records': 0, 'filtered': institution not in skip_delete_institutions}
            
            try:
                with open(file, 'rb') as marc_file:
                    reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
                    record_count = sum(1 for record in reader if record is not None)
                    total_records += record_count
                    institution_stats[institution]['files'] += 1
                    institution_stats[institution]['records'] += record_count
                    f.write(f"{os.path.basename(file)}: {record_count:,} records\n")
            except Exception as e:
                f.write(f"{os.path.basename(file)}: Error reading file - {e}\n")
        
        f.write(f"\n{'-' * 30}\n")
        f.write(f"Total: {total_records:,} records in {total_files} files\n\n")
        
        # Institution summary
        f.write(f"By Institution:\n")
        for inst, stats in sorted(institution_stats.items()):
            filter_status = "filtered" if stats.get('filtered', True) else "NOT filtered"
            f.write(f"  {inst.upper()}: {stats['records']:,} records in {stats['files']} file(s) ({filter_status})\n")
        
        # Delete IDs info
        if os.path.exists(os.path.join(export_subdir, 'delete_ids.txt')):
            with open(os.path.join(export_subdir, 'delete_ids.txt'), 'r') as del_file:
                delete_count = sum(1 for line in del_file if line.strip())
            f.write(f"\nDelete IDs in reference file: {delete_count:,}\n")
    
    print(f"\nExport package created: {export_subdir}")
    return export_subdir

# Main execution
print("=== EXPORT WITH DELETE FILTERING (NO SPLITTING) ===")

# Create export package with delete filtering enabled, but skip Penn
# due to potential timing issues between MARC data and delete files
export_path = create_export_package(filter_deletes=True, skip_delete_institutions=['penn'])

if export_path:
    # List export contents
    export_files = os.listdir(export_path)
    print(f"\n=== EXPORT PACKAGE CONTENTS ===")
    for file in sorted(export_files):
        file_path = os.path.join(export_path, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / 1024 / 1024  # MB
            print(f"  {file} ({file_size:.2f} MB)")

    print(f"\n=== PROCESSING COMPLETE ===")
    print(f"Final export package: {export_path}")
    print(f"Ready for conversion to Parquet format")

    # Display final statistics
    summary_file = os.path.join(export_path, 'processing_summary.txt')
    if os.path.exists(summary_file):
        print(f"\nKey Statistics from Processing Summary:")
        with open(summary_file, 'r') as f:
            for line in f:
                if line.strip() and any(x in line for x in ["Total:", "Delete IDs used:", "Records filtered:", "By Institution:", "Institutions skipped:"]):
                    print(f"  {line.strip()}")
                    if "By Institution:" in line:
                        # Print the next few institution lines
                        for _ in range(13):  # For all Ivy Plus institutions
                            inst_line = next(f, None)
                            if inst_line and inst_line.strip():
                                print(f"  {inst_line.strip()}")

# Keep SparkContext active for any additional processing
print(f"\nSparkContext remains active for additional processing if needed")
print(f"To stop Spark when completely done, run: sc.stop()")

=== EXPORT WITH DELETE FILTERING (NO SPLITTING) ===
=== SEARCHING FOR POD DELETE FILES (.del.txt) ===
Searching pattern: /Volumes/Samsung_T5/pod-pyspark-notbook/pod_*/**/*.del.txt
  Found 6083 files
  harvard-2024-10-03-delta-deletes.del.txt: 9 IDs (from harvard)
  harvard-2024-01-13-delta-deletes.del.txt: 7 IDs (from harvard)
  harvard-2024-04-22-delta-deletes.del.txt: 1 IDs (from harvard)
  harvard-2024-01-18-delta-deletes.del.txt: 16 IDs (from harvard)
  harvard-2025-07-17-delta-deletes.del.txt: 39 IDs (from harvard)
  harvard-2025-07-25-delta-deletes.del.txt: 21 IDs (from harvard)
  harvard-2025-04-23-delta-deletes.del.txt: 8 IDs (from harvard)
  harvard-2024-05-16-delta-deletes.del.txt: 2 IDs (from harvard)
  harvard-2025-08-24-delta-deletes.del.txt: 17 IDs (from harvard)
  harvard-2023-01-09-delta-deletes.del.txt: 3 IDs (from harvard)
  harvard-2025-06-13-delta-deletes.del.txt: 58 IDs (from harvard)
  harvard-2022-11-26-delta-deletes.del.txt: 13 IDs (from harvard)
  harvard-2025-

# debugging

In [21]:
# Complete debugging cell with all necessary imports and variables
import os
import glob
import gzip
from pymarc import MARCReader

# Set the volume base (same as in your notebook)
VOLUME_BASE = '/Volumes/Samsung_T5/pod-pyspark-notbook'
mmsid_to_check = "9943951413503681"

# First, verify if export_path is defined and current
print("=== VERIFYING EXPORT PATH ===")
if 'export_path' in globals():
    print(f"Current export_path: {export_path}")
    
    # Check if this export has Penn skip configured
    summary_file = os.path.join(export_path, 'processing_summary.txt')
    if os.path.exists(summary_file):
        with open(summary_file, 'r') as f:
            content = f.read()
            if "Institutions skipped: penn" in content:
                print("✅ This export was configured to skip Penn delete filtering")
            else:
                print("❌ This export did NOT skip Penn delete filtering")
else:
    print("❌ No export_path variable found - you need to run the export cell first!")

# Check the consolidated delete_ids.txt file
print("\n=== CHECKING CONSOLIDATED DELETE IDS FILE ===")
delete_ids_file = os.path.join(VOLUME_BASE, "pod-processing-outputs/processed/delete_ids.txt")

if os.path.exists(delete_ids_file):
    print(f"Loading consolidated delete_ids.txt file...")
    with open(delete_ids_file, 'r') as f:
        all_delete_ids = [line.strip() for line in f if line.strip()]
    
    print(f"Total delete IDs in consolidated file: {len(all_delete_ids)}")
    
    # Check exact match
    if mmsid_to_check in all_delete_ids:
        print(f"✅ EXACT MATCH FOUND - MMSID {mmsid_to_check} is in delete_ids.txt")
        print("This would normally cause filtering, BUT Penn is in skip list")
        
        # Try to find which original delete file it came from
        print("\nSearching for source delete file...")
        penn_delete_patterns = [
            os.path.join(VOLUME_BASE, "pod_penn/**/*.del.txt"),
            os.path.join(VOLUME_BASE, "pod_penn/**/*.del"),
        ]
        
        for pattern in penn_delete_patterns:
            del_files = glob.glob(pattern, recursive=True)
            for del_file in del_files:
                with open(del_file, 'r') as f:
                    content = f.read()
                    if mmsid_to_check in content:
                        print(f"  Found in original file: {del_file}")
                        # Show the line containing the MMSID
                        lines = content.split('\n')
                        for line in lines:
                            if mmsid_to_check in line:
                                print(f"  Line content: '{line.strip()}'")
                        break
    else:
        print(f"❌ NOT FOUND in consolidated delete_ids.txt")
else:
    print("❌ Consolidated delete_ids.txt file not found")

# Now check if the record is in the current export
if 'export_path' in globals() and export_path:
    print("\n=== CHECKING CURRENT EXPORT ===")
    penn_export_file = os.path.join(export_path, "penn_filtered.mrc")
    
    if os.path.exists(penn_export_file):
        print(f"Checking Penn export file: {penn_export_file}")
        with open(penn_export_file, 'rb') as marc_file:
            reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
            found = False
            record_count = 0
            for record in reader:
                record_count += 1
                if record and '001' in record and record['001'].data == mmsid_to_check:
                    print(f"✅ SUCCESS: Record {mmsid_to_check} is present in Penn export!")
                    print(f"   Found at position: approximately {record_count}th record")
                    found = True
                    break
            
            if not found:
                print(f"❌ Record NOT FOUND in export (checked {record_count} records)")
                print("   This suggests the export was not re-run with skip_delete_institutions=['penn']")
            else:
                print("\n✅ Penn delete filtering was successfully skipped!")
                print("   The record is now included in the export as expected")
    else:
        print(f"❌ Penn export file not found: {penn_export_file}")
        print("   Make sure you've run the export cell")

print("\n=== FINAL STATUS ===")
if 'export_path' in globals():
    print(f"1. Record {mmsid_to_check} is in Penn's delete list")
    print("2. Penn is configured to skip delete filtering")
    print("3. Check above to see if record is in current export")
    print("\nIf the record is still missing, re-run the export cell with:")
    print("   export_path = create_export_package(filter_deletes=True, skip_delete_institutions=['penn'])")
else:
    print("You need to run the export cell first!")

=== VERIFYING EXPORT PATH ===
Current export_path: pod-processing-outputs/export/marc_export_20250905_070958
✅ This export was configured to skip Penn delete filtering

=== CHECKING CONSOLIDATED DELETE IDS FILE ===
Loading consolidated delete_ids.txt file...
Total delete IDs in consolidated file: 3012814
❌ NOT FOUND in consolidated delete_ids.txt

=== CHECKING CURRENT EXPORT ===
Checking Penn export file: pod-processing-outputs/export/marc_export_20250905_070958/penn_filtered.mrc
✅ SUCCESS: Record 9943951413503681 is present in Penn export!
   Found at position: approximately 1938377th record

✅ Penn delete filtering was successfully skipped!
   The record is now included in the export as expected

=== FINAL STATUS ===
1. Record 9943951413503681 is in Penn's delete list
2. Penn is configured to skip delete filtering
3. Check above to see if record is in current export

If the record is still missing, re-run the export cell with:
   export_path = create_export_package(filter_deletes=Tru

In [24]:
# Comprehensive cleanup to free space
import shutil
import os
import glob

def cleanup_for_space(keep_processed=True, keep_latest_export=True):
    """
    Clean up temporary files to free space
    
    Args:
        keep_processed: Keep the processed directory (contains consolidated MARC files)
        keep_latest_export: Keep the most recent export directory
    """
    freed_space = 0
    
    # 1. Clean work directory (usually the largest)
    work_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs', 'work', 'marc')
    if os.path.exists(work_dir):
        size = sum(os.path.getsize(f) for f in glob.glob(os.path.join(work_dir, '**/*'), recursive=True))
        print(f"Work directory: {size / 1024**3:.2f} GB")
        shutil.rmtree(work_dir)
        os.makedirs(work_dir, exist_ok=True)
        freed_space += size
        print("  ✓ Cleaned")
    
    # 2. Clean old exports
    export_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs', 'export')
    if os.path.exists(export_dir):
        export_subdirs = sorted([d for d in os.listdir(export_dir) 
                                if os.path.isdir(os.path.join(export_dir, d)) and d.startswith('marc_export_')])
        
        if keep_latest_export and export_subdirs:
            # Keep only the latest
            for subdir in export_subdirs[:-1]:
                path = os.path.join(export_dir, subdir)
                size = sum(os.path.getsize(os.path.join(path, f)) 
                          for f in os.listdir(path) if os.path.isfile(os.path.join(path, f)))
                shutil.rmtree(path)
                freed_space += size
                print(f"  ✓ Deleted old export: {subdir} ({size / 1024**3:.2f} GB)")
        elif not keep_latest_export:
            # Delete all exports
            for subdir in export_subdirs:
                path = os.path.join(export_dir, subdir)
                size = sum(os.path.getsize(os.path.join(path, f)) 
                          for f in os.listdir(path) if os.path.isfile(os.path.join(path, f)))
                shutil.rmtree(path)
                freed_space += size
                print(f"  ✓ Deleted export: {subdir} ({size / 1024**3:.2f} GB)")
    
    # 3. Clean archive directory
    archive_dir = os.path.join(VOLUME_BASE, 'pod-processing-outputs', 'archive')
    if os.path.exists(archive_dir) and os.listdir(archive_dir):
        size = sum(os.path.getsize(os.path.join(archive_dir, f)) 
                  for f in os.listdir(archive_dir) if os.path.isfile(os.path.join(archive_dir, f)))
        shutil.rmtree(archive_dir)
        os.makedirs(archive_dir, exist_ok=True)
        freed_space += size
        print(f"  ✓ Cleaned archive: ({size / 1024**3:.2f} GB)")
    
    print(f"\nTotal space freed: {freed_space / 1024**3:.2f} GB")
    
    # Check new disk status
    stats = shutil.disk_usage(VOLUME_BASE)
    print(f"\nDisk space after cleanup:")
    print(f"  Free: {stats.free / 1024**3:.2f} GB ({100 - (stats.used / stats.total * 100):.1f}%)")
    
    return freed_space

# Run the cleanup
print("=== CLEANUP FOR DISK SPACE ===")
print("This will delete temporary work files and old exports")
response = input("Proceed with cleanup? (yes/no): ")
if response.lower() == 'yes':
    cleanup_for_space(keep_processed=True, keep_latest_export=True)
else:
    print("Cleanup cancelled")

=== CLEANUP FOR DISK SPACE ===
This will delete temporary work files and old exports
Work directory: 122.46 GB
  ✓ Cleaned
  ✓ Deleted old export: marc_export_20250905_070532 (0.00 GB)

Total space freed: 122.46 GB

Disk space after cleanup:
  Free: 447.36 GB (24.0%)


# Final Export Step: Copy Filtered Files to Final Directory

## Purpose
This cell copies all filtered MARC files from the timestamped export directory to a standardized `final` directory for streamlined processing in the next pipeline stage.

## What It Does
- Copies all `.mrc` files (filtered by institution) from the export package
- Places them in `pod-processing-outputs/final/` directory
- Skips non-MARC files (like `.txt` summaries)
- Provides a clean, consistent location for `pod-processing.ipynb` to read from

## Why This Step
- **Consistency**: The `final` directory always contains the latest processed files
- **Simplicity**: Next notebook doesn't need to navigate timestamped directories
- **Clean Interface**: Only MARC files are copied, excluding metadata files

## Output
Files are copied as:
- `harvard_filtered.mrc`
- `yale_filtered.mrc`
- `princeton_filtered.mrc`
- etc.

Ready for conversion to Parquet format in `pod-processing.ipynb`

In [25]:
# Copy filtered files to final directory for streamlined processing with debugging
import shutil
import glob
from pymarc import MARCReader

# MMSID to check for (the one that's disappearing)
mmsid_to_check = "9943951413503681"
print(f"\n=== DEBUGGING RECORD TRACKING FOR MMSID {mmsid_to_check} ===")

if export_path:
    print("\n=== COPYING TO FINAL DIRECTORY ===")
    
    # First, check if the record exists in the original processed files
    print("\nCHECKING ORIGINAL PROCESSED FILES:")
    record_found_in_processed = False
    penn_processed_file = os.path.join(processed_dir, "penn_updates.mrc")
    
    if os.path.exists(penn_processed_file):
        with open(penn_processed_file, 'rb') as marc_file:
            reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
            record_count = 0
            for record in reader:
                record_count += 1
                if record and '001' in record and record['001'].data == mmsid_to_check:
                    print(f"✅ FOUND TARGET RECORD in processed file: {penn_processed_file}")
                    print(f"  Record position: approximately {record_count}th record")
                    print(f"  Record fields: {[f.tag for f in record.get_fields()]}")
                    record_found_in_processed = True
                    break
        
        if not record_found_in_processed:
            print(f"❌ TARGET RECORD NOT FOUND in processed file (checked {record_count} records)")
    else:
        print(f"❌ Penn processed file not found: {penn_processed_file}")
    
    # Then check if it exists in the export filtered files
    print("\nCHECKING EXPORT FILTERED FILES:")
    record_found_in_export = False
    penn_export_file = os.path.join(export_path, "penn_filtered.mrc")
    
    if os.path.exists(penn_export_file):
        with open(penn_export_file, 'rb') as marc_file:
            reader = MARCReader(marc_file, to_unicode=True, force_utf8=True, utf8_handling='replace')
            record_count = 0
            for record in reader:
                record_count += 1
                if record and '001' in record and record['001'].data == mmsid_to_check:
                    print(f"✅ FOUND TARGET RECORD in export file: {penn_export_file}")
                    print(f"  Record position: approximately {record_count}th record")
                    record_found_in_export = True
                    break
        
        if not record_found_in_export:
            print(f"❌ TARGET RECORD NOT FOUND in export file (checked {record_count} records)")
    else:
        print(f"❌ Penn export file not found: {penn_export_file}")
    
    # Continue with normal copying operations
    print("\nCOPYING FILES TO FINAL DIRECTORY:")
    # Ensure final directory exists
    os.makedirs(final_dir, exist_ok=True)
    
    # Copy all filtered MARC files to final directory
    marc_files = glob.glob(os.path.join(export_path, '*.mrc'))
    
    for marc_file in marc_files:
        if not marc_file.endswith('.txt'):  # Skip text files
            dest = os.path.join(final_dir, os.path.basename(marc_file))
            shutil.copy2(marc_file, dest)
            print(f"  Copied: {os.path.basename(marc_file)}")
    
    print(f"\n✓ {len(marc_files)} files copied to {final_dir}")
    print("Files are now ready for pod-processing.ipynb")
    
    # Summary of findings
    print("\n=== RECORD TRACKING SUMMARY ===")
    if record_found_in_processed and not record_found_in_export:
        print(f"⚠️ The record {mmsid_to_check} was found in the processed file but NOT in the export file")
        print("   This indicates it was filtered out during the export filtering step")
        print("   Check the filtering logic in the 'Export with delete filtering' cell")
    elif not record_found_in_processed and not record_found_in_export:
        print(f"⚠️ The record {mmsid_to_check} was NOT found in either the processed or export files")
        print("   This indicates it was lost before or during the initial processing")
        print("   Check the processing logic in the MARC consolidation cell")
    elif record_found_in_processed and record_found_in_export:
        print(f"✅ The record {mmsid_to_check} was found in both processed and export files")
        print("   It should be available in the final output")
    else:
        print(f"❓ Unusual situation: record not in processed file but is in export file")
        print("   This shouldn't happen in normal operation")
else:
    print("❌ No export path defined - run the export cell first")


=== DEBUGGING RECORD TRACKING FOR MMSID 9943951413503681 ===

=== COPYING TO FINAL DIRECTORY ===

CHECKING ORIGINAL PROCESSED FILES:
✅ FOUND TARGET RECORD in processed file: pod-processing-outputs/processed/penn_updates.mrc
  Record position: approximately 1938377th record
  Record fields: ['005', '008', '001', '026', '035', '035', '035', '035', '040', '041', '100', '240', '245', '250', '264', '300', '336', '337', '338', '500', '500', '650', '650', '648', '650', '650', '648', '650', '650', '650', '700', '710', '752', '902', '900']

CHECKING EXPORT FILTERED FILES:
✅ FOUND TARGET RECORD in export file: pod-processing-outputs/export/marc_export_20250905_070958/penn_filtered.mrc
  Record position: approximately 1938377th record

COPYING FILES TO FINAL DIRECTORY:
  Copied: cornell_filtered.mrc
  Copied: penn_filtered.mrc
  Copied: harvard_filtered.mrc
  Copied: mit_filtered.mrc
  Copied: brown_filtered.mrc
  Copied: duke_filtered.mrc
  Copied: yale_filtered.mrc
  Copied: chicago_filtered.m