## 1. Setup and Imports

In [None]:
# Install required packages if not already installed
!pip install requests beautifulsoup4 lxml pandas -q

In [1]:
import os
import sys
import pandas as pd
from pathlib import Path

from fibis_scraper import FIBISScraper, Dataset

print("Imports successful!")
print(f"Working directory: {os.getcwd()}")

Imports successful!
Working directory: D:\mridul\Scraping Assessment\Task 1


## 2. Initialize the Scraper

The scraper is configured with:
- **Rate limiting**: 1.5 seconds between requests to avoid overwhelming the server
- **Retry logic**: 3 retries with exponential backoff for failed requests
- **Timeout**: 30 seconds per request
- **User-Agent**: Browser-like headers for compatibility

In [2]:
# Initialize scraper with output directory
OUTPUT_DIR = "datasets"
scraper = FIBISScraper(output_dir=OUTPUT_DIR)

print(f"Scraper initialized.")
print(f"Output directory: {scraper.output_dir.absolute()}")
print(f"Rate limit: {scraper.REQUEST_DELAY}s between requests")
print(f"Max retries: {scraper.MAX_RETRIES}")

Scraper initialized.
Output directory: D:\mridul\Scraping Assessment\Task 1\datasets
Rate limit: 1.5s between requests
Max retries: 3


## 3. Discover Available Datasets

Scrape the main records list page to discover all publicly available datasets.

In [3]:
# Clean up old datasets folder before re-running (optional)
import shutil
datasets_path = Path("datasets")
if datasets_path.exists():
    print("Removing old datasets folder...")
    shutil.rmtree(datasets_path)
    print("Old datasets folder removed.")

scraper = FIBISScraper()

# Fetch all available datasets from FIBIS
all_datasets = scraper.get_available_datasets()

print(f"\nFound {len(all_datasets)} publicly available datasets")
print("\nSample datasets with categories:")
for i, ds in enumerate(all_datasets[:10], 1):
    print(f"  {i}. {ds.name}")
    print(f"     Category: {ds.category if ds.category else 'N/A'}")
    print(f"     Records: {ds.record_count:,} | ID: {ds.dataset_id}")

2025-12-30 14:56:23,168 - INFO - Fetching available datasets from FIBIS...


Removing old datasets folder...
Old datasets folder removed.


2025-12-30 14:56:31,356 - INFO - Mapped 2053 datasets to categories
2025-12-30 14:56:31,435 - INFO - Found 2053 dataset links to process
2025-12-30 14:56:31,654 - INFO - Found 2053 unique publicly available datasets



Found 2053 publicly available datasets

Sample datasets with categories:
  1. Jhansi Lychgate Burial Register
     Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
     Records: 1,698 | ID: 1004
  2. Saharanpur Burials.
     Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
     Records: 368 | ID: 2390
  3. St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register
     Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
     Records: 426 | ID: 2524
  4. Register of Burials at Cinnamara, Assam and Outstations 1939-1959
     Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
     Records: 22 | ID: 1731
  5. Transcription of assorted entries from the Burial Indexes 1800-1947
     Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
     Records: 1,312 | ID: 104
  6. Transcription of assorted entries from the Burial Indexes 1800-1947
     Category: Birth Marriage & Deaths > Deaths

## 4. Select 10 Datasets for Scraping

We'll select 10 datasets with varying sizes (10 to 2000 records) to demonstrate scalability.

### Selected Datasets:
The following 10 datasets were chosen to represent different categories and sizes:

In [4]:
# Select the first 10 datasets directly (they are already unique by ID)
# No need to match by name - just take the dataset objects directly
# I created a category for every dataset so that organising in folder makes it easier
# This was done to ensure that datasets with same name don't overwrite each other

selected_datasets = all_datasets[:10]

print(f"{'='*80}")
print("SELECTED DATASETS FOR SCRAPING")
print(f"{'='*80}")
total_records = 0
for i, ds in enumerate(selected_datasets, 1):
    print(f"\n{i:2}. {ds.name}")
    print(f"    Category: {ds.category if ds.category else 'N/A'}")
    print(f"    Records: {ds.record_count:,} | ID: {ds.dataset_id}")
    total_records += ds.record_count

print(f"\n{'='*80}")
print(f"Total expected records: {total_records:,}")
print(f"{'='*80}")

SELECTED DATASETS FOR SCRAPING

 1. Jhansi Lychgate Burial Register
    Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
    Records: 1,698 | ID: 1004

 2. Saharanpur Burials.
    Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
    Records: 368 | ID: 2390

 3. St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register
    Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
    Records: 426 | ID: 2524

 4. Register of Burials at Cinnamara, Assam and Outstations 1939-1959
    Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
    Records: 22 | ID: 1731

 5. Transcription of assorted entries from the Burial Indexes 1800-1947
    Category: Birth Marriage & Deaths > Deaths & Burials > Bengal Burials
    Records: 1,312 | ID: 104

 6. Transcription of assorted entries from the Burial Indexes 1800-1947
    Category: Birth Marriage & Deaths > Deaths & Burials > Bombay Burials
    Records: 569 | ID: 105

## 5. Run the Scraping Pipeline

This will:
1. Visit each dataset's page
2. Extract all records with pagination handling
3. Save to CSV files in properly named folders
4. Create metadata.json files with scraping details

**Note:** This may take several minutes due to rate limiting.

In [5]:
# Run the scraping pipeline
print("Starting scraping pipeline...")
print(f"Estimated time: {len(selected_datasets) * 2}-{len(selected_datasets) * 5} minutes\n")

results = scraper.run_pipeline(datasets_to_scrape=selected_datasets)

print("\n" + "="*60)
print("SCRAPING COMPLETE!")
print("="*60)

2025-12-30 14:56:38,263 - INFO - Starting pipeline with 10 datasets
2025-12-30 14:56:38,265 - INFO - [1/10] Processing: Jhansi Lychgate Burial Register
2025-12-30 14:56:38,266 - INFO - Scraping dataset: Jhansi Lychgate Burial Register (1698 records)


Starting scraping pipeline...
Estimated time: 20-50 minutes



2025-12-30 14:56:39,609 - INFO -   Page 1: 30 records (total: 30)
2025-12-30 14:56:40,604 - INFO -   Page 2: 30 records (total: 60)
2025-12-30 14:56:41,762 - INFO -   Page 3: 30 records (total: 90)
2025-12-30 14:56:43,147 - INFO -   Page 4: 30 records (total: 120)
2025-12-30 14:56:44,706 - INFO -   Page 5: 30 records (total: 150)
2025-12-30 14:56:46,300 - INFO -   Page 6: 30 records (total: 180)
2025-12-30 14:56:47,779 - INFO -   Page 7: 30 records (total: 210)
2025-12-30 14:56:49,235 - INFO -   Page 8: 30 records (total: 240)
2025-12-30 14:56:50,748 - INFO -   Page 9: 30 records (total: 270)
2025-12-30 14:56:52,217 - INFO -   Page 10: 30 records (total: 300)
2025-12-30 14:56:53,676 - INFO -   Page 11: 30 records (total: 330)
2025-12-30 14:56:55,217 - INFO -   Page 12: 30 records (total: 360)
2025-12-30 14:56:56,862 - INFO -   Page 13: 30 records (total: 390)
2025-12-30 14:56:58,385 - INFO -   Page 14: 30 records (total: 420)
2025-12-30 14:56:59,918 - INFO -   Page 15: 30 records (tota


SCRAPING COMPLETE!


## 6. Verify Output Structure

Check that all datasets were saved correctly with proper folder structure.

In [6]:
# Display results summary
print("\nScraping Results:")
print("-" * 60)

success_count = 0
for name, path in results.items():
    if path.startswith("ERROR"):
        print(f"‚ùå {name}: {path}")
    else:
        success_count += 1
        print(f"‚úÖ {name}")
        print(f"   ‚îî‚îÄ‚îÄ {path}")

print(f"\nSuccessfully scraped: {success_count}/{len(results)} datasets")


Scraping Results:
------------------------------------------------------------
‚úÖ Jhansi Lychgate Burial Register
   ‚îî‚îÄ‚îÄ datasets\Birth Marriage & Deaths\Deaths & Burials\Bengal Burials\Jhansi Lychgate Burial Register\Jhansi Lychgate Burial Register.csv
‚úÖ Saharanpur Burials.
   ‚îî‚îÄ‚îÄ datasets\Birth Marriage & Deaths\Deaths & Burials\Bengal Burials\Saharanpur Burials\Saharanpur Burials.csv
‚úÖ St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register
   ‚îî‚îÄ‚îÄ datasets\Birth Marriage & Deaths\Deaths & Burials\Bengal Burials\St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register\St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register.csv
‚úÖ Register of Burials at Cinnamara, Assam and Outstations 1939-1959
   ‚îî‚îÄ‚îÄ datasets\Birth Marriage & Deaths\Deaths & Burials\Bengal Burials\Register of Burials at Cinnamara, Assam and Outstations 1939-1959\Register of Burials at Cinnamara, Assam and Outstations 1939-1959.csv
‚úÖ T

In [7]:
# Verify folder structure
print("\nOutput Folder Structure:")
print("=" * 60)

output_path = Path(OUTPUT_DIR)

def print_tree(path, prefix=""):
    """Recursively print folder tree structure"""
    items = sorted(path.iterdir())
    dirs = [item for item in items if item.is_dir()]
    files = [item for item in items if item.is_file()]
    
    for file in files:
        size = file.stat().st_size
        if size > 1024:
            size_str = f"{size/1024:.1f} KB"
        else:
            size_str = f"{size} bytes"
        print(f"{prefix}üìÑ {file.name} ({size_str})")
    
    for i, folder in enumerate(dirs):
        is_last = (i == len(dirs) - 1)
        print(f"{prefix}üìÅ {folder.name}/")
        
        new_prefix = prefix + "    "
        print_tree(folder, new_prefix)

if output_path.exists():
    print_tree(output_path)
else:
    print(f"Output directory not found: {output_path}")

print("\n" + "=" * 60)
csv_files = list(output_path.rglob("*.csv")) if output_path.exists() else []
print(f"Total datasets saved: {len(csv_files)}")


Output Folder Structure:
üìÅ Birth Marriage & Deaths/
    üìÅ Deaths & Burials/
        üìÅ Bengal Burials/
            üìÅ Jhansi Lychgate Burial Register/
                üìÑ Jhansi Lychgate Burial Register.csv (62.5 KB)
                üìÑ metadata.json (430 bytes)
            üìÅ Register of Burials at Cinnamara, Assam and Outstations 1939-1959/
                üìÑ metadata.json (460 bytes)
                üìÑ Register of Burials at Cinnamara, Assam and Outstations 1939-1959.csv (876 bytes)
            üìÅ Saharanpur Burials/
                üìÑ metadata.json (416 bytes)
                üìÑ Saharanpur Burials.csv (11.5 KB)
            üìÅ St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register/
                üìÑ metadata.json (469 bytes)
                üìÑ St John the Baptist Armenian Apostolic Church, Rangoon - Burial Register.csv (11.1 KB)
            üìÅ Transcription of assorted entries from the Burial Indexes 1800-1947/
                üìÑ

In [8]:
print("\nSample Data Preview:")
print("=" * 60)

output_path = Path(OUTPUT_DIR)
sample_shown = False

csv_files = list(output_path.rglob("*.csv"))

if csv_files:
    csv_file = csv_files[0]
    try:
        df = pd.read_csv(csv_file)
        if len(df) > 0:
            relative_path = csv_file.relative_to(output_path)
            print(f"\nDataset: {csv_file.stem}")
            print(f"Path: {relative_path}")
            print(f"Records: {len(df)}, Columns: {len(df.columns)}")
            print(f"Columns: {list(df.columns)}")
            print("\nFirst 5 rows:")
            display(df.head())
            sample_shown = True
    except Exception as e:
        print(f"Error reading {csv_file}: {e}")

if not sample_shown:
    print("No data to preview.")


Sample Data Preview:

Dataset: All Souls Church Coimbatore (1872-2015) Burial Register - Index of names
Path: Birth Marriage & Deaths\Deaths & Burials\Madras Burials\All Souls Church Coimbatore (1872-2015) Burial Register - Index of names\All Souls Church Coimbatore (1872-2015) Burial Register - Index of names.csv
Records: 207, Columns: 4
Columns: ['Death year', 'Surname', 'Christian name', 'View']

First 5 rows:


Unnamed: 0,Death year,Surname,Christian name,View
0,1872.0,Pettigrew,Edward,
1,1872.0,Scott,Augustus Marley,
2,1872.0,Sargeaunt,Charles Folliott,
3,1872.0,Martin,John Francis,
4,1872.0,Wharton,William Barton,


## 7. Summary Statistics

In [9]:
print("\nFinal Summary:")
print("=" * 60)

total_files = 0
total_records_scraped = 0
dataset_stats = []

output_path = Path(OUTPUT_DIR)

csv_files = list(output_path.rglob("*.csv"))

for csv_file in csv_files:
    try:
        df = pd.read_csv(csv_file)
        total_files += 1
        total_records_scraped += len(df)
        
        relative_path = csv_file.relative_to(output_path)
        category_parts = list(relative_path.parts[:-2])
        category = " > ".join(category_parts) if category_parts else "N/A"
        
        dataset_stats.append({
            'Dataset': csv_file.stem,
            'Category': category,
            'Records': len(df),
            'Columns': len(df.columns)
        })
    except Exception as e:
        pass

print(f"Total datasets scraped: {total_files}")
print(f"Total records extracted: {total_records_scraped:,}")

if dataset_stats:
    stats_df = pd.DataFrame(dataset_stats)
    print(f"\nPer-Dataset Statistics:")
    display(stats_df)


Final Summary:
Total datasets scraped: 10
Total records extracted: 6,062

Per-Dataset Statistics:


Unnamed: 0,Dataset,Category,Records,Columns
0,All Souls Church Coimbatore (1872-2015) Burial...,Birth Marriage & Deaths > Deaths & Burials > M...,207,4
1,Chandernagore Civil Death Indexes (1831-1864),Birth Marriage & Deaths > Deaths & Burials > C...,1008,5
2,Civil Registration of Deaths in Chandernagore ...,Birth Marriage & Deaths > Deaths & Burials > C...,403,5
3,"Cape Town, St. George's Cathedral (1796 - 1830)",Birth Marriage & Deaths > Deaths & Burials > B...,49,4
4,Transcription of assorted entries from the Bur...,Birth Marriage & Deaths > Deaths & Burials > B...,569,5
5,Jhansi Lychgate Burial Register,Birth Marriage & Deaths > Deaths & Burials > B...,1698,6
6,"Register of Burials at Cinnamara, Assam and Ou...",Birth Marriage & Deaths > Deaths & Burials > B...,22,4
7,Saharanpur Burials,Birth Marriage & Deaths > Deaths & Burials > B...,368,4
8,"St John the Baptist Armenian Apostolic Church,...",Birth Marriage & Deaths > Deaths & Burials > B...,426,5
9,Transcription of assorted entries from the Bur...,Birth Marriage & Deaths > Deaths & Burials > B...,1312,5


## 8. How It Works

### Scraping Pipeline Architecture:

# FIBIS Scraper Pipeline

## 1. Dataset Discovery
- Parse `recordslist.php` ‚Üí extract dataset names and URLs
- Extract category hierarchy (`browse_classes` links)

## 2. Dataset Selection
- Filter by record count
- Select **10 unique datasets**

## 3. Record Extraction (for each dataset)
- Request page 1
- Parse HTML tables/divs
- Extract headers and data rows
- Check for pagination ‚Üí request next page
- Repeat until no more pages

## 4. Output Organization (FIBIS Category Hierarchy)
```
datasets/
‚îî‚îÄ‚îÄ [Category Level 1]/
    ‚îî‚îÄ‚îÄ [Category Level 2]/
        ‚îî‚îÄ‚îÄ [Category Level 3]/
            ‚îî‚îÄ‚îÄ [Dataset Name]/
                ‚îú‚îÄ‚îÄ [Dataset Name].csv
                ‚îî‚îÄ‚îÄ metadata.json

```

### Key Features:

- **Category-Based Organization**: Folder structure mirrors FIBIS website hierarchy
- **No Overwrites**: Datasets with same name but different categories are saved separately
- **Unique Dataset IDs**: Each dataset tracked by ID to prevent duplicates

### Error Handling:

- **Retry Logic**: 3 retries with exponential backoff (5s, 10s, 15s)
- **Timeout Handling**: 30-second timeout per request
- **Rate Limiting**: 1.5 seconds between requests
- **Graceful Degradation**: Continues with next dataset if one fails

### Pagination Handling:

- Detects "Next" links and page number links
- Uses `start` parameter for page offsets
- Stops when no more pages are detected
- Safety limit of 1000 pages maximum

---

## End of Notebook

All 10 datasets have been scraped and saved to the `datasets/` folder.

**Output Structure:**
## Example
```
datasets/
‚îî‚îÄ‚îÄ Birth Marriage & Deaths/
    ‚îî‚îÄ‚îÄ Deaths & Burials/
        ‚îú‚îÄ‚îÄ Bengal Burials/
        ‚îÇ   ‚îî‚îÄ‚îÄ Jhansi Lychgate Burial Register/
        ‚îÇ       ‚îú‚îÄ‚îÄ Jhansi Lychgate Burial Register.csv
        ‚îÇ       ‚îî‚îÄ‚îÄ metadata.json
        ‚îî‚îÄ‚îÄ Bombay Burials/
            ‚îî‚îÄ‚îÄ Transcription of assorted entries.../
                ‚îú‚îÄ‚îÄ Transcription of assorted entries....csv
                ‚îî‚îÄ‚îÄ metadata.json

```