# Reproducible Research Workflow

**Purpose**: Best practices for publication and sharing  
**Target Audience**: Researchers preparing manuscripts

## What You'll Learn

- Document data sources and versions
- Use checksum tracking for integrity
- Create reproducible analysis pipelines
- Export for publication
- Share databases with collaborators

## Why Reproducibility Matters

Reproducible research allows:
- Peer verification
- Replication studies
- Long-term validity
- Collaborative science

In [1]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path().resolve().parent / 'src'))

from maude_db import MaudeDatabase
import pandas as pd
from datetime import datetime

## 1. Document Your Setup

Always record:
- Date of analysis
- maude_db version
- Years of data
- Tables used

In [2]:
# Document analysis metadata
analysis_metadata = {
    'analysis_date': datetime.now().strftime('%Y-%m-%d'),
    'analyst': 'Your Name',
    'project': 'Device Safety Study',
    'data_years': '2020-2023',
    'tables': ['device', 'master'],
    'research_question': 'Analyze pacemaker adverse events'
}

print("Analysis Metadata:")
for key, value in analysis_metadata.items():
    print(f"  {key}: {value}")

Analysis Metadata:
  analysis_date: 2026-01-05
  analyst: Your Name
  project: Device Safety Study
  data_years: 2020-2023
  tables: ['device', 'master']
  research_question: Analyze pacemaker adverse events


## 2. Create Database with Verification

The library automatically tracks checksums to verify data integrity.

In [3]:
db = MaudeDatabase('reproducible_study.db', verbose=True)

# Download data (checksums tracked automatically)
db.add_years(
    years='2020-2023',
    tables=['device', 'master'],
    download=True,
    data_dir='./maude_data'
)

print("\n✓ Data loaded with checksum tracking")


Grouping years by file for optimization...

Downloading files...
  Downloading device2020.zip...
  Downloading device2021.zip...
  Using cached device2022.zip
  Using cached device2023.zip
  Using mdrfoithru2024.zip instead (latest available cumulative file).
  Downloading mdrfoithru2024.zip...

Processing data files...

Loading device for year 2020...
    Identified date columns: DATE_REMOVED_FLAG, IMPLANT_DATE_YEAR, DATE_REMOVED_YEAR, DATE_RECEIVED, EXPIRATION_DATE_OF_DEVICE, DATE_RETURNED_TO_MANUFACTURER
    Processed 1,100,000 rows...
    Total: 1,567,925 rows

Loading device for year 2021...
    Identified date columns: DATE_REMOVED_FLAG, IMPLANT_DATE_YEAR, DATE_REMOVED_YEAR, DATE_RECEIVED, EXPIRATION_DATE_OF_DEVICE, DATE_RETURNED_TO_MANUFACTURER
    Processed 1,100,000 rows...
    Processed 2,032,838 rows...
    Total: 2,032,838 rows

Loading device for year 2022...
    Identified date columns: DATE_REMOVED_FLAG, IMPLANT_DATE_YEAR, DATE_REMOVED_YEAR, DATE_RECEIVED, EXPIRATION_DA


  for i, chunk in enumerate(pd.read_csv(


    Scanned 1,099,999 rows, kept 1...
    Scanned 2,099,999 rows, kept 1...
    Scanned 3,099,999 rows, kept 471,816...
    Scanned 4,099,999 rows, kept 1,471,392...



  for i, chunk in enumerate(pd.read_csv(


    Scanned 5,099,998 rows, kept 2,471,026...
    Scanned 6,099,998 rows, kept 3,470,651...



  for i, chunk in enumerate(pd.read_csv(


    Scanned 7,099,997 rows, kept 4,470,398...
    Scanned 8,099,997 rows, kept 5,470,343...
    Scanned 9,099,997 rows, kept 6,470,338...
    Scanned 10,099,997 rows, kept 7,470,336...
    Scanned 11,099,997 rows, kept 8,470,335...
    Scanned 12,099,997 rows, kept 8,878,787...
    Scanned 13,099,997 rows, kept 8,878,788...
    Scanned 14,099,997 rows, kept 8,878,788...
    Scanned 15,099,997 rows, kept 8,878,789...
    Scanned 16,099,997 rows, kept 8,878,790...
    Scanned 17,099,997 rows, kept 8,878,790...
    Scanned 18,099,997 rows, kept 8,878,790...
    Scanned 19,099,997 rows, kept 8,878,790...
    Scanned 20,099,997 rows, kept 8,878,790...
    Total: Scanned 20,747,247 rows, loaded 8,878,790 rows for 4 years
    Per-year breakdown:
      2020: 1,564,999 rows
      2021: 2,028,313 rows
      2022: 2,945,665 rows
      2023: 2,339,813 rows

Creating indexes...

Database update complete

✓ Data loaded with checksum tracking


## 3. Parameterized Analysis Pipeline

Use configuration variables for easy replication.

In [4]:
# Analysis configuration
CONFIG = {
    'device_name': 'pacemaker',
    'start_date': '2020-01-01',
    'end_date': '2023-12-31',
    'min_events': 10  # Exclude low-count manufacturers
}

print("Analysis Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Analysis Configuration:
  device_name: pacemaker
  start_date: 2020-01-01
  end_date: 2023-12-31
  min_events: 10


In [5]:
# Run analysis with config
results = db.query_device(
    device_name=CONFIG['device_name'],
    start_date=CONFIG['start_date'],
    end_date=CONFIG['end_date']
)

print(f"\nQuery results: {len(results):,} events")


Query results: 234,325 events


## 4. Export for Publication

Export data and metadata for supplementary materials.

In [6]:
# Export results
output_prefix = f"{CONFIG['device_name']}_{analysis_metadata['analysis_date']}"

# Export full results
results.to_csv(f'{output_prefix}_full_data.csv', index=False)
print(f"✓ Exported: {output_prefix}_full_data.csv")

# Export summary statistics
summary = {
    'total_events': len(results),
    'date_range': f"{CONFIG['start_date']} to {CONFIG['end_date']}",
    'unique_reports': results['MDR_REPORT_KEY'].nunique()
}

summary_df = pd.DataFrame([summary])
summary_df.to_csv(f'{output_prefix}_summary.csv', index=False)
print(f"✓ Exported: {output_prefix}_summary.csv")

✓ Exported: pacemaker_2026-01-05_full_data.csv
✓ Exported: pacemaker_2026-01-05_summary.csv


## 5. Methods Section Template

Use this template for your manuscript methods section:

In [7]:
methods_text = f"""
DATA SOURCE
-----------
We analyzed data from the FDA MAUDE (Manufacturer and User Facility Device 
Experience) database for the period {CONFIG['start_date']} to {CONFIG['end_date']}.
Data were downloaded on {analysis_metadata['analysis_date']} using the maude_db 
Python library (version X.X.X).

INCLUSION CRITERIA
------------------
We included adverse event reports for devices with generic names containing 
'{CONFIG['device_name']}'. Reports were filtered to the date range specified above.

DATA INTEGRITY
--------------
Data integrity was verified using SHA-256 checksums of source files. The analysis 
database and code are available upon request for reproducibility.

ANALYSIS
--------
We analyzed {len(results):,} adverse event reports. [Describe your specific analyses here]
"""

print(methods_text)

# Save methods text
with open(f'{output_prefix}_methods.txt', 'w') as f:
    f.write(methods_text)
print(f"\n✓ Saved methods template to: {output_prefix}_methods.txt")


DATA SOURCE
-----------
We analyzed data from the FDA MAUDE (Manufacturer and User Facility Device 
Experience) database for the period 2020-01-01 to 2023-12-31.
Data were downloaded on 2026-01-05 using the maude_db 
Python library (version X.X.X).

INCLUSION CRITERIA
------------------
We included adverse event reports for devices with generic names containing 
'pacemaker'. Reports were filtered to the date range specified above.

DATA INTEGRITY
--------------
Data integrity was verified using SHA-256 checksums of source files. The analysis 
database and code are available upon request for reproducibility.

ANALYSIS
--------
We analyzed 234,325 adverse event reports. [Describe your specific analyses here]


✓ Saved methods template to: pacemaker_2026-01-05_methods.txt


## 6. Database Archiving

For long-term storage, consider:
- Uploading database to Zenodo for DOI
- Including checksums in README
- Documenting exact FDA file versions used

In [8]:
db.close()
print("\n✓ Reproducible workflow complete!")
print("\nNext steps:")
print("  1. Upload database + code to Zenodo")
print("  2. Include data availability statement in paper")
print("  3. Share analysis notebooks as supplementary material")


✓ Reproducible workflow complete!

Next steps:
  1. Upload database + code to Zenodo
  2. Include data availability statement in paper
  3. Share analysis notebooks as supplementary material
