# Safety Signal Detection

**Purpose**: Identify unusual patterns in adverse events  
**Data**: 2020-2023 (4 years)  
**Target Audience**: Advanced researchers, safety surveillance

## What is Signal Detection?

Signal detection identifies potential safety issues by looking for:
- Sudden spikes in event counts
- Changes in event type proportions
- Emerging failure modes in narratives

## Important Note

This notebook demonstrates **exploratory techniques**. Statistical signals require clinical validation and proper epidemiological study design.

In [1]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path().resolve().parent / 'src'))

from maude_db import MaudeDatabase
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_columns', None)

In [2]:
db = MaudeDatabase('signal_detection.db', verbose=True)

db.add_years(
    years='2020-2023',
    tables=['device', 'master'],
    download=True,
    data_dir='./maude_data'
)


Grouping years by file for optimization...

Downloading files...
  Downloading device2020.zip...
  Downloading device2021.zip...
  Using cached device2022.zip
  Using cached device2023.zip
  Using mdrfoithru2024.zip instead (latest available cumulative file).
  Downloading mdrfoithru2024.zip...
  Error downloading 2020: 
  Skipping master - download failed

Processing data files...

Loading device for year 2020...
    Identified date columns: DATE_REMOVED_FLAG, IMPLANT_DATE_YEAR, DATE_REMOVED_YEAR, DATE_RECEIVED, EXPIRATION_DATE_OF_DEVICE, DATE_RETURNED_TO_MANUFACTURER
    Processed 1,100,000 rows...
    Total: 1,567,925 rows

Loading device for year 2021...
    Identified date columns: DATE_REMOVED_FLAG, IMPLANT_DATE_YEAR, DATE_REMOVED_YEAR, DATE_RECEIVED, EXPIRATION_DATE_OF_DEVICE, DATE_RETURNED_TO_MANUFACTURER
    Processed 1,100,000 rows...
    Processed 2,032,838 rows...
    Total: 2,032,838 rows

Loading device for year 2022...
    Identified date columns: DATE_REMOVED_FLAG, IMP


  for i, chunk in enumerate(pd.read_csv(


    Scanned 1,099,999 rows, kept 1...
    Scanned 2,099,999 rows, kept 1...
    Scanned 3,099,999 rows, kept 471,816...
    Scanned 4,099,999 rows, kept 1,471,392...



  for i, chunk in enumerate(pd.read_csv(


    Scanned 5,099,998 rows, kept 2,471,026...
    Scanned 6,099,998 rows, kept 3,470,651...



  for i, chunk in enumerate(pd.read_csv(


    Scanned 7,099,997 rows, kept 4,470,398...
    Scanned 8,099,997 rows, kept 5,470,343...
    Scanned 9,099,997 rows, kept 6,470,338...
    Scanned 10,099,997 rows, kept 7,470,336...
    Scanned 11,099,997 rows, kept 8,470,335...
    Scanned 12,099,997 rows, kept 8,878,787...
    Scanned 13,099,997 rows, kept 8,878,788...
    Scanned 14,099,997 rows, kept 8,878,788...
    Scanned 15,099,997 rows, kept 8,878,789...
    Scanned 16,099,997 rows, kept 8,878,790...
    Scanned 17,099,997 rows, kept 8,878,790...
    Scanned 18,099,997 rows, kept 8,878,790...
    Scanned 19,099,997 rows, kept 8,878,790...
    Scanned 20,099,997 rows, kept 8,878,790...
    Total: Scanned 20,747,247 rows, loaded 8,878,790 rows for 4 years
    Per-year breakdown:
      2020: 1,564,999 rows
      2021: 2,028,313 rows
      2022: 2,945,665 rows
      2023: 2,339,813 rows

Creating indexes...

Database update complete


## 1. Temporal Analysis: Detect Spikes

Look for months with unusually high event counts.

In [3]:
# Query device of interest
device_name = 'insulin pump'
results = db.query_device(device_name=device_name)
print(f"Total events: {len(results):,}")

# Convert dates and group by month
results['date'] = pd.to_datetime(results['DATE_RECEIVED'], errors='coerce')
results['year_month'] = results['date'].dt.to_period('M')

monthly = results.groupby('year_month').size()
print(f"\nMonthly event counts:")
print(monthly.tail(12))

Total events: 827,623


ValueError: cannot assemble with duplicate keys

In [None]:
# Detect spikes using statistical threshold
mean = monthly.mean()
std = monthly.std()
threshold = mean + 2*std  # 2 standard deviations

spikes = monthly[monthly > threshold]
print(f"\nMonths exceeding threshold ({threshold:.0f} events):")
print(spikes)

# Visualize
plt.figure(figsize=(12, 5))
plt.plot(monthly.index.astype(str), monthly.values, marker='o', linewidth=2)
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold ({threshold:.0f})')
plt.xlabel('Month')
plt.ylabel('Event Count')
plt.title(f'{device_name.title()} Events Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 2. Event Type Proportion Changes

Detect if the ratio of deaths/injuries/malfunctions changes over time.

In [None]:
# Calculate event type proportions by year
results['year'] = results['date'].dt.year

yearly_breakdown = []
for year in sorted(results['year'].dropna().unique()):
    year_data = results[results['year'] == year]
    breakdown = db.event_type_breakdown_for(year_data)
    yearly_breakdown.append({
        'Year': int(year),
        'Total': breakdown['total'],
        'Deaths': breakdown['deaths'],
        'Injuries': breakdown['injuries'],
        'Malfunctions': breakdown['malfunctions'],
        'Death_Rate': breakdown['deaths'] / breakdown['total'] if breakdown['total'] > 0 else 0
    })

breakdown_df = pd.DataFrame(yearly_breakdown)
print(breakdown_df)

## 3. Limitations and Next Steps

**Limitations of this analysis:**
- No adjustment for market growth
- Reporting bias (changes in reporting practices)
- Seasonal effects not considered
- Multiple testing (many comparisons)

**For rigorous signal detection:**
1. Use proper statistical methods (Poisson regression, disproportionality analysis)
2. Adjust for confounders (market size, reporting trends)
3. Validate signals clinically
4. Consider recall/label changes as co-variates

In [None]:
db.close()
print("\nâœ“ Signal detection complete!")
print("See docs/research_guide.md for best practices.")