# Boolean Device Search Examples

This notebook demonstrates the boolean search functionality in PyMAUDE.

The `search_by_device_names()` method provides flexible AND/OR boolean logic for finding devices, with support for grouped searches for comparative analysis.

**Author**: Jacob Schwartz <jaschwa@umich.edu>  
**Copyright**: 2026, GNU GPL v3

## Setup

First, import PyMAUDE and create/connect to a database.

In [1]:
from pymaude import MaudeDatabase
import pandas as pd

# Connect to database
db = MaudeDatabase('analysis/venous_thrombectomy/maude_2008_2025.db')

# Load data if needed (skip if already loaded)
# db.add_years('2020-2024', tables=['master', 'device'], download=True)

# Check database info
db.info()


Database: analysis/venous_thrombectomy/maude_2008_2025.db
_maude_load_metadata 72 records
device          23,286,013 records
master          22,649,853 records
patient         23,619,555 records
text            53,246,898 records

Date range: 2008-01-01 00:00:00 to 2025-12-31 00:00:00
Database size: 47.81 GB


## 1. Create Search Index (One-Time Setup)

Before using `search_by_device_names()`, create a search index for optimal performance. This is a one-time operation that adds a concatenated column and index to the device table.

**Note**: This step is optional - searches will work without it, but will be 10-30x slower.

In [2]:
# Create search index (run once per database)
result = db.create_search_index()

print(f"Index created: {result['created']}")
print(f"Rows indexed: {result['rows_updated']:,}")
print(f"Time taken: {result['time_seconds']:.1f} seconds")

Creating search index...
  - Adding DEVICE_NAME_CONCAT column
  - Populating with concatenated values
  - Creating index on 23,286,013 rows
Search index created in 150.9s
Index created: True
Rows indexed: 23,286,013
Time taken: 150.9 seconds


## 2. Simple Searches

### Single Term Search

Search for devices containing a single term across BRAND_NAME, GENERIC_NAME, and MANUFACTURER_D_NAME.

In [None]:
# Find all devices containing "argon"
results = db.search_by_device_names('argon')

print(f"Found {len(results)} events")
print(f"Unique brands: {results['BRAND_NAME'].nunique()}")
print("\nTop 5 brand names:")
print(results['BRAND_NAME'].value_counts().head())

### OR Search (Multiple Terms)

Search for devices matching **ANY** of the provided terms.

In [None]:
# # Find devices containing "argon" OR "penumbra" OR "angiojet"
# results = db.search_devices(['argon', 'penumbra', 'angiojet'])

# print(f"Found {len(results)} events")
# print(f"Unique manufacturers: {results['MANUFACTURER_D_NAME'].nunique()}")
# print("\nTop manufacturers:")
# print(results['MANUFACTURER_D_NAME'].value_counts())

## 3. Boolean Logic (AND/OR)

### AND Search

Search for devices where **ALL** terms match (nested list = AND within group).

In [None]:
# Find devices with BOTH "argon" AND "cleaner"
# Note the nested list: [['argon', 'cleaner']]
results = db.search_by_device_names([['argon', 'cleaner']])

print(f"Found {len(results)} events")
print("\nBrand names (should all contain both 'argon' and 'cleaner'):")
print(results['BRAND_NAME'].unique()[:10])

In [5]:
results.shape

(45, 9)

### Complex Boolean: (A AND B) OR C

The most common pattern: search for devices matching multiple specific criteria.

In [None]:
# # Find: ("argon" AND "cleaner") OR "angiojet"
# # This matches:
# #   - Argon Cleaner devices (both terms present)
# #   - AngioJet devices (any brand)

# results = db.search_devices([
#     ['argon', 'cleaner'],  # First group: both terms required
#     ['angiojet']           # OR second group: single term
# ])

# print(f"Found {len(results)} events")
# print(f"Unique EVENT_KEYs: {results['EVENT_KEY'].nunique()}")
# print("\nEvent type breakdown:")
# print(results['EVENT_TYPE'].value_counts())

### Three-Way Boolean Logic

Combine multiple AND groups with OR.

In [None]:
# Find: ("argon" AND "cleaner") OR ("argon" AND "thrombectomy") OR ("cleaner" AND "thrombectomy")

results = db.search_by_device_names([
    ['argon', 'cleaner'],         # Argon Cleaner devices
    ['argon', 'thrombectomy'],    # OR Argon Thrombectomy devices  
    ['cleaner', 'thrombectomy']   # OR Cleaner Thrombectomy devices
])

print(f"Found {len(results)} events")
print("\nGeneric name distribution:")
print(results['GENERIC_NAME'].value_counts().head(10))

In [7]:
results.shape

(59, 9)

In [9]:
results

Unnamed: 0,MDR_REPORT_KEY,EVENT_KEY,DATE_RECEIVED,EVENT_TYPE,BRAND_NAME,GENERIC_NAME,MANUFACTURER_D_NAME,DEVICE_REPORT_PRODUCT_CODE,device_rowid
0,2190922,,2011-07-19 00:00:00,M,CLEANER ROTATIONAL THROMBECTOMY SYSTEM,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL LP,MCW,9467322
1,2594095,,2012-05-29 00:00:00,M,CLEANER ROTATIONAL THROMBECTOMY SYSTEM,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL LP,MCW,9864309
2,2885392,,2012-11-09 00:00:00,D,CLEANER ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL LP,MCW,10148580
3,3464139,,2013-10-07 00:00:00,M,CLEANER ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL L.P.,MCW,10724259
4,3565921,,2013-11-08 00:00:00,M,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL L.P.,MCW,10814092
5,3599609,,2013-12-05 00:00:00,M,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL LP,MCW,10821997
6,3869501,,2014-04-30 00:00:00,M,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,"REX MEDICAL, L.P.",MCW,11130042
7,3899964,,2014-05-01 00:00:00,M,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,REX MEDICAL LP,MCW,11160296
8,3937959,,2014-02-27 00:00:00,M,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THOMBECTOMY DEVICE,REX MEDICAL LP,MCW,11198093
9,4217417,,2014-10-14 00:00:00,D,CLEANER15 ROTATIONAL THROMBECTOMY DEVICE,MECHANICAL THROMBECTOMY DEVICE,"REX MEDICAL, L.P.",MCW,11475071


## 4. Date Filtering

Combine boolean search with date filters.

In [None]:
# Search for thrombectomy devices from 2023 only
results = db.search_by_device_names(
    [['argon', 'thrombectomy'], 'angiojet', 'penumbra'],
    start_date='2023-01-01',
    end_date='2023-12-31'
)

print(f"Found {len(results)} events in 2023")
print("\nMonthly distribution:")
results['month'] = pd.to_datetime(results['DATE_RECEIVED']).dt.to_period('M')
print(results['month'].value_counts().sort_index())

## 5. Real-World Example: Venous Thrombectomy Devices

A realistic research scenario: identify all venous thrombectomy devices.

In [None]:
# Define inclusion criteria for venous thrombectomy devices
# Include:
#   - Argon Cleaner aspiration devices
#   - Boston Scientific AngioJet rheolytic devices  
#   - Penumbra aspiration devices
#   - Inari ClotTriever devices

thrombectomy_devices = db.search_by_device_names([
    ['argon', 'cleaner'],
    ['boston', 'angiojet'],
    ['penumbra', 'indigo'],
    ['penumbra', 'lightning'],
    ['inari', 'clottriever']
], start_date='2019-01-01')  # Problem codes available 2019+

print(f"Found {len(thrombectomy_devices)} events")
print(f"Unique devices: {thrombectomy_devices['BRAND_NAME'].nunique()}")
print(f"Date range: {thrombectomy_devices['DATE_RECEIVED'].min()} to {thrombectomy_devices['DATE_RECEIVED'].max()}")
print("\nTop 10 devices:")
print(thrombectomy_devices['BRAND_NAME'].value_counts().head(10))

### Enrich with Additional Data

Once you have your device selection, enrich with narratives, patient data, etc.

In [None]:
# Get patient outcome data
enriched = db.enrich_with_patient_data(thrombectomy_devices)

print(f"Events with patient data: {enriched['OUTCOME_CODE'].notna().sum()}")
print("\nOutcome distribution:")
if 'OUTCOME_CODE' in enriched.columns:
    print(enriched['OUTCOME_CODE'].value_counts())

# Get yearly trends (DataFrame-only)
trends = db.get_trends_by_year(thrombectomy_devices)
print("\nYearly trends:")
print(trends)

## 6. Examining Results

Inspect what was found to validate your search criteria.

In [None]:
# Get unique brand names to verify search captured what you wanted
results = db.search_by_device_names([['argon', 'cleaner'], 'angiojet'])

print("Unique BRAND_NAMEs found:")
for brand in sorted(results['BRAND_NAME'].unique()):
    count = (results['BRAND_NAME'] == brand).sum()
    print(f"  {brand}: {count} events")

### Filter Out Unwanted Results

If your search captured devices you don't want, filter them afterward.

In [None]:
# Search for Argon devices
results = db.search_by_device_names('argon')

print(f"Initial results: {len(results)} events")

# Exclude balloon catheters (not thrombectomy devices)
filtered = results[~results['BRAND_NAME'].str.contains('BALLOON', case=False, na=False)]

print(f"After excluding balloons: {len(filtered)} events")
print("\nRemaining brand names:")
print(filtered['BRAND_NAME'].unique()[:10])

## 7. Performance Comparison

Compare search speed with and without the search index.

In [None]:
import time

# With search index (default)
start = time.time()
results_fast = db.search_by_device_names([['argon', 'cleaner'], 'angiojet'])
time_with_index = time.time() - start

# Without search index (searches individual columns)
start = time.time()
results_slow = db.search_by_device_names(
    [['argon', 'cleaner'], 'angiojet'],
    use_concat_column=False
)
time_without_index = time.time() - start

print(f"With search index: {time_with_index:.3f} seconds")
print(f"Without search index: {time_without_index:.3f} seconds")
print(f"Speedup: {time_without_index / time_with_index:.1f}x faster")
print(f"\nBoth return same results: {len(results_fast) == len(results_slow)}")

## 8. Common Patterns Cheatsheet

Quick reference for common search patterns.

In [None]:
# Pattern 1: Simple OR (any term matches)
results = db.search_by_device_names(['term1', 'term2', 'term3'])

# Pattern 2: Simple AND (all terms must match)
results = db.search_by_device_names([['term1', 'term2', 'term3']])

# Pattern 3: (A AND B) OR (C AND D)
results = db.search_by_device_names([
    ['term1', 'term2'],
    ['term3', 'term4']
])

# Pattern 4: Manufacturer + device type
results = db.search_by_device_names([['manufacturer', 'device_type']])

# Pattern 5: Multiple manufacturers, same device type
results = db.search_by_device_names([
    ['manufacturer1', 'device_type'],
    ['manufacturer2', 'device_type']
])

# Pattern 6: Recent events only
results = db.search_by_device_names(
    ['device_term'],
    start_date='2023-01-01'
)

# Pattern 7: Grouped search for comparative analysis
results = db.search_by_device_names({
    'mechanical': [['argon', 'cleaner'], 'angiojet'],
    'aspiration': 'penumbra'
})
# Results include search_group column for analysis
print(results['search_group'].unique())

# Pattern 8: Count matches without loading full data
count = len(db.search_by_device_names(['device_term']))
print(f"Found {count} matching events")

## 9. Tips and Best Practices

### Performance Tips

1. **Always create the search index first**: `db.create_search_index()`
2. **Use date filters** to reduce result set size
3. **More specific terms** = faster searches
4. **Test your search criteria** on a small date range first

### Search Strategy

1. **Start broad, then refine**: Begin with manufacturer or device type, then add specificity
2. **Validate your results**: Check `BRAND_NAME.unique()` to see what you captured
3. **Iterate**: If you missed devices, add more terms to your criteria
4. **Document your criteria**: Save search criteria in your analysis code for reproducibility

### Common Pitfalls

- **Typos**: "penubra" won't match "penumbra"
- **Case doesn't matter**: "ARGON" and "argon" are equivalent
- **Partial matches**: "throm" matches "thrombectomy", "thrombosis", etc.
- **Multiple devices per event**: Some events involve multiple devices
- **Duplicate EVENT_KEYs**: Use `deduplicate_events=True` (default) for event counts

## 10. Grouped Search for Comparative Analysis

The most powerful feature: dict-based grouped search for comparing multiple device categories.

### Basic Grouped Search

```python
# Compare mechanical vs aspiration thrombectomy devices
results = db.search_by_device_names({
    'mechanical': [['argon', 'cleaner'], 'angiojet'],
    'aspiration': [['penumbra', 'indigo'], ['penumbra', 'lightning']]
})

# Results include search_group column
print(results['search_group'].unique())
# Output: ['mechanical', 'aspiration']

# Use with helper functions for automatic grouping
summary = db.summarize_by_brand(results)  # Groups by search_group automatically
trends = db.get_trends_by_year(results)   # Includes search_group breakdown
comparison = db.event_type_comparison(results)  # Compares groups

# Filter to single group for focused analysis
mechanical_only = results[results['search_group'] == 'mechanical']
trends_mech = db.get_trends_by_year(mechanical_only)
```

### Overlap Handling

Events matching multiple groups only appear in the **first matching group** (dict order):

```python
results = db.search_by_device_names({
    'all_argon': 'argon',
    'cleaner_xt': [['argon', 'cleaner', 'xt']]  # Subset of all_argon
})
# Warning: "X events previously matched to 'all_argon' were skipped from 'cleaner_xt'"
```

### Real-World Example

```python
# Compare venous stent brands
results = db.search_by_device_names({
    'venovo': 'venovo',
    'zilver_vena': 'zilver vena',
    'vici': 'vici venous',
    'abre': 'abre'
}, start_date='2019-01-01')

# Analyze by group
summary = db.summarize_by_brand(results)
print(summary['counts'])

# Visualize trends
trends = db.get_trends_by_year(results)
import matplotlib.pyplot as plt
for group in results['search_group'].unique():
    group_trends = trends[trends['search_group'] == group]
    plt.plot(group_trends['year'], group_trends['event_count'], label=group, marker='o')
plt.legend()
plt.show()
```

## Summary

The `search_by_device_names()` method provides:

✅ **Fast boolean searches** with AND/OR logic  
✅ **Grouped search** for device comparison with dict input  
✅ **Simple API**: strings, lists, nested lists, or dicts  
✅ **Date filtering** integration  
✅ **Case-insensitive** partial matching  
✅ **10-30x faster** with search index  
✅ **Reproducible** - search criteria in code  
✅ **Helper integration** - works seamlessly with analysis functions  

For more details, see the [API Reference](docs/api_reference.md).

**Author**: Jacob Schwartz <jaschwa@umich.edu>  
**Copyright**: 2026, GNU GPL v3