Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

DocumentMetadataAPI is a Flask REST API that returns bibliographic metadata (title, journal, abstract, pub date, etc.) for biomedical publications, looked up by PMID, PMC ID, or DOI. It targets a p90 SLO of 150ms.

The backend is MongoDB (DocumentDB in production), accessed via the `pymongo` driver. Tracing is done via OpenTelemetry → Jaeger.

## Running locally

```bash
pip install -r requirements.txt
python main.py # Flask dev server on port 5000
```

For production-style serving (matches Docker):
```bash
gunicorn -b 0.0.0.0:8000 --workers 1 --threads 8 --timeout 0 main:app
```

The app connects to MongoDB. Without a `connection_string` env var it defaults to a local MongoDB instance (`client['local']`). In production, set:
```bash
export connection_string="mongodb://..."
```

## API endpoints

- `GET /` — health check; returns sample IDs from each ID type
- `GET /version` — returns version string
- `GET /publications?pubids=PMID:123,PMC456&request_id=<uuid>` — main endpoint; returns metadata + `_meta` wrapper
- `GET /identifiers?pubids=PMID:123,DOI:10.1/x` — cross-reference lookup; returns synonyms across PMID/PMC/DOI

## Architecture

**`main.py`** — Flask app with two main routes:
- `/publications`: queries the `documentMetadata` MongoDB collection by `document_id`. For missing IDs, falls back to the NCBI PMC ID converter API (`www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/`).
- `/identifiers`: queries the `documentIds` reference collection for cross-ID lookups, then falls back to NCBI PMC ID converter for misses. Processes PMID, PMC, and DOI in separate batches.

**`data_loader.py`** — AWS Lambda handler for bulk-loading data. Reads gzipped TSV files from GCS (via boto3 S3-compatible client), parses them, fetches PMID→PMC/DOI synonyms, and upserts into MongoDB. The same document is stored three times: once per identifier (PMID, PMC, DOI), all pointing to identical metadata.

**`data_checker.py`** — AWS Lambda handler for auditing. Verifies whether documents from a given TSV file are present (or absent for delete files) in MongoDB.

**`query_tester.py`** — Local load-testing script. Hits the production endpoint with batches of 10/50/100 IDs and reports `processing_time_ms` statistics.

## MongoDB collections

- `documentMetadata` — main collection; keyed on `document_id` (e.g., `PMID:30690000`, `PMC1234`, `10.1000/xyz`)
- `documentIds` — reference/synonym collection; fields `PM`, `PMC`, `DOI`

## ID format conventions

The API normalizes input IDs before lookup:
- `PMC:` prefix → `PMC` (no colon)
- `DOI:` prefix → stripped entirely
- Lookups are case-insensitive for prefixes

## Deployment

Jenkins CI (`Jenkinsfile`) builds a Docker image, pushes to AWS ECR (`853771734544.dkr.ecr.us-east-1.amazonaws.com/translator-docmetadataapi`), and deploys to AWS EKS. The pipeline polls SCM every 5 minutes and targets the `translator-eks-ci-blue-cluster`.

`rds-combined-ca-bundle.pem` is the TLS CA bundle for AWS DocumentDB (MongoDB-compatible); the Dockerfile now fetches the updated global bundle from AWS at build time instead.
60 changes: 59 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,62 @@ GET /publications?pubids=PMID:30690000,PMID:82374,PMID:28736,PMID:8000234&reques
]
}

```
```

## Updating the database

The database is populated from tab-separated (TSV) files where each row is one publication. The columns must appear in this order (minimum 10):

```
document_id pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract
```

- `document_id` should be prefixed (`PMID:30690000`). Use `-` for an absent `pub_day`.
- Each PMID record is automatically duplicated under its PMC and DOI synonyms (fetched from the NCBI ID converter API), so you only need to supply PMID-keyed rows.

### Loading locally

`data_loader.py` has no CLI entry point, so loading is done via a short Python script:

```python
from pymongo import MongoClient
import data_loader

client = MongoClient("mongodb://...") # or omit arg for localhost
db = client['test']
data_loader.collection = db['documentMetadata']
data_loader.reference = db['documentIds']

data_loader.process_file('/path/to/data.tsv')
```

To delete records instead of upsert, pass a plain text file with one `document_id` per line and call `process_file('/path/to/deletes.txt', is_delete=True)`.

### Via AWS Lambda (`data_loader.lambda_handler`)

The Lambda variant downloads a gzipped TSV from a GCS bucket and then upserts (or, if "deleted" appears in the filename, deletes). It expects this event payload:

```json
{
"source": {
"bucket": "my-gcs-bucket",
"filepath": "path/to/data.tsv.gz",
"hmac_key_id": "...",
"hmac_secret": "..."
}
}
```

It connects to MongoDB via the `connection_string` environment variable (required).

> **Note:** There is a known bug in the current source — `lambda_handler` passes `source_info['bucket']` and `source_info['filepath']` to `process_file` instead of the local path returned by `get_file`. The Lambda may not be functional without fixing this first.

### Comparison

| | Local script | Lambda |
|---|---|---|
| Data source | Local TSV file | Gzipped TSV from GCS |
| MongoDB auth | Any connection string you supply | `connection_string` env var on the Lambda |
| Synonyms | Fetched from NCBI API | Fetched from NCBI API |
| Delete support | `is_delete=True` | Filename must contain `"deleted"` |
| Known issues | None | Bug passing wrong path to `process_file` |