diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..f5a9b03 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,63 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Overview + +DocumentMetadataAPI is a Flask REST API that returns bibliographic metadata (title, journal, abstract, pub date, etc.) for biomedical publications, looked up by PMID, PMC ID, or DOI. It targets a p90 SLO of 150ms. + +The backend is MongoDB (DocumentDB in production), accessed via the `pymongo` driver. Tracing is done via OpenTelemetry → Jaeger. + +## Running locally + +```bash +pip install -r requirements.txt +python main.py # Flask dev server on port 5000 +``` + +For production-style serving (matches Docker): +```bash +gunicorn -b 0.0.0.0:8000 --workers 1 --threads 8 --timeout 0 main:app +``` + +The app connects to MongoDB. Without a `connection_string` env var it defaults to a local MongoDB instance (`client['local']`). In production, set: +```bash +export connection_string="mongodb://..." +``` + +## API endpoints + +- `GET /` — health check; returns sample IDs from each ID type +- `GET /version` — returns version string +- `GET /publications?pubids=PMID:123,PMC456&request_id=` — main endpoint; returns metadata + `_meta` wrapper +- `GET /identifiers?pubids=PMID:123,DOI:10.1/x` — cross-reference lookup; returns synonyms across PMID/PMC/DOI + +## Architecture + +**`main.py`** — Flask app with two main routes: +- `/publications`: queries the `documentMetadata` MongoDB collection by `document_id`. For missing IDs, falls back to the NCBI PMC ID converter API (`www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/`). +- `/identifiers`: queries the `documentIds` reference collection for cross-ID lookups, then falls back to NCBI PMC ID converter for misses. Processes PMID, PMC, and DOI in separate batches. + +**`data_loader.py`** — AWS Lambda handler for bulk-loading data. Reads gzipped TSV files from GCS (via boto3 S3-compatible client), parses them, fetches PMID→PMC/DOI synonyms, and upserts into MongoDB. The same document is stored three times: once per identifier (PMID, PMC, DOI), all pointing to identical metadata. + +**`data_checker.py`** — AWS Lambda handler for auditing. Verifies whether documents from a given TSV file are present (or absent for delete files) in MongoDB. + +**`query_tester.py`** — Local load-testing script. Hits the production endpoint with batches of 10/50/100 IDs and reports `processing_time_ms` statistics. + +## MongoDB collections + +- `documentMetadata` — main collection; keyed on `document_id` (e.g., `PMID:30690000`, `PMC1234`, `10.1000/xyz`) +- `documentIds` — reference/synonym collection; fields `PM`, `PMC`, `DOI` + +## ID format conventions + +The API normalizes input IDs before lookup: +- `PMC:` prefix → `PMC` (no colon) +- `DOI:` prefix → stripped entirely +- Lookups are case-insensitive for prefixes + +## Deployment + +Jenkins CI (`Jenkinsfile`) builds a Docker image, pushes to AWS ECR (`853771734544.dkr.ecr.us-east-1.amazonaws.com/translator-docmetadataapi`), and deploys to AWS EKS. The pipeline polls SCM every 5 minutes and targets the `translator-eks-ci-blue-cluster`. + +`rds-combined-ca-bundle.pem` is the TLS CA bundle for AWS DocumentDB (MongoDB-compatible); the Dockerfile now fetches the updated global bundle from AWS at build time instead. diff --git a/README.md b/README.md index 0bcc843..55abce3 100644 --- a/README.md +++ b/README.md @@ -66,4 +66,62 @@ GET /publications?pubids=PMID:30690000,PMID:82374,PMID:28736,PMID:8000234&reques ] } -``` \ No newline at end of file +``` + +## Updating the database + +The database is populated from tab-separated (TSV) files where each row is one publication. The columns must appear in this order (minimum 10): + +``` +document_id pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract +``` + +- `document_id` should be prefixed (`PMID:30690000`). Use `-` for an absent `pub_day`. +- Each PMID record is automatically duplicated under its PMC and DOI synonyms (fetched from the NCBI ID converter API), so you only need to supply PMID-keyed rows. + +### Loading locally + +`data_loader.py` has no CLI entry point, so loading is done via a short Python script: + +```python +from pymongo import MongoClient +import data_loader + +client = MongoClient("mongodb://...") # or omit arg for localhost +db = client['test'] +data_loader.collection = db['documentMetadata'] +data_loader.reference = db['documentIds'] + +data_loader.process_file('/path/to/data.tsv') +``` + +To delete records instead of upsert, pass a plain text file with one `document_id` per line and call `process_file('/path/to/deletes.txt', is_delete=True)`. + +### Via AWS Lambda (`data_loader.lambda_handler`) + +The Lambda variant downloads a gzipped TSV from a GCS bucket and then upserts (or, if "deleted" appears in the filename, deletes). It expects this event payload: + +```json +{ + "source": { + "bucket": "my-gcs-bucket", + "filepath": "path/to/data.tsv.gz", + "hmac_key_id": "...", + "hmac_secret": "..." + } +} +``` + +It connects to MongoDB via the `connection_string` environment variable (required). + +> **Note:** There is a known bug in the current source — `lambda_handler` passes `source_info['bucket']` and `source_info['filepath']` to `process_file` instead of the local path returned by `get_file`. The Lambda may not be functional without fixing this first. + +### Comparison + +| | Local script | Lambda | +|---|---|---| +| Data source | Local TSV file | Gzipped TSV from GCS | +| MongoDB auth | Any connection string you supply | `connection_string` env var on the Lambda | +| Synonyms | Fetched from NCBI API | Fetched from NCBI API | +| Delete support | `is_delete=True` | Filename must contain `"deleted"` | +| Known issues | None | Bug passing wrong path to `process_file` | \ No newline at end of file