ncats · gaurav · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,63 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+DocumentMetadataAPI is a Flask REST API that returns bibliographic metadata (title, journal, abstract, pub date, etc.) for biomedical publications, looked up by PMID, PMC ID, or DOI. It targets a p90 SLO of 150ms.
+
+The backend is MongoDB (DocumentDB in production), accessed via the `pymongo` driver. Tracing is done via OpenTelemetry → Jaeger.
+
+## Running locally
+
+```bash
+pip install -r requirements.txt
+python main.py          # Flask dev server on port 5000
+```
+
+For production-style serving (matches Docker):
+```bash
+gunicorn -b 0.0.0.0:8000 --workers 1 --threads 8 --timeout 0 main:app
+```
+
+The app connects to MongoDB. Without a `connection_string` env var it defaults to a local MongoDB instance (`client['local']`). In production, set:
+```bash
+export connection_string="mongodb://..."
+```
+
+## API endpoints
+
+- `GET /` — health check; returns sample IDs from each ID type
+- `GET /version` — returns version string
+- `GET /publications?pubids=PMID:123,PMC456&request_id=<uuid>` — main endpoint; returns metadata + `_meta` wrapper
+- `GET /identifiers?pubids=PMID:123,DOI:10.1/x` — cross-reference lookup; returns synonyms across PMID/PMC/DOI
+
+## Architecture
+
+**`main.py`** — Flask app with two main routes:
+- `/publications`: queries the `documentMetadata` MongoDB collection by `document_id`. For missing IDs, falls back to the NCBI PMC ID converter API (`www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/`).
+- `/identifiers`: queries the `documentIds` reference collection for cross-ID lookups, then falls back to NCBI PMC ID converter for misses. Processes PMID, PMC, and DOI in separate batches.
+
+**`data_loader.py`** — AWS Lambda handler for bulk-loading data. Reads gzipped TSV files from GCS (via boto3 S3-compatible client), parses them, fetches PMID→PMC/DOI synonyms, and upserts into MongoDB. The same document is stored three times: once per identifier (PMID, PMC, DOI), all pointing to identical metadata.
+
+**`data_checker.py`** — AWS Lambda handler for auditing. Verifies whether documents from a given TSV file are present (or absent for delete files) in MongoDB.
+
+**`query_tester.py`** — Local load-testing script. Hits the production endpoint with batches of 10/50/100 IDs and reports `processing_time_ms` statistics.
+
+## MongoDB collections
+
+- `documentMetadata` — main collection; keyed on `document_id` (e.g., `PMID:30690000`, `PMC1234`, `10.1000/xyz`)
+- `documentIds` — reference/synonym collection; fields `PM`, `PMC`, `DOI`
+
+## ID format conventions
+
+The API normalizes input IDs before lookup:
+- `PMC:` prefix → `PMC` (no colon)
+- `DOI:` prefix → stripped entirely
+- Lookups are case-insensitive for prefixes
+
+## Deployment
+
+Jenkins CI (`Jenkinsfile`) builds a Docker image, pushes to AWS ECR (`853771734544.dkr.ecr.us-east-1.amazonaws.com/translator-docmetadataapi`), and deploys to AWS EKS. The pipeline polls SCM every 5 minutes and targets the `translator-eks-ci-blue-cluster`.
+
+`rds-combined-ca-bundle.pem` is the TLS CA bundle for AWS DocumentDB (MongoDB-compatible); the Dockerfile now fetches the updated global bundle from AWS at build time instead.
diff --git a/README.md b/README.md
@@ -66,4 +66,62 @@ GET /publications?pubids=PMID:30690000,PMID:82374,PMID:28736,PMID:8000234&reques
     ]
 }
 
-```
+```
+
+## Updating the database
+
+The database is populated from tab-separated (TSV) files where each row is one publication. The columns must appear in this order (minimum 10):
+
+```
+document_id  pub_year  pub_month  pub_day  journal_name  journal_abbrev  volume  issue  article_title  abstract
+```
+
+- `document_id` should be prefixed (`PMID:30690000`). Use `-` for an absent `pub_day`.
+- Each PMID record is automatically duplicated under its PMC and DOI synonyms (fetched from the NCBI ID converter API), so you only need to supply PMID-keyed rows.
+
+### Loading locally
+
+`data_loader.py` has no CLI entry point, so loading is done via a short Python script:
+
+```python
+from pymongo import MongoClient
+import data_loader
+
+client = MongoClient("mongodb://...")   # or omit arg for localhost
+db = client['test']
+data_loader.collection = db['documentMetadata']
+data_loader.reference  = db['documentIds']
+
+data_loader.process_file('/path/to/data.tsv')
+```
+
+To delete records instead of upsert, pass a plain text file with one `document_id` per line and call `process_file('/path/to/deletes.txt', is_delete=True)`.
+
+### Via AWS Lambda (`data_loader.lambda_handler`)
+
+The Lambda variant downloads a gzipped TSV from a GCS bucket and then upserts (or, if "deleted" appears in the filename, deletes). It expects this event payload:
+
+```json
+{
+  "source": {
+    "bucket": "my-gcs-bucket",
+    "filepath": "path/to/data.tsv.gz",
+    "hmac_key_id": "...",
+    "hmac_secret": "..."
+  }
+}
+```
+
+It connects to MongoDB via the `connection_string` environment variable (required).
+
+> **Note:** There is a known bug in the current source — `lambda_handler` passes `source_info['bucket']` and `source_info['filepath']` to `process_file` instead of the local path returned by `get_file`. The Lambda may not be functional without fixing this first.
+
+### Comparison
+
+| | Local script | Lambda |
+|---|---|---|
+| Data source | Local TSV file | Gzipped TSV from GCS |
+| MongoDB auth | Any connection string you supply | `connection_string` env var on the Lambda |
+| Synonyms | Fetched from NCBI API | Fetched from NCBI API |
+| Delete support | `is_delete=True` | Filename must contain `"deleted"` |
+| Known issues | None | Bug passing wrong path to `process_file` |