# Data Acquisition - Summary

## Objective
Download and organize gnomAD v4 reference data and access Tapestry cohort files.

## Data Sources
- **gnomAD v4 PCA loadings**: Hail table from GCP public bucket (gs://gcp-public-data--gnomad)
- **gnomAD v4 RF model**: ONNX format ancestry classifier
- **Tapestry cohort**: Mayo retrospective CDS variant data (n=97,422 samples)

## Files Organized
- `gnomad_v4_pca_loadings.ht/` - Hail table with PC loadings for 168,373 variants
- `gnomad_v4_rf_model.onnx` - Random Forest classifier for ancestry prediction
- `tapestry_cds.{bed,bim,fam}` - PLINK format genotype files
- `tapestry_metadata.csv` - Sample metadata including assay versions

## Next Step
Merge metadata with PLINK sample files (Notebook 00_merge).

In [None]:
# Get gnomAD data
from google.cloud import storage
from pathlib import Path

# Tapestry VCFs

In [None]:
client = storage.Client()
bucket_name = "ra-model-artifacts"
prefix = "datasets/tapestry/vcf-files/04-unified-header-tars"
blobs = client.list_blobs(bucket_name, prefix=prefix)

destination_path = Path("/home/ext_meehl_joshua_mayo_edu/gfm-discovery/02_genomics_domain/data/tapestry/vcfs/04-unified-header-tars")
destination_path.mkdir(parents=True, exist_ok=True)

for blob in blobs:
    destination_file_name = destination_path / Path(blob.name).name
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob.name)
    blob.download_to_filename(destination_file_name)
    print(f"Downloaded {blob.name} to {destination_file_name}")

In [None]:
prefix = "datasets/tapestry/vcf-files/05-merged-cohort"
blobs = client.list_blobs(bucket_name, prefix=prefix)

destination_path = Path("/home/ext_meehl_joshua_mayo_edu/gfm-discovery/02_genomics_domain/data/tapestry/vcfs/05-merged-cohort")
destination_path.mkdir(parents=True, exist_ok=True)

for blob in blobs:
    destination_file_name = destination_path / Path(blob.name).name
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob.name)
    blob.download_to_filename(destination_file_name)
    print(f"Downloaded {blob.name} to {destination_file_name}")

# Get gnomAD data - (NOT WORKING)

In [None]:
client = storage.Client()
bucket_name = "gcp-public-data--gnomad"
ancestries = [
    "afr",
    "amr",
    "asj",
    "eas",
    "fin",
    "nfe",
    "est",
    "nwe",
    "seu",
]
write_path = Path('/home/ext_meehl_joshua_mayo_edu/gfm-discovery/02_genomics_domain/data/ld/gnomAD')

In [None]:
blobs_bm = [f"release/2.1.1/ld/gnomad.genomes.r2.1.1.{ancestry}.common.adj.ld.bm" for ancestry in ancestries]
blobs_idx = [f"release/2.1.1/ld/gnomad.genomes.r2.1.1.{ancestry}.common.adj.ld.variant_indices.ht" for ancestry in ancestries]

In [None]:
for blob in blobs_bm + blobs_idx:
    destination_file_name = write_path / Path(blob).name
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob)
    blob.download_to_filename(destination_file_name)
    print(f"Downloaded {blob.name} to {destination_file_name}")