## **WGS Variant QC: Generate JSON Input Configuration**

Generates `bgens_qc_input.json` for the `bgens_qc.wdl` workflow - discovers all WGS BGEN/SAMPLE files (chr1–22, X, Y) and links the sample-QCed phenotype file.

**Environment:** UK Biobank RAP, single node (mem1_hdd1_v2_x16)

**Adapted from:** [DNAnexus UKB_RAP](https://github.com/dnanexus/UKB_RAP/blob/main/end_to_end_gwas_phewas/bgens_qc/generate_inputs.ipynb)

In [None]:
import glob
import json
import subprocess
import os

### **Step 1: Configure Analysis Parameters**

In [1]:
# Output file naming prefix for QC results
output_file_prefix = "final_WGS_snps_GRCh38_qc_pass"

#### **PLINK2 Quality Control Filters**

The following thresholds are applied to filter low-quality variants and samples:

| Filter | Threshold | Purpose |
|--------|-----------|----------|
| `--mac 10` | Minimum Allele Count ≥ 10 | Removes extremely rare variants that may be sequencing errors |
| `--maf 0.01` | Minor Allele Frequency ≥ 1% | Excludes singleton/ultra-rare variants with insufficient power for association testing |
| `--hwe 5e-8` | Hardy-Weinberg Equilibrium p-value ≥ 5e-8 | Filters variants with extreme deviation from HWE, indicating potential genotyping errors |
| `--mind 0.1` | Sample missingness ≤ 10% | Removes samples with excessive missing genotype calls |
| `--geno 0.05` | Variant missingness ≤ 5% | Excludes variants that failed to genotype in >5% of samples |

In [None]:
plink_options = "--mac 10 --maf 0.01 --hwe 5e-8 --mind 0.1 --geno 0.05"

In [None]:
# Path to UK Biobank WGS data (BGEN format, GRCh38)
path_to_data = '/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, BGEN format [500k release]/'

In [None]:
# Path to sample-QCed phenotype file from Step 01
phenotype_folder = '/02.Phenotype_SampleQC/'
phenotype_file = 'sle_pqc.phe'

### **Step 2: Define WDL Workflow Input Schema**

The `bgens_qc.wdl` workflow requires the following inputs:
- **geno_bgen_files**: Array of BGEN files (one per chromosome)
- **geno_sample_files**: Array of SAMPLE files (one per chromosome)
- **keep_file**: Phenotype file containing sample IDs to retain
- **output_prefix**: Naming prefix for output files
- **plink2_options**: QC filter parameters
- **ref_first**: Whether to treat the first allele as reference (default: true)

In [None]:
inputs = {
    "bgens_qc.extract_files": "Array[File]",
    "bgens_qc.ref_first": "Boolean (optional, default = true)",
    "bgens_qc.keep_file": "File? (optional)",
    "bgens_qc.output_prefix": "String",
    "bgens_qc.plink2_options": "String (optional, default = \"\")",
    "bgens_qc.geno_sample_files": "Array[File]+",
    "bgens_qc.geno_bgen_files": "Array[File]+"
}

### **Step 3: Discover BGEN Files**

Use DNAnexus CLI to find all .bgen file IDs in the WGS data directory. Expected: 24 files (chr1-22 + chrX + chrY)

In [None]:
cmd = ['dx', 'find', 'data', '--name', '*.bgen', '--path', path_to_data, '--brief']  # Construct DNAnexus CLI command to find all .bgen files in specified path, return brief output (file IDs only)
bgens = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()]  # Execute command, decode bytes to string, format each file ID with dx:// prefix
bgens  # Display list of BGEN file paths

In [None]:
print(f"Found {len(bgens)} BGEN files")

### **Step 4: Discover SAMPLE Files**

SAMPLE files contain sample IDs and must correspond 1:1 with BGEN files.

In [None]:
cmd = ['dx', 'find', 'data', '--name', '*.sample', '--path', path_to_data, '--brief']
samples = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()]
samples

In [None]:
print(f"Found {len(samples)} SAMPLE files")

### **Step 5: Locate Phenotype File**

Used to restrict variant QC to samples that passed sample QC in Step 01.

In [None]:
cmd = ['dx', 'find', 'data', '--name', phenotype_file, '--path', phenotype_folder, '--brief']
pheno_file = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()][0]
pheno_file

### **Step 6: Assemble Inputs**
Populate the input with discovered file IDs and analysis parameters.

In [None]:
# Remove extract_files parameter (not needed for this analysis)
del inputs["bgens_qc.extract_files"]

# Populate with discovered files and parameters
inputs["bgens_qc.ref_first"] = False # WGS data is ref-last
inputs["bgens_qc.keep_file"] = pheno_file
inputs["bgens_qc.output_prefix"] = output_file_prefix
inputs["bgens_qc.plink2_options"] = plink_options
inputs["bgens_qc.geno_sample_files"] = samples
inputs["bgens_qc.geno_bgen_files"] = bgens

In [None]:
inputs

### **Step 7: Save JSON and Upload to RAP**

In [None]:
with open('bgens_qc_input.json', 'w') as f:
    json.dump(inputs, f, indent=2)

In [None]:
%%bash
dx upload bgens_qc_input.json --path /03.Variant_QC/

### **Next Steps**

1. Upload `bgens_qc_input.json` to `/03.Variant_QC/` on RAP
2. Run WGS variant QC workflow: `bash run_wgs_qc.sh`
3. Count variants passing QC: `bash count_variants.sh results/final_WGS_snps_GRCh38_qc_pass.snplist`

**Expected output:** `results/final_WGS_snps_GRCh38_qc_pass.snplist` 