## **SLE GWAS: Generate WES Variant QC JSON Input Configuration**
This notebook generates a JSON configuration file for whole exome sequencing (WES) variant quality control using the `bgens_qc.wdl` workflow.

**Purpose:**
- Discover all WES BGEN/SAMPLE files for chromosomes 1-22 and X
- Link to the sample-QCed phenotype file from the previous step
- Generate a structured JSON input file for the WDL workflow

**Analysis Environment:**
- Platform: UK Biobank Research Analysis Platform (RAP)
- Instance: Single Node, mem1_hdd1_v2_x16

**Adapted from:**
DNAnexus UKB_RAP repository ([GitHub](https://github.com/dnanexus/UKB_RAP/blob/main/end_to_end_gwas_phewas/bgens_qc/generate_inputs.ipynb))

In [None]:
import glob
import json
import subprocess
import os

### **Step 1: Configure Analysis Parameters**

In [None]:
# Output file naming prefix for QC results
output_file_prefix = "final_WES_snps_GRCh38_qc_pass"

#### **PLINK2 Quality Control Filters**

The following thresholds are applied to filter low-quality variants and samples:

| Filter | Threshold | Purpose |
|--------|-----------|----------|
| `--mac 10` | Minimum Allele Count ≥ 10 | Removes extremely rare variants that may be sequencing errors |
| `--maf 0.0001` | Minor Allele Frequency ≥ 0.01% | Excludes singleton/ultra-rare variants with insufficient power for association testing |
| `--hwe 1e-15` | Hardy-Weinberg Equilibrium p-value ≥ 1e-15 | Filters variants with extreme deviation from HWE, indicating potential genotyping errors |
| `--mind 0.1` | Sample missingness ≤ 10% | Removes samples with excessive missing genotype calls |
| `--geno 0.1` | Variant missingness ≤ 10% | Excludes variants that failed to genotype in >10% of samples |

In [None]:
plink_options = "--mac 10 --maf 0.0001 --hwe 1e-15 --mind 0.1 --geno 0.1"

In [None]:
# Path to UK Biobank WES data (BGEN format, GRCh38)
path_to_data = '/Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/'

In [None]:
# Path to sample-QCed phenotype file from previous step
phenotype_folder = '/02.Phenotype_SampleQC/'
phenotype_file = 'sle_pqc_gwas.phe'

### **Step 2: Define WDL Workflow Input Schema**

The `bgens_qc.wdl` workflow requires the following inputs:
- **geno_bgen_files**: Array of BGEN files (one per chromosome)
- **geno_sample_files**: Array of SAMPLE files (one per chromosome)
- **keep_file**: Phenotype file containing sample IDs to retain
- **output_prefix**: Naming prefix for output files
- **plink2_options**: QC filter parameters
- **ref_first**: Whether to treat the first allele as reference (default: true)

In [None]:
inputs = {
    "bgens_qc.extract_files": "Array[File]",
    "bgens_qc.ref_first": "Boolean (optional, default = true)",
    "bgens_qc.keep_file": "File? (optional)",
    "bgens_qc.output_prefix": "String",
    "bgens_qc.plink2_options": "String (optional, default = \"\")",
    "bgens_qc.geno_sample_files": "Array[File]+",
    "bgens_qc.geno_bgen_files": "Array[File]+"
}

### **Step 3: Discover BGEN Files**

Use DNAnexus CLI to find all .bgen file IDs in the WES data directory. Expected: 24 files (chr1-22 + chrX + chrY)

In [None]:
cmd = ['dx', 'find', 'data', '--name', '*.bgen', '--path', path_to_data, '--brief']  # Construct DNAnexus CLI command to find all .bgen files in specified path, return brief output (file IDs only)
bgens = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()]  # Execute command, decode bytes to string, format each file ID with dx:// prefix
bgens  # Display list of BGEN file paths

In [None]:
print(f"Found {len(bgens)} BGEN files")

### **Step 4: Discover SAMPLE Files**

SAMPLE files contain sample IDs and must correspond 1:1 with BGEN files.

In [None]:
cmd = ['dx', 'find', 'data', '--name', '*.sample', '--path', path_to_data, '--brief']
samples = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()]
samples

In [None]:
print(f"Found {len(samples)} SAMPLE files")

### **Step 5: Locate Phenotype File**

The phenotype file from Step 2 contains the samples that passed sample QC. This file will be used to:
1. Filter WES variants to only QC-passed samples
2. Ensure consistency between phenotype and genotype data

In [None]:
cmd = ['dx', 'find', 'data', '--name', phenotype_file, '--path', phenotype_folder, '--brief']
pheno_file = [f'dx://{item.decode("utf-8")}' for item in subprocess.check_output(cmd).splitlines()][0]
pheno_file

### **Step 6: Assemble Inputs**
Populate the input with discovered file IDs and analysis parameters.

In [None]:
# Remove extract_files parameter (not needed for this analysis)
del inputs["bgens_qc.extract_files"]

# Populate with discovered files and parameters
inputs["bgens_qc.ref_first"] = True
inputs["bgens_qc.keep_file"] = pheno_file
inputs["bgens_qc.output_prefix"] = output_file_prefix
inputs["bgens_qc.plink2_options"] = plink_options
inputs["bgens_qc.geno_sample_files"] = samples
inputs["bgens_qc.geno_bgen_files"] = bgens

In [None]:
inputs

### **Step 7: Save Configuration as JSON**
The JSON file will be saved in the current JupyterLab workspace and should be uploaded to the project directory `/03.Variant_QC/` before running the WDL workflow.

In [None]:
with open('bgens_qc_input.json', 'w') as f:
    json.dump(inputs, f, indent=2)

In [None]:
%%bash
dx upload bgens_qc_input.json --path /03.Variant_QC/

### **Next Steps**

After generating the JSON configuration:
1. Upload to project: The file `bgens_qc_input.json` will be uploaded to `/03.Variant_QC/`
2. Run the WES variant QC workflow: `bash run_wes_qc.sh`
3. Count variants passing QC: `bash count_variants.sh final_WES_snps_GRCh38_qc_pass.snplist`

**Expected outputs:**
- Step 1: `bgens_qc_input.json` (WDL workflow configuration file)
- Step 2: `final_WES_snps_GRCh38_qc_pass.snplist` containing ~800K-900K WES variants passing QC filters