# PLINK Preprocessing Pipeline

This notebook demonstrates how to preprocess PLINK bfiles for downstream analysis.  
The pipeline includes the following steps:

1. **Quality Control (QC)**: Filters samples and variants based on criteria such as:
   - Minor Allele Frequency (MAF)
   - Hardy-Weinberg Equilibrium (HWE)
   - Missingness thresholds for individuals and variants
2. **Alignment**: Aligns the dataset to the major reference allele.
3. **Allele Frequency Calculation**: Computes allele frequencies for the dataset.

The output files from this pipeline will be used as input for GRM computation and REML analysis in `basic_usage.ipynb`.

---
# Step 1: Set up the environment
---
Import necessary modules and define the project root path

In [6]:
import ipynbname
import os, sys

# Automatically detect the notebook's directory
notebook_path = ipynbname.path()
notebook_dir = os.path.dirname(notebook_path)
ROOT_PATH = os.path.abspath(os.path.join(notebook_dir, '..'))

print(f"Project root path: {ROOT_PATH}")

Project root path: /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA


---
# Step 2: Define helper functions for PLINK commands
---

In [7]:
import os

def build_plink_options(params):
    """Convert parameter dictionary to PLINK command options."""
    opts = []
    for k, v in params.items():
        if v is None:
            continue
        if isinstance(v, bool):
            if v:
                opts.append(f"--{k}")
        else:
            opts.append(f"--{k} {v}")
    return " \\\n    ".join(opts)  # pretty multi-line formatting

def do_qc(plink_path, input_prefix, output_prefix, **kwargs):
    """Perform Quality Control (QC) using PLINK."""
    defaults = {
        "keep": None,
        "maf": 0.01,
        "hwe": 1e-6,
        "mind": 0.1,
        "geno": 0.1,
        "make-bed": True,
        "rm-dup": "exclude-all"
    }
    defaults.update(kwargs)

    options_str = build_plink_options(defaults)
    qc_command = f"""{plink_path} \\
    --bfile {input_prefix} \\
    {options_str} \\
    --out {output_prefix}"""

    print("Running QC command:\n", qc_command)
    os.system(qc_command)

def do_afreq(plink_path, input_prefix, output_prefix):
    """Calculate allele frequencies using PLINK."""
    plink_command = f"""{plink_path} \\
    --bfile {input_prefix} \\
    --freq \\
    --out {output_prefix}"""
    
    print("Running allele frequency command:\n", plink_command)
    os.system(plink_command)

def do_align(plink_path, input_prefix, output_prefix):
    """Align dataset to the major reference allele using PLINK."""
    plink_command = f"""{plink_path} \\
    --bfile {input_prefix} \\
    --maj-ref force \\
    --make-bed \\
    --out {output_prefix}"""
    
    print("Running alignment command:\n", plink_command)
    os.system(plink_command)

---
# Step 3: Define the preprocessing pipeline
---

In [8]:
def qc_pipeline(plink_path, input_path, output_path, fn_prefix, **kwargs):
    """Run the full preprocessing pipeline."""
    # Step 1: QC
    input_file = f"{input_path}/{fn_prefix}"
    output_file = f"{output_path}/tmp.{fn_prefix}"
    do_qc(
        plink_path=plink_path,
        input_prefix=input_file,
        output_prefix=output_file,
        **kwargs
    )

    # Step 2: Align to major reference
    input_file = output_file
    output_file = f"{output_path}/{fn_prefix}.majref"
    do_align(
        plink_path=plink_path,
        input_prefix=input_file,
        output_prefix=output_file
    )

    # Step 3: Allele frequency file
    do_afreq(
        plink_path=plink_path,
        input_prefix=output_file,
        output_prefix=output_file
    )

    # Step 4: Clean temporary files
    cleanup_cmd = f"rm {output_path}/tmp.*"
    print("Cleaning up temporary files...")
    os.system(cleanup_cmd)

In [9]:
# Step 4: Run the pipeline
PLINK_PATH = f"{ROOT_PATH}/plink2"
INPUT_PATH = f"{ROOT_PATH}/gcta64"
OUTPUT_PATH = f"{ROOT_PATH}/test/bfile"
FN_PREFIX = "test"

print("Starting the preprocessing pipeline...")
qc_pipeline(
    plink_path=PLINK_PATH,
    input_path=INPUT_PATH,
    output_path=OUTPUT_PATH,
    fn_prefix=FN_PREFIX
)
print("Preprocessing pipeline completed.")

Starting the preprocessing pipeline...
Running QC command:
 /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/plink2 \
    --bfile /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/gcta64/test \
    --maf 0.01 \
    --hwe 1e-06 \
    --mind 0.1 \
    --geno 0.1 \
    --make-bed \
    --rm-dup exclude-all \
    --out /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/test/bfile/tmp.test
PLINK v2.0.0-a.7 M1 (1 Sep 2025)                   cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/test/bfile/tmp.test.log.
Options in effect:
  --bfile /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/gcta64/test
  --geno 0.1
  --hwe 1e-06
  --maf 0.01
  --make-bed
  --mind 0.1
  --out /Users/jerry/Documents/JerryProject/1.Project/Factor/Analysis/src/pyGCTA/test/bfile/tmp.

# Summary

In this notebook, we:
1. Performed Quality Control (QC) on PLINK bfiles.
2. Aligned the dataset to the major reference allele.
3. Calculated allele frequencies for the dataset.

The preprocessed files are now ready for GRM computation and REML analysis in `basic_usage.ipynb`.