# Bioinformatics workflow exercise: SoS and linear mixed model

Author: Haoyue Shuai, Nov 17, 2020

This tutorial introduces a workflow language, Script of Scripts (SoS), for bioinformatics analysis pipelines, with an example implementation of various linear mixed model methods for genetic association studies.

This is an SoS Notebook with SoS kernel cells containing workflow steps written in SoS, and bash kernel cells to run these workflow steps. Please run bash codes here directly in this notebook so the output will be saved to the notebook.

## Jupyter Lab setup

Download this notebook and launch it with [JupyterLab](https://jupyter.org/). You can follow [these suggested setup instructions](http://statgen.us/lab-wiki/orientation/jupyter-setup.html).

Please first making sure you have all the kernels needed. They should be available after all software are installed as instructed:

In [1]:
jupyter kernelspec list

Available kernels:
  ir              /Users/hyeonjukim/Library/Jupyter/kernels/ir
  julia-1.0       /Users/hyeonjukim/Library/Jupyter/kernels/julia-1.0
  julia-1.5       /Users/hyeonjukim/Library/Jupyter/kernels/julia-1.5
  calysto_bash    /Users/hyeonjukim/opt/miniconda3/share/jupyter/kernels/calysto_bash
  markdown        /Users/hyeonjukim/opt/miniconda3/share/jupyter/kernels/markdown
  python3         /Users/hyeonjukim/opt/miniconda3/share/jupyter/kernels/python3
  sos             /Users/hyeonjukim/opt/miniconda3/share/jupyter/kernels/sos



In [2]:
[global]
# parameter 1
parameter: n = 1.0
# parameter 2
parameter: beta = [1.0,2.0,3.0]

In [3]:
# Print the value of n with bash
[print_n]
bash: expand = '${ }'
    echo ${n}

In [4]:
sos run orientation-hkim.ipynb print_n

INFO: Running [32mprint_n[0m: Print the value of n with bash
1.0
INFO: [32mprint_n[0m is [32mcompleted[0m.
INFO: Workflow print_n (ID=w4bcbb8958466f710) is executed successfully with 1 completed step.



In [5]:
sos run orientation-hkim.ipynb print_n --n 666

INFO: Running [32mprint_n[0m: Print the value of n with bash
666.0
INFO: [32mprint_n[0m is [32mcompleted[0m.
INFO: Workflow print_n (ID=we094e7d433abb2ad) is executed successfully with 1 completed step.



In [6]:
[print_beta]
bash: expand = '${ }'
    echo ${beta}

In [7]:
sos run orientation-hkim.ipynb print_beta

INFO: Running [32mprint_beta[0m: 
[1.0, 2.0, 3.0]
INFO: [32mprint_beta[0m is [32mcompleted[0m.
INFO: Workflow print_beta (ID=w78fa93e094c77376) is executed successfully with 1 completed step.



In [8]:
# Print log(beta) with Python
[log_beta]
python: expand = '${ }'
    import numpy as np
    print(np.log(${beta}))

In [9]:
sos run orientation-hkim.ipynb log_beta

INFO: Running [32mlog_beta[0m: Print log(beta) with Python
[0.         0.69314718 1.09861229]
INFO: [32mlog_beta[0m is [32mcompleted[0m.
INFO: Workflow log_beta (ID=w077f5b194f70ad1b) is executed successfully with 1 completed step.



In [10]:
# Print exp(n) with R
[exp_n]
R: expand = '${ }'
    print(exp(${n}))

In [11]:
sos run orientation-hkim.ipynb exp_n

INFO: Running [32mexp_n[0m: Print exp(n) with R
[1] 2.718282
INFO: [32mexp_n[0m is [32mcompleted[0m.
INFO: Workflow exp_n (ID=w68dec66676a24f9f) is executed successfully with 1 completed step.



In [12]:
sos run orientation-hkim.ipynb -h

usage: sos run orientation-hkim.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  print_n
  print_beta
  log_beta
  exp_n

Global Workflow Options:
  --n 1.0 (as float)
                        parameter 1
  --beta 1.0 2.0 3.0 (as list)
                        parameter 2

Sections
  print_n:              Print the value of n with bash
  print_beta:
  log_beta:             Print log(beta) with Python
  exp_n:                Print exp(n) with R



In [13]:
sos run ../workflow/LMM.ipynb -h

usage: sos run ../workflow/LMM.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  boltlmm
  gcta
  fastGWA
  regenie
  SAIGE

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --sampleFile VAL (as path, required)
                        Path to sample file
  --bfile VAL (as path, required)
                        Genotype files in plink binary this is used for
                        computing the GRM
  --genoFile  paths

                        Path to bgen or bed files
  --phenoFile VAL (as path, required)
                        Phenotype file for quantitative trait

In [14]:
cd ../data/LMM_MWE
ls

LDSCORE.1000G_EUR.tab.gz		imputed_genotypes.sample
boltlmm_template.yml			imputed_genotypes_chr21.bgen
fastGWA_template.yml			imputed_genotypes_chr21.bgen.bgi
genetic_map_hg19_withX.txt.gz		imputed_genotypes_chr22.bgen
genotype_inventory.txt			imputed_genotypes_chr22.bgen.bgi
genotypes.bed				output
genotypes.bim				phenotypes.txt
genotypes.fam				regenie_template.yml
genotypes21_22.bed			regions.txt
genotypes21_22.bim			unrelated_samples.txt
genotypes21_22.fam



In [2]:
cd ~/GIT/orientation/notebook/




In [4]:
sos run ../workflow/LMM.ipynb fastGWA \
    --cwd ../data/output-fastGWA \
    --bfile ../data/LMM_MWE/genotypes.bed \
    --sampleFile ../data/LMM_MWE/imputed_genotypes.sample \
    --genoFile ../data/LMM_MWE/imputed_genotypes_chr*.bgen \
    --phenoFile ../data/LMM_MWE/phenotypes.txt \
    --formatFile ../data/LMM_MWE/fastGWA_template.yml \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 1 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --parts 2 \
    --p-filter 1

INFO: Running [32mfastGWA_1[0m: fastGWA mixed model (based on the sparse GRM generated above)
INFO: [32mfastGWA_1[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfastGWA_1[0m (index=1) is [32mignored[0m due to saved signature
INFO: [32mfastGWA_1[0m output:   [32m../data/output-fastGWA/cache/imputed_genotypes_chr21.phenotypes.fastGWA.gz ../data/output-fastGWA/cache/imputed_genotypes_chr22.phenotypes.fastGWA.gz in 2 groups[0m
INFO: Running [32mfastGWA_2[0m: Merge results and log files
INFO: [32mfastGWA_2[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfastGWA_2[0m output:   [32m../data/output-fastGWA/phenotypes_BMI.fastGWA.snp_stats.gz ../data/output-fastGWA/phenotypes_BMI.fastGWA.snp_counts.txt[0m
INFO: Running [32mfastGWA_3[0m: Manhattan and QQ plots using `qqman`
INFO: [32mfastGWA_3[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfastGWA_3[0m output:   [32m../data/output-fastGWA/phenotypes_BMI.fa

In [32]:
pwd

/Users/hyeonjukim/GIT/orientation/notebook



In [6]:
%preview ../data/output-fastGWA/phenotypes_BMI.fastGWA.manhattan.png

In [48]:
sos run ~/GIT/orientation/workflow/LMM.ipynb regenie \
    --cwd ~/GIT/orientation/data/output_regenie \
    --bfile ~/GIT/orientation/data/LMM_MWE/genotypes21_22.bed \
    --maf-filter 0.001 \
    --sampleFile ~/GIT/orientation/data/LMM_MWE/imputed_genotypes.sample \
    --genoFile ~/GIT/orientation/data/LMM_MWE/imputed_genotypes_chr*.bgen \
    --phenoFile ~/GIT/orientation/data/LMM_MWE/phenotypes.txt \
    --formatFile ~/GIT/orientation/data/LMM_MWE/regenie_template.yml \
    --phenoCol ASTHMA T2D \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 4 \
    --bsize 1000 \
    --lowmem_prefix ~/GIT/orientation/data/output_regenie \
    --trait bt \
    --minMAC 4 \
    --bgenMinMAF 0.05 \
    --bgenMinINFO 0.8 \
    --reverse_log_p \
    --p-filter 1 
   

INFO: Running [32mregenie_0[0m: Select the SNPs and samples to be used based on maf, geno, hwe and mind options
INFO: [32mregenie_0[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mregenie_0[0m output:   [32m/Users/hyeonjukim/GIT/orientation/data/output_regenie/cache/genotypes21_22.qc_pass.id /Users/hyeonjukim/GIT/orientation/data/output_regenie/cache/genotypes21_22.qc_pass.snplist[0m
INFO: Running [32mregenie_1[0m: Run REGENIE step 1: fitting the null
HINT: Pulling docker image statisticalgenetics/lmm:1.6
HINT: Docker image statisticalgenetics/lmm:1.6 is now up to date
[91mERROR[0m: [91mregenie_1 (id=f6b83d92f6ba226b) returns an error.[0m
[91mERROR[0m: [91m[regenie_1]: [0]: Executing script in docker returns an error (exitcode=1, stdout=/Users/hyeonjukim/GIT/orientation/data/output_regenie/phenotypes_ASTHMA_T2D.regenie_pred.stdout).
The script has been saved to /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_33471.sh. To reproduce the 

In [46]:
mypath=["~/GIT/orientation/data/output_regenie",
    "~/GIT/orientation/data/LMM_MWE/genotypes21_22.bed", 
    "~/GIT/orientation/data/LMM_MWE/imputed_genotypes.sample", 
    "~/GIT/orientation/data/LMM_MWE/imputed_genotypes_chr*.bgen", 
     "~/GIT/orientation/data/LMM_MWE/phenotypes.txt",
    "~/GIT/orientation/data/LMM_MWE/regenie_template.yml",
    "~/GIT/orientation/data/output_regenie","/~/GIT/orientation/notebook"]

realpath.(mypath[end])

    

LoadError: [91mIOError: realpath: no such file or directory (ENOENT)[39m

In [30]:
docker run --rm  -v /Users/hyeonjukim/GIT/orientation/data/output:/Users/hyeonjukim/GIT/orientation/data/output -v /Users/hyeonjukim/GIT/orientation/notebook:/Users/hyeonjukim/GIT/orientation/notebook -v /Users/hyeonjukim/GIT/orientation/data/LMM_MWE:/Users/hyeonjukim/GIT/orientation/data/LMM_MWE -v /Users/hyeonjukim/GIT/orientation/data/output/cache:/Users/hyeonjukim/GIT/orientation/data/output/cache -v /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_31904.sh:/var/lib/sos/docker_run_31904.sh    -t  -w=/Users/hyeonjukim/GIT/orientation/notebook -u 502:20    statisticalgenetics/lmm:1.6 /bin/bash /var/lib/sos/docker_run_31904.sh

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_31904.sh" to rootfs at "/var/lib/sos/docker_run_31904.sh" caused: stat /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_31904.sh: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.



In [28]:
sos run ~/GIT/orientation/workflow/LMM.ipynb boltlmm \
    --cwd ~/GIT/orientation/data/output_boltlmm \
    --bfile ~/GIT/orientation/data/LMM_MWE/genotypes.bed \
    --sampleFile ~/GIT/orientation/data/LMM_MWE/imputed_genotypes.sample \
    --genoFile ~/GIT/orientation/data/LMM_MWE/imputed_genotypes_chr*.bgen \
    --phenoFile ~/GIT/orientation/data/LMM_MWE/phenotypes.txt \
    --formatFile ~/GIT/orientation/data/LMM_MWE/boltlmm_template.yml \
    --LDscoresFile ~/GIT/orientation/data/LMM_MWE/LDSCORE.1000G_EUR.tab.gz \
    --geneticMapFile ~/GIT/orientation/data/LMM_MWE/genetic_map_hg19_withX.txt.gz \
    --phenoCol BMI \
    --covarCol SEX \
    --covarMaxLevels 10 \
    --qCovarCol AGE \
    --numThreads 4 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --lmm-option none \
    --p-filter 1 
 

INFO: Running [32mboltlmm_1[0m: Run BOLT analysis
HINT: Pulling docker image statisticalgenetics/lmm:1.6
HINT: Docker image statisticalgenetics/lmm:1.6 is now up to date
[91mERROR[0m: [91m[boltlmm_1]: [(id=5b8869203119a2dd, index=0)]: Executing script in docker returns an error (exitcode=1, stdout=/Users/hyeonjukim/GIT/orientation/data/output_boltlmm/cache/imputed_genotypes_chr21.phenotypes_BMI.boltlmm.snp_stats.stdout).
The script has been saved to /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_31524.sh. To reproduce the error please run:
[0m[32mdocker run --rm  -v /Users/hyeonjukim/GIT/orientation/data/LMM_MWE:/Users/hyeonjukim/GIT/orientation/data/LMM_MWE -v /Users/hyeonjukim/GIT/orientation/data/output_boltlmm/cache:/Users/hyeonjukim/GIT/orientation/data/output_boltlmm/cache -v /Users/hyeonjukim/GIT/orientation/notebook:/Users/hyeonjukim/GIT/orientation/notebook -v /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_31524.sh:/var/lib/sos/docker_run_31524.sh

In [26]:
docker run --rm  -v /Users/hyeonjukim/GIT/orientation/notebook/output_boltlmm/cache:/Users/hyeonjukim/GIT/orientation/notebook/output_boltlmm/cache -v /Users/hyeonjukim/GIT/orientation/data/LMM_MWE:/Users/hyeonjukim/GIT/orientation/data/LMM_MWE -v /Users/hyeonjukim/GIT/orientation/notebook:/Users/hyeonjukim/GIT/orientation/notebook -v /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21283.sh:/var/lib/sos/docker_run_21283.sh    -t  -w=/Users/hyeonjukim/GIT/orientation/notebook -u 502:20    statisticalgenetics/lmm:1.6 /bin/bash /var/lib/sos/docker_run_21283.sh

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21283.sh" to rootfs at "/var/lib/sos/docker_run_21283.sh" caused: stat /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21283.sh: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.



In [27]:
docker run --rm  -v /Users/hyeonjukim/GIT/orientation/notebook:/Users/hyeonjukim/GIT/orientation/notebook -v /Users/hyeonjukim/GIT/orientation/data/LMM_MWE:/Users/hyeonjukim/GIT/orientation/data/LMM_MWE -v /Users/hyeonjukim/GIT/orientation/notebook/output_boltlmm/cache:/Users/hyeonjukim/GIT/orientation/notebook/output_boltlmm/cache -v /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21282.sh:/var/lib/sos/docker_run_21282.sh    -t  -w=/Users/hyeonjukim/GIT/orientation/notebook -u 502:20    statisticalgenetics/lmm:1.6 /bin/bash /var/lib/sos/docker_run_21282.sh

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21282.sh" to rootfs at "/var/lib/sos/docker_run_21282.sh" caused: stat /Users/hyeonjukim/.sos/ed96d0c7509a3905/.sos/docker_run_21282.sh: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

