# Phasing Full AOU Participant Cohort

Purpose: Phasing entire AOU cohort for local ancestry and admixture inference

In [1]:
#Importing packages relevant for notebook
from datetime import datetime
import os
import subprocess

In [2]:
#Assigning bucket address to a variable
my_bucket = os.getenv('WORKSPACE_BUCKET')
my_bucket

'gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b'

In [3]:
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable so that its easier to use within %%bash cells.
%env USER_NAME={USER_NAME}

env: USER_NAME=nmshahir


In [4]:
!gsutil ls -lh gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test

 58.97 GiB  2024-04-04T17:34:24Z  gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_QC_only_sort.vcf.gz
  2.39 MiB  2024-04-04T21:26:01Z  gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_QC_only_sort.vcf.gz.tbi
  1.92 GiB  2024-04-15T00:59:09Z  gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr13.b38.sorted.phased.bcf
788.08 MiB  2024-04-09T20:34:40Z  gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr22.b38.sorted.phased.bcf
       0 B  2024-04-03T17:43:13Z  gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/test.txt
TOTAL: 5 objects, 66209472624 bytes (61.66 GiB)


In [2]:
%%bash

dsub --help

usage: /opt/conda/bin/dsub [-h] [--provider PROVIDER] [--version VERSION]
                           [--unique-job-id] [--name NAME]
                           [--tasks [FILE M-N ...]] [--image IMAGE]
                           [--dry-run] [--command COMMAND] [--script SCRIPT]
                           [--env [KEY=VALUE ...]] [--label [KEY=VALUE ...]]
                           [--input [KEY=REMOTE_PATH ...]]
                           [--input-recursive [KEY=REMOTE_PATH ...]]
                           [--output [KEY=REMOTE_PATH ...]]
                           [--output-recursive [KEY=REMOTE_PATH ...]]
                           [--user USER] [--user-project USER_PROJECT]
                           [--mount [KEY=PATH_SPEC ...]] [--wait]
                           [--retries RETRIES] [--poll-interval POLL_INTERVAL]
                           [--after AFTER [AFTER ...]] [--skip] [--summary]
                           [--min-cores MIN_CORES] [--min-ram MIN_RAM]
                        

# Files Needed

|File | File Type | Formatting Notes|
|-----|-----------|------------------|
|Query file | VCF file of participants to be phased| 1) AC field is needed for SHAPEIT to run. If it is not within the VCF, run "bcftools +fill-AN-AC <query_file.vcf>  -Oz -o <query_file_w_AC.vcf.gz>  -- -t all"; 2) Additionally the file must also be sorted and indexed|  
|Genetic map files| gmap(.gz) |SHAPEIT group provides these. Use "wget https://github.com/odelaneau/shapeit5/raw/main/resources/maps/b38/" to acquire them |


# Final Version

First, we'll create the script for running SHAPEIT

In [1]:
%%writefile phase_aou.sh
#!/bin/bash
set -o errexit
set -o nounset
# Phasing
phase_common_static --input "${input_vcf}/AOU_QC_only_sort.vcf.gz" --region ${chrom} --map "${gene_map}/chr${chrom}.b38.gmap.gz" --output "${out_path}/AOU_chr${chrom}.b38.sorted.phased.bcf" --thread 8

Writing phase_aou.sh


Explanation of parts:

{input_vcf} - path to the query vcf, this is the file we're going to be phasing.

{chrom} - corresponds to the chromosome number

{gene_map} - corresponds to the path to the genetic map files 

{out_path} - corresponds to the path where we want the output files to reside

All of these paths will be defined in our dsub script at submission.

Next, we'll copy our script to our workspace bucket. In this instance, I've sent it to the "scripts" directory in my workspace bucket.

In [2]:
!gsutil cp phase_aou.sh ${WORKSPACE_BUCKET}/scripts/phase_aou.sh

Copying file://phase_aou.sh [Content-Type=text/x-sh]...
/ [1 files][  252.0 B/  252.0 B]                                                
Operation completed over 1 objects/252.0 B.                                      


Below is a dsub job that we'll use to run SHAPEIT on a cloud virtual machine (VM). For a more robust overview of dsub, I strongly recommend reviewing their notebooks on how to use dsub on the AOU researcher workbench here: https://workbench.researchallofus.org/workspaces/aou-rw-6221d5ec/howtousedsubintheresearcherworkbenchv7/data . 

For the dsub job out lined here the relevant parts are that in order to avoid submitting a job for each chromosome manually, we implement a loop interating from 1 to 22 to submit a job for chromosome for phasing. Since SHAPEIT5 is not part of the default terra-jupyter image, instead we use a SHAPEIT5 container "gcr.io/boxwood-sandbox-353200/shapeit5_2023-05-05" that I've uploaded to the GCR. Note: The GCR is in the process of being depreciated and moving to Artifact Registry. This will be updated when the transition finishes.

In [10]:
%%bash --out aou_phase
source ~/aou_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

BASH_SCRIPT="gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/scripts/phase_aou.sh"


LOWER=1
UPPER=23
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
do 
aou_dsub \
  --image gcr.io/boxwood-sandbox-353200/shapeit5_2023-05-05 \
  --disk-size 512 \
  --boot-disk-size 100 \
  --min-ram 512 \
  --logging "${WORKSPACE_BUCKET}/data/logging" \
  --input-recursive input_vcf="${WORKSPACE_BUCKET}/data/phase_test" \
  --input-recursive gene_map="${WORKSPACE_BUCKET}/gnomad_ref" \
  --output-recursive out_path="${WORKSPACE_BUCKET}/data/phase_test" \
  --env chrom=${chromo} \
  --script "${BASH_SCRIPT}"
done


Job properties:
  job-id: phase-aou--nmshahir--240416-010609-90
  job-name: phase-aou
  user-id: nmshahir
Provider internal-id (operation): projects/284833138454/locations/us-central1/operations/14666657719332510908
Launched job-id: phase-aou--nmshahir--240416-010609-90
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-f195c3ba --location us-central1 --jobs 'phase-aou--nmshahir--240416-010609-90' --users 'nmshahir' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-f195c3ba --location us-central1 --jobs 'phase-aou--nmshahir--240416-010609-90' --users 'nmshahir'
Job properties:
  job-id: phase-aou--nmshahir--240416-010612-06
  job-name: phase-aou
  user-id: nmshahir
Provider internal-id (operation): projects/284833138454/locations/us-central1/operations/695032650448920075
Launched job-id: phase-aou--nmshahir--240416-010612-06
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-f195c3ba

In [20]:
!dstat --provider google-cls-v2 --project terra-vpc-sc-f195c3ba --location us-central1 --jobs 'phase-aou--nmshahir--240416-010623-66' --users 'nmshahir' --status '*' --full

- create-time: '2024-04-16 01:06:23.751820'
  dsub-version: v0-4-10
  end-time: '2024-04-16 05:57:55.399194'
  envs:
    chrom: '21'
  events:
  - name: start
    start-time: 2024-04-16 01:06:29.805475+00:00
  - name: pulling-image
    start-time: 2024-04-16 01:07:20.042320+00:00
  - name: localizing-files
    start-time: 2024-04-16 01:07:53.537266+00:00
  - name: running-docker
    start-time: 2024-04-16 01:29:59.713134+00:00
  - name: delocalizing-files
    start-time: 2024-04-16 05:57:45.230557+00:00
  - name: ok
    start-time: 2024-04-16 05:57:55.399194+00:00
  input-recursives:
    input_vcf: gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test
    ref_map: gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/gnomad_ref
  inputs: {}
  internal-id: projects/284833138454/locations/us-central1/operations/6597179824237372606
  job-id: phase-aou--nmshahir--240416-010623-66
  job-name: phase-aou
  labels: {}
  last-update: '2024-04-16 05:57:55.39

In [5]:
!gsutil ls ${WORKSPACE_BUCKET}/data/phase_test

gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_QC_only_sort.vcf.gz
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_QC_only_sort.vcf.gz.tbi
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr1.b38.sorted.phased.bcf
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr1.b38.sorted.phased.bcf.csi
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr10.b38.sorted.phased.bcf
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr10.b38.sorted.phased.bcf.csi
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr11.b38.sorted.phased.bcf
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr11.b38.sorted.phased.bcf.csi
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr12.b38.sorted.phased.bcf
gs://fc-secure-9bc86de4-cb77-407c-82c4-de33c6265a3b/data/phase_test/AOU_chr12.b38.sorted.phased.bcf.cs

In [4]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --users "${USER_NAME}" \
    --jobs '*' \
    --status '*' 

Job Name         Status                                      Last Update
---------------  ------------------------------------------  --------------
phase-aou        Success                                     04-17 17:38:02
phase-aou        Success                                     04-17 18:06:52
phase-aou        Success                                     04-17 17:34:11
phase-aou        Success                                     04-17 16:50:45
phase-aou        Success                                     04-17 18:54:24
phase-aou        Success                                     04-17 19:44:34
phase-aou        Success                                     04-18 15:20:53
phase-aou        Success                                     04-17 21:12:31
phase-aou        Success                                     04-18 00:17:45
phase-aou        Success                                     04-18 01:39:05
phase-aou        Success                                     04-18 06:03:30
phase-aou      