Now I'm re-running XHMM to remove targets with high GC content and also low complexity. Like the previous analysis, I'll run only the simplex families with at least 2 kids in them, for the contrast between affected and unaffected. Also have to remember to remove the other 2 samples we always remove.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [None]:
# in terminal
exome_targets='/data/NCR_SBRB/simplex/SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_capture_targets.bed'
gatk_memory="50g"
ref_fa='/fdb/GATK_resource_bundle/hg19-2.8/ucsc.hg19.fasta'
out_dir='/data/sudregp/cnv/xhmm_clean'

cd $out_dir
module load GATK
module load XHMM

GATK -m ${gatk_memory} GCContentByInterval -L ${exome_targets} -R ${ref_fa} -o ./DATA.locus_GC.txt
cat ./DATA.locus_GC.txt | awk '{if ($2 < 0.1 || $2 > 0.9) print $1}' > ./extreme_gc_targets.txt

# merging all subjects in the directory
ls -1 *.sample_interval_summary > depth_list.txt;
grep -v -f ../exclude.txt depth_list.txt > depth_list2.txt
grep -v -f ../exclude_multikids.txt depth_list2.txt > depth_list3.txt

xhmm --mergeGATKdepths --GATKdepthsList=depth_list3.txt -o ./DATA.RD.txt;

# this does the same thing as the XHMM script, but it actually works in parsing 
# the base pair start and ends
cat ${exome_targets} | awk 'BEGIN{OFS="\t"; print "#CHR\tBP1\tBP2\tID"}{print $1, $2, $3, NR}' > ./EXOME.targets.reg

module load plinkseq
pseq . loc-load --locdb ./EXOME.targets.LOCDB --file ./EXOME.targets.reg --group targets \
    --out ./EXOME.targets.LOCDB.loc-load --noweb

# this has the same effect as the suggested command, but it actually works
pseq . loc-stats --locdb ./EXOME.targets.LOCDB --group targets --seqdb ./seqdb.hg19 --noweb | \
    awk '{if (NR > 1) { print  $4, $10 }}' | sed 's/\.\./-/' - > ./DATA.locus_complexity.txt

cat ./DATA.locus_complexity.txt | awk '{if ($2 > 0.25) print $1}' > ./low_complexity_targets.txt

xhmm --matrix -r ./DATA.RD.txt --centerData --centerType target \
-o ./DATA.filtered_centered.RD.txt \
--outputExcludedTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--excludeTargets ./extreme_gc_targets.txt --excludeTargets ./low_complexity_targets.txt \
--minTargetSize 10 --maxTargetSize 10000 \
--minMeanTargetRD 10 --maxMeanTargetRD 500 \
--minMeanSampleRD 25 --maxMeanSampleRD 200 \
--maxSdSampleRD 150

xhmm --PCA -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA

xhmm --normalize -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA \
--normalizeOutput ./DATA.PCA_normalized.txt \
--PCnormalizeMethod PVE_mean --PVE_mean_factor 0.7

xhmm --matrix -r ./DATA.PCA_normalized.txt --centerData --centerType sample --zScoreData \
-o ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
--outputExcludedTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
--maxSdTargetRD 30

xhmm --matrix -r ./DATA.RD.txt \
--excludeTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--excludeTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--excludeSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--excludeSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
-o ./DATA.same_filtered.RD.txt

xhmm --discover -p params.txt -r ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
    -R ./DATA.same_filtered.RD.txt -c ./DATA.xcnv -a ./DATA.aux_xcnv -s ./DATA

xhmm --genotype -p params.txt -r ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
    -R ./DATA.same_filtered.RD.txt -g ./DATA.xcnv -F $ref_fa -v ./DATA.vcf



Let's make sure no sample was removed by XHMM this time:

In [4]:
%%bash
cd ~/data/cnv/xhmm_clean/
ls -ltr DATA.filtered_centered.RD.txt.filtered_samples.txt

-rw-rw---- 1 sudregp sudregp 0 Jan 25 16:03 DATA.filtered_centered.RD.txt.filtered_samples.txt
