The idea here is to use the multiplex sample as replication (or negation) sample for the CNV findings in the simplex cohort. In other words, we want to check that whatever we found in the simplex cohort is not true for the multiplex cohort. So, no samples in the simplex cohort should be used in the multiplex findings. For that, I'll use the samples_multiplex_only.txt, which means that the samples are only in multiplex and not in simplex (151 samples raw).

Let's start the preprocessing then. First, make sure that no weird samples need to be thrown out:

In [1]:
import glob
ped_files = ['/data/sudregp/multiplex_simplex/multiplex.ped']
wes_prefix = ['CLIA', 'CCGO', 'WPS']
fid = open('/data/sudregp/cnv/xhmm_clean2/samples.txt', 'r')
exclude_list = [line.rstrip() for line in fid]
fid.close()

trios = []
affected = []
controls = []
samples = []
famids = []
for ped_file in ped_files:
    fid = open(ped_file, 'r')
    for line in fid:
        famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
        # if the current ID and its parents have WES data, and the sample is 
        # not in yet
        if (fa.split('_')[0] in wes_prefix and
            mo.split('_')[0] in wes_prefix and
            sid.split('_')[0] in wes_prefix and
            sid not in samples and
            (sid not in exclude_list or fa not in exclude_list or mo not in exclude_list)):
            fam = {}
            fam['child'] = sid
            if aff == '1':
                affected.append(sid)
            else:
                controls.append(sid)
            fam['father'] = fa
            fam['mother'] = mo
            fam['famid'] = famid
            trios.append(fam)
            samples += [sid, fa, mo]
            famids.append(famid)
    fid.close()
samples = set(samples)
famids = set(famids)
kids = set(affected + controls)

print 'Unique samples:', len(samples)
print 'Unique families:', len(famids)
print 'Unique children:', len(kids)

Unique samples: 153
Unique families: 20
Unique children: 80


So, basically we'll looking at everyone in the multiplex pedigree, provided they're not in the simplex analysis and have parents.

In [3]:
fid = open('/data/sudregp/multiplex_simplex/xhmm_replication/samples.txt', 'w')
for s in samples:
    fid.write(s + '\n')
fid.close()

fid = open('/data/sudregp/multiplex_simplex/xhmm_replication/kid_samples.txt', 'w')
for s in kids:
    fid.write(s + '\n')
fid.close()

Starting a very similar analysis we did for xhmm_ximplex_clean of today's date:

In [None]:
# in terminal
exome_targets='/data/NCR_SBRB/simplex/SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_capture_targets.bed'
gatk_memory="50g"
ref_fa='/fdb/GATK_resource_bundle/hg19-2.8/ucsc.hg19.fasta'
out_dir='/data/sudregp/multiplex_simplex/xhmm_replication/'

cd $out_dir
module load GATK
module load XHMM

GATK -m ${gatk_memory} GCContentByInterval -L ${exome_targets} -R ${ref_fa} -o ./DATA.locus_GC.txt
cat ./DATA.locus_GC.txt | awk '{if ($2 < 0.1 || $2 > 0.9) print $1}' > ./extreme_gc_targets.txt

# merging all subjects in the directory
while read s; do
    cp ../xhmm/${s}* .
done < samples.txt
ls -1 *.sample_interval_summary > depth_list.txt;
cp ../xhmm/params.txt .

xhmm --mergeGATKdepths --GATKdepthsList=depth_list.txt -o ./DATA.RD.txt;

# this does the same thing as the XHMM script, but it actually works in parsing 
# the base pair start and ends
cat ${exome_targets} | awk 'BEGIN{OFS="\t"; print "#CHR\tBP1\tBP2\tID"}{print $1, $2, $3, NR}' > ./EXOME.targets.reg

module load plinkseq
pseq . loc-load --locdb ./EXOME.targets.LOCDB --file ./EXOME.targets.reg --group targets \
    --out ./EXOME.targets.LOCDB.loc-load --noweb

# this has the same effect as the suggested command, but it actually works
pseq . loc-stats --locdb ./EXOME.targets.LOCDB --group targets --seqdb ./seqdb.hg19 --noweb | \
    awk '{if (NR > 1) { print  $4, $10 }}' | sed 's/\.\./-/' - > ./DATA.locus_complexity.txt

cat ./DATA.locus_complexity.txt | awk '{if ($2 > 0.25) print $1}' > ./low_complexity_targets.txt

xhmm --matrix -r ./DATA.RD.txt --centerData --centerType target \
-o ./DATA.filtered_centered.RD.txt \
--outputExcludedTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--excludeTargets ./extreme_gc_targets.txt --excludeTargets ./low_complexity_targets.txt \
--minTargetSize 10 --maxTargetSize 10000 \
--minMeanTargetRD 10 --maxMeanTargetRD 500 \
--minMeanSampleRD 25 --maxMeanSampleRD 200 \
--maxSdSampleRD 150

xhmm --PCA -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA

xhmm --normalize -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA \
--normalizeOutput ./DATA.PCA_normalized.txt \
--PCnormalizeMethod PVE_mean --PVE_mean_factor 0.7

xhmm --matrix -r ./DATA.PCA_normalized.txt --centerData --centerType sample --zScoreData \
-o ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
--outputExcludedTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
--maxSdTargetRD 30

xhmm --matrix -r ./DATA.RD.txt \
--excludeTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--excludeTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--excludeSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--excludeSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
-o ./DATA.same_filtered.RD.txt

xhmm --discover -p params.txt -r ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
    -R ./DATA.same_filtered.RD.txt -c ./DATA.xcnv -a ./DATA.aux_xcnv -s ./DATA

xhmm --genotype -p params.txt -r ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
    -R ./DATA.same_filtered.RD.txt -g ./DATA.xcnv -F $ref_fa -v ./DATA.vcf

In [1]:
%%bash
cd ~/data/multiplex_simplex/xhmm_replication/
ls -ltr DATA.filtered_centered.RD.txt.filtered_samples.txt

-rw-rw---- 1 sudregp sudregp 0 Feb  6 17:38 DATA.filtered_centered.RD.txt.filtered_samples.txt


In [None]:
#in terminal
module load plinkseq
module load plink/1.07

cd ~/data/multiplex_simplex/xhmm_replication/

pseq DATA new-project
# adding a first column with subject ID for PSEQ
cut -f 2 ../multiplex.ped > junk.txt
paste junk.txt ../multiplex.ped > multiplex.ped.info
pseq DATA load-pedigree --file multiplex.ped.info
pseq DATA load-vcf --vcf DATA.vcf

for q in 50 60 70 80 90; do
    pseq DATA cnv-denovo --noweb --minSQ $q --minNQ $q --out DATA_q${q}
    grep DENOVO DATA_q${q}.denovo.cnv > pseq_DENOVO.txt
    # borrow the header row
    head -1 DATA.xcnv > denovo.xcnv;

    # filter out denovo CNVs
    while read sample; do
        grep $sample DATA.xcnv > sample.xcnv;
        for cnv in `grep $sample pseq_DENOVO.txt | cut -f 3 -`; do
            # replacing .. by -
            cnv=`echo $cnv | sed -e 's/\.\./\-/'`;
            grep $cnv sample.xcnv >> denovo.xcnv; 
        done;
    done < kid_samples.txt;
    /usr/local/apps/XHMM/2016-01-04/sources/scripts/xcnv_to_cnv denovo.xcnv > tmp.cnv
    # switch around FAMID and IID columns, and remove header
    awk '{OFS="\t"; if ( $3 != "CHR" ) {print $2, $1, $3, $4, $5, $6, $7, $8 }}' tmp.cnv > denovo_q${q}.cnv
    rm sample.xcnv pseq_DENOVO.txt tmp.cnv denovo.xcnv
    
    # filter out inherited cnvs
    grep MATERNAL_TRANSMITTED DATA_q${q}.denovo.cnv > pseq_TRANSMITTED.txt
    grep PATERNAL_TRANSMITTED DATA_q${q}.denovo.cnv >> pseq_TRANSMITTED.txt
    # borrow the header row
    head -1 DATA.xcnv > inherited.xcnv;

    while read sample; do
        grep $sample DATA.xcnv > sample.xcnv;
        for cnv in `grep $sample pseq_TRANSMITTED.txt | cut -f 3 -`; do
            # replacing .. by -
            cnv=`echo $cnv | sed -e 's/\.\./\-/'`;
            grep $cnv sample.xcnv >> inherited.xcnv; 
        done;
    done < kid_samples.txt;
    /usr/local/apps/XHMM/2016-01-04/sources/scripts/xcnv_to_cnv inherited.xcnv > tmp.cnv
    # switch around FAMID and IID columns, and remove header
    awk '{OFS="\t"; if ( $3 != "CHR" ) {print $2, $1, $3, $4, $5, $6, $7, $8 }}' tmp.cnv > inherited_q${q}.cnv
    rm sample.xcnv pseq_TRANSMITTED.txt tmp.cnv inherited.xcnv
    
    # compile all CNVs for kids
    # borrow the header row
    head -1 DATA.xcnv > all.xcnv;

    # effectively just filtering DATA.xcnv to keep only kids
    while read sample; do
        grep $sample DATA.xcnv >> all.xcnv;
    done < kid_samples.txt;
    /usr/local/apps/XHMM/2016-01-04/sources/scripts/xcnv_to_cnv all.xcnv > tmp.cnv
    # switch around FAMID and IID columns, and remove header
    awk '{OFS="\t"; if ( $3 != "CHR" ) {print $2, $1, $3, $4, $5, $6, $7, $8 }}' tmp.cnv > all_q${q}.cnv
    rm tmp.cnv all.xcnv
done

In [None]:
#terminal

cp ~/data/cnv/penncnv/wellknown_region_hg19 bad_regions.list
cp ~/data/cnv/penncnv/glist-hg19 .
cp ~/data/cnv/penncnv/genes.txt .
cp ~/data/cnv/penncnv/hg19_allenBrainGene_trimmed.txt .

for q in 50 60 70 80 90; do
    for cnvtype in all denovo inherited; do
        cnvname=${cnvtype}_q${q}.cnv
        plink --cnv-list $cnvname --cnv-make-map --noweb --out ${cnvtype}_q${q};
        
        # remove bad regions
        plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
            --noweb --1 --cnv-exclude bad_regions.list --cnv-overlap .5 \
            --cnv-write --out ${cnvtype}_q${q}_clean
        plink --cnv-list ${cnvtype}_q${q}_clean.cnv --cnv-make-map --noweb --1 \
            --out ${cnvtype}_q${q}_clean
        
        for qc in '' '_clean'; do
            cnvname=${cnvtype}_q${q}${qc}.cnv
            # whole burden
            plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                --noweb --1 --cnv-check-no-overlap --out ${cnvtype}_q${q}${qc}_burden;
            # gene sets
            plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                --noweb --1 --cnv-intersect glist-hg19 --cnv-verbose-report-regions \
                --cnv-subset genes.txt --out ${cnvtype}_q${q}${qc}_genes;
            plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                --noweb --1 --cnv-intersect glist-hg19 --cnv-verbose-report-regions \
                --cnv-subset hg19_allenBrainGene_trimmed.txt \
                --out ${cnvtype}_q${q}${qc}_brainGenes;
            # subtypes only
            for sub in del dup; do
                plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                --noweb --1 --cnv-${sub} --out ${cnvtype}_q${q}${qc}_${sub}Burden;
                # gene sets
                plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                    --noweb --1 --cnv-intersect glist-hg19 --cnv-verbose-report-regions \
                    --cnv-subset genes.txt --cnv-${sub} \
                    --out ${cnvtype}_q${q}${qc}_${sub}Genes;
                plink --map ${cnvname}.map --fam ../multiplex_nofamid.ped --cnv-list $cnvname \
                    --noweb --1 --cnv-intersect glist-hg19 --cnv-verbose-report-regions \
                    --cnv-subset hg19_allenBrainGene_trimmed.txt --cnv-${sub} \
                    --out ${cnvtype}_q${q}${qc}_${sub}BrainGenes;
            done;
        done;
    done;
done

# junk

Checking if anyone got removed: