Here we assume that the XHMM data has already been QCed (see note from same day in Python kernel), and let's use some PLINK/SEQ trickery to get the rest of the analysis working:

In [25]:
module load plinkseq
module load XHMM
module load plink/1.07

cd /data/NCR_SBRB/simplex/xhmm

[-] Unloading GSL 2.2.1 ...
[-] Unloading Graphviz v2.38.0 ...
[-] Unloading gdal 2.0 ...
[-] Unloading proj 4.9.2 ...
[-] Unloading gcc 4.9.1 ...
[-] Unloading openmpi 1.10.0 for GCC 4.9.1
[-] Unloading tcl_tk 8.6.3
[-] Unloading Zlib 1.2.8 ...
[-] Unloading Bzip2 1.0.6 ...
[-] Unloading pcre 8.38 ...
[-] Unloading liblzma 5.2.2 ...
[-] Unloading libjpeg-turbo 1.5.1 ...
[-] Unloading tiff 4.0.7 ...
[-] Unloading curl 7.46.0 ...
[-] Unloading boost libraries v1.65 ...
[-] Unloading R 3.4.0 on cn3682
[+] Loading GSL 2.2.1 ...
[+] Loading Graphviz v2.38.0 ...
[+] Loading gdal 2.0 ...
[+] Loading proj 4.9.2 ...
[+] Loading gcc 4.9.1 ...
[+] Loading openmpi 1.10.0 for GCC 4.9.1
[+] Loading tcl_tk 8.6.3
[+] Loading Zlib 1.2.8 ...
[+] Loading Bzip2 1.0.6 ...
[+] Loading pcre 8.38 ...
[+] Loading liblzma 5.2.2 ...
[-] Unloading Zlib 1.2.8 ...
[+] Loading Zlib 1.2.8 ...
[-] Unloading liblzma 5.2.2 ...
[+] Loading liblzma 5.2.2 ...
[+] Loading libjpeg-turbo 1.5.1 ...
[+] Loading tiff 4.0.7 ...


Again, following directions from their protocol paper to create PLINK friendly files:

In [27]:
grep "#CHROM" DATA.vcf | awk '{for (i = 10; i <= NF; i++) print $i,1,0,0,1,1}' > DATA.fam;
/usr/local/apps/XHMM/2016-01-04/sources/scripts/xcnv_to_cnv DATA.xcnv > DATA.cnv;
plink --cfile DATA --cnv-make-map --out DATA --noweb


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ DATA.log ]
Analysis started: Thu Nov 30 15:03:23 2017

Options in effect:
	--cfile DATA
	--cnv-make-map
	--out DATA
	--noweb


Reading segment list (CNVs) from [ DATA.cnv ]
Writing new MAP file to [ DATA.cnv.map ]
Wrote 14564 unique positions to file

Analysis finished: Thu Nov 30 15:03:24 2017



In [69]:
plink --cfile DATA --noweb --cnv-check-no-overlap


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ plink.log ]
Analysis started: Thu Nov 30 16:02:59 2017

Options in effect:
	--cfile DATA
	--noweb
	--cnv-check-no-overlap

Reading marker information from [ DATA.cnv.map ]
14564 (of 14564) markers to be included from [ DATA.cnv.map ]
Reading individual information from [ DATA.fam ]
Reading pedigree information from [ DATA.fam ] 
98 individuals read from [ DATA.fam ] 
98 individuals with nonmissing phenotypes
Assuming a disease phenotype (1

Looks good. Let's now only keep the regions that intersect with transcript regions. First, some filtering is required:

In [84]:
cat annotated_targets.refseq.loci | awk '{if ($2 > 0) {print $3}}' > tmp_misc.txt;
cat annotated_targets.refseq.loci | awk '{if ($2 > 0) {print $1}}' > tmp_pos.txt;
cut -d "." -f 3 tmp_pos.txt > tmp_bp2.txt;
cut -d ":" -f 1 tmp_pos.txt | cut -d "r" -f 2 > tmp_chr.txt;
cut -d ":" -f 2 tmp_pos.txt | cut -d "." -f 1 > tmp_bp1.txt;
paste tmp_chr.txt tmp_bp1.txt tmp_bp2.txt tmp_misc.txt > gene_locations.txt;

In [88]:
plink --cfile DATA --noweb --cnv-disrupt --cnv-intersect gene_locations.txt \
    --cnv-write --out DATA.gene_disrupt


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ DATA.gene_disrupt.log ]
Analysis started: Thu Nov 30 16:31:51 2017

Options in effect:
	--cfile DATA
	--noweb
	--cnv-disrupt
	--cnv-intersect gene_locations.txt
	--cnv-write
	--out DATA.gene_disrupt

** For gPLINK compatibility, do not use '.' in --out **
Reading marker information from [ DATA.cnv.map ]
14564 (of 14564) markers to be included from [ DATA.cnv.map ]
Reading individual information from [ DATA.fam ]
Reading pedigree informatio

I don't want to filter based on frequency yet. So, let's just jump into the denovo filtering. But keep in mind that I might need to adjust the sex of the trios later! First, I need to create a VCF with the filtered CNVs from above:

In [103]:
cat DATA.gene_disrupt.cnv | awk '{if (NR>1) print $3":"$4"-"$5}' | sort | uniq > DATA.gene_disrupt.CNV_regions.txt

In [107]:
grep "^#" DATA.vcf > DATA.gene_disrupt.vcf;
grep -f DATA.gene_disrupt.CNV_regions.txt DATA.vcf | sort | uniq >> DATA.gene_disrupt.vcf

In [115]:
pseq DATA_gene_disrupt new-project

Creating new project specification file [ DATA_gene_disrupt.pseq ]


In [116]:
pseq DATA_gene_disrupt load-pedigree --file /data/NCR_SBRB/simplex/simplex.ped
pseq DATA_gene_disrupt load-vcf --vcf DATA.gene_disrupt.vcf

Inserted 24 new individuals, updated 75 existing individuals
loading : /gpfs/gsfs8/users/NCR_SBRB/simplex/xhmm/DATA.gene_disrupt.vcf ( 98 individuals )
parsed 5000 rows        
/gpfs/gsfs8/users/NCR_SBRB/simplex/xhmm/DATA.gene_disrupt.vcf : inserted 5050 variants


In [119]:
pseq DATA_gene_disrupt cnv-denovo --mask --noweb reg.ex=chrX,chrY --minSQ 60 --minNQ 60 --out DATA_gene_disrupt

-------------------------------------------------------------------------------
||||||||||||||||||||||||||| PSEQ (v0.10; 14-Jul-14) |||||||||||||||||||||||||||
-------------------------------------------------------------------------------

Copying this log to file [ DATA_gene_disrupt.log ]
Analysis started Thu Nov 30 16:56:59 2017

-------------------------------------------------------------------------------

Project : DATA_gene_disrupt
Command : cnv-denovo
Options : --mask
          --noweb reg.ex=chrX,chrY
          --minSQ 60
          --minNQ 60
          --out DATA_gene_disrupt

-------------------------------------------------------------------------------

Writing to file [ DATA_gene_disrupt.denovo.cnv ]  : per-site output from cnv-denovo
Writing to file [ DATA_gene_disrupt.denovo.cnv.indiv ]  : per-trio output from cnv-denovo
Starting CNV de novo scan...
Included 5050 of 5050 variants considered

Analysis finished Thu Nov 30 16:57:03 2017
::::::::::::::::::::::::::::::::::::

# TODO

* analyze resulting files