# table of contents:

## [1. Import hail, other libraries and data](#1)

[1.1 import phenotype and pedigree data](#1.1)

[1.2 Annotate mt with phenotype and pedigree info](#1.2)

## [2. Explore MatrixTable, collect field descriptions](#2)

   [2.1 Removing the star alleles](#2.1)
   
   [2.1.5 Filter out outliers based on PCA](#2.1.5)
   
   [2.2 Creating a mt_p with patients only and mt_c with non-patients](#2.2)


## [3. Explore Clinvar pathogenic variants](#3)

[3.1 Filter out pathogenic variants with Gnomad AF > 0.001 and those that occur in controls](#3.1)

## [4. Filter for variants in genes associated with GTS ](#4)

[4.1 List of 260 genes enriched in basal ganglia](#4.1)

[4.2 Lists of genes associated with HPO phenotypes](#4.2)


## [5. Find compound / double hets ](#5)

[5.1 collect the compound hets int a dataframe](#5.1)

## [6.Family investigations](#6)

## [7.NUP214 investigation](#7)

<a id='1'></a> 
## 1. Import hail, other libraries and data

always run this code to widen notebook:

In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
# di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
import hail as hl
hl.init() 

Running on Apache Spark version 2.4.1
SparkUI available at http://8c03623a3bed:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.28-61941242c15d
LOGGING: writing to /hail/hail-20191212-1458-0.2.28-61941242c15d.log


In [3]:
from hail.plot import show
from pprint import pprint
from bokeh.layouts import gridplot
hl.plot.output_notebook()

import numpy as np
import pandas as pd
from functools import reduce
from itertools import chain

In [17]:
hl.import_vcf('data/vcf/sample.annotated-with-functional-annotations-gene-level.vcf', reference_genome='GRCh38').write('data/sample.mt', overwrite=True)

2019-12-11 15:04:17 Hail: INFO: Ordering unsorted dataset with network shuffle
2019-12-11 15:04:47 Hail: INFO: wrote matrix table with 726380 rows and 151 columns in 116 partitions to data/sample.mt


In [4]:
mt = hl.read_matrix_table('data/sample.mt') # mt stands for MatrixTable

<a id='1.1'></a>
## 1.1 import phenotype and pedigree data

In [5]:
pheno = hl.import_table('GTS-coded.csv', delimiter = ',', impute = True, key = 'ID')

2019-12-12 14:58:47 Hail: INFO: Reading table to impute column types
2019-12-12 14:58:47 Hail: INFO: Finished type imputation
  Loading column 'ID' as type 'str' (imputed)
  Loading column 'family' as type 'str' (imputed)
  Loading column 'sex' as type 'str' (imputed)
  Loading column 'kinship' as type 'str' (imputed)
  Loading column 'disease' as type 'str' (imputed)
  Loading column 'phenotype' as type 'str' (imputed)
  Loading column 'add_pheno' as type 'str' (imputed)
  Loading column 'heavy_tics' as type 'str' (imputed)


<a id='1.2'></a>

## 1.2 Annotate mt with phenotype and pedigree info

In [6]:
mt = mt.annotate_cols(phenotypes = pheno[mt.s])

<a id='2'></a> 
## 2. Explore MatrixTable, collect field descriptions

<a id='2.1'></a>

## 2.1 Removing the star alleles

### These are orphaned stars and shouldn't be here



In [7]:
mt = mt.filter_rows(mt.alleles.contains('*'), keep = False)

<a id='2.1.5'></a>

### 2.1.5 Filter out outliers based on PCA

In [8]:
mt = mt.filter_cols(mt.s != 'WGS_139', keep = True)
mt = mt.filter_cols(mt.s != 'WGS_D6816', keep = True)

In [23]:
eigenvalues, pcs, _ = hl.hwe_normalized_pca(mt.GT)

2019-12-11 15:04:49 Hail: INFO: hwe_normalized_pca: running PCA using 42468 variants.
2019-12-11 15:04:52 Hail: INFO: pca: running PCA with 10 components...


In [24]:
mt = mt.annotate_cols(scores = pcs[mt.s].scores)

In [25]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=mt.phenotypes.family,
                    title='PCA', xlabel='PC1', ylabel='PC2')
show(p)

In [27]:
mt.count()

(47373, 149)

In [28]:
mt.write('data/clean.mt', overwrite = True) #overwrite matrixtable to be complete and correct

2019-12-11 15:05:34 Hail: INFO: wrote matrix table with 47373 rows and 149 columns in 116 partitions to data/clean.mt


<a id='2.2'></a>

## 2.2 Creating a mt_p with patients only and mt_c with non-patients

In [9]:
mt_p = mt.filter_cols(mt.phenotypes.disease == 'YES', keep = True)
mt_p = mt_p.filter_rows(hl.agg.any(mt_p.GT.is_non_ref())) #filtering out variants that do not occur in any patients
mt_c = mt.filter_cols(mt.phenotypes.disease != 'YES', keep = True)
mt_c = mt_c.filter_rows(hl.agg.any(mt_c.GT.is_non_ref())) #filtering out variants that do not occur in any controls

mt_c = mt_c.filter_rows(hl.agg.any(mt_c.GT.is_non_ref())) #filtering out variants that do not occur in any controls

<a id='3'></a> 

## 3. Explore Clinvar pathogenic variants

for patient samples

In [19]:
p = (mt_p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic'))
c = (mt_p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic/likely_pathogenic'))
l = (mt_p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('likely_pathogenic'))


patho_p = mt_p.filter_rows(p | c | l)

p = (mt_c.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic'))
c = (mt_c.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic/likely_pathogenic'))
l = (mt_c.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('likely_pathogenic'))


patho_c = mt_c.filter_rows(p | c | l)

<a id='3.1'></a> 


### 3.1 Filter out pathogenic variants with Gnomad AF > 0.001 and those that occur in controls


In [20]:
AF_nfe = hl.float64(hl.delimit(patho_p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))

patho_p_rare = patho_p.filter_rows(AF_nfe < 0.001)
patho_p_rare = patho_p_rare.anti_join_rows(patho_c.rows()) #remove variants that occur in controls

patho_p_rare = hl.variant_qc(patho_p_rare)

p = patho_p_rare.filter_cols(hl.agg.any(patho_p_rare.GT.is_non_ref()))


summary = dict()
fields = [p.s, p.locus, p.GT, p.rsid, p.alleles, p.phenotypes.family, p.phenotypes.sex, p.phenotypes.kinship,
          p.phenotypes.add_pheno, p.phenotypes.heavy_tics , p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe, 
          p.info.ISEQ_GENES_NAMES, p.info.ISEQ_CLINVAR_ALLELE_ID, p.info.ISEQ_CLINVAR_DISEASES, 
          p.info.ISEQ_HPO_INHERITANCE, p.info.ISEQ_HPO_PHENOTYPES, p.info.ISEQ_HPO_DISEASES,
          p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, p.info.ANN ]
field_names = ['sample', 'locus', 'genotype', 
               'rsid', 'alleles', 'family', 'sex', 'kinship', 'additional_pheno', 'heavy_tics', 
               'GNOMAD_V3_AF_non_finn_eur', 'Gene', 'CLINVAR_ALLELE_ID', 'CLINVAR_DISEASES', 'HPO_INHERITANCE', 
               'HPO_PHENOTYPES', 'HPO_DISEASES',
               'AGGREGATED_CLINVAR_SIGNIFICANCE', 'SnpEff']


for each, each_name in zip(fields, field_names):
    key = each_name
    summary[key] = p.aggregate_entries(hl.agg.filter(p.GT.is_non_ref(), hl.agg.collect(each)))

patho_vars_df = pd.DataFrame(summary)
patho_vars_df.to_csv('pathogenic_variants_patients_summary.csv')

<a id='4'></a> 

## 4. FIlter for a list of genes related to GTS

In [11]:
#hand-made gene list for Tourette associated genes
gene_list = ['PANK2', 'COL27A1', 'PDGFB', 'CELSR3', 'OPA1', 'FBN2', 'WWC1', 'NIPBL', 
             'FN1', 'FBN2', 'SLITRK1', 'SLITRK2', 'SLITRK3', 'SLITRK4', 'SLITRK5', 'SLITRK6', 
             'HDC', 'OPRK1', 'PCDH10', 'NTSR2', 'OPRK1', 'CHD8', 'SCUBE1', 'PNKD', 'CNTNAP2', 'MOG', 
             'DRD2', 'DRD3', 'DRD4', 'DRD5', 'DAT1', 'DBH', 'HTR2A', 'TPH2', 'EAAT1', 'SAPAP3',
            'CTNNA3', 'NLGN4', 'FSCB', 'IMMP2L', 'NRXN1', 'AADAC', 'DBH', 'MAOA', 'HTR1A', 'HTR2C', 'SLC6A4',
             'TPH2', 'COL27A1', '5-HTTLPR', 'EAAT1', 'COL8A1', 'KCNE1', 'KCNE2'
            ]        

In [1767]:
mt_f = mt_p.filter_rows(hl.any(lambda x: hl.literal(gene_list).contains(x), mt_p.info.ISEQ_GENES_NAMES))

filter out common variants:

In [1768]:
AF_nfe = hl.float64(hl.delimit(mt_f.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
mt_f = mt_f.filter_rows(AF_nfe < 0.001)

filter out variants that occur in controls:

In [1769]:
mt_f_contr = mt_c.filter_rows(hl.any(lambda x: hl.literal(gene_list).contains(x), mt_c.info.ISEQ_GENES_NAMES))

AF_nfe = hl.float64(hl.delimit(mt_f_contr.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
mt_f_contr = mt_f_contr.filter_rows(AF_nfe < 0.001)

mt_f = mt_f.anti_join_rows(mt_f_contr.rows())

In [1770]:
mt_f.aggregate_rows(hl.agg.explode(lambda element: hl.agg.counter(element), mt_f.info.ISEQ_GENES_NAMES))

{'SLITRK6': 1,
 'FBN2': 2,
 'SLITRK3': 2,
 'CHD8': 3,
 'NTSR2': 1,
 'MOG': 1,
 'SCUBE1': 3,
 'NIPBL': 3,
 'PCDH10': 1,
 'DRD3': 1,
 'COL27A1': 2,
 'FN1': 6,
 'DRD4': 1,
 'OPA1': 1}

In [1771]:
p = mt_f.filter_cols(hl.agg.any(mt_f.GT.is_non_ref()))

summary = dict()
fields = [p.s, p.locus, p.GT, p.rsid, p.alleles, p.phenotypes.family, p.phenotypes.sex, p.phenotypes.kinship,
          p.phenotypes.add_pheno, p.phenotypes.heavy_tics , p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe, 
          p.info.ISEQ_GENES_NAMES, p.info.ISEQ_CLINVAR_ALLELE_ID, p.info.ISEQ_CLINVAR_DISEASES, 
          p.info.ISEQ_HPO_INHERITANCE, p.info.ISEQ_HPO_PHENOTYPES, p.info.ISEQ_HPO_DISEASES,
          p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, p.info.ANN ]
field_names = ['sample', 'locus', 'genotype', 
               'rsid', 'alleles', 'family', 'sex', 'kinship', 'additional_pheno', 'heavy_tics', 
               'GNOMAD_V3_AF_non_finn_eur', 'Gene', 'CLINVAR_ALLELE_ID', 'CLINVAR_DISEASES', 'HPO_INHERITANCE', 
               'HPO_PHENOTYPES', 'HPO_DISEASES',
               'AGGREGATED_CLINVAR_SIGNIFICANCE', 'SnpEff']

for each, each_name in zip(fields, field_names):
    key = each_name
    summary[key] = p.aggregate_entries(hl.agg.filter(p.GT.is_non_ref(), hl.agg.collect(each)))

vars_df = pd.DataFrame(summary)

vars_df.to_csv('tourette_gene_list_variants_patients_summary.csv')

<a id='4.1'></a> 

### 4.1 List of 260 genes enriched in basal ganglia

https://www.proteinatlas.org/search/brain_category_rna%3Abasal+ganglia%3BRegion+enriched%2CGroup+enriched%2CRegion+enhanced+AND+sort_by%3Atissue+specific+score

In [13]:
bg_genes = [line.rstrip() for line in open('./gts_gene_lists/brain_category_rna_basal.tsv')]

mt_p.filter_rows(hl.any(lambda x: hl.literal(bg_genes).contains(x), mt_p.info.ISEQ_GENES_NAMES)).count()
mt_f = mt_p.filter_rows(hl.any(lambda x: hl.literal(bg_genes).contains(x), mt_p.info.ISEQ_GENES_NAMES))

filter out common variants:

In [1774]:
AF_nfe = hl.float64(hl.delimit(mt_f.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
mt_f = mt_f.filter_rows(AF_nfe < 0.0001)

filter out variants that occur in controls:

In [1775]:
mt_f_contr = mt_c.filter_rows(hl.any(lambda x: hl.literal(bg_genes).contains(x), mt_c.info.ISEQ_GENES_NAMES))

AF_nfe = hl.float64(hl.delimit(mt_f_contr.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
mt_f_contr = mt_f_contr.filter_rows(AF_nfe < 0.0001)

mt_f = mt_f.anti_join_rows(mt_f_contr.rows())

In [1776]:
p = mt_f.filter_cols(hl.agg.any(mt_f.GT.is_non_ref()))

summary = dict()
fields = [p.s, p.locus, p.GT, p.rsid, p.alleles, p.phenotypes.family, p.phenotypes.sex, p.phenotypes.kinship,
          p.phenotypes.add_pheno, p.phenotypes.heavy_tics , p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe, 
          p.info.ISEQ_GENES_NAMES, p.info.ISEQ_CLINVAR_ALLELE_ID, p.info.ISEQ_CLINVAR_DISEASES, 
          p.info.ISEQ_HPO_INHERITANCE, p.info.ISEQ_HPO_PHENOTYPES, p.info.ISEQ_HPO_DISEASES,
          p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, p.info.ANN ]
field_names = ['sample', 'locus', 'genotype', 
               'rsid', 'alleles', 'family', 'sex', 'kinship', 'additional_pheno', 'heavy_tics', 
               'GNOMAD_V3_AF_non_finn_eur', 'Gene', 'CLINVAR_ALLELE_ID', 'CLINVAR_DISEASES', 'HPO_INHERITANCE', 
               'HPO_PHENOTYPES', 'HPO_DISEASES',
               'AGGREGATED_CLINVAR_SIGNIFICANCE', 'SnpEff']

for each, each_name in zip(fields, field_names):
    key = each_name
    summary[key] = p.aggregate_entries(hl.agg.filter(p.GT.is_non_ref(), hl.agg.collect(each)))

vars_df = pd.DataFrame(summary)
vars_df.describe()
vars_df.to_csv('basal_ganglia_genelist.csv')

<a id='4.2'></a> 

## 4.2 various lists of HPO genes

In [14]:
file_list = !ls ./gts_gene_lists/*csv

In [15]:
file_list[0:2]

['./gts_gene_lists/adhd.csv', './gts_gene_lists/agg_beh.csv']

In [16]:
file_list[2:]

['./gts_gene_lists/echolalia.csv',
 './gts_gene_lists/inv_mov.csv',
 './gts_gene_lists/motor_tics.csv',
 './gts_gene_lists/neuro_dev.csv',
 './gts_gene_lists/ocd_beh.csv',
 './gts_gene_lists/phonic_tics.csv',
 './gts_gene_lists/self_mut.csv',
 './gts_gene_lists/tics.csv']

In [1762]:
for a_file in file_list[2:]:
    
    mt_p_only = mt_p.anti_join_rows(mt_c.rows())
    AF_nfe = hl.float64(hl.delimit(mt_p_only.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
    
    genes = []
    df = pd.read_csv(a_file)
    genes = list(df['GENE_SYMBOL'])
  
    
    if len(genes) > 150:
        mt_p_only = mt_p_only.filter_rows(AF_nfe < 0.0001)   
    else:
        mt_p_only = mt_p_only.filter_rows(AF_nfe < 0.001)
        
    mtx = mt_p_only.filter_rows(hl.any(lambda x: hl.literal(genes).contains(x), mt_p_only.info.ISEQ_GENES_NAMES)) 
    mtx = mtx.filter_cols(hl.agg.any(mtx.GT.is_non_ref()))
    
    summary = dict()
    fields = [mtx.s, mtx.locus, mtx.GT, mtx.GQ, mtx.rsid, mtx.alleles, mtx.phenotypes.family, mtx.phenotypes.sex, mtx.phenotypes.kinship,
              mtx.phenotypes.add_pheno, mtx.phenotypes.heavy_tics , mtx.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe, 
              mtx.info.ISEQ_GENES_NAMES, mtx.info.ISEQ_CLINVAR_ALLELE_ID, mtx.info.ISEQ_CLINVAR_DISEASES, 
              mtx.info.ISEQ_HPO_INHERITANCE, mtx.info.ISEQ_HPO_PHENOTYPES, mtx.info.ISEQ_HPO_DISEASES,
              mtx.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, mtx.info.ANN ]
    field_names = ['sample', 'locus', 'genotype', 'GQ',
                   'rsid', 'alleles', 'family', 'sex', 'kinship', 'additional_pheno', 'heavy_tics', 
                   'GNOMAD_V3_AF_non_finn_eur', 'Gene', 'CLINVAR_ALLELE_ID', 'CLINVAR_DISEASES', 'HPO_INHERITANCE', 
                   'HPO_PHENOTYPES', 'HPO_DISEASES',
                   'AGGREGATED_CLINVAR_SIGNIFICANCE', 'SnpEff']

    for each, each_name in zip(fields, field_names):
        key = each_name
        summary[key] = mtx.aggregate_entries(hl.agg.filter(mtx.GT.is_non_ref(), hl.agg.collect(each)))

    vars_df = pd.DataFrame(summary)
    vars_df.to_csv('sum_'+a_file.split('/')[2])

<a id='5'></a>

## 5. Find compound homozygotes / double hets


In [17]:
!ls ./gts_gene_lists/*csv

./gts_gene_lists/adhd.csv	 ./gts_gene_lists/neuro_dev.csv
./gts_gene_lists/agg_beh.csv	 ./gts_gene_lists/ocd_beh.csv
./gts_gene_lists/echolalia.csv	 ./gts_gene_lists/phonic_tics.csv
./gts_gene_lists/inv_mov.csv	 ./gts_gene_lists/self_mut.csv
./gts_gene_lists/motor_tics.csv  ./gts_gene_lists/tics.csv


In [21]:
#get a long list of genes:

bg_genes = [line.rstrip() for line in open('./gts_gene_lists/brain_category_rna_basal.tsv')] # this has changed location!

patho_list = list(patho_vars_df['Gene'])
patho_list = list(chain.from_iterable(patho_list))
patho_list = [l.split(',') for l in ','.join(patho_list).split(':')]
patho_list = list(chain.from_iterable(patho_list))
patho_list = [l.split(',') for l in ','.join(patho_list).split('-')]
patho_list = list(chain.from_iterable(patho_list))

genes = []
file_list = !ls ./gts_gene_lists/*csv

for a_file in file_list:
    df = pd.read_csv(a_file)
    genes = genes + list(df['GENE_SYMBOL'])

many_genes = genes + gene_list + bg_genes
many_genes = set(many_genes)

to_remove = ['RAI1','TP53','GP1BA','TRPV3','ENTPD1','PCK1','TNK1','PRKRA','SCN4A', 'IDH2', 'HSPA9','PAH', 'PRPF3']

for x in to_remove:
    many_genes.remove(x)

In [22]:
len(many_genes)

2566

In [23]:
import pickle

with open('./py_objects/many_genes', 'wb') as l:
    pickle.dump(many_genes, l)

<a id='5.1'></a>

### 5.1 filter the large matrix for the list of genes and collect into dataframe

In [1852]:
mt_g = mt.filter_rows(hl.any(lambda x: hl.literal(many_genes).contains(x), mt.info.ISEQ_GENES_NAMES))
mt_g = mt_g.filter_rows(mt_g.info.ISEQ_HIGHEST_IMPACT.contains('HIGH'))

samples = mt_g.s.collect()

In [1910]:
dfs = []

for sample in samples:
    mt_s = mt_g.filter_cols(mt_g.s == sample)
    mt_s = mt_s.filter_rows(hl.agg.any(mt_s.GT.is_non_ref()))

    genes = mt_s.aggregate_rows(hl.agg.explode(lambda element: hl.agg.counter(element), mt_s.info.ISEQ_GENES_NAMES))
    genes = dict(filter(lambda elem: elem[1] > 1,genes.items()))
    genes = list(genes.keys())

    if not genes:
        print('sample '+sample+' has no genes on this list.')
        continue
    
    mtx = mt_s.filter_rows(hl.any(lambda x: hl.literal(genes).contains(x), mt_s.info.ISEQ_GENES_NAMES))

    summary = dict()
    fields = [mtx.s, mtx.locus, mtx.GT, mtx.GQ, mtx.rsid, mtx.alleles, mtx.phenotypes.family, mtx.phenotypes.disease, mtx.phenotypes.sex, mtx.phenotypes.kinship,
              mtx.phenotypes.add_pheno, mtx.phenotypes.heavy_tics , mtx.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe,
              mtx.info.ISEQ_GENES_NAMES, mtx.info.ISEQ_CLINVAR_ALLELE_ID, mtx.info.ISEQ_CLINVAR_DISEASES,
              mtx.info.ISEQ_HPO_INHERITANCE, mtx.info.ISEQ_HPO_PHENOTYPES, mtx.info.ISEQ_HPO_DISEASES,
              mtx.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, mtx.info.ANN ]
    field_names = ['sample', 'locus', 'genotype', 'GQ', 'rsid', 'alleles', 'family', 'disease', 'sex', 'kinship', 'additional_pheno', 'heavy_tics',
               'GNOMAD_V3_AF_non_finn_eur', 'Gene', 'CLINVAR_ALLELE_ID', 'CLINVAR_DISEASES', 'HPO_INHERITANCE', 'HPO_PHENOTYPES', 'HPO_DISEASES',
               'AGGREGATED_CLINVAR_SIGNIFICANCE', 'SnpEff']

    for each, each_name in zip(fields, field_names):
        key = each_name
        summary[key] = mtx.aggregate_entries(hl.agg.filter(mtx.GT.is_non_ref(), hl.agg.collect(each)))
    dfs.append(pd.DataFrame(summary))

dfs = pd.concat(dfs)

dfs.to_csv('high_double_all_samples.csv')

sample S_7227 has no genes on this list.
sample S_7261 has no genes on this list.
sample WGS_122 has no genes on this list.


<a id='6'></a>


# 6. Family investigations

In [2303]:
mt_g = mt.filter_rows(hl.any(lambda x: hl.literal(many_genes).contains(x), mt.info.ISEQ_GENES_NAMES))
mt_g_h = mt_g.filter_rows(mt_g.info.ISEQ_HIGHEST_IMPACT.contains('HIGH'))

mt_f = mt_g.filter_cols(mt_g.phenotypes.family != '.')
mt_f = mt_f.filter_rows(hl.agg.any(mt_f.GT.is_non_ref()))

AF_nfe = hl.float64(hl.delimit(mt_f.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe))
mt_f = mt_f.filter_rows(AF_nfe < 0.1)

In [2304]:
mt_f.count()

(3065, 109)

In [2305]:
# for people that have families, collect those that share variants with some double/compound hets

families = list(set(mt_f.phenotypes.family.collect()))

In [2306]:
families.sort()

In [2308]:
mtxs = []

for fam in families:
    # filter the large marix with medium and high variants for each family:
    mt_fs = mt_f.filter_cols(mt_f.phenotypes.family == fam)
    mt_fs = mt_fs.filter_rows(hl.agg.any(mt_fs.GT.is_non_ref()))
    
    # get the genes for which proband is high:
    mt_prob = mt_g_h.filter_cols(mt_g_h.phenotypes.family == fam)
    mt_prob = mt_prob.filter_rows(hl.agg.any(mt_prob.GT.is_non_ref()))
    mt_prob = mt_prob.filter_cols(mt_prob.phenotypes.kinship == 'P')
    mt_prob = mt_prob.filter_rows(hl.agg.any(mt_prob.GT.is_non_ref()))
    
    genes = mt_prob.aggregate_rows(hl.agg.explode(lambda element: hl.agg.counter(element), mt_prob.info.ISEQ_GENES_NAMES))
    genes = list(genes.keys())
    
    if not genes:
        print('samples in fam'+fam+' have no genes on this list.')
        continue
    
    mtx = mt_fs.filter_rows(hl.any(lambda x: hl.literal(genes).contains(x), mt_fs.info.ISEQ_GENES_NAMES))
    mtxs.append(mtx)
    
    

samples in famP have no genes on this list.


A gene: TBCD appears in three families - dfs[6],[0] , NF1 in family I and SYNGAP1, J,Kfamilty - COG1 (to już się gdiześ pojawiało, Chr 17), MTMR2, rodzina T i inne - DMPK

In [2309]:
genes_fam = []
genes_patients = []
genes_controls = []

for idx, mtx in enumerate(mtxs):
    
    genes_fam.append(mtx.aggregate_entries(hl.agg.filter(mtx.GT.is_non_ref(), 
                                                         hl.agg.explode(lambda element: hl.agg.counter(element), mtx.info.ISEQ_GENES_NAMES))))
    
    mtx_p = mtx.filter_cols(mtx.phenotypes.disease == 'YES')
    genes_patients.append(mtx_p.aggregate_entries(hl.agg.filter(mtx_p.GT.is_non_ref(), 
                                                                hl.agg.explode(lambda element: hl.agg.counter(element), mtx_p.info.ISEQ_GENES_NAMES))))
                          
    mtx_c = mtx.filter_cols(mtx.phenotypes.disease == 'NO')                   
    genes_controls.append(mtx_c.aggregate_entries(hl.agg.filter(mtx_c.GT.is_non_ref(), 
                                                                hl.agg.explode(lambda element: hl.agg.counter(element), mtx_c.info.ISEQ_GENES_NAMES))))
    # get rid of alleles that occur only in one person!
    
    

In [2310]:
for idx in range(len(genes_fam)):
    
    genes_fam[idx] = dict(filter(lambda elem: elem[1] > 1,genes_fam[idx].items()))
    genes_fam[idx] = list(genes_fam[idx].keys())

    genes_controls[idx] = dict(filter(lambda elem: elem[1] > 1,genes_controls[idx].items()))
    genes_controls[idx] = list(genes_controls[idx].keys())
        
    genes_patients[idx] = dict(filter(lambda elem: elem[1] > 1,genes_patients[idx].items()))
    genes_patients[idx] = list(genes_patients[idx].keys())
        
    mtxs[idx] = mtxs[idx].filter_rows(hl.any(lambda x: hl.literal(genes_fam[idx]).contains(x), mtxs[idx].info.ISEQ_GENES_NAMES)) #this filteres the mts by list

In [2311]:
for mtx in mtxs:
    print(hl.eval(mtx.count()))
    print('in family')
    print(mtx.phenotypes.family.take(1))

(4, 7)
in family
['A']
(10, 8)
in family
['B']
(13, 9)
in family
['C']
(4, 7)
in family
['D']
(30, 7)
in family
['E']
(6, 11)
in family
['F']
(14, 4)
in family
['G']
(2, 6)
in family
['H']
(8, 6)
in family
['I']
(13, 5)
in family
['J']
(3, 3)
in family
['K']
(4, 3)
in family
['L']
(10, 3)
in family
['M']
(5, 3)
in family
['N']
(3, 3)
in family
['O']
(13, 6)
in family
['R']
(8, 6)
in family
['S']
(11, 8)
in family
['T']


In [2330]:
tbs = []

colss = []

for mtx in mtxs:
    tb = mtx.select_cols(mtx.phenotypes.family, mtx.phenotypes.disease, mtx.phenotypes.sex, mtx.phenotypes.kinship, mtx.phenotypes.add_pheno,
                        mtx.phenotypes.heavy_tics)
    tb = tb.select_entries(tb.GT, tb.GQ)
    tb = tb.select_rows(tb.rsid, tb.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe, tb.info.ISEQ_GENES_NAMES,
                        tb.info.ISEQ_CLINVAR_ALLELE_ID, tb.info.ISEQ_CLINVAR_DISEASES,
                        tb.info.ISEQ_HPO_INHERITANCE, tb.info.ISEQ_HPO_PHENOTYPES, tb.info.ISEQ_HPO_DISEASES,
                        tb.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE, tb.info.ANN, tb.info.PhastCons100way)
    cols = tb.cols()
    colss.append(cols.to_pandas())
    
    tb = tb.make_table()
    tbs.append(tb.to_pandas())


2019-12-10 10:44:31 Hail: INFO: Coerced sorted dataset
2019-12-10 10:44:37 Hail: INFO: Coerced sorted dataset
2019-12-10 10:44:42 Hail: INFO: Coerced sorted dataset
2019-12-10 10:44:48 Hail: INFO: Coerced sorted dataset
2019-12-10 10:44:54 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:01 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:09 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:17 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:23 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:28 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:34 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:41 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:48 Hail: INFO: Coerced sorted dataset
2019-12-10 10:45:54 Hail: INFO: Coerced sorted dataset
2019-12-10 10:46:04 Hail: INFO: Coerced sorted dataset
2019-12-10 10:46:09 Hail: INFO: Coerced sorted dataset
2019-12-10 10:46:14 Hail: INFO: Coerced sorted dataset
2019-12-10 10:46:22 Hail: INFO: Coerced sorted dataset


In [2342]:
for col, tb in zip(colss, tbs):
    name = col.iloc[0]['family']

    col_name = 'cols_'+str(name)   
    row_name = 'rows_'+str(name)

    col.to_pickle('./py_objects/'+col_name)
    tb.to_pickle('./py_objects/'+row_name)

<a id='7'></a>


# 7. NUP214 investigation as example

In [2346]:
mt.count()

(47373, 149)

In [2356]:
mt_nup = mt.filter_rows(hl.any(lambda x: hl.literal('NUP214').contains(x), mt.info.ISEQ_GENES_NAMES))

In [2357]:
mt_nup.count()

(2, 149)

In [2370]:
mt_nup.GQ.show(n_cols = 10, n_rows = 20)

locus,alleles,S_136.GQ,S_170c.GQ,S_170d.GQ,S_6981.GQ,S_6982.GQ,S_7146.GQ,S_7156.GQ,S_7212.GQ,S_7213.GQ,S_7214.GQ
locus<GRCh38>,array<str>,int32,int32,int32,int32,int32,int32,int32,int32,int32,int32
chr9:131139255,"[""C"",""CT""]",23,0,14,13,6,5,0,0,10,32
chr9:131197554,"[""T"",""A""]",72,90,99,84,69,99,81,90,99,70


In [1]:
mt_nup.rsid.show(n_cols = 10, n_rows = 20)

NameError: name 'mt_nup' is not defined

In [2371]:
mt_nup.GT.show(n_cols = 10, n_rows = 20)

locus,alleles,S_136.GT,S_170c.GT,S_170d.GT,S_6981.GT,S_6982.GT,S_7146.GT,S_7156.GT,S_7212.GT,S_7213.GT,S_7214.GT
locus<GRCh38>,array<str>,call,call,call,call,call,call,call,call,call,call
chr9:131139255,"[""C"",""CT""]",0/1,0/0,0/1,0/0,0/0,0/0,0/0,0/0,0/0,0/0
chr9:131197554,"[""T"",""A""]",0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0


In [2372]:
mt_nup.DP.show(n_cols = 10, n_rows = 20)

locus,alleles,S_136.DP,S_170c.DP,S_170d.DP,S_6981.DP,S_6982.DP,S_7146.DP,S_7156.DP,S_7212.DP,S_7213.DP,S_7214.DP
locus<GRCh38>,array<str>,int32,int32,int32,int32,int32,int32,int32,int32,int32,int32
chr9:131139255,"[""C"",""CT""]",9,18,10,5,7,15,27,10,3,8
chr9:131197554,"[""T"",""A""]",25,33,33,31,28,41,28,33,34,26


In [2393]:
mt_nup_p = mt_nup.filter_cols(mt_nup.phenotypes.disease == 'YES')
mt_nup_p = mt_nup_p.filter_cols(mt_nup_p.phenotypes.family != '.')

mt_nup_c = mt_nup.filter_cols(mt_nup.phenotypes.disease == 'NO')

In [2394]:
mt_nup_p = mt_nup_p.annotate_rows(non_ref_ind = hl.agg.counter(mt_nup_p.GT.is_non_ref()),
                                 is_hom_non_ref = hl.agg.counter(mt_nup_p.GT.is_hom_var()),
                                 )

mt_nup_c = mt_nup_c.annotate_rows(non_ref_ind = hl.agg.counter(mt_nup_c.GT.is_non_ref()),
                                 is_hom_non_ref = hl.agg.counter(mt_nup_c.GT.is_hom_var()),
                                 )

In [2395]:
mt_nup_c.non_ref_ind.show()

locus,alleles,non_ref_ind
locus<GRCh38>,array<str>,"dict<bool, int64>"
chr9:131139255,"[""C"",""CT""]","{false:31,true:12}"
chr9:131197554,"[""T"",""A""]","{false:42,true:1}"


In [2396]:
mt_nup_p.non_ref_ind.show()

locus,alleles,non_ref_ind
locus<GRCh38>,array<str>,"dict<bool, int64>"
chr9:131139255,"[""C"",""CT""]","{false:46,true:17}"
chr9:131197554,"[""T"",""A""]",{false:63}


In [2397]:
mt_nup_p.is_hom_non_ref.show()

locus,alleles,is_hom_non_ref
locus<GRCh38>,array<str>,"dict<bool, int64>"
chr9:131139255,"[""C"",""CT""]","{false:62,true:1}"
chr9:131197554,"[""T"",""A""]",{false:63}


In [2398]:
mt_nup_p.filter_cols(hl.agg.any(mt_nup_p.GT.is_hom_var())).s.show()

s
str
"""S_7253"""


In [2399]:
mt_nup_p.filter_cols(mt_nup_p.s == 'S_7146').GT.show()

locus,alleles,S_7146.GT
locus<GRCh38>,array<str>,call
chr9:131139255,"[""C"",""CT""]",0/0
chr9:131197554,"[""T"",""A""]",0/0


In [2400]:
mt_nup_p.filter_cols(mt_nup_p.s == 'S_7146').DP.show()

locus,alleles,S_7146.DP
locus<GRCh38>,array<str>,int32
chr9:131139255,"[""C"",""CT""]",15
chr9:131197554,"[""T"",""A""]",41


In [2401]:
mt_nup_p.filter_cols(mt_nup_p.s == 'S_7146').GQ.show()

locus,alleles,S_7146.GQ
locus<GRCh38>,array<str>,int32
chr9:131139255,"[""C"",""CT""]",5
chr9:131197554,"[""T"",""A""]",99
