# table of contents:

## [1. Import hail, other libraries and data](#1)

[1.1 import phenotype and pedigree data](#1.1)

[1.2 Annotate mt with phenotype and pedigree info](#1.2)

## [2. Explore MatrixTable, collect field descriptions](#2)

   [2.1 Removing the star alleles](#2.1)
   
   [2.2 Creating a mt_p with patients only and mt_c with non-patients](#2.2)
    
## [3. Do simple aggregations, filtering and plots](#3)

## [4. Work with single sample and single variant](#4)

   [4.1 Single sample](#4.1)

   [4.2 Single variant](#4.2)

## [5. Explore Clinvar sinificance of detected variants](#5)

[5.1 Are there known pathogenic variants in > 1 sample?](#5.1)

[6. Annotate with CADD (in progress)](#6)

<a id='1'></a> 
## 1. Import hail, other libraries and data

always run this code to widen notebook:

In [719]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

display(HTML("<style>.container { width:100% !important; }</style>"))

In [1]:
import hail as hl
hl.init() 

Running on Apache Spark version 2.4.1
SparkUI available at http://349d1de1bab4:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.26-2dcc3d963867
LOGGING: writing to /hail/hail-20191114-0831-0.2.26-2dcc3d963867.log


In [728]:
from hail.plot import show
from pprint import pprint
from bokeh.layouts import gridplot
hl.plot.output_notebook()

In [521]:
import numpy as np
import pandas as pd
from functools import reduce

In [10]:
hl.import_vcf('data/annotated.test.vcf', reference_genome='GRCh38').write('data/sample.mt', overwrite=True)

2019-11-12 15:44:48 Hail: INFO: Ordering unsorted dataset with network shuffle
2019-11-12 15:45:18 Hail: INFO: wrote matrix table with 726380 rows and 151 columns in 116 partitions to data/sample.mt


In [356]:
mt = hl.read_matrix_table('data/sample.mt') # mt stands for MatrixTable

<a id='1.1'></a>
## 1.1 import phenotype and pedigree data

In [535]:
pheno = hl.import_table('GTS-coded.csv', delimiter = ',', impute = True, key = 'ID')

2019-11-26 11:56:50 Hail: INFO: Reading table to impute column types
2019-11-26 11:56:50 Hail: INFO: Finished type imputation
  Loading column 'ID' as type 'str' (imputed)
  Loading column 'family' as type 'str' (imputed)
  Loading column 'sex' as type 'str' (imputed)
  Loading column 'kinship' as type 'str' (imputed)
  Loading column 'disease' as type 'str' (imputed)
  Loading column 'phenotype' as type 'str' (imputed)
  Loading column 'add_pheno' as type 'str' (imputed)
  Loading column 'heavy_tics' as type 'str' (imputed)


Number of samples:

In [539]:
pheno.count()

151

Number of patients among samples:

In [540]:
pheno.aggregate(hl.agg.filter(pheno.disease == 'YES', hl.agg.count()))

104

In [544]:
pheno.show()

ID,family,sex,kinship,disease,phenotype,add_pheno,heavy_tics
str,str,str,str,str,str,str,str
"""S_136""","""D""","""M""","""P""","""YES""","""GTS""",""".""","""n/a"""
"""S_170c""","""B""","""M""","""father""","""YES""","""GTS""",""".""","""n/a"""
"""S_170d""","""B""","""M""","""father_brother_son""","""YES""","""GTS""",""".""","""n/a"""
"""S_6981""","""A""","""M""","""relative""","""n/a""","""n/a""",""".""","""n/a"""
"""S_6982""","""A""","""F""","""relative""","""n/a""","""n/a""",""".""","""n/a"""
"""S_7146""","""E""","""M""","""P""","""YES""","""GTS""",""".""","""n/a"""
"""S_7156""",""".""","""F""",""".""","""n/a""","""n/a""",""".""",
"""S_7212""","""A""","""M""","""father""","""NO""",""".""",""".""","""n/a"""
"""S_7213""","""A""","""M""","""mother_brother""","""NO""",""".""",""".""","""n/a"""
"""S_7214""","""A""","""F""","""mother_brother_daughter""","""NO""",""".""",""".""","""n/a"""


<a id='1.2'></a>

## 1.2 Annotate mt with phenotype and pedigree info

In [609]:
mt = mt.annotate_cols(phenotypes = pheno[mt.s])

<a id='2'></a> 
## 2. Explore MatrixTable, collect field descriptions

In [192]:
header = hl.get_vcf_metadata('data/annotated.test.vcf')

In [618]:
mt.describe() #see whats inside the mt

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'phenotypes': struct {
        family: str, 
        sex: str, 
        kinship: str, 
        disease: str, 
        phenotype: str, 
        add_pheno: str, 
        heavy_tics: str
    }
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        ExcessHet: float64, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        MULTIALLELIC: boo

In [4]:
mt.rows().select().show(5)

locus,alleles
locus<GRCh38>,array<str>
chr1:10146,"[""AC"",""*""]"
chr1:10177,"[""A"",""*""]"
chr1:10622,"[""T"",""*""]"
chr1:10623,"[""T"",""*""]"
chr1:30923,"[""G"",""*""]"


<a id='2.1'></a>

## 2.1 Removing the star alleles

### These are orphaned stars and shouldn't be here



In [363]:
mt = mt.filter_rows(mt.alleles.contains('*'), keep = False)

In [620]:
mt.count_cols()

151

In [621]:
mt.count_rows()

47373

<a id='2.2'></a>

## 2.2 Creating a mt_p with patients only and mt_c with non-patients

In [625]:
mt_p = mt.filter_cols(mt.phenotypes.disease == 'YES', keep = True)

In [649]:
mt_p = mt_p.filter_rows(hl.agg.any(mt_p.GT.is_non_ref())) #filtering out variants that do not occur in any patients

In [650]:
mt_p.count()

(42135, 104)

In [628]:
mt_c = mt.filter_cols(mt.phenotypes.disease != 'YES', keep = True)

In [661]:
mt_c = mt_c.filter_rows(hl.agg.any(mt_c.GT.is_non_ref())) #filtering out variants that do not occur in any controls

In [662]:
mt_c.count()

(24019, 47)

In [652]:
mt_p.count()[1] / mt_c.count()[1]

2.2127659574468086

In [665]:
mt_p.count()[0] / mt_c.count()[0]

1.754236229651526

We can use rows along with select to pull out 5 variants. The select method takes either a string refering to a field name in the table, or a Hail Expression. Here, we leave the arguments blank to keep only the row key fields, locus and alleles.

In [366]:
mt.entry.show(5)

locus,alleles
locus<GRCh38>,array<str>
chr1:69968,"[""A"",""G""]"
chr1:183189,"[""G"",""C""]"
chr1:183238,"[""G"",""C""]"
chr1:183937,"[""G"",""A""]"
chr1:184994,"[""G"",""C""]"


In [367]:
mt.entry.take(5) #create a list

[Struct(AD=[7, 0], DP=7, GQ=21, GT=Call(alleles=[0, 0], phased=False), PGT=None, PID=None, PL=[0, 21, 239]),
 Struct(AD=[0, 0], DP=0, GQ=None, GT=None, PGT=None, PID=None, PL=None),
 Struct(AD=[0, 0], DP=0, GQ=None, GT=None, PGT=None, PID=None, PL=None),
 Struct(AD=[0, 0], DP=0, GQ=None, GT=None, PGT=None, PID=None, PL=None),
 Struct(AD=[6, 0], DP=6, GQ=18, GT=Call(alleles=[0, 0], phased=False), PGT=None, PID=None, PL=[0, 18, 173])]

In [368]:
mt.entry.take(200)[130] #but works slow!

Struct(AD=[10, 0], DP=10, GQ=30, GT=Call(alleles=[0, 0], phased=False), PGT=None, PID=None, PL=[0, 30, 327])

In [40]:
mt.s.show(5) #samples


s
str
"""S_136"""
"""S_170c"""
"""S_170d"""
"""S_6981"""
"""S_6982"""


In [191]:
mt.s.describe()

--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f93c162d410>
Index:
    ['column']
--------------------------------------------------------


In [442]:
mt.s.take(5)

['S_136', 'S_170c', 'S_170d', 'S_6981', 'S_6982']

In [369]:
mt.GT.show()

locus,alleles,S_136.GT,S_170c.GT,S_170d.GT,S_6981.GT
locus<GRCh38>,array<str>,call,call,call,call
chr1:69968,"[""A"",""G""]",0/0,,,
chr1:183189,"[""G"",""C""]",0/0,0/0,0/0,0/0
chr1:183238,"[""G"",""C""]",0/0,0/0,0/0,0/0
chr1:183937,"[""G"",""A""]",0/0,0/1,0/1,0/0
chr1:184994,"[""G"",""C""]",0/0,0/0,,0/0
chr1:185336,"[""C"",""T""]",0/0,0/1,0/0,0/0
chr1:185497,"[""G"",""A""]",0/0,0/0,0/1,0/0
chr1:185550,"[""G"",""A""]",0/0,0/0,0/0,0/0
chr1:186338,"[""T"",""G""]",0/1,0/0,0/1,0/1
chr1:186341,"[""T"",""G""]",0/0,0/0,0/0,0/0


In [371]:
mt.info.ISEQ_CLINVAR_REVIEW_STATUS.describe() #describe a field

--------------------------------------------------------
Type:
        array<str>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f939ceb6490>
Index:
    ['row']
--------------------------------------------------------


In [372]:
hl.summarize_variants(mt)

Number of alleles,Count
2,47373

Allele type,Count
SNP,41030
Deletion,3351
Insertion,2992

Contig,Count
chr1,4499
chr2,2769
chr3,2436
chr4,1549
chr5,2578
chr6,1900
chr7,2886
chr8,1458
chr9,1812
chr10,1567


<a id='3'></a> 
## 3. Do simple aggregations, filtering and plots

In [622]:
mt.aggregate_entries(hl.agg.counter(mt.info.MULTIALLELIC)) #count occurence of given value, but has troubles with the array fields

{False: 6678428, True: 474895}

exercise: Try to see some variants where ISEQ_GENES_NAMES is.defined:

In [448]:
mt.filter_rows(hl.is_defined(mt.info.ISEQ_GENES_NAMES)).info.ISEQ_GENES_NAMES.show(5) #is_defined() filters out empty/nondefined

locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,array<str>
chr1:69968,"[""A"",""G""]","[""OR4F5""]"
chr1:183189,"[""G"",""C""]","[""FO538757.2""]"
chr1:183238,"[""G"",""C""]","[""FO538757.2""]"
chr1:183937,"[""G"",""A""]","[""FO538757.2""]"
chr1:184994,"[""G"",""C""]","[""FO538757.1""]"


In [449]:
mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles())) #distribution of genotype calls

{0: 6651990, 1: 344890, 2: 84194, None: 72249}

In [450]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT))) #calculate the call rate directly

0.9898999388116544

annotate the mt with call rate per variant: 

In [451]:
mt = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))

In [452]:
mt.call_rate.describe()

--------------------------------------------------------
Type:
        float64
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f935fd37ad0>
Index:
    ['row']
--------------------------------------------------------


In [453]:
mt.call_rate.summarize()

0,1
Non-missing,47373 (100.00%)
Missing,0
Minimum,0.01
Maximum,1.00
Mean,0.99
Std Dev,0.06


In [460]:
p = hl.plot.histogram(mt.call_rate, bins=25, title='Variant Call Rate Histogram', range=(0,1.0), legend='Call Rate')
show(p) #plot call rate per variant frequency

In [490]:
p = hl.plot.histogram(mt.DP, range=(0,100), bins=30, title='DP Histogram', legend='DP')
show(p)

<a id='4'></a> 
## 4. Work with single sample and single variant

<a id='4.1'></a> 
### 4.1 Single sample

In [461]:
S_136 = mt.filter_cols(mt.s == "S_136")

In [462]:
S_136.GT.describe()

--------------------------------------------------------
Type:
        call
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f937795a110>
Index:
    ['column', 'row']
--------------------------------------------------------


In [463]:
S_136.count_cols()

1

In [464]:
S_136.count_rows()

47373

In [465]:
S_136.aggregate_entries(hl.agg.counter(S_136.GT.n_alt_alleles()))

{0: 44094, 1: 2291, 2: 532, None: 456}

In [413]:
S_136.GT.show()

locus,alleles,S_136.GT
locus<GRCh38>,array<str>,call
chr1:10146,"[""AC"",""*""]",0/0
chr1:10177,"[""A"",""*""]",
chr1:10622,"[""T"",""*""]",
chr1:10623,"[""T"",""*""]",
chr1:30923,"[""G"",""*""]",
chr1:54714,"[""TTCTTTCTTTCTTTC"",""*""]",0/0
chr1:54715,"[""TC"",""*""]",0/0
chr1:54718,"[""TTCTTTC"",""*""]",0/0
chr1:54720,"[""CTTTCTTTCTTTCT"",""*""]",0/0
chr1:54724,"[""CT"",""*""]",0/1


In [209]:
S_136.aggregate_entries(hl.agg.counter(S_136.info.MULTIALLELIC))

{False: 44228, True: 682152}

In [212]:
S_136.alleles.is_star()

AttributeError: 'ArrayExpression' object has no attribute 'is_star'

In [210]:
S_136.alleles.describe

<bound method Expression.describe of <ArrayExpression of type array<str>>>

<a id='4.2'></a> 
### 4.2 Single variant

In [181]:
mt.locus.describe()

--------------------------------------------------------
Type:
        locus<GRCh38>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f93c162d410>
Index:
    ['row']
--------------------------------------------------------


In [186]:
hl.Locus.parse('chr1:183189', reference_genome='GRCh38') #get hail locus-type field from chrZ:12345 string

Locus(contig=chr1, position=183189, reference_genome=GRCh38)

In [187]:
chr1_183189 = mt.filter_rows(mt.locus == hl.Locus.parse('chr1:183189', reference_genome='GRCh38'))

In [188]:
chr1_183189.aggregate_entries(hl.agg.counter(chr1_183189.GT.n_alt_alleles()))

2019-11-14 16:46:20 Hail: INFO: reading 1 of 116 data partitions


{0: 132, 1: 19}

In [375]:
chr1_183189.alleles.show()

2019-11-19 08:29:49 Hail: INFO: reading 1 of 116 data partitions


locus,alleles
locus<GRCh38>,array<str>
chr1:183189,"[""G"",""C""]"


In [374]:
chr1_183189.info.ANN.collect()

2019-11-19 08:28:42 Hail: INFO: reading 1 of 116 data partitions


[['C|missense_variant|MODERATE|FO538757.2|ENSG00000279928|transcript|ENST00000624431.1|protein_coding|2/3|c.114G>C|p.Lys38Asn|430/718|114/402|38/133||',
  'C|downstream_gene_variant|MODIFIER|FO538757.1|ENSG00000279457|transcript|ENST00000623083.3|protein_coding||c.*2028C>G|||||1736|',
  'C|downstream_gene_variant|MODIFIER|FO538757.1|ENSG00000279457|transcript|ENST00000623834.3|protein_coding||c.*2028C>G|||||1734|',
  'C|downstream_gene_variant|MODIFIER|FO538757.1|ENSG00000279457|transcript|ENST00000624735.1|protein_coding||c.*1738C>G|||||1738|',
  'C|downstream_gene_variant|MODIFIER|MIR6859-2|ENSG00000273874|transcript|ENST00000612080.1|miRNA||n.*4702C>G|||||4702|']]

In [385]:
chr1_183189.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.collect()

2019-11-19 08:49:37 Hail: INFO: reading 1 of 116 data partitions


[None]

<a id='5'></a> 

### 5. Explore Clinvar sinificance of detected variants

for patient samples

In [623]:
mt.aggregate_rows(hl.agg.counter(hl.delimit(mt.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE))) # hl.delimit function converts array into string

{'likely_benign': 487,
 None: 44158,
 'risk_factor': 8,
 'benign': 363,
 'benign:other': 1,
 'pathogenic/likely_pathogenic': 32,
 'uncertain_significance:other': 1,
 'likely_pathogenic': 28,
 'not_provided': 60,
 'protective': 2,
 'pathogenic:risk_factor': 2,
 'benign/likely_benign': 330,
 'drug_response': 5,
 'conflicting_interpretations_of_pathogenicity': 697,
 'benign/likely_benign:other': 3,
 'uncertain_significance': 1074,
 'other': 5,
 'pathogenic': 114,
 'conflicting_interpretations_of_pathogenicity:risk_factor': 1,
 'affects': 2}

In [655]:
mt_p.aggregate_rows(hl.agg.counter(hl.delimit(mt_p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE)))

{'likely_benign': 424,
 None: 39352,
 'risk_factor': 8,
 'benign': 337,
 'benign:other': 1,
 'pathogenic/likely_pathogenic': 29,
 'uncertain_significance:other': 1,
 'likely_pathogenic': 25,
 'not_provided': 54,
 'protective': 2,
 'pathogenic:risk_factor': 2,
 'benign/likely_benign': 296,
 'drug_response': 4,
 'conflicting_interpretations_of_pathogenicity': 594,
 'benign/likely_benign:other': 3,
 'uncertain_significance': 902,
 'other': 5,
 'pathogenic': 94,
 'conflicting_interpretations_of_pathogenicity:risk_factor': 1,
 'affects': 1}

In [666]:
mt_c.aggregate_rows(hl.agg.counter(hl.delimit(mt_c.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE)))

{'likely_benign': 262,
 None: 22268,
 'risk_factor': 6,
 'benign': 271,
 'benign:other': 1,
 'pathogenic/likely_pathogenic': 11,
 'uncertain_significance:other': 1,
 'likely_pathogenic': 12,
 'not_provided': 30,
 'protective': 1,
 'benign/likely_benign': 208,
 'drug_response': 4,
 'conflicting_interpretations_of_pathogenicity': 368,
 'benign/likely_benign:other': 3,
 'uncertain_significance': 518,
 'other': 2,
 'pathogenic': 52,
 'affects': 1}

In [667]:
mt_c.filter_rows(mt_c.info.ISEQ_CLINVAR_DISEASES.contains('Schizophrenia'), keep = True).count()

(6, 47)

In [658]:
mt_p.filter_rows(mt_p.info.ISEQ_CLINVAR_DISEASES.contains('Schizophrenia'), keep = True).count()

(10, 104)

In [668]:
mt_c.filter_rows(mt_c.info.ISEQ_CLINVAR_DISEASES.contains('Autism_17^Autism_spectrum_disorder'), keep = True).count()

(4, 47)

In [660]:
mt_p.filter_rows(mt_p.info.ISEQ_CLINVAR_DISEASES.contains('Autism_17^Autism_spectrum_disorder'), keep = True).count()

(6, 104)

In [445]:
mt.filter_rows(mt.info.ISEQ_CLINVAR_DISEASES.contains('Autism_17^Autism_spectrum_disorder'), keep = True).rsid.show()

locus,alleles,rsid
locus<GRCh38>,array<str>,str
chr11:70473268,"[""C"",""T""]","""rs140134890"""
chr11:70490374,"[""C"",""T""]","""rs117843717"""
chr11:70698758,"[""G"",""A""]",
chr11:70820627,"[""C"",""T""]",
chr11:70820628,"[""G"",""A""]",
chr11:70952780,"[""A"",""AACTCGCC""]",


In [669]:
patho_p = mt_p.filter_rows(mt_p.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic'), keep = True)
patho_c = mt_c.filter_rows(mt_c.info.ISEQ_AGGREGATED_CLINVAR_SIGNIFICANCE.contains('pathogenic'), keep = True)

In [701]:
patho_p.aggregate_entries(hl.agg.counter(patho_p.GT.n_alt_alleles()))

{0: 9593, 1: 151, 2: 7, None: 25}

In [699]:
patho_c.aggregate_entries(hl.agg.counter(patho_c.GT.n_alt_alleles()))

{0: 2353, 1: 86, None: 5}

<a id='5.05'></a> 

### 5.05 Get AF from patho_p and patho_c, compare with gnomad3

In [712]:
patho_p = hl.variant_qc(patho_p)
patho_c = hl.variant_qc(patho_c)

In [714]:
patho_p.variant_qc.AF.show() #this is just for patients

locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,array<float64>
chr1:7809893,"[""C"",""G""]","[9.95e-01,4.81e-03]"
chr1:7809900,"[""A"",""G""]","[9.95e-01,4.81e-03]"
chr1:15445687,"[""AACACCCGCAAGAAGCCGGTAGTCT"",""A""]","[9.95e-01,4.81e-03]"
chr1:17270928,"[""C"",""T""]","[9.95e-01,4.81e-03]"
chr1:22906853,"[""G"",""A""]","[9.90e-01,9.62e-03]"
chr1:43338634,"[""G"",""C""]","[9.95e-01,4.81e-03]"
chr1:94031110,"[""G"",""A""]","[9.95e-01,4.81e-03]"
chr1:119140608,"[""A"",""C""]","[9.95e-01,4.81e-03]"
chr1:161169143,"[""C"",""G""]","[9.76e-01,2.40e-02]"
chr1:171107811,"[""C"",""T""]","[9.90e-01,9.71e-03]"


In [732]:
patho_p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe.show()

locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,array<float64>
chr1:7809893,"[""C"",""G""]",[5.74e-03]
chr1:7809900,"[""A"",""G""]",[5.74e-03]
chr1:15445687,"[""AACACCCGCAAGAAGCCGGTAGTCT"",""A""]",[1.86e-04]
chr1:17270928,"[""C"",""T""]",[9.09e-03]
chr1:22906853,"[""G"",""A""]",[5.33e-03]
chr1:43338634,"[""G"",""C""]",[5.58e-04]
chr1:94031110,"[""G"",""A""]",[4.03e-04]
chr1:119140608,"[""A"",""C""]",[5.12e-03]
chr1:161169143,"[""C"",""G""]",[9.88e-03]
chr1:171107811,"[""C"",""T""]",[2.03e-03]


In [738]:
hist = patho_p.aggregate_entries(hl.expr.aggregators.hist(patho_p.variant_qc.AF[1], 0, 0.05, 10))
p = hl.plot.histogram(hist, legend='AF patho_p', title='patho allele frequency in patients')

hist2 = patho_p.aggregate_entries(hl.expr.aggregators.hist(hl.float64(hl.delimit(patho_p.info.ISEQ_GNOMAD_GENOMES_V3_AF_nfe)), 0, 0.05, 10))
p2 = hl.plot.histogram(hist2, legend='AF GnomadV3', title='patho allele frequency in Gnomad')



show(gridplot([p, p2], ncols=2, plot_width=400, plot_height=400))


In [717]:
patho_c.variant_qc.AF.show() #this is just for controls (may be different variants)

locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,array<float64>
chr1:7809893,"[""C"",""G""]","[9.89e-01,1.06e-02]"
chr1:7809900,"[""A"",""G""]","[9.89e-01,1.06e-02]"
chr1:25303329,"[""T"",""G""]","[9.88e-01,1.19e-02]"
chr1:40258319,"[""GA"",""G""]","[9.79e-01,2.13e-02]"
chr1:56940965,"[""G"",""A""]","[9.68e-01,3.19e-02]"
chr1:94031110,"[""G"",""A""]","[9.89e-01,1.06e-02]"
chr1:152307547,"[""G"",""A""]","[9.89e-01,1.06e-02]"
chr1:161169143,"[""C"",""G""]","[9.68e-01,3.19e-02]"
chr1:171107811,"[""C"",""T""]","[9.79e-01,2.13e-02]"
chr1:197434706,"[""G"",""A""]","[9.89e-01,1.06e-02]"


In [716]:
patho_p.info.AF.show() #this is from whole 151 samples, only for alternative allele

locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,array<float64>
chr1:7809893,"[""C"",""G""]",[7.00e-03]
chr1:7809900,"[""A"",""G""]",[7.00e-03]
chr1:15445687,"[""AACACCCGCAAGAAGCCGGTAGTCT"",""A""]",[3.00e-03]
chr1:17270928,"[""C"",""T""]",[3.00e-03]
chr1:22906853,"[""G"",""A""]",[7.00e-03]
chr1:43338634,"[""G"",""C""]",[3.00e-03]
chr1:94031110,"[""G"",""A""]",[7.00e-03]
chr1:119140608,"[""A"",""C""]",[3.00e-03]
chr1:161169143,"[""C"",""G""]",[2.60e-02]
chr1:171107811,"[""C"",""T""]",[1.30e-02]


<a id='5.1'></a> 


### 5.1 Are there known pathogenic variants in > 1 sample?

Which variants? Which samples?

In [704]:
patho_p.aggregate_rows(hl.agg.counter(hl.delimit(patho_p.info.AC))) #this if for all samples and all variants though

{'12': 1,
 '8': 1,
 '4': 6,
 '9': 1,
 '5': 4,
 '6': 2,
 '1': 48,
 '2': 15,
 '7': 3,
 '3': 13}

## 6. GWAS between patients and controls for all the variants

first I need to change disease to bool... transmute?

#gwas = hl.logistic_regression_rows(y=mt.phenotypes.disease,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0])
#gwas.row.describe()