#Relatedness-Test

In this notebook, we will generate the genomes of many families and show their relatedness using the pc_relate and hl.king methods.

First, we import hail and a few other libraries. We also set the reference genome to GRCh38

In [1]:
import hail as hl
hl.init(default_reference = 'GRCh38')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
from hail.ggplot import *
import plotly
import plotly.io as pio
pio.renderers.default = 'iframe'

2022-08-17 14:50:07 WARN  Utils:69 - Your hostname, wm550-d22 resolves to a loopback address: 127.0.0.1; using 10.10.204.150 instead (on interface en0)
2022-08-17 14:50:07 WARN  Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2022-08-17 14:50:07 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
2022-08-17 14:50:08 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4045. Attempting port 4046.


Running on Apache Spark version 3.1.3
SparkUI available at http://10.10.204.150:4046
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.97-f9f63e8c0856
LOGGING: writing to /Users/aleisha/hail/hail/hail-20220817-1450-0.2.97-f9f63e8c0856.log


Then, using the make_parents function, we generate a table of potential parent genomes.
In this example, we generate 100 potential parents from 3 populations, showing their genotypes at 100,000 positions.

In [2]:
def rand_int(lower, upper):
    return hl.rand_cat(hl.range(upper - lower).map(lambda x: 1)) + lower

In [3]:
def make_parents(pops, n_parentoptions, n_variants):
    chrom = (rand_int(1,22))
    mt = hl.balding_nichols_model(pops, n_parentoptions, n_variants, af_dist = hl.rand_beta(0.25,0.25))
    grandmother_smaller = hl.rand_bool(0.5)
    mt = mt.annotate_entries(grandmother = mt.GT[hl.if_else(grandmother_smaller, 0, 1)],
                             grandfather = mt.GT[hl.if_else(grandmother_smaller, 1, 0)])
    mt = mt.annotate_entries(GT = hl.call(mt.grandmother, mt.grandfather, phased = True))
    mt = mt.key_rows_by(
                locus = hl.locus('chr' + hl.str(chrom), mt.locus.position),
        alleles = mt.alleles)
    
    ht = mt.localize_entries('option', 'columns')
    return ht

In [4]:
parent = make_parents(3,100,100000)

2022-08-17 14:50:09 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 100 samples, and 100000 variants...


Here's what one of the parent options looks like! There are 100 indexed columns in this table, each with unique data for each parent.

In [5]:
parent.option[34].show()

2022-08-17 14:50:14 Hail: INFO: Ordering unsorted dataset with network shuffle8]


locus,alleles,GT,grandmother,grandfather
locus<GRCh38>,array<str>,call,int32,int32
chr1:23,"[""A"",""C""]",1|0,1,0
chr1:76,"[""A"",""C""]",0|0,0,0
chr1:80,"[""A"",""C""]",0|1,0,1
chr1:90,"[""A"",""C""]",1|1,1,1
chr1:102,"[""A"",""C""]",0|0,0,0
chr1:120,"[""A"",""C""]",1|1,1,1
chr1:131,"[""A"",""C""]",0|0,0,0
chr1:140,"[""A"",""C""]",1|1,1,1
chr1:148,"[""A"",""C""]",1|1,1,1
chr1:149,"[""A"",""C""]",1|0,1,0


We can now generate any number of offspring genomes using any number of parent pairings using the make_children function. Make_children returns a matrix table with entries for the original parent options and for the children produced. Here, we will use a list of 50 unique parent pairings to generate 50 unique children, and a list of 70 parent pairings with repetition to create 70 children - including some siblings and half siblings. This second dataset should reveal how our relatedness tests hold up against samples with both population structure and familial relationships. 

First we download the genetic map from locus zoom at the Univerity of Michigan

In [3]:
![ -e recomb-hg38.tar.gz ] || curl -fSLO http://csg.sph.umich.edu/locuszoom/download/recomb-hg38.tar.gz
![ -e recomb-hg38/genetic_map_GRCh38_merged.tab ] || tar -xvzf recomb-hg38.tar.gz

In [6]:
def make_children(parentops, pairs):
    allchrm = hl.import_table('recomb-hg38/genetic_map_GRCh38_merged.tab', impute = True, min_partitions = 16)
    allchrm = allchrm.annotate(chrom = allchrm.chrom[3:])
    allchrm = allchrm.annotate(chrom = hl.if_else(allchrm.chrom == 'X', 23, hl.int(allchrm.chrom)))
    allchrm = allchrm.filter(allchrm.chrom != 23)
    allchrms = allchrm.to_pandas()
    allchrms = allchrms.sort_values(by = ['chrom', 'pos'])

    chrmlist = pd.unique(allchrms['chrom'])
    chrm = []
    for i in chrmlist:
        i = allchrms.loc[allchrms['chrom'] == i]
        chrm.append(i)

    frames = range(22)

    for frame in frames:
        position1 = pd.DataFrame({'chrom': frame + 1, 'pos': 1, 'recomb_rate': 0, "pos_cm": 0}, index = [0])
        chrm[frame] = pd.concat([position1, chrm[frame][:]]).reset_index(drop = True)
        chrm[frame] = chrm[frame].sort_values(by = 'pos')

    poissonmom = []
    poissondad = []
    for frame in frames:
        chrm[frame]['cM_from_last'] = chrm[frame]['pos_cm'] - chrm[frame]['pos_cm'].shift(periods = +1)
        chrm[frame]['prob for interval'] = chrm[frame]['cM_from_last'] / chrm[frame]['pos_cm'].max()
        chrm[frame]['dist_from_last'] = chrm[frame]['pos'] - chrm[frame]['pos'].shift(periods = +1)
        chrm[frame]['for_each_pos'] = chrm[frame]['prob for interval'] / chrm[frame]['dist_from_last']
        chrm[frame] = chrm[frame].fillna(0)
    paireggcrosses = []
    pairspermcrosses = []
    for i in pairs:
        eggcrosses = []
        spermcrosses = []
        for frame in frames:
        
            num_crossesmom = np.random.poisson(lam = chrm[frame]['pos_cm'].max() / 25)
            num_crossesdad = np.random.poisson(lam = chrm[frame]['pos_cm'].max() / 25)
            poissonmom.append(num_crossesmom)
            poissondad.append(num_crossesdad)

            eggchrmcrossint = np.random.choice(chrm[frame]['pos'],
                                               size = poissonmom[frame], p = chrm[frame]['prob for interval'])
            spermchrmcrossint = np.random.choice(chrm[frame]['pos'],
                                                 size = poissonmom[frame], p = chrm[frame]['prob for interval'])
            eggchrmcross=[]
            spermchrmcross=[]
            for i in eggchrmcrossint:
                eggchrmcrosses = i-np.random.randint(chrm[frame].loc[chrm[frame]['pos']==i]['dist_from_last'])
                eggchrmcross.append(eggchrmcrosses[0])
            for i in spermchrmcrossint:
                spermchrmcrosses = i-np.random.randint(chrm[frame].loc[chrm[frame]['pos']==i]['dist_from_last'])
                spermchrmcross.append(spermchrmcrosses[0])
        
            supperlimit = [spermchrmcross[i * 4] for i in range(math.ceil(len(spermchrmcross) /4))]
            eupperlimit = [eggchrmcross[i * 4] for i in range(math.ceil(len(eggchrmcross) / 4))]
    
            spermcross = pd.DataFrame({'upper limit':supperlimit})
            eggcross = pd.DataFrame({'upper limit':eupperlimit})
    
            eggcross = eggcross.sort_values(by = 'upper limit')
            spermcross = spermcross.sort_values(by = 'upper limit')
    
            eggcross['segment'] = range(1, math.ceil(len(eggchrmcross) / 4 )+ 1, 1)
            spermcross['segment'] = range(1, math.ceil(len(eggchrmcross) / 4 )+ 1, 1)

            eggcross['lower limit'] = eggcross['upper limit'].shift(periods = +1)
            spermcross['lower limit'] = spermcross['upper limit'].shift(periods = +1)

            eggcross = eggcross.fillna(1)
            spermcross = spermcross.fillna(1)

            eggcross['lower limit'] = eggcross['lower limit'].astype(int)
            spermcross['lower limit'] = spermcross['lower limit'].astype(int)

            egglastchunk = pd.DataFrame({'segment': math.ceil(poissonmom[frame]/4) + 1, 
                                     "upper limit": chrm[frame]['pos'].max(),
                                 'lower limit': eggcross['upper limit'].max()}, index = [0])
            spermlastchunk = pd.DataFrame({'segment': math.ceil(poissondad[frame]/4) + 1, "upper limit": chrm[frame]['pos'].max(),
                                   'lower limit': spermcross['upper limit'].max(),
                                       }, index = [0])
        
            eggcross = pd.concat([eggcross, egglastchunk], ignore_index = True, axis = 0)
            spermcross = pd.concat([spermcross, spermlastchunk], ignore_index=True, axis = 0)

            eggcross = eggcross[['segment', 'lower limit', 'upper limit']]
            spermcross = spermcross[['segment', 'lower limit', 'upper limit']]

            eggcrosses.append(list(eggcross['lower limit']))
            spermcrosses.append(list(spermcross['lower limit']))
            
        paireggcrosses.append(eggcrosses)
        pairspermcrosses.append(spermcrosses)

    parentops=parentops.annotate_globals(pairs = pairs, paireggcrosses = paireggcrosses,
                                         pairspermcrosses=pairspermcrosses) 
    
    parentops=parentops.annotate_globals(eggbase = parentops.paireggcrosses.map(lambda x:x.map(lambda y: hl.int(hl.rand_bool(0.5)))),
                                         spermbase = parentops.pairspermcrosses.map(lambda x:x.map(lambda y: hl.int(hl.rand_bool(0.5)))))
    
                                        
    chromosome_index_dict = hl.literal({contig: index for index, contig in enumerate(parentops.locus.dtype.reference_genome.contigs)})
   
    parentops = parentops.annotate(frame_index = chromosome_index_dict[parentops.locus.contig])

    def do_for_each_pair(index_pairs):                                
        index = index_pairs[0]
        pairs = index_pairs[1]
        momindex = pairs[0]
        dadindex = pairs[1]
        frame_index = parentops.frame_index
        egg_segments = parentops.paireggcrosses[index][frame_index]
        sperm_segments = parentops.pairspermcrosses[index][frame_index]
        position = parentops.locus.position
        egg_segment_index = hl.binary_search(egg_segments, position)
        sperm_segment_index = hl.binary_search(sperm_segments, position)

        egg_is_base = egg_segment_index % 2 == 1
        egg_base = parentops.eggbase[index][frame_index]
        sperm_is_base = sperm_segment_index % 2 == 1
        sperm_base = parentops.spermbase[index][frame_index]
        mombaseoralt = hl.if_else(egg_is_base, egg_base, 1-egg_base)
        dadbaseoralt = hl.if_else(sperm_is_base, sperm_base, 1-sperm_base)

        allelefrommom = parentops.option[momindex].GT[mombaseoralt]
        allelefromdad = parentops.option[dadindex].GT[dadbaseoralt]                                
        child = hl.call(allelefrommom, allelefromdad, phased = True)                                 

        return hl.struct(GT = child)
    
    parentops = parentops.annotate(new_children = hl.enumerate(parentops.pairs).map(do_for_each_pair), 
                                   parent_GT = parentops.option.map(lambda entry: entry.select('GT'))) 
    parentops = parentops.annotate_globals(new_children_ids = hl.enumerate(parentops.pairs).map(lambda idx_and_pair:
                                                                                              hl.struct(s = hl.str('child_') + hl.str(idx_and_pair[0]),
                                                                                                        mother = idx_and_pair[1][0],
                                                                                                        father = idx_and_pair[1][1],
                                                                                                        pop = hl.set([
                                                                                                            parentops.columns[idx_and_pair[1][0]].pop,
                                                                                                            parentops.columns[idx_and_pair[1][1]].pop
                                                                                                        ]))),
                                           parent_ids=parentops.columns.map(lambda column: hl.struct(s = hl.str('parent_')+hl.str(column.sample_idx),
                                                                                                     mother = hl.missing('int32'),
                                                                                                     father = hl.missing('int32'),
                                                                                                     pop = hl.set([column.pop]))))
    parentops = parentops.annotate(allGT = parentops.parent_GT.extend(parentops.new_children))
    parentops = parentops.annotate_globals(allIDs = parentops.parent_ids.extend(parentops.new_children_ids))
    parentops = parentops.drop('new_children', 'parent_ids', 'parent_GT', 'new_children_ids','option')
    mt = parentops._unlocalize_entries('allGT', 'allIDs', ['s'])
  
    return mt

In [7]:
kids = [(i*2, i*2 + 1) for i in range(50)]

In [8]:
pairs = [(0,1),(0,2),(45, 34), (45,78),(25, 92), (34, 29), (45, 34), (45, 34), (84,32), (84,32), (84,32), (84,32), (84,78), (84,56), (34,39),
          (0,1),(89,34),(89,34),(89,34),(89,45),(89,35),(89,35),(89,78),(89,36),(42,57),(42,57),(34,56),(25,92),(92,34),(56,89),
         (76,19),(76,19),(76,19),(76,19),(76,19),(76,19),(76,19), (34, 29), (34, 29), (34, 29), (34, 29), (34, 29), (34, 29),
         (34, 29),(45, 34),(45, 34),(45, 34),(45, 34),(45, 34),(45, 34),(45, 34),(90,34),(90,34),(90,34),(90,34),(90,34),(90,34),(90,34),
         (90,34),(90,34),(90,34), (23,45),(19,57),(12,0),(98,14),(73,29),(38,74),(84,29),(10,39),(71,30),(48,94),(75,92),(48,17),(58,28),(48,20),(10,29)]

In [9]:
children = make_children(parent, kids)

2022-08-17 14:50:16 Hail: INFO: Reading table to impute column types
2022-08-17 14:50:19 Hail: INFO: Finished type imputation          (10 + 6) / 16]
  Loading field 'chrom' as type str (imputed)
  Loading field 'pos' as type int32 (imputed)
  Loading field 'recomb_rate' as type float64 (imputed)
  Loading field 'pos_cm' as type float64 (imputed)
[Stage 8:>                                                        (0 + 16) / 16]

In [10]:
children2= make_children(parent, pairs)

2022-08-17 14:50:48 Hail: INFO: Reading table to impute column types
2022-08-17 14:50:50 Hail: INFO: Finished type imputation          (10 + 6) / 16]
  Loading field 'chrom' as type str (imputed)
  Loading field 'pos' as type int32 (imputed)
  Loading field 'recomb_rate' as type float64 (imputed)
  Loading field 'pos_cm' as type float64 (imputed)

We now have three data sets: parent, a table which contains the parent genomes, children, a matrix table which contains the parent and child genomes, and children2, a matrix table with parent, child, sibling, and half sibling genomes. We will run a principal compnent analysis on all datasets to reveal our sample populations. Note that we must first convert the parent table into a matrix table and do some quality control.

In [11]:
parentmt = parent._unlocalize_entries('option', 'columns', ['sample_idx'])

In [12]:
qcparentmt = hl.variant_qc(parentmt)
qcparentmt = qcparentmt.filter_rows((qcparentmt.variant_qc.AF[1] > 0.05) & (qcparentmt.variant_qc.AF[1] < 0.95))
qcparentmt = qcparentmt.filter_rows(qcparentmt.variant_qc.p_value_hwe > 1e-4)

In [13]:
parents_evals, parents_scores, parents_loadings = hl.hwe_normalized_pca(qcparentmt.GT, k=2, compute_loadings=True)
qcparentmt = qcparentmt.annotate_cols(scores = parents_scores[qcparentmt.sample_idx].scores)
parentpca = hl.plot.scatter(qcparentmt.scores[0],
                    qcparentmt.scores[1],
                    title='Parent PCA', xlabel='PC1', ylabel='PC2',
                   label=qcparentmt.pop)
show(parentpca)

num partitions: 8
8


2022-08-17 14:51:25 Hail: INFO: hwe_normalize: found 46596 variants after filtering out monomorphic sites.
2022-08-17 14:51:27 Hail: INFO: Ordering unsorted dataset with network shuffle8]
2022-08-17 14:51:28 Hail: INFO: pca: running PCA with 2 components...
2022-08-17 14:51:30 Hail: INFO: Coerced sorted dataset


In [14]:
qcchildren = hl.variant_qc(children)
qcchildren = qcchildren.filter_rows((qcchildren.variant_qc.AF[1] > 0.05) & (qcchildren.variant_qc.AF[1] < 0.95))
qcchildren = qcchildren.filter_rows(qcchildren.variant_qc.p_value_hwe > 1e-4)

normalized_children=hl.hwe_normalized_pca(qcchildren.GT)
qcchildren = qcchildren.annotate_cols(scores = normalized_children[1][qcchildren.s].scores)
childrenpca = hl.plot.scatter(qcchildren.scores[0],
                    qcchildren.scores[1],
                    title='Parent/Children PCA', xlabel='PC1', ylabel='PC2',
                   label=qcchildren.pop)
show(childrenpca)

num partitions: 8
8


2022-08-17 14:51:35 Hail: INFO: hwe_normalize: found 46998 variants after filtering out monomorphic sites.
2022-08-17 14:51:37 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:51:38 Hail: INFO: pca: running PCA with 10 components...

In [15]:
qcchildren2 = hl.variant_qc(children2)
qcchildren2 = qcchildren2.filter_rows((qcchildren2.variant_qc.AF[1] > 0.05) & (qcchildren2.variant_qc.AF[1] < 0.95))
qcchildren2 = qcchildren2.filter_rows(qcchildren2.variant_qc.p_value_hwe > 1e-4)

normalized_children2=hl.hwe_normalized_pca(qcchildren2.GT)
qcchildren2 = qcchildren2.annotate_cols(scores = normalized_children2[1][qcchildren2.s].scores)
childrenpca2 = hl.plot.scatter(qcchildren2.scores[0],
                    qcchildren2.scores[1],
                    title='Parent/Children PCA', xlabel='PC1', ylabel='PC2',
                   label=qcchildren2.pop)
show(childrenpca2)

num partitions: 8
8


2022-08-17 14:51:55 Hail: INFO: hwe_normalize: found 46839 variants after filtering out monomorphic sites.
2022-08-17 14:51:56 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:51:57 Hail: INFO: pca: running PCA with 10 components...


Yikes! On the dataset with more complex family structures, we can only vaguely see the three original populations, and we can only roughly idenitfy that children land in between parent populations. The populations are no longer in tight clusters as one would expect, and the children do not fall directly between their parent populations. This is because the children's relatedness can be interpreted in a PCA as new populations being formed, distorting the PCs. When using complex family structures for a PCA, we must use the loadings computed in the parent PCA and project the children onto those PCs, instead of computing new ones. This provides a much cleaner PCA result. 

In [16]:
qt= children
qt=hl.variant_qc(qt)
scores = hl.experimental.pc_project(qt.GT, parents_loadings.loadings, qt.variant_qc.AF[1])
qt=qt.annotate_cols(scores = scores[qt.s].scores)
p = hl.plot.scatter(qt.scores[0],
                    qt.scores[1],
                    title='PCA', xlabel='PC1', ylabel='PC2',
                   label=qt.pop)
show(p)

2022-08-17 14:52:01 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2022-08-17 14:52:03 Hail: INFO: Ordering unsorted dataset with network shuffle

Now, we will run pc_relate on the parent genomes to look for relatedness among the parent options.

In [17]:
parentrelations = hl.pc_relate(parentmt.GT, 0.01, k=10) 
k= hl.plot.scatter(parentrelations.ibd0, parentrelations.kin, xlabel='ibd0', ylabel='kin',
                  hover_fields={'sample1':parentrelations.i, 'sample2':parentrelations.j})
show(k)

num partitions: 8
8


2022-08-17 14:52:06 Hail: INFO: hwe_normalize: found 64977 variants after filtering out monomorphic sites.
2022-08-17 14:52:07 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:08 Hail: INFO: pca: running PCA with 10 components...
2022-08-17 14:52:14 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:16 Hail: INFO: Wrote all 25 blocks of 100000 x 100 matrix with block size 4096.
2022-08-17 14:52:17 Hail: INFO: wrote matrix with 11 rows and 100000 columns as 25 blocks of size 4096 to /tmp/pcrelate-write-read-mZiSVLipkfgSgXRBLqYwD5.bm
2022-08-17 14:52:17 Hail: INFO: wrote matrix with 100000 rows and 100 columns as 25 blocks of size 4096 to /tmp/pcrelate-write-read-N76tkkv21nT67m4tZd24S6.bm
2022-08-17 14:52:18 Hail: INFO: wrote matrix with 100000 rows and 100 columns as 25 blocks of size 4096 to /tmp/pcrelate-write-read-JaEzrVDuAYtmKZmAxPRjdK.bm
2022-08-17 14:52:18 Hail: INFO: wrote matrix with 100 rows and 100 columns as 1 block of size 

We can do the same for the parent/children and parent/children/siblings data sets, but we must first define a function to characterize the relationship between samples. This is so that we can color code according to known relationships and observe where they fall on the graph. We also do some light quality control, filtering out exceptionally low p values.

In [18]:
def relationship_type(sample1id, sample2id, mt):
    mt2 = mt.add_col_index()
    sample_id_to_idx = hl.literal(mt2.aggregate_cols(hl.dict(hl.agg.collect((mt2.s, mt2.col_idx)))))
    children_to_parents = hl.literal(mt2.aggregate_cols(
        hl.dict(hl.agg.collect((mt2.col_idx, hl.set(hl.if_else(hl.is_defined(mt2.mother), [mt2.mother, mt2.father], hl.empty_array('int32')))))),
        _localize=False))
    sample1 = sample_id_to_idx[sample1id]
    sample2 = sample_id_to_idx[sample2id]
    sample1_parents = children_to_parents[sample1]
    sample2_parents = children_to_parents[sample2]
    return (hl.case()
        .when(sample1id == sample2id, 'self')
        .when(sample1_parents.contains(sample2) | sample2_parents.contains(sample1), 'parent-child')
        .when((sample1_parents.size()>0)&(sample1_parents == sample2_parents)& (sample1id != sample2id), 'sibling')
        .when(sample1_parents.intersection(sample2_parents).size() == 1, 'half-sibling')
        .default('unrelated'))

In [19]:
qt = children
qt = hl.variant_qc(qt)
scores = hl.experimental.pc_project(qt.GT, parents_loadings.loadings, qt.variant_qc.AF[1])
qt = qt.filter_rows((qt.variant_qc.AF[1] > 0.01) & (qt.variant_qc.AF[1] < 0.99))
qt = qt.filter_rows(qt.variant_qc.p_value_hwe > 1e-4)
qcrelatednesstest = hl.pc_relate(qt.GT, 0.01, scores_expr=scores[qt.s].scores)
qcrelatednesstest= qcrelatednesstest.annotate(relationship=relationship_type(qcrelatednesstest.i.s, qcrelatednesstest.j.s,qt))
q = hl.plot.scatter(qcrelatednesstest.ibd0, qcrelatednesstest.kin, label=qcrelatednesstest.relationship,
                   xlabel='ibd0', ylabel='kin',
                   hover_fields={'sample1':qcrelatednesstest.i.s, 'sample2':qcrelatednesstest.j.s})
show(q)

2022-08-17 14:52:26 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:29 Hail: INFO: Ordering unsorted dataset with network shuffle8]
2022-08-17 14:52:31 Hail: INFO: Wrote all 15 blocks of 59095 x 150 matrix with block size 4096.
2022-08-17 14:52:33 Hail: INFO: Coerced sorted dataset
2022-08-17 14:52:33 Hail: INFO: Coerced sorted dataset
2022-08-17 14:52:34 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:36 Hail: INFO: wrote matrix with 3 rows and 59095 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-YHoROddol7v7mYyezfRSS4.bm
2022-08-17 14:52:36 Hail: INFO: wrote matrix with 59095 rows and 150 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-Fq073z0Kucm1y80WzIaQTE.bm
2022-08-17 14:52:36 Hail: INFO: wrote matrix with 59095 rows and 150 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-ZnsoxyBxC2K3wgg8EeWFOJ.bm
2022-08-17 14:52:37 Hail: INFO: wrote matrix with 150 rows and 150 columns as 1 block 

pc_relate works very well with the small, simple family structures. However, take a look at the results of running pc_relate on our data set which includes siblings and half siblings.

In [20]:
qt = children2
qt = hl.variant_qc(qt)
scores = hl.experimental.pc_project(qt.GT, parents_loadings.loadings, qt.variant_qc.AF[1])
qt = qt.filter_rows((qt.variant_qc.AF[1] > 0.01) & (qt.variant_qc.AF[1] < 0.99))
qt = qt.filter_rows(qt.variant_qc.p_value_hwe > 1e-4)
qcrelatednesstest = hl.pc_relate(qt.GT, 0.01, scores_expr=scores[qt.s].scores)
qcrelatednesstest= qcrelatednesstest.annotate(relationship=relationship_type(qcrelatednesstest.i.s, qcrelatednesstest.j.s,qt))
q = hl.plot.scatter(qcrelatednesstest.ibd0, qcrelatednesstest.kin, label=qcrelatednesstest.relationship,
                   xlabel='ibd0', ylabel='kin',
                   hover_fields={'sample1':qcrelatednesstest.i.s, 'sample2':qcrelatednesstest.j.s})
show(q)

2022-08-17 14:52:45 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:47 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:49 Hail: INFO: Wrote all 15 blocks of 58841 x 176 matrix with block size 4096.
2022-08-17 14:52:50 Hail: INFO: Coerced sorted dataset
2022-08-17 14:52:50 Hail: INFO: Coerced sorted dataset
2022-08-17 14:52:51 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:52:52 Hail: INFO: wrote matrix with 3 rows and 58841 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-9TXB550xPeI0k1GO68RA0t.bm
2022-08-17 14:52:52 Hail: INFO: wrote matrix with 58841 rows and 176 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-rJruecivPmQxFNuV7GPbC4.bm
2022-08-17 14:52:52 Hail: INFO: wrote matrix with 58841 rows and 176 columns as 15 blocks of size 4096 to /tmp/pcrelate-write-read-4CWrUENeNpxqTvx3RFdtSF.bm
2022-08-17 14:52:53 Hail: INFO: wrote matrix with 176 rows and 176 columns as 1 block of

PC relate seems to break down in the presence of very large familes.

An alternative to pc_relate is the hl.king function which tests for relatedness in different way and shows more realistic relatedness measures. Here we use it on both the parent and complex parent/child datasets

In [21]:
parentking = hl.king(parentmt.GT)
parentking = parentking.entries()
kinghist = hl.plot.histogram(parentking.phi)
show(kinghist)

2022-08-17 14:53:01 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:03 Hail: INFO: Wrote all 25 blocks of 100000 x 100 matrix with block size 4096.
2022-08-17 14:53:03 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:04 Hail: INFO: Wrote all 25 blocks of 100000 x 100 matrix with block size 4096.
2022-08-17 14:53:04 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:05 Hail: INFO: Wrote all 25 blocks of 100000 x 100 matrix with block size 4096.
2022-08-17 14:53:06 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:07 Hail: INFO: Wrote all 25 blocks of 100000 x 100 matrix with block size 4096.
2022-08-17 14:53:07 Hail: INFO: wrote matrix with 100 rows and 100 columns as 1 block of size 4096 to /tmp/JIpog1iJY5wRtYFAvYXePr
2022-08-17 14:53:08 Hail: INFO: wrote matrix with 100 rows and 100 columns as 1 block of size 4096 to /tmp/DfvhnqtAcfWmKDNAR2mzwT
2022-08-17 14:53:09 Hail: INFO: wrot

In [22]:
childking = hl.king(children2.GT)
childking= childking.annotate_entries(relationship=relationship_type(childking.s_1, childking.s,children2))
childking = childking.entries()
kinghist = ggplot(childking, aes(x=childking.phi, fill=childking.relationship))+ geom_histogram()
kinghist.show()

2022-08-17 14:53:15 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:16 Hail: INFO: Wrote all 25 blocks of 100000 x 176 matrix with block size 4096.
2022-08-17 14:53:17 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:18 Hail: INFO: Wrote all 25 blocks of 100000 x 176 matrix with block size 4096.
2022-08-17 14:53:19 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:20 Hail: INFO: Wrote all 25 blocks of 100000 x 176 matrix with block size 4096.
2022-08-17 14:53:21 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-08-17 14:53:23 Hail: INFO: Wrote all 25 blocks of 100000 x 176 matrix with block size 4096.
2022-08-17 14:53:23 Hail: INFO: wrote matrix with 176 rows and 176 columns as 1 block of size 4096 to /tmp/EoC51KvoMWsArh6jfzJAz7
2022-08-17 14:53:24 Hail: INFO: wrote matrix with 176 rows and 176 columns as 1 block of size 4096 to /tmp/MbuiLJugdMIui5E6aig0lM
2022-08-17 14:53:26 Hail: INFO: wrot