# Sex check

We will impute the sex of each sample, and compare this with the self-reported gender.

Note the distinction between 'sex' and 'gender' used here: 'sex' refers to the genetic sex, where 'gender' is self-reported. We impute the sex by testing for the presence of a Y chromosome. The gender is obtained from a questionaire given to participants before a blood sample is drawn.

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210607-1040-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

In [4]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
RESOURCES_DIR = '/non-fargen/resources'

## Load FarGen exome data

Load filtered, high-quality, variants.

In [7]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt/')

In [8]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Impute sex

We impute the sex of the samples by computing the inbreeding coefficient (F) on the X chromosome. This inbreeding coefficient is calculated as $F = \frac{O - E}{N-E}$ where $O$ is the observed number of homozygotes, $E$ is the expected number of homozygotes, and $N$ is the number of non-missing genotype calls. The expected number of homozygotes is calculated as $E = 1 - 2 f (1 - f)$ where $f$ is the minor-allel frequency.

NOTE: the sex imputation method requires diallelic sites.

In [10]:
mt = mt.filter_rows(hl.len(mt.alleles) == 2)

In [11]:
imputed_sex_ht = hl.impute_sex(mt.GT)

2021-06-07 10:41:53 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


Below we've plotted the inbreeding coefficient, and there is a quite clear clustering of individuals.

In [12]:
p = hl.plot.histogram(imputed_sex_ht.f_stat, title='Inbreeding coefficient (F) computed on the X chromosome')
p.plot_width = 800
p.plot_height = 500
show(p)

Based on the plot above, we define new $F$ thresholds for male and female, and do the imputation again.

In [13]:
imputed_sex_ht = hl.impute_sex(mt.GT, female_threshold=0.3, male_threshold=0.4)
imputed_sex_ht.show(10)

s,is_female,f_stat,n_called,expected_homs,observed_homs
str,bool,float64,int64,float64,int64
"""FN000001""",False,0.774,31677,30500.0,31410
"""FN000002""",False,0.737,31677,30500.0,31367
"""FN000009""",False,0.7,31677,30500.0,31323
"""FN000011""",True,-0.0836,31672,30500.0,30394
"""FN000012""",True,-0.0256,31672,30500.0,30462
"""FN000014""",True,-0.00412,31670,30500.0,30487
"""FN000015""",False,0.687,31677,30500.0,31308
"""FN000016""",True,0.226,31673,30500.0,30760
"""FN000017""",True,-0.00278,31674,30500.0,30491
"""FN000018""",True,0.0397,31671,30500.0,30539


## Load self-reported gender data

In [31]:
# Read the CSV.
gender_ht = hl.import_table(BASE_DIR + '/data/metadata/fargen_indi-gen.csv', delimiter=',')

# Key the table by individual name.
gender_ht = gender_ht.key_by(gender_ht.IndividualName)

# Us a boolean 'is_female' variable, like in the imputed data.
gender_ht = gender_ht.annotate(is_female = gender_ht.Gender == '0')

2021-06-07 10:49:25 Hail: INFO: Reading table without type imputation
  Loading field 'IndividualName' as type str (not specified)
  Loading field 'Gender' as type str (not specified)


## Compare imputed sex with self-reported gender

In [24]:
imputed_sex_ht = imputed_sex_ht.annotate(is_female_gender=gender_ht[imputed_sex_ht.s].is_female)

In [25]:
imputed_sex_ht.show(10)

2021-06-07 10:45:34 Hail: INFO: Coerced sorted dataset
2021-06-07 10:45:34 Hail: INFO: Coerced sorted dataset
2021-06-07 10:45:34 Hail: INFO: Coerced sorted dataset


s,is_female,f_stat,n_called,expected_homs,observed_homs,is_female_actual,is_female_gender
str,bool,float64,int64,float64,int64,bool,bool
"""FN000001""",False,0.774,31677,30500.0,31410,False,False
"""FN000002""",False,0.737,31677,30500.0,31367,False,False
"""FN000009""",False,0.7,31677,30500.0,31323,False,False
"""FN000011""",True,-0.0836,31672,30500.0,30394,True,True
"""FN000012""",True,-0.0256,31672,30500.0,30462,True,True
"""FN000014""",True,-0.00412,31670,30500.0,30487,True,True
"""FN000015""",False,0.687,31677,30500.0,31308,False,False
"""FN000016""",True,0.226,31673,30500.0,30760,True,True
"""FN000017""",True,-0.00278,31674,30500.0,30491,True,True
"""FN000018""",True,0.0397,31671,30500.0,30539,True,True


In [26]:
sex = pd.Series(imputed_sex_ht.is_female.collect(), name='Imputed')
gender = pd.Series(imputed_sex_ht.is_female_gender.collect(), name='Actual')

# Calculate confusion matrix.
confusion_table = pd.crosstab(sex, gender, margins=True, margins_name='Sum')

2021-06-07 10:45:50 Hail: INFO: Coerced sorted dataset
2021-06-07 10:45:50 Hail: INFO: Coerced sorted dataset


In [27]:
confusion_table

Actual,False,True,Sum
Imputed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,187,4,191
True,3,274,277
Sum,190,278,468


## Inspect disconcordant samples

We will inspect the samples where the self-reported gender is different from the imputed sex.

As we have seen, the males and females cluster very nicely w.r.t. the inbreeding coeffcient, so we can be confident that we are correctly imputing the sex. Below we see the imputation data for the samples where the imputed sex and self-reported gender mismatch.

In [33]:
imputed_sex_ht.filter(imputed_sex_ht.is_female != imputed_sex_ht.is_female_actual).show()

2021-06-07 10:53:32 Hail: INFO: Coerced sorted dataset
2021-06-07 10:53:32 Hail: INFO: Coerced sorted dataset
2021-06-07 10:53:32 Hail: INFO: Coerced sorted dataset


s,is_female,f_stat,n_called,expected_homs,observed_homs,is_female_actual,is_female_gender,sex_check
str,bool,float64,int64,float64,int64,bool,bool,bool
"""FN000187""",True,0.138,31674,30500.0,30657,False,False,False
"""FN000861""",True,0.0367,31670,30500.0,30534,False,False,False
"""FN000871""",True,0.0566,31672,30500.0,30559,False,False,False
"""FN000884""",False,0.8,31674,30500.0,31438,True,True,False
"""FN000902""",False,0.717,31677,30500.0,31343,True,True,False
"""FN000957""",False,0.638,31676,30500.0,31249,True,True,False
"""FN001127""",False,0.801,31677,30500.0,31442,True,True,False


**FIXME:** can disconcordance be explained by poor sample quality?