# Sex check

We will impute the sex of each sample, and compare this with the self-reported gender.

Note the distinction between 'sex' and 'gender' used here: 'sex' refers to the genetic sex, where 'gender' is self-reported. We impute the sex by testing for the presence of a Y chromosome. The gender is obtained from a questionaire given to participants before a blood sample is drawn.

In [2]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210607-1119-0.2.61-3c86d3ba497a.log


In [3]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [4]:
import pandas as pd

In [5]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
RESOURCES_DIR = '/non-fargen/resources'

## Load FarGen exome data

Load filtered, high-quality, variants.

In [6]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt/')

In [7]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Impute sex

We impute the sex of the samples by computing the inbreeding coefficient (F) on the X chromosome. This inbreeding coefficient is calculated as $F = \frac{O - E}{N-E}$ where $O$ is the observed number of homozygotes, $E$ is the expected number of homozygotes, and $N$ is the number of non-missing genotype calls. The expected number of homozygotes is calculated as $E = 1 - 2 f (1 - f)$ where $f$ is the minor-allel frequency.

NOTE: the sex imputation method requires diallelic sites.

In [8]:
mt = mt.filter_rows(hl.len(mt.alleles) == 2)

In [9]:
imputed_sex_ht = hl.impute_sex(mt.GT)

2021-06-07 11:30:20 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


Below we've plotted the inbreeding coefficient, and there is a quite clear clustering of individuals.

In [10]:
p = hl.plot.histogram(imputed_sex_ht.f_stat, title='Inbreeding coefficient (F) computed on the X chromosome')
p.plot_width = 800
p.plot_height = 500
show(p)

Based on the plot above, we define new $F$ thresholds for male and female, and do the imputation again.

In [19]:
imputed_sex_ht = hl.impute_sex(mt.GT, female_threshold=0.3, male_threshold=0.4)

# Make a new variable 'sex' that is either 'f' or 'm'.
imputed_sex_ht = imputed_sex_ht.annotate(sex = hl.if_else(imputed_sex_ht.is_female, 'f', 'm'))

## Load self-reported gender data

In [28]:
# Read the CSV.
gender_ht = hl.import_table(BASE_DIR + '/data/metadata/fargen_indi-gen.csv', delimiter=',')

# Key the table by individual name.
gender_ht = gender_ht.key_by(gender_ht.IndividualName)

# Us a boolean 'is_female' variable, like in the imputed data.
gender_ht = gender_ht.annotate(gender = hl.if_else(gender_ht.Gender == '0', 'f', 'm'))

# Recode the 'Gender' variable into a new 'gender' variable, that is either 'm', 'f', or missing.
gender_ht = gender_ht.transmute(gender = hl.case()
                                        .when(gender_ht.Gender == '0', 'f')
                                        .when(gender_ht.Gender == '1', 'm')
                                        .or_missing())

2021-06-07 11:38:30 Hail: INFO: Reading table without type imputation
  Loading field 'IndividualName' as type str (not specified)
  Loading field 'Gender' as type str (not specified)


## Compare imputed sex with self-reported gender

Below we compute a confusion matrix between self-reported gender and imputed sex.

We see that 7 samples have disconcordant sex and gender. Of these, 4 are reported as female and imputed as male, and 3 are reported as male and imputed as female.

In [30]:
# Annotate the imputed sex table with the self-reported gender.
imputed_sex_ht = imputed_sex_ht.annotate(gender=gender_ht[imputed_sex_ht.s].gender)

# Make Pandas series with the sex and gender.
sex = pd.Series(imputed_sex_ht.sex.collect(), name='Sex')
gender = pd.Series(imputed_sex_ht.gender.collect(), name='Gender')

# Calculate confusion matrix.
confusion_table = pd.crosstab(sex, gender, margins=True, margins_name='Sum')

confusion_table

2021-06-07 11:38:35 Hail: INFO: Coerced sorted dataset
2021-06-07 11:38:35 Hail: INFO: Coerced sorted dataset
2021-06-07 11:38:35 Hail: INFO: Coerced sorted dataset
2021-06-07 11:38:35 Hail: INFO: Coerced sorted dataset


Gender,f,m,Sum
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f,274,3,277
m,4,187,191
Sum,278,190,468


## Inspect disconcordant samples

We will inspect the samples where the self-reported gender is different from the imputed sex.

As we have seen, the males and females cluster very nicely w.r.t. the inbreeding coeffcient, so we can be confident that we are correctly imputing the sex. Below we see the imputation data for the samples where the imputed sex and self-reported gender mismatch.

In [45]:
imputed_sex_ht.filter(imputed_sex_ht.sex != imputed_sex_ht.gender).show()

2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset
2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset
2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset
2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset
2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset
2021-06-07 11:19:23 Hail: INFO: Coerced sorted dataset


s,is_female,f_stat,n_called,expected_homs,observed_homs,is_female_actual,is_female_gender,sex_check,sex,gender
str,bool,float64,int64,float64,int64,bool,bool,bool,str,str
"""FN000187""",True,0.138,31674,30500.0,30657,False,False,False,"""f""","""m"""
"""FN000861""",True,0.0367,31670,30500.0,30534,False,False,False,"""f""","""m"""
"""FN000871""",True,0.0566,31672,30500.0,30559,False,False,False,"""f""","""m"""
"""FN000884""",False,0.8,31674,30500.0,31438,True,True,False,"""m""","""f"""
"""FN000902""",False,0.717,31677,30500.0,31343,True,True,False,"""m""","""f"""
"""FN000957""",False,0.638,31676,30500.0,31249,True,True,False,"""m""","""f"""
"""FN001127""",False,0.801,31677,30500.0,31442,True,True,False,"""m""","""f"""


**FIXME:** can disconcordance be explained by poor sample quality?

In [32]:
# Annotate the table with a boolean sex mismatch variable.
imputed_sex_ht = imputed_sex_ht.annotate(sex_mismatch = imputed_sex_ht.sex != imputed_sex_ht.gender)

# Annotate the matrix table with the same information.
mt = mt.annotate_cols(sex_mismatch = imputed_sex_ht[mt.s].sex_mismatch)

The plot below shows

In [42]:
p = hl.plot.scatter(mt.sample_qc.gq_stats.mean, mt.sample_qc.n_het, label=mt.sex_mismatch, title='', xlabel='GQ mean', ylabel='# heterozygotes')
p.plot_width = 800
p.plot_height = 500
show(p)

2021-06-07 14:20:38 Hail: INFO: Coerced sorted dataset
2021-06-07 14:20:38 Hail: INFO: Coerced sorted dataset
2021-06-07 14:20:41 Hail: INFO: Coerced sorted dataset
2021-06-07 14:20:41 Hail: INFO: Coerced sorted dataset
