# Sex check

We will impute the sex of each sample, and compare this with the self-reported gender.

Note the distinction between 'sex' and 'gender' used here: 'sex' refers to the genetic sex, where 'gender' is self-reported. We impute the sex by testing for the presence of a Y chromosome. The gender is obtained from a questionaire given to participants before a blood sample is drawn.

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

2021-10-11 10:32:29 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2021-10-11 10:32:30 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
  Compatibility is not guaranteed.


Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-6676655f87-9xllv:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20211011-1032-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

In [4]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

## Load FarGen exome data

Load filtered, high-quality, variants.

In [5]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt/')

In [6]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 148305
Number of samples: 469


## Impute sex

We impute the sex of the samples by computing the inbreeding coefficient (F) on the X chromosome. This inbreeding coefficient is calculated as $F = \frac{O - E}{N-E}$ where $O$ is the observed number of homozygotes, $E$ is the expected number of homozygotes, and $N$ is the number of non-missing genotype calls. The expected number of homozygotes is calculated as $E = 1 - 2 f (1 - f)$ where $f$ is the minor-allel frequency.

NOTE: the sex imputation method requires diallelic sites.

In [7]:
mt = mt.filter_rows(hl.len(mt.alleles) == 2)

In [8]:
imputed_sex_ht = hl.impute_sex(mt.GT)

2021-10-11 10:32:35 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


Below we've plotted the inbreeding coefficient, and there is a quite clear clustering of individuals.

In [9]:
p = hl.plot.histogram(imputed_sex_ht.f_stat, title='Inbreeding coefficient (F) computed on the X chromosome')
p.plot_width = 800
p.plot_height = 500
show(p)



Based on the plot above, we define new $F$ thresholds for male and female, and do the imputation again.

In [10]:
imputed_sex_ht = hl.impute_sex(mt.GT, female_threshold=0.4, male_threshold=0.4)

# Make a new variable 'sex' that is either 'f' or 'm'.
imputed_sex_ht = imputed_sex_ht.annotate(sex = hl.if_else(imputed_sex_ht.is_female, 'f', 'm'))

## Load self-reported gender data

In [11]:
# Read the CSV.
gender_ht = hl.import_table(BASE_DIR + '/data/metadata/fargen_indi-gen.csv', delimiter=',')

# Key the table by individual name.
gender_ht = gender_ht.key_by(gender_ht.IndividualName)

# Us a boolean 'is_female' variable, like in the imputed data.
gender_ht = gender_ht.annotate(gender = hl.if_else(gender_ht.Gender == '0', 'f', 'm'))

# Recode the 'Gender' variable into a new 'gender' variable, that is either 'm', 'f', or missing.
gender_ht = gender_ht.transmute(gender = hl.case()
                                        .when(gender_ht.Gender == '0', 'f')
                                        .when(gender_ht.Gender == '1', 'm')
                                        .or_missing())

2021-10-11 10:32:41 Hail: INFO: Reading table without type imputation
  Loading field 'IndividualName' as type str (not specified)
  Loading field 'Gender' as type str (not specified)


## Compare imputed sex with self-reported gender

Below we compute a confusion matrix between self-reported gender and imputed sex.

We see that 7 samples have disconcordant sex and gender. Of these, 4 are reported as female and imputed as male, and 3 are reported as male and imputed as female.

In [12]:
# Annotate the imputed sex table with the self-reported gender.
imputed_sex_ht = imputed_sex_ht.annotate(gender=gender_ht[imputed_sex_ht.s].gender)

# Make Pandas series with the sex and gender.
sex = pd.Series(imputed_sex_ht.sex.collect(), name='Sex')
gender = pd.Series(imputed_sex_ht.gender.collect(), name='Gender')

# Calculate confusion matrix.
confusion_table = pd.crosstab(sex, gender, margins=True, margins_name='Sum')

confusion_table

2021-10-11 10:32:42 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:42 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:42 Hail: INFO: Coerced sorted dataset


Gender,f,m,Sum
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f,275,3,278
m,4,187,191
Sum,279,190,469


## Inspect disconcordant samples

We will inspect the samples where the self-reported gender is different from the imputed sex.

As we have seen, the males and females cluster very nicely w.r.t. the inbreeding coeffcient, so we can be confident that we are correctly imputing the sex. We used imputed samples with $F > 0.4$ as male and $F < 0.4$ as female.

Below we see the imputation data for the samples where the imputed sex and self-reported gender mismatch. Note that these samples fall nicely into the categories defined by the cut-off values we defined.

In [13]:
imputed_sex_ht.filter(imputed_sex_ht.sex != imputed_sex_ht.gender).show()

2021-10-11 10:32:44 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:44 Hail: INFO: Coerced sorted dataset


s,is_female,f_stat,n_called,expected_homs,observed_homs,sex,gender
str,bool,float64,int64,float64,int64,str,str
"""FN000187""",True,-0.0852,1185,1010.0,992,"""f""","""m"""
"""FN000861""",True,-0.18,1284,1080.0,1046,"""f""","""m"""
"""FN000871""",True,-0.0836,1282,1080.0,1062,"""f""","""m"""
"""FN000884""",False,0.87,982,859.0,966,"""m""","""f"""
"""FN000902""",False,0.882,716,648.0,708,"""m""","""f"""
"""FN000957""",False,0.875,920,808.0,906,"""m""","""f"""
"""FN001127""",False,0.922,997,868.0,987,"""m""","""f"""


We will investigate whether poor data quality can explain the disconcordancy. If anything seems to be abnormal with these samples, we may distrust the data.

In [14]:
# Annotate the table with a boolean sex mismatch variable.
imputed_sex_ht = imputed_sex_ht.annotate(sex_mismatch = imputed_sex_ht.sex != imputed_sex_ht.gender)

# Annotate the matrix table with the same information.
mt = mt.annotate_cols(sex_mismatch = imputed_sex_ht[mt.s].sex_mismatch)

The plot below shows the per samples DP mean and against different other QC variables. The samples with disconcordant sex are highlighted in blue.

We must note that it seems that GQ seems to be correlated with number of heterozygotes. Samples with low genotype quality seem to have fewer heterozygotes. This does make some intuitivt sense. If a sample has low depth it will have fewer called variants, therefore more homozygote reference and therefore fewer heterozygotes. This same pattern is reflected in the call rate and number of singletons.

In [15]:
exprs_list = [('# heterozygotes', mt.sample_qc.n_het), ('Ti/Tv rate', mt.sample_qc.r_ti_tv), ('Call rate', mt.sample_qc.call_rate), ('# singletons', mt.sample_qc.n_singleton)]
plot_list = []
for name, exprs in exprs_list:
    p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, exprs, label=mt.sex_mismatch, title=name, xlabel='DP mean', ylabel=name)
    p.plot_width = 800
    p.plot_height = 500
    plot_list.append(p)

2021-10-11 10:32:45 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:45 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:47 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:47 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:48 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:48 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:49 Hail: INFO: Coerced sorted dataset
2021-10-11 10:32:49 Hail: INFO: Coerced sorted dataset


In [16]:
show(gridplot(plot_list, ncols=2, plot_width=600, plot_height=400))

There seems to be nothing abnormal with these samples with disconcordant imputed sex and self-reported gender. We will therefore trust that our computation is correct, and keep the samples.