# Get the exome from the gnomAD genome data

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '100g', 'spark.local.dir': '/home/olavur/tmp'})

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-7889d4ff4c-6wxtc:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/gnomad_genome_genotypes/hail-20210323-1048-0.2.61-3c86d3ba497a.log


In [2]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
RESOURCES_DIR = '/non-fargen/resources'

## Load gnomAD data

In [6]:
gnomad_mt = hl.read_matrix_table(RESOURCES_DIR + '/gnomAD/gnomad.genomes.v3.1.hgdp_1kg_subset_dense.mt')

In [4]:
gnomad_mt.count()

(175312130, 3942)

## Load exome target BED file

Load the SureSelect Human All Exon V6 UTR target BED file, which is used in the FarGen Phase I exome sequencing.

In [7]:
interval_ht = hl.import_bed(RESOURCES_DIR + '/sureselect_human_all_exon_v6_utr_grch38/S07604624_Padded.bed', reference_genome='GRCh38')

2021-03-23 10:50:07 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)
  Loading field 'f3' as type str (user-supplied)


## Filter data

Keep only sites that are present in the target regions.

In [8]:
gnomad_exome_mt = gnomad_mt.filter_rows(hl.is_defined(interval_ht[gnomad_mt.locus]))

In [9]:
gnomad_exome_mt.count()

2021-03-23 10:50:50 Hail: INFO: Coerced sorted dataset


(7094228, 3942)

The gnomAD genome data is 2.4TB, so we would this filtered exome data to be (in GB):

In [12]:
2400 * 7094228 / 175312130

97.11904817995195

## Write data to disk

In [13]:
if False:
    gnomad_exome_mt.write(RESOURCES_DIR + '/gnomAD/gnomad.genomes.v3.1.hgdp_1kg_subset_dense_EXOME.mt', overwrite=True)

2021-03-23 11:03:59 Hail: INFO: Coerced sorted dataset
2021-03-23 12:11:18 Hail: INFO: wrote matrix table with 7094228 rows and 3942 columns in 115375 partitions to /non-fargen/resources/gnomAD/gnomad.genomes.v3.1.hgdp_1kg_subset_dense_EXOME.mt
    Total size: 99.22 GiB
    * Rows/entries: 99.22 GiB
    * Columns: 1.08 MiB
    * Globals: 7.12 KiB
    * Smallest partition: 0 rows (20.00 B)
    * Largest partition:  14831 rows (135.42 MiB)
