# Convert MatrixTable to VCF

In [1]:
import hail as hl
hl.init(default_reference='GRCh38', spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

2021-12-17 09:26:37 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2021-12-17 09:26:38 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
  Compatibility is not guaranteed.


Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-68c965f6f5-qw44l:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/main/hail-20211217-0926-0.2.61-3c86d3ba497a.log


In [2]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

In [3]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants_pao_removed.mt')

Recalculate variant statistics.

In [4]:
mt = hl.variant_qc(mt)

mt = mt.annotate_rows(info = mt.info.annotate(AC = mt.variant_qc.AC, AF = mt.variant_qc.AF, AN = mt.variant_qc.AN, dp_mean = mt.variant_qc.dp_stats.mean, gq_mean = mt.variant_qc.gq_stats.mean, n_het = mt.variant_qc.n_het))

Drop some unused fields.

In [5]:
field_list = ['InbreedingCoeff', 'BaseQRankSum', 'DB', 'DS', 'END', 'ExcessHet', 'FS', 'MLEAC', 'MLEAF', 'MQ', 'MQRankSum', 'NEGATIVE_TRAIN_SITE', 'PG', 'POSITIVE_TRAIN_SITE', 'QD', 'RAW_MQandDP', 'ReadPosRankSum', 'SOR', 'VQSLOD', 'DP', 'culprit']

mt = mt.annotate_rows(info = mt.info.drop(*field_list))

In [6]:
mt = mt.drop('variant_qc', 'sample_qc', 'high_hom_het', 'pao_list', 'a_index', 'was_split')

The `filters` field should be a set of strings, not just a string.

In [7]:
mt = mt.annotate_rows(filters = hl.set([mt.filters]))

Describe `MatrixTable` content.

In [8]:
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        ANN: array<str>, 
        LOF: array<str>, 
        NMD: array<str>, 
        dp_mean: float64, 
        gq_mean: float64, 
        n_het: int64
    }
----------------------------------------
Entry fields:
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'GT': call
    'MIN_DP': int32
    'PGT': call
    'PID': str
    'PL': array<int32>
    'PP': array<int32>
    'PS': int32
    'RGQ': int32
    'SB': array<int32>
    'AB': float32
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------


Convert `MatrixTable` to VCF, and write to file.

In [9]:
# Metadata to for the VCF header.
metadata = {'info':
            {
                'dp_mean': {'Description': 'Mean depth for variant.'},
                'gq_mean': {'Description': 'Mean genotype quality for variant.'},
                'n_het': {'Description': 'Number of heterozygote genotypes at site.'}
            }
           }

In [10]:
hl.export_vcf(mt, BASE_DIR + '/data/mt/fargen_phase1_exome_genotypes.vcf.bgz', metadata=metadata)

2021-12-17 09:27:27 Hail: INFO: while writing:
    /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/fargen_phase1_exome_genotypes.vcf.bgz
  merge time: 1.093s


Write a sites-only VCF.

In [11]:
rows_ht = mt.rows()

In [12]:
hl.export_vcf(rows_ht, BASE_DIR + '/data/mt/fargen_phase1_exome_sites.vcf.bgz', metadata=metadata)

2021-12-17 09:29:05 Hail: INFO: while writing:
    /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/fargen_phase1_exome_sites.vcf.bgz
  merge time: 72.743ms


In [13]:
%%bash

tabix /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/fargen_phase1_exome_genotypes.vcf.bgz

In [None]:
%%bash

tabix /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/fargen_phase1_exome_sites.vcf.bgz