# A deep dive into TOB-WGS' pipeline

Chapter 1: what does `ReblockGVCF` actually do ? Let's examine sample TOB1520 (which is an outlier for some QC metrics)

**version 2.2 of the pipeline: `ReblockGVCF` now supposed NOT to remove low quality variants**

In [None]:
import hail as hl;

# All datasets in TOB-WGS are using GRCh38
hl.init(default_reference='GRCh38');


Import TOB1520 GVCF and explore structure and content


In [10]:
gvcf = hl.import_vcf('gs://cpg-tob-wgs-test/gvcf/batch1/TOB1520.g.vcf.gz', min_partitions=12, force_bgz=True)
gvcf.describe()
gvcf.info.END.show()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AS_InbreedingCoeff: array<float64>, 
        AS_QD: array<float64>, 
        AS_RAW_BaseQRankSum: str, 
        AS_RAW_MQ: str, 
        AS_RAW_MQRankSum: str, 
        AS_RAW_ReadPosRankSum: str, 
        AS_SB_TABLE: str, 
        BaseQRankSum: float64, 
        DP: int32, 
        DS: bool, 
        END: int32, 
        ExcessHet: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQRankSum: float64, 
        RAW_MQandDP: array<int32>, 
        ReadPosRankSum: float64
    }
----------------------------------------
Entry fields:
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'GT': call


2021-06-19 06:56:17 Hail: INFO: Coerced sorted dataset


locus,alleles,Unnamed: 2_level_0
locus<GRCh38>,array<str>,int32
chr1:10001,"[""T"",""<NON_REF>""]",10016
chr1:10017,"[""C"",""<NON_REF>""]",10026
chr1:10027,"[""A"",""<NON_REF>""]",10027
chr1:10028,"[""C"",""<NON_REF>""]",10038
chr1:10039,"[""A"",""<NON_REF>""]",10041
chr1:10042,"[""C"",""<NON_REF>""]",10044
chr1:10045,"[""A"",""<NON_REF>""]",10045
chr1:10046,"[""C"",""<NON_REF>""]",10046
chr1:10047,"[""C"",""<NON_REF>""]",10047
chr1:10048,"[""C"",""<NON_REF>""]",10050


In [11]:
hl.summarize_variants(gvcf)

2021-06-19 06:57:15 Hail: INFO: Coerced sorted dataset


Number of alleles,Count
2,30075685
3,5202620
4,235909
5,26301
6,6800
7,1784
8,1258

Allele type,Count
Symbolic,35550357
SNP,4464040
Insertion,637685
Deletion,591733
Star,103550
Complex,1

Metric,Value
Transitions,2747207.0
Transversions,1716833.0
Ratio,1.6

Contig,Count
chr1,2946558
chr2,2798614
chr3,2239708
chr4,2158804
chr5,1998760
chr6,1963461
chr7,2016563
chr8,1651842
chr9,1553560
chr10,1675602


Worth noting:

of 35,550,357 variants, 30,075,685 are homozygous reference blocks

Let's load the GVCF after `ReblockGVCF` had been performed

In [13]:
rb_gvcf = hl.import_vcf('gs://cpg-tob-wgs-test-tmp/joint-calling/v2.2/hail/batch/7c4ba7/1/output_gvcf.g.vcf.gz', force_bgz=True, min_partitions=12)
rb_gvcf.describe();
hl.summarize_variants(rb_gvcf);

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        AS_BaseQRankSum: array<float64>, 
        AS_FS: array<float64>, 
        AS_InbreedingCoeff: array<float64>, 
        AS_MQ: array<float64>, 
        AS_MQRankSum: array<float64>, 
        AS_QD: array<float64>, 
        AS_QUALapprox: str, 
        AS_RAW_BaseQRankSum: str, 
        AS_RAW_MQ: str, 
        AS_RAW_MQRankSum: str, 
        AS_RAW_ReadPosRankSum: str, 
        AS_ReadPosRankSum: array<float64>, 
        AS_SB_TABLE: str, 
        AS_SOR: array<float64>, 
        AS_VarDP: str, 
        BaseQRankSum: float64, 
        DP: int32, 
        DS: bool, 
        END: int32, 
   

2021-06-19 07:09:59 Hail: INFO: Coerced sorted dataset


Number of alleles,Count
2,29886034
3,4824254
4,211029
5,23444
6,6243
7,1647
8,1134

Allele type,Count
Symbolic,34953785
SNP,4135327
Insertion,588737
Deletion,543947
Star,88643
Complex,1

Metric,Value
Transitions,2701454.0
Transversions,1433873.0
Ratio,1.88

Contig,Count
chr1,2893335
chr2,2753753
chr3,2201870
chr4,2125750
chr5,1967506
chr6,1933167
chr7,1983526
chr8,1627502
chr9,1527980
chr10,1646718


These are the annotations added by ReblockGVCF

        AC, AF, AN: ???? why would we need that on a single sample ?
        AS_BaseQRankSum: array<float64>, 
        AS_FS: array<float64>,
        AS_MQ: array<float64>, 
        AS_MQRankSum: array<float64>,
        AS_QD: array<float64>, 
        AS_QUALapprox: str,
        AS_ReadPosRankSum: array<float64>,
        AS_SOR: array<float64>, 
        AS_VarDP: str,
        FS: float64, 
        MQ: float64,  
        MQ_DP: int32, 
        QD: float64, 
        QUALapprox: int32, 
        RAW_GT_COUNT: array<int32>, 
        SOR: float64, 
        VarDP: int32
and the annotations removed by ReblockGVCF

        MLEAC: array<int32>, 
        MLEAF: array<float64>, 

Additionally we lost more than 5e6 variants: not only homref blocks but also SNPs, etc.
Let's have a look at what we lost:

In [14]:
lost = gvcf.anti_join_rows(rb_gvcf.rows())
hl.summarize_variants(lost)

2021-06-19 07:15:38 Hail: INFO: Coerced sorted dataset
2021-06-19 07:16:36 Hail: INFO: Coerced sorted dataset


Number of alleles,Count
2,280423
3,378366
4,24880
5,2857
6,557
7,137
8,124

Allele type,Count
Symbolic,687344
SNP,328713
Insertion,48948
Deletion,47786
Star,14907

Metric,Value
Transitions,45753.0
Transversions,282960.0
Ratio,0.16

Contig,Count
chr1,61859
chr2,51571
chr3,43550
chr4,37815
chr5,35760
chr6,34553
chr7,38132
chr8,27709
chr9,29387
chr10,33361


We lost mostly homref blocks, as expected, but also other kind of variants, such as 425,204 SNPs

Let's inspect some of these lost variants (more than 2 alleles means it is not a homref block)

In [15]:
lost.filter_rows(hl.len(lost.alleles) == 3).show(50,1,truncate=30)

2021-06-19 07:19:09 Hail: INFO: Coerced sorted dataset
2021-06-19 07:20:05 Hail: INFO: Coerced sorted dataset
2021-06-19 07:21:03 Hail: INFO: Coerced sorted dataset


Unnamed: 0_level_0,Unnamed: 1_level_0,'TOB1520','TOB1520','TOB1520','TOB1520','TOB1520','TOB1520','TOB1520','TOB1520','TOB1520','TOB1520'
locus,alleles,AD,DP,GQ,GT,MIN_DP,PGT,PID,PL,PS,SB
locus<GRCh38>,array<str>,array<int32>,int32,int32,call,int32,call,str,array<int32>,int32,array<int32>
chr1:10109,"[""AACCCT"",""A"",""<NON_REF>""]","[11,1,0]",12,31,0/0,,,,"[0,31,318,33,319,321]",,"[5,6,1,0]"
chr1:10231,"[""CCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAA"",""C"",""<NON_REF>""]","[8,0,0]",8,23,0/0,,,,"[0,23,233,23,233,233]",,"[3,5,0,0]"
chr1:10441,"[""CCCTA"",""C"",""<NON_REF>""]","[11,0,0]",11,33,0|0,,1|0,"""10439_AC_A""","[0,33,473,33,473,473]",10439.0,"[5,6,0,0]"
chr1:16571,"[""G"",""A"",""<NON_REF>""]","[33,2,0]",35,54,0/0,,,,"[0,54,892,99,898,944]",,"[14,19,0,2]"
chr1:16688,"[""G"",""A"",""<NON_REF>""]","[20,2,0]",22,13,0/0,,,,"[0,13,497,60,503,551]",,"[13,7,0,2]"
chr1:19226,"[""T"",""A"",""<NON_REF>""]","[27,2,0]",29,0,0|0,,0|1,"""19226_T_A""","[0,0,1196,84,1202,1286]",19226.0,"[13,14,2,0]"
chr1:49291,"[""C"",""T"",""<NON_REF>""]","[11,1,0]",12,4,0/0,,,,"[0,4,342,33,345,373]",,"[6,5,0,1]"
chr1:58188,"[""G"",""T"",""<NON_REF>""]","[8,1,0]",9,17,0|0,,0|1,"""58188_G_T""","[0,17,252,24,255,262]",58188.0,"[8,0,1,0]"
chr1:83965,"[""AAAG"",""A"",""<NON_REF>""]","[4,0,0]",4,12,0/0,,,,"[0,12,174,12,174,174]",,"[4,0,0,0]"
chr1:99066,"[""TTTC"",""T"",""<NON_REF>""]","[10,2,0]",12,3,0/0,,,,"[0,3,416,30,421,448]",,"[3,7,0,2]"


In [16]:
lost = lost.annotate_rows(nonhomref=hl.agg.count_where(lost.GT.is_non_ref())>0)
lost.aggregate_rows(hl.agg.count_where(lost.nonhomref))

2021-06-19 07:22:01 Hail: INFO: Coerced sorted dataset
2021-06-19 07:22:59 Hail: INFO: Coerced sorted dataset


0

**GOOD, no non HomRef variants were removed**