Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

iris-garden · 2024-05-06T16:48:27Z

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Jan 08, 2024 at 12:59) nistha said:

I’m pretty new to working with Hail, so I’m a little confused about how to work with the MatrixTables. I would appreciate any advice!

I have a MatrixTable (mt) that looks like this:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'ancestry_pred': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'filters': set<str>
    'a_index': int32
    'was_split': bool
    'variant_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_het: int64, 
        n_non_ref: int64, 
        het_freq_hwe: float64, 
        p_value_hwe: float64, 
        p_value_excess_het: float64
    }
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        homozygote_count: array<int32>
    }
    'gvs_afr_af': float64
    'gvs_amr_af': float64
    'gvs_eas_af': float64
    'gvs_eur_af': float64
    'gvs_mid_af': float64
    'gvs_oth_af': float64
    'gvs_sas_af': float64
    'gene_id': str
    'gene_symbol': str
    'consequence': str
    'omim_phenotypes_name': str
    'clinvar_classification': str
    'clinvar_phenotype': str
    'vid': str
----------------------------------------
Entry fields:
    'GT': call
    'GQ': int32
    'RGQ': int32
    'FT': str
    'AD': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

I want to get a table that looks like this, where we can get a list of non-reference heterozygous and homozygous variants (vid) by gene (gene_symbol) for each sample (s):

s	gene_symbol	vid	consequence
1	TESK2	[‘1-3414321-G-A’, ‘1-3414321-T-A’]	[‘missense_variant’, ‘missense_variant’]
1	PEX26	[‘22-18561317-T-A’, ‘22-18561317-TG-A’]	[‘missense_variant’, ‘frameshift_variant’]
2	TESK2	[‘1-3414321-G-A’]	[‘missense_variant’]
3	PEX26	[‘22-18561317-C-A’]	[‘missense_variant’]
…	…	…	…

where vid is the locus.contig - locus.position - ref_allele - alt_allele.

I tried to get this table by doing the following, but when I show the table, it is empty:

# filter for het_non_ref and hom_var
mt = mt.filter_entries(mt.GT.is_het_non_ref() | mt.GT.is_hom_var())

# get entries
pv_table = mt.entries()

# select columns
candidate_genes = pv_table.select(pv_table.s,
                                  pv_table.ancestry_pred,
                                  pv_table.locus,
                                  pv_table.alleles,
                                  pv_table.vid,
                                  pv_table.gene_symbol,
                                  pv_table.gene_id,
                                  pv_table.consequence,
                                 )

# group by gene symbol and sample id 
candidate_genes = candidate_genes.group_by("s", "gene_symbol").aggregate(
    vid = hl.agg.collect(
        candidate_genes.vid
    ),
    consequence = hl.agg.collect(
        candidate_genes.consequence
    )
)

candidate_genes.show()

I thought about using make_table but I have over 1000 samples.
What is the best strategy to get a table like the one above?

Thanks!

The text was updated successfully, but these errors were encountered:

iris-garden added the discourse migrated from discuss.hail.is label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

iris-garden commented May 6, 2024

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

Comments

iris-garden commented May 6, 2024

(Jan 08, 2024 at 12:59) nistha said: