Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? #3582

Open
iris-garden opened this issue May 6, 2024 · 0 comments
Labels
discourse migrated from discuss.hail.is

Comments

@iris-garden
Copy link
Owner

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Jan 08, 2024 at 12:59) nistha said:

I’m pretty new to working with Hail, so I’m a little confused about how to work with the MatrixTables. I would appreciate any advice!

I have a MatrixTable (mt) that looks like this:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'ancestry_pred': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'filters': set<str>
    'a_index': int32
    'was_split': bool
    'variant_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_het: int64, 
        n_non_ref: int64, 
        het_freq_hwe: float64, 
        p_value_hwe: float64, 
        p_value_excess_het: float64
    }
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        homozygote_count: array<int32>
    }
    'gvs_afr_af': float64
    'gvs_amr_af': float64
    'gvs_eas_af': float64
    'gvs_eur_af': float64
    'gvs_mid_af': float64
    'gvs_oth_af': float64
    'gvs_sas_af': float64
    'gene_id': str
    'gene_symbol': str
    'consequence': str
    'omim_phenotypes_name': str
    'clinvar_classification': str
    'clinvar_phenotype': str
    'vid': str
----------------------------------------
Entry fields:
    'GT': call
    'GQ': int32
    'RGQ': int32
    'FT': str
    'AD': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

I want to get a table that looks like this, where we can get a list of non-reference heterozygous and homozygous variants (vid) by gene (gene_symbol) for each sample (s):

s gene_symbol vid consequence
1 TESK2 [‘1-3414321-G-A’, ‘1-3414321-T-A’] [‘missense_variant’, ‘missense_variant’]
1 PEX26 [‘22-18561317-T-A’, ‘22-18561317-TG-A’] [‘missense_variant’, ‘frameshift_variant’]
2 TESK2 [‘1-3414321-G-A’] [‘missense_variant’]
3 PEX26 [‘22-18561317-C-A’] [‘missense_variant’]

where vid is the locus.contig - locus.position - ref_allele - alt_allele.

I tried to get this table by doing the following, but when I show the table, it is empty:

# filter for het_non_ref and hom_var
mt = mt.filter_entries(mt.GT.is_het_non_ref() | mt.GT.is_hom_var())

# get entries
pv_table = mt.entries()

# select columns
candidate_genes = pv_table.select(pv_table.s,
                                  pv_table.ancestry_pred,
                                  pv_table.locus,
                                  pv_table.alleles,
                                  pv_table.vid,
                                  pv_table.gene_symbol,
                                  pv_table.gene_id,
                                  pv_table.consequence,
                                 )

# group by gene symbol and sample id 
candidate_genes = candidate_genes.group_by("s", "gene_symbol").aggregate(
    vid = hl.agg.collect(
        candidate_genes.vid
    ),
    consequence = hl.agg.collect(
        candidate_genes.consequence
    )
)

candidate_genes.show()

I thought about using make_table but I have over 1000 samples.
What is the best strategy to get a table like the one above?

Thanks!

@iris-garden iris-garden added the discourse migrated from discuss.hail.is label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discourse migrated from discuss.hail.is
Projects
None yet
Development

No branches or pull requests

1 participant