Skip to content

Identical variants patterns are inconsistently flagged and filtered out #286

@ktmeaton

Description

@ktmeaton

Hi pyseer team!

I've come across an issue, in which variants with identical patterns will all be flagged as lrt-filtering-failed except for the very last variant in the input file. I've encountered this in pyseer v1.3.11 and v1.3.12 so far (installed with conda).

Test Data

Here is the test data for the minimal working example: data.zip

Expand to see phenotypes.
sample resistant
sample1 0
sample2 0
sample3 1
sample4 1
Expand to see variants.
Variant sample1 sample2 sample3 sample4
variant1 1 1 0 0
variant2 0 0 1 1
variant3 0 0 1 1
variant4 1 1 0 0
variant5 0 0 1 1
Expand to see kinship.
sample1 sample2 sample3 sample4
sample1 0.0038336778 0.0 0.0 0.0020687925
sample2 0.0 0.0038336778 0.0016861048 0.0
sample3 0.0 0.0016861048 0.0023424293 0.0
sample4 0.0020687925 0.0 0.0 0.0032878782

Reproducing the Issue

Here is my command that triggers it:

pyseer \
  --lmm \
  --cpu 1 \
  --similarity kinship.tsv \
  --pres variants.Rtab \
  --phenotypes phenotypes.tsv \
  --phenotype-column resistant \
  --print-filtered > locus_effects.tsv

Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variants

In the output file, all variants are flagged as lrt-filtering-failed except for the very last variant (variant5).

variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes
variant1 5.00E-01 4.55E-02 1.00E+00 lrt-filtering-failed,bad-chisq
variant2 5.00E-01 4.55E-02 1.00E+00 lrt-filtering-failed,bad-chisq
variant3 5.00E-01 4.55E-02 1.00E+00 lrt-filtering-failed,bad-chisq
variant4 5.00E-01 4.55E-02 1.00E+00 lrt-filtering-failed,bad-chisq
variant5 5.00E-01 4.55E-02 0.00E+00 1.00E+00 0.00E+00 1.00E+00 bad-chisq

Possible Solutions

I've found this could be related to the significant digits in the similarity matrix. If I reduce the kinship matrix to 8 significant digits, I get the expected, consistent output for all variants.

Expand to see kinship with 8 significant digits.
sample1 sample2 sample3 sample4
sample1 0.00383368 0.0 0.0 0.00206879
sample2 0.0 0.00383368 0.00168610 0.0
sample3 0.0 0.00168611 0.00234243 0.0
sample4 0.00206879 0.0 0.0 0.00328788
pyseer \
  --lmm \
  --cpu 1 \
  --similarity kinship_e8.tsv \
  --pres variants.Rtab \
  --phenotypes phenotypes.tsv \
  --phenotype-column resistant \
  --print-filtered > locus_effects.tsv

Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variants
variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes
variant1 5.00E-01 4.55E-02 0.00E+00 -1.00E+00 0.00E+00 1.00E+00 bad-chisq
variant2 5.00E-01 4.55E-02 0.00E+00 1.00E+00 0.00E+00 1.00E+00 bad-chisq
variant3 5.00E-01 4.55E-02 0.00E+00 1.00E+00 0.00E+00 1.00E+00 bad-chisq
variant4 5.00E-01 4.55E-02 0.00E+00 -1.00E+00 0.00E+00 1.00E+00 bad-chisq
variant5 5.00E-01 4.55E-02 0.00E+00 1.00E+00 0.00E+00 1.00E+00 bad-chisq

As an aside, I do know that I should be filtering out bad-chisq variants. This is just my attempt at a minimal working example of an issue I encountered while working with a larger population size.

I'm curious if this issue is reproducible for you, and if so, if you think I'm on the right track investigating the precision in the similarity matrix? Thank you for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions