-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Hi pyseer team!
I've come across an issue, in which variants with identical patterns will all be flagged as lrt-filtering-failed except for the very last variant in the input file. I've encountered this in pyseer v1.3.11 and v1.3.12 so far (installed with conda).
Test Data
Here is the test data for the minimal working example: data.zip
Expand to see phenotypes.
| sample | resistant |
|---|---|
| sample1 | 0 |
| sample2 | 0 |
| sample3 | 1 |
| sample4 | 1 |
Expand to see variants.
| Variant | sample1 | sample2 | sample3 | sample4 |
|---|---|---|---|---|
| variant1 | 1 | 1 | 0 | 0 |
| variant2 | 0 | 0 | 1 | 1 |
| variant3 | 0 | 0 | 1 | 1 |
| variant4 | 1 | 1 | 0 | 0 |
| variant5 | 0 | 0 | 1 | 1 |
Expand to see kinship.
| sample1 | sample2 | sample3 | sample4 | |
|---|---|---|---|---|
| sample1 | 0.0038336778 | 0.0 | 0.0 | 0.0020687925 |
| sample2 | 0.0 | 0.0038336778 | 0.0016861048 | 0.0 |
| sample3 | 0.0 | 0.0016861048 | 0.0023424293 | 0.0 |
| sample4 | 0.0020687925 | 0.0 | 0.0 | 0.0032878782 |
Reproducing the Issue
Here is my command that triggers it:
pyseer \
--lmm \
--cpu 1 \
--similarity kinship.tsv \
--pres variants.Rtab \
--phenotypes phenotypes.tsv \
--phenotype-column resistant \
--print-filtered > locus_effects.tsv
Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variantsIn the output file, all variants are flagged as lrt-filtering-failed except for the very last variant (variant5).
| variant | af | filter-pvalue | lrt-pvalue | beta | beta-std-err | variant_h2 | notes |
|---|---|---|---|---|---|---|---|
| variant1 | 5.00E-01 | 4.55E-02 | 1.00E+00 | lrt-filtering-failed,bad-chisq | |||
| variant2 | 5.00E-01 | 4.55E-02 | 1.00E+00 | lrt-filtering-failed,bad-chisq | |||
| variant3 | 5.00E-01 | 4.55E-02 | 1.00E+00 | lrt-filtering-failed,bad-chisq | |||
| variant4 | 5.00E-01 | 4.55E-02 | 1.00E+00 | lrt-filtering-failed,bad-chisq | |||
| variant5 | 5.00E-01 | 4.55E-02 | 0.00E+00 | 1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
Possible Solutions
I've found this could be related to the significant digits in the similarity matrix. If I reduce the kinship matrix to 8 significant digits, I get the expected, consistent output for all variants.
Expand to see kinship with 8 significant digits.
| sample1 | sample2 | sample3 | sample4 | |
|---|---|---|---|---|
| sample1 | 0.00383368 | 0.0 | 0.0 | 0.00206879 |
| sample2 | 0.0 | 0.00383368 | 0.00168610 | 0.0 |
| sample3 | 0.0 | 0.00168611 | 0.00234243 | 0.0 |
| sample4 | 0.00206879 | 0.0 | 0.0 | 0.00328788 |
pyseer \
--lmm \
--cpu 1 \
--similarity kinship_e8.tsv \
--pres variants.Rtab \
--phenotypes phenotypes.tsv \
--phenotype-column resistant \
--print-filtered > locus_effects.tsv
Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variants| variant | af | filter-pvalue | lrt-pvalue | beta | beta-std-err | variant_h2 | notes |
|---|---|---|---|---|---|---|---|
| variant1 | 5.00E-01 | 4.55E-02 | 0.00E+00 | -1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
| variant2 | 5.00E-01 | 4.55E-02 | 0.00E+00 | 1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
| variant3 | 5.00E-01 | 4.55E-02 | 0.00E+00 | 1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
| variant4 | 5.00E-01 | 4.55E-02 | 0.00E+00 | -1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
| variant5 | 5.00E-01 | 4.55E-02 | 0.00E+00 | 1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |
As an aside, I do know that I should be filtering out bad-chisq variants. This is just my attempt at a minimal working example of an issue I encountered while working with a larger population size.
I'm curious if this issue is reproducible for you, and if so, if you think I'm on the right track investigating the precision in the similarity matrix? Thank you for your time!