Identical variants patterns are inconsistently flagged and filtered out

Hi pyseer team!

I've come across an issue, in which variants with identical patterns will all be flagged as `lrt-filtering-failed` _except_ for the very last variant in the input file. I've encountered this in `pyseer` `v1.3.11` and `v1.3.12` so far (installed with `conda`).

## Test Data

Here is the test data for the minimal working example: [data.zip](https://github.com/user-attachments/files/19238609/data.zip)

<details>
<summary>Expand to see phenotypes.</summary>

| sample | resistant |
|:------- |:--------- |
| sample1 | 0 |
| sample2 | 0 |
| sample3 | 1 |
| sample4 | 1 |

</details>

<details>
 <summary>Expand to see variants.</summary>
 
|Variant |sample1|sample2|sample3|sample4|
|:-------|:------|:------|:------|:------|
|variant1|1 |1 |0 |0 |
|variant2|0 |0 |1 |1 |
|variant3|0 |0 |1 |1 |
|variant4|1 |1 |0 |0 |
|variant5|0 |0 |1 |1 |

</details>

<details>
 <summary>Expand to see kinship.</summary>

| |sample1 |sample2 |sample3 |sample4 |
|:------|:-----------|:-----------|:-----------|:-----------|
|sample1|0.0038336778|0.0 |0.0 |0.0020687925|
|sample2|0.0 |0.0038336778|0.0016861048|0.0 |
|sample3|0.0 |0.0016861048|0.0023424293|0.0 |
|sample4|0.0020687925|0.0 |0.0 |0.0032878782|

</details>

## Reproducing the Issue

Here is my command that triggers it:

```bash
pyseer \
 --lmm \
 --cpu 1 \
 --similarity kinship.tsv \
 --pres variants.Rtab \
 --phenotypes phenotypes.tsv \
 --phenotype-column resistant \
 --print-filtered > locus_effects.tsv

Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variants
```

In the output file, all variants are flagged as `lrt-filtering-failed` except for the very last variant (`variant5`).

| variant | af | filter-pvalue | lrt-pvalue | beta | beta-std-err | variant_h2 | notes |
|:-------- |:-------- |:------------- |:---------- |:-------- |:------------ |:---------- |:------------------------------ |
| variant1 | 5.00E-01 | 4.55E-02 | 1.00E+00 | | | | lrt-filtering-failed,bad-chisq |
| variant2 | 5.00E-01 | 4.55E-02 | 1.00E+00 | | | | lrt-filtering-failed,bad-chisq |
| variant3 | 5.00E-01 | 4.55E-02 | 1.00E+00 | | | | lrt-filtering-failed,bad-chisq |
| variant4 | 5.00E-01 | 4.55E-02 | 1.00E+00 | | | | lrt-filtering-failed,bad-chisq |
| variant5 | 5.00E-01 | 4.55E-02 | 0.00E+00 | 1.00E+00 | 0.00E+00 | 1.00E+00 | bad-chisq |

## Possible Solutions

I've found this could be related to the significant digits in the similarity matrix. If I reduce the kinship matrix to 8 significant digits, I get the expected, consistent output for all variants.

<details>
 <summary>Expand to see kinship with 8 significant digits.</summary>

| |sample1 |sample2 |sample3 |sample4 |
|:------|:---------|:---------|:---------|:---------|
|sample1|0.00383368|0.0 |0.0 |0.00206879|
|sample2|0.0 |0.00383368|0.00168610|0.0 |
|sample3|0.0 |0.00168611|0.00234243|0.0 |
|sample4|0.00206879|0.0 |0.0 |0.00328788|

</details>

```bash
pyseer \
 --lmm \
 --cpu 1 \
 --similarity kinship_e8.tsv \
 --pres variants.Rtab \
 --phenotypes phenotypes.tsv \
 --phenotype-column resistant \
 --print-filtered > locus_effects.tsv

Read 4 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (4, 4)
Analysing 4 samples found in both phenotype and similarity matrix
h^2 = 0.00
5 loaded variants
0 pre-filtered variants
5 tested variants
5 printed variants
```

|variant |af |filter-pvalue|lrt-pvalue|beta |beta-std-err|variant_h2|notes |
|:-------|:-------|:------------|:---------|:--------|:-----------|:---------|:--------|
|variant1|5.00E-01|4.55E-02 |0.00E+00 |-1.00E+00|0.00E+00 |1.00E+00 |bad-chisq|
|variant2|5.00E-01|4.55E-02 |0.00E+00 |1.00E+00 |0.00E+00 |1.00E+00 |bad-chisq|
|variant3|5.00E-01|4.55E-02 |0.00E+00 |1.00E+00 |0.00E+00 |1.00E+00 |bad-chisq|
|variant4|5.00E-01|4.55E-02 |0.00E+00 |-1.00E+00|0.00E+00 |1.00E+00 |bad-chisq|
|variant5|5.00E-01|4.55E-02 |0.00E+00 |1.00E+00 |0.00E+00 |1.00E+00 |bad-chisq|


As an aside, I do know that I should be filtering out `bad-chisq` variants. This is just my attempt at a minimal working example of an issue I encountered while working with a larger population size.

I'm curious if this issue is reproducible for you, and if so, if you think I'm on the right track investigating the precision in the similarity matrix? Thank you for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Identical variants patterns are inconsistently flagged and filtered out #286

Test Data

Reproducing the Issue

Possible Solutions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Variant	sample1	sample2	sample3	sample4
variant1	1	1	0	0
variant2	0	0	1	1
variant3	0	0	1	1
variant4	1	1	0	0
variant5	0	0	1	1

	sample1	sample2	sample3	sample4
sample1	0.0038336778	0.0	0.0	0.0020687925
sample2	0.0	0.0038336778	0.0016861048	0.0
sample3	0.0	0.0016861048	0.0023424293	0.0
sample4	0.0020687925	0.0	0.0	0.0032878782

variant	af	filter-pvalue	lrt-pvalue	beta	beta-std-err	variant_h2	notes
variant1	5.00E-01	4.55E-02	1.00E+00				lrt-filtering-failed,bad-chisq
variant2	5.00E-01	4.55E-02	1.00E+00				lrt-filtering-failed,bad-chisq
variant3	5.00E-01	4.55E-02	1.00E+00				lrt-filtering-failed,bad-chisq
variant4	5.00E-01	4.55E-02	1.00E+00				lrt-filtering-failed,bad-chisq
variant5	5.00E-01	4.55E-02	0.00E+00	1.00E+00	0.00E+00	1.00E+00	bad-chisq

	sample1	sample2	sample3	sample4
sample1	0.00383368	0.0	0.0	0.00206879
sample2	0.0	0.00383368	0.00168610	0.0
sample3	0.0	0.00168611	0.00234243	0.0
sample4	0.00206879	0.0	0.0	0.00328788

Identical variants patterns are inconsistently flagged and filtered out #286

Description

Test Data

Reproducing the Issue

Possible Solutions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions