You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using Panaroo on 100 E. coli isolates using different parameters and I am validating the gene presence/absence matrix using a subset of genes (genes starting with tra). It seems that often the same gene is split among different rows, ideally, I would expect only 15 rows for all the genes starting with tra: traA, traC, traD, traG and so on. This has important implications since I want to use this data to perform a GWAS analysis. Do you have some recommendations on how to minimize this issue? I attach a part of the gene presence/absence table for one of the runs filtered for the tra genes. I have played with -c, -f, -mode and merge_paralogs parameters without success.
Thank you.
Nicola
The text was updated successfully, but these errors were encountered:
This is likely to be mainly due to the diversity and mobility of the gene.
If you have diverse versions of a gene occurring in different regions of the genome it can be very challenging (and sometimes not desirable) to cluster them together. In my experience, annotations of gene families like this one are often inconsistent so I would also expect instance of annotations of different tra* genes to be clustered together in some situations.
By design, Panaroo is cautious about clustering diverse copies of a gene that occur in different locations as these can often have different functions or phenotypes. I would recommend performing two separate GWAS analyses. One using the gene clusters from Panaroo, which will also encode information about location and diversity and one using unitigs as described in pyseer
I am using Panaroo on 100 E. coli isolates using different parameters and I am validating the gene presence/absence matrix using a subset of genes (genes starting with tra). It seems that often the same gene is split among different rows, ideally, I would expect only 15 rows for all the genes starting with tra: traA, traC, traD, traG and so on. This has important implications since I want to use this data to perform a GWAS analysis. Do you have some recommendations on how to minimize this issue? I attach a part of the gene presence/absence table for one of the runs filtered for the tra genes. I have played with -c, -f, -mode and merge_paralogs parameters without success.
Thank you.
Nicola
The text was updated successfully, but these errors were encountered: