Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncollapsed gene families #223

Closed
nicola-palmieri opened this issue Mar 9, 2023 · 1 comment
Closed

Uncollapsed gene families #223

nicola-palmieri opened this issue Mar 9, 2023 · 1 comment

Comments

@nicola-palmieri
Copy link

I am using Panaroo on 100 E. coli isolates using different parameters and I am validating the gene presence/absence matrix using a subset of genes (genes starting with tra). It seems that often the same gene is split among different rows, ideally, I would expect only 15 rows for all the genes starting with tra: traA, traC, traD, traG and so on. This has important implications since I want to use this data to perform a GWAS analysis. Do you have some recommendations on how to minimize this issue? I attach a part of the gene presence/absence table for one of the runs filtered for the tra genes. I have played with -c, -f, -mode and merge_paralogs parameters without success.

Thank you.
Nicola
panaroo-output-100_Ecoli_isolates-tra-genes

@gtonkinhill
Copy link
Owner

Hi Nicola,

This is likely to be mainly due to the diversity and mobility of the gene.

If you have diverse versions of a gene occurring in different regions of the genome it can be very challenging (and sometimes not desirable) to cluster them together. In my experience, annotations of gene families like this one are often inconsistent so I would also expect instance of annotations of different tra* genes to be clustered together in some situations.

By design, Panaroo is cautious about clustering diverse copies of a gene that occur in different locations as these can often have different functions or phenotypes. I would recommend performing two separate GWAS analyses. One using the gene clusters from Panaroo, which will also encode information about location and diversity and one using unitigs as described in pyseer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants