Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene family count matrix related with representative gene family . #217

Closed
aababc1 opened this issue May 5, 2024 · 4 comments
Closed

Gene family count matrix related with representative gene family . #217

aababc1 opened this issue May 5, 2024 · 4 comments
Assignees
Labels

Comments

@aababc1
Copy link

aababc1 commented May 5, 2024

Thank you for creating such a great tool.
I have one question, your tool doesn't seem to generate the gene count matrix directly. If I want to generate a count table in addition to gene presence absence, I wonder if I need to manually generate a count table using the gene_families.tsv information.
Or is there some other command that supports this functionality in ppanggolin?
Thank you in advance for your answer.

@axbazin
Copy link
Member

axbazin commented May 6, 2024

Thank you for your kind words!

Indeed I do not think we have such a file among the possible outputs.
The closest would probably be the "matrix.csv" file (see this which has all the information you want, but has the list of genes rather than the raw count itself. Transforming this file may be your easiest way out.

Otherwise, using the gene_families.tsv file along with one of the file that links genes to genomes (e.g., the gff or the genes annotation table files) is a good solution too.

Adelme

@aababc1
Copy link
Author

aababc1 commented May 6, 2024

Thank you for your prompt response and suggestion!

I have two more question .

As your suggestion,I generated custom script for that. The number of the all genes in result file matched with the number of lines in gene_families.tsv . It would be greatful if you see attached file could be routinely integrated with ppanggolin in my analysis pipeline.
make_count_table.py.txt

First one is, ppanggolin represent F in third column in gene_families.tsv file that are fragmented. I included them for the analysis. I thought the fragmented gene sequences could be annotated functionally and it could be utilized for downstream anlaysis. I wonder your opinion about fragmented genes information inclusion in downstream analysis.

Second one is about gene families threshold. In the paper,coverage 80 % identity 80% was utilized for gene family construction. This are frequently used for gene family clustering, but I have question about adjusting the coverage and identity for species level in microbial comparative genomic analysis. If the species are different, users should choose different clustering criteria , or just default values could be utilized for analysis? And if someone lowering the identity to 50%, there could be severe bias introduced in downstream analysis based on pangenome function annotation based on pangenome reference sequences?

Thank you very much Adelme

`

`

@axbazin axbazin self-assigned this May 6, 2024
@axbazin
Copy link
Member

axbazin commented May 6, 2024

Your script looks fine for me, it does seem to be doing what you want.

About the fragmented genes, it depends on the "downstream analysis" and the biological question. From the technical point of view, if you are annotating genes independently from their gene families, then I think it is fine. If you are doing functional annotations at the scope of gene families, I'd remove them as they may not be able to realize the "function" that they would be annotated with.

For the question of gene families threshold, indeed I'd recommend to lower the identity threshold for clustering. If they are "close" sister species (e.g. Neisseria meningitidis and Neisseria gonorrhoeae) 80% is fine, but in general lowering it is better. However you are correct, it may generate a strong bias if you are annotating your gene families, as some paralogs with different functions may be annotated exactly the same way, in that case. That will only be true for some families though. It's a balance to have between wrongly clustered paralogs and wrongly splitted orthologs, you can adjust the threshold depending on what's important to your own analysis/biological question.

In my opinion, while annotating gene families is "practical" and much faster, annotating genes directly is still best if you want to avoid mistakes as much as possible.

Have a nice day!
Adelme

@aababc1
Copy link
Author

aababc1 commented May 6, 2024

Thank you so much for your very detail explanation .

I asked the gene family clustering threshold and fragmented gene families because I am handling fragmented genomes such as MAG. As you commented, annotate genomes individually will show best accuracy I think. I will test some things based on you advice. My �questions are all resolved. Thank you once again.

Have a nice day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants