Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phylogenetic tree build fails #931

Closed
phiweger opened this issue Aug 10, 2018 · 15 comments
Closed

phylogenetic tree build fails #931

phiweger opened this issue Aug 10, 2018 · 15 comments

Comments

@phiweger
Copy link

Hi,

I am having problems similar to issue #690 related to building a phylogenetic tree.

Housekeeping first:

Anvi'o version ...............................: margaret (v5.1)
Profile DB version ...........................: 29
Contigs DB version ...........................: 12
Pan DB version ...............................: 12
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

I installed anvio via brew on a Mac HighSierra 10.13.5

brew tap merenlab/anvio
brew install merenlab/anvio/anvio
anvi-self-test --suite mini
# all fine

I am following Murat's tutorial on the infant gut dataset:

anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa -o phylogenomic-tree.txt

Input aligment file path .....................: .../INFANT-GUT-TUTORIAL/seqs-for-phylogenomics.fa
Output file path .............................: .../INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: Streptococcus, P_rhinitidis, L_citreum, C_albicans, S_epidermidis, F_magna, P_avidum, E_facealis, S_hominis, Aneorococcus_sp, S_aureus
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories


File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/

Thank you for looking into this.

@meren
Copy link
Member

meren commented Aug 10, 2018

I am having hard time reproducing this :/ Can you please send the FASTA file you used to get this error? :)

@phiweger
Copy link
Author

phiweger commented Aug 10, 2018

I just sent it.

@meren
Copy link
Member

meren commented Aug 10, 2018

Still unable to reproduce:

$ anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics-Viehwege.fa -o phylogenomic-tree.txt
Input aligment file path .....................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics-Viehwege.fa
Output file path .............................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: Streptococcus, P_rhinitidis, L_citreum, C_albicans, S_epidermidis, F_magna, P_avidum, E_facealis, S_hominis, Aneorococcus_sp, S_aureus
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 No SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories

FastTree output newick file ..................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt

$ cat phylogenomic-tree.txt
((Streptococcus:0.27715,E_facealis:0.10530)0.791:0.03751,(P_avidum:0.60808,L_citreum:0.16850)1.000:0.08870,((S_aureus:0.03558,(S_epidermidis:0.04069,S_hominis:0.04839)0.808:0.01802)1.000:0.08470,(C_albicans:0.43611,(Aneorococcus_sp:0.41781,(F_magna:0.18320,P_rhinitidis:0.17801)0.993:0.04472)0.999:0.05406)1.000:0.08265)1.000:0.06239);

@phiweger
Copy link
Author

Odd. When I run muscle followed by FastTree manually and proceed w/ the following step in the tutorial

anvi-interactive --tree phylogenomic-tree.txt \
                 -p temp-profile.db \
                 --title "Pylogenomics of IGD Bins" \
                 --manual

then all's well. I reinstalled conda install ete3==3.1.1 as it says in anvio's requirements.txt, still, the error persists.

One observation is that the error about ete3 complaining is thrown very shortly after calling anvi-gen-phylogenomic-tree, so that muscle cannot have finished yet. So I guess there really might not be an MSA yet -- could the call to muscle be the problem?

@meren
Copy link
Member

meren commented Aug 11, 2018

Can you please run the same command with the flag --debug? So we can see the Traceback

@phiweger
Copy link
Author

Sure:

anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa -o phylogenomic-tree.txt --debug

Input aligment file path .....................: .../gone-fishing/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics.fa
Output file path .............................: .../gone-fishing/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: P_avidum, F_magna, L_citreum, S_aureus, Aneorococcus_sp, Streptococcus, S_epidermidis, C_albicans, P_rhinitidis, S_hominis, E_facealis
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories

Traceback for debugging
================================================================================
  File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 70, in <module>
    main(args)
  File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 52, in main
    program().run_command(input_file_path, output_file_path)
  File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/drivers/fasttree.py", line 63, in run_command
    if filesnpaths.is_proper_newick(output_stdout):
  File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/filesnpaths.py", line 57, in is_proper_newick
    to say about this: '%s'. Pity :/" % e)
================================================================================


File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/

@meren
Copy link
Member

meren commented Apr 17, 2019

I was just going through old issues that were not fully addressed and saw this one. I hope it sorted itself out :( thanks for your time to report this and for your followup to help identify the problem. and apologies for not getting back to this earlier.

@mschecht
Copy link
Contributor

Hi @meren, I am getting the same error as described in this issue.

os: MacOS Catalina 10.15.4

anvio version

Anvi'o version ...............................: esther (v6.2-master)
Profile DB version ...........................: 32
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

This was my original command and the concatenated-proteins.fa contained all SCGs I found when visualizing the pangenome.

$ anvi-gen-phylogenomic-tree -f concatenated-proteins.fa -o phylogenomic-tree.txt --debug

Input aligment file path .....................: /Users/mschechter/Downloads/concatenated-proteins.fa
Output file path .............................: /Users/mschechter/Downloads/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 179,052
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Wrong number of characters for genome_1 ......: expected 182504 but have 182502 instead.
Info .........................................: This sequence may be truncated, or another sequence may be too long.

Traceback for debugging
================================================================================
  File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 73, in <module>
    main(args)
  File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 55, in main
    program().run_command(input_file_path, output_file_path)
  File "/Users/mschechter/github/anvio/anvio/drivers/fasttree.py", line 63, in run_command
    if filesnpaths.is_proper_newick(output_stdout):
  File "/Users/mschechter/github/anvio/anvio/filesnpaths.py", line 57, in is_proper_newick
    "to say about this: '%s'. Pity :/" % e)
================================================================================


File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/

I then went back into the interactive interface and made a new, significantly smaller selection of SCGs (n = 5) and anvi-gen-phylogenomic-tree worked.

$ anvi-gen-phylogenomic-tree -f concatenated-proteins_small.fa -o phylogenomic-tree.txt

Input aligment file path .....................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/concatenated-proteins_small.fa
Output file path .............................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 975
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Info .........................................: Ignored unknown character X (seen 576 times)
Refining topology ............................: 22 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Info .........................................: Total branch-length 0.078 after 0.18 sec
ML-NNI round 1 ...............................: LogLk = -3545.418 NNIs 15 max delta 5.85 Time 1.76
Info .........................................: Switched to using 20 rate categories (CAT approximation)
Info .........................................: Rate categories were divided by 0.645 so that average rate = 1.0
Info .........................................: CAT-based log-likelihoods may not be comparable across runs
Info .........................................: Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2 ...............................: LogLk = -3512.777 NNIs 7 max delta 0.00 Time 3.66
Info .........................................: Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3 ...............................: LogLk = -3512.777 NNIs 5 max delta 0.00 Time 4.85 (final)
Optimize all lengths .........................: LogLk = -3512.777 Time 5.20

FastTree output newick file ..................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt

Here are the differences in alignment lengths between the input files:
concatenated-proteins.fa: 179,052
concatenated-proteins_small.fa: 975

I also attempted to use MUSCLE and FastTree individually with my original concatenated-proteins.fa but unfortunately could not get passed the alignment step. I am not sure if this information is informative but I just wanted to add it in just to make sure.

$ muscle -in ../concatenated-proteins.fa -out concatenated-proteins.msa

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

concatenated-proteins 328 seqs, lengths min 178383, max 178587, avg 178512
00:01:04   326 MB(17%)  Iter   1  100.00%  K-mer dist pass 1
00:01:04   326 MB(17%)  Iter   1  100.00%  K-mer dist pass 2
Killed04   436 MB(22%)  Iter   1    0.31%  Align node

Thank you for taking a look and please let me know if you want me to send you any of my files for reproducibility.

@mschecht mschecht reopened this Apr 23, 2020
@meren
Copy link
Member

meren commented Apr 23, 2020

Alignment sequence length ....................: 179,052

This is simply too many residues to consider. That's why Mahmoud has implemented functional homogeneity estimates per gene cluster, so you can choose only those gene clusters with meaningful variation (most of them will have functional homogeneity of 1.0, meaning that there is no variation across genes within them) and no alignment issues (i.e., geometric homogeneity > 0.95).

@mschecht
Copy link
Contributor

Thanks for the suggestions @meren. I went back and filtered for a group of 70 SCGs using the combined homogeneity index and was successfully able to use anvi-gen-phylogenomic-tree

@Sirbius
Copy link

Sirbius commented Nov 15, 2023

Hi guys,
I'm using anvi'o v7 within the Docker container. I'm trying to build a phylogenetic tree on the concatenated Single Core Gene sequences extracted with :

anvi-get-sequences-for-gene-clusters -g Xac-GENOMES.db -p XacPangenome/XacAnalysis-PAN.db --min-functional-homogeneity-index 1 --min-geometric-homogeneity-index 0.95 --min-num-genomes-gene-cluster-occurs 13 --max-num-genes-from-each-genome 1 --concatenate-gene-clusters -o SCG-H-filtered.fasta

But when I run:
anvi-gen-phylogenomic-tree -f SCG-H-filtered.fasta -o SCG-H-tree I still getting the famous error from ETE:

Input aligment file path .....................: /home/silviat/Andrea/Xac_pangenome/SCG-H-filtered.fasta
Output file path .............................: /home/silviat/Andrea/Xac_pangenome/SCG-H-tree
Alignment names ..............................: Xac_301, Xac_A7, Xac_CFBP1159_INRA, Xac_CFBP1159_ZHAW, Xac_CFBP1846, Xac_CFBP2565, Xac_CFBP6600,
Xac_IVIA3978, Xac_NCCB100457, Xac_XH2, Xac_XH3, Xac_XH7, Xac_XH8
Alignment sequence length ....................: 509,496
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Wrong number of characters for Xac_301 .......: expected 526407 but have 526397 instead.
Info .........................................: This sequence may be truncated, or another sequence may be too long.

File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
about this: 'Unexisting tree file or Malformed newick tree structure. You may
want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
Pity :/

Of course what's saying about Xac_301 (expected 526407 but have 526397 instead) is not really true.
Is it problem of large number of SCG genome?
I cannot go lower with the number of gene clusters since we are dealing with isolates of the same pathovar.

Any suggestions?
Any other alternative way to build a phylogenetic tree on the SCG?
Thanks
Silvia

@meren
Copy link
Member

meren commented Nov 16, 2023

Hey @Sirbius, would you please consider using the Docker container for v8?

Plus,

Alignment sequence length ....................: 509,496

0.5 million nucleotides is a little too much for any meaningful analysis I think :) I think you should consider using these flags instead:

--min-geometric-homogeneity-index 1.0 --min-functional-homogeneity-index 0.95

Yours seem to be the opposite of the best practice.

@MrCorylus
Copy link

Hello @meren,
I’m working with @Sirbius, could you please explain why you think that a phylogenomic tree built using that large alignment sequence length is meaningless? We though that the more genes we compare among strains the more solid the analysis is, are we wrong?
We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree.
I would like to ask you, since genes which have a 100% sequence identity among them in all the strains analyzed are not explanatory in a phylogenomic study, could have sense to set --max-geometric-homogeneity-index 0.99 in order to exclude all the clusters which contain genes which have the same sequence in all the strains? Is that the sense of the geometric homogeneity index? Setting this parameter we obtain only 265 clusters and we are able to build a phylogenetic tree.
Thank you in advance,

Andrea

@meren
Copy link
Member

meren commented Nov 17, 2023

Hi @MrCorylus, would you mind sharing with me a private download link for the PAN.db and genomes storage (GENOMES.db) via email so I can take look at the data before making a suggestion?

@meren
Copy link
Member

meren commented Nov 17, 2023

Hi again Andrea,

Thanks for the email.

We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree.

I'm sorry, I can see that I've made a mistake in my suggestion. It should've been --max-functional-homogeneity-index 0.95 and not --min-functional-homogeneity-index 0.95. But when you correct for that you only get 2 gene clusters, which is not very useful. But using these parameters instead,

image

I was able to get 30 gene clusters, and was able to generate a tree:

image

I used the interactive interface for convenience, but you should be able to translate my parameters to the command line easily.

I hope this helps.

Best wishes,
Meren

@meren meren closed this as completed Nov 17, 2023
@merenlab merenlab deleted a comment from MrCorylus Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants