Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unifrac warning message #936

Closed
cwatt opened this issue May 15, 2018 · 30 comments
Closed

Unifrac warning message #936

cwatt opened this issue May 15, 2018 · 30 comments

Comments

@cwatt
Copy link

cwatt commented May 15, 2018

I'm running into trouble calculating unifrac distances in phyloseq. I keep getting the following message whenever I run the distance function with weighted or unweighted unifrac:

Warning message: In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, : data length [86515] is not a sub-multiple or multiple of the number of rows [43258]

I've checked to make sure that my tree is rooted, that there are no 0 count OTUs in my count table, or any NA values in the OTU count table or tree. After seeing this thread I updated phyloseq (v1.23.1), but the problem is still ongoing. A solution will be reached when I run the distance function, but I'm not sure if it can be trusted. Does anyone have any ideas about what this warning means, how much of a problem it is, and how it might be solved? I'd really appreciate the help.

Thanks in advance!

Update: I've updated R to version 3.5 and all packages to the latest versions (vegan v2.5-2, ape v5.1, and phyloseq v1.24.0) but no luck getting rid of the warning.

Edit: Removed data and code 7/3/18

@mstagliamonte
Copy link

mstagliamonte commented May 23, 2018

This happens to me when the data frame containing the sample names is incomplete. I am talking about the metadata to be merged with the OTU table to create your phyloseq object.

Hope it helps
Max

@cwatt
Copy link
Author

cwatt commented May 30, 2018

Thanks for the suggestion, Max. I double checked and unfortunately my sample names is complete, so that's not causing whatever this issue is.

@MSMortensen
Copy link

Hi,
I get the same error message every time I try to calculate UniFrac distances for a data set which have had OTUs removed after the tree was created. If this is the case you might want to recalculate your phylogenetic tree.

@cwatt
Copy link
Author

cwatt commented Jul 20, 2018

Hi MSMortsensen, I went back and pruned samples after incorporating them into the phyloseq object and that fixed the issue! Thank you for your insight!

@cwatt cwatt closed this as completed Jul 20, 2018
@gkphylo
Copy link

gkphylo commented Oct 17, 2018

I am experiencing the exact same issue but can't recalculate the tree as suggested.
I am using the new decontam package for filtering contaminants using a phyloseq object (i.e. "set").
After filtering out the contaminants using subset commands, e.g.
'''ex5 <- subset_taxa(set, !Genus=="g__Pseudomonas")'''
I am pruning to remove taxa_sums<1 and then trying to plot an ordination and keep getting this tree edge problem.
How can you recalculate the phylogenetic tree as suggested above to resolve this problem?

''' uuf = UniFrac(ex89, weighted = FALSE)
Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [3635] is not a sub-multiple or multiple of the number of rows [1818]'''

@gkphylo
Copy link

gkphylo commented Oct 17, 2018

Hi MSMortsensen, I went back and pruned samples after incorporating them into the phyloseq object and that fixed the issue! Thank you for your insight!

can you please clarify how you pruned samples to solve this? thanks

@MSMortensen
Copy link

I am experiencing the exact same issue but can't recalculate the tree as suggested.
I am using the new decontam package for filtering contaminants using a phyloseq object (i.e. "set").
After filtering out the contaminants using subset commands, e.g.
'''ex5 <- subset_taxa(set, !Genus=="g__Pseudomonas")'''
I am pruning to remove taxa_sums<1 and then trying to plot an ordination and keep getting this tree edge problem.
How can you recalculate the phylogenetic tree as suggested above to resolve this problem?

''' uuf = UniFrac(ex89, weighted = FALSE)
Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [3635] is not a sub-multiple or multiple of the number of rows [1818]'''

Hi,
The problem seems to be that one of the taxa you have removed were the root of your phylogenetic tree. If you reroot your tree it should solve the problem.
For me the solution was to export the repseq from the pruned phyloseq object, import that into Qiime2, calculate and root a new tree with just those OTUs and then use that new tree with the pruned phyloseq object.

@cwatt
Copy link
Author

cwatt commented Oct 17, 2018

@gkphylo My specific problem was caused because I removed some unneeded samples from my OTU table before putting everything together in a phyloseq object, which meant that my tree was inconsistent with my table. When I pruned those samples after making the phyloseq object instead, the problem was solved because phyloseq removed the offending OTUs from the tree at the same time.

If you've already put everything into a phyloseq object before removing things, it sounds like you have a slightly different problem than I did.

@gkphylo
Copy link

gkphylo commented Oct 17, 2018

It is interesting that my tree seems to be rooted after pruning and I still get the problem.

""" is.rooted(phy_tree(set))
[1] TRUE """"

I will try @MSMortensen 's approach and report back to all. I bet this question will become very important for users of the decontam package

@Nat211
Copy link

Nat211 commented Oct 17, 2018

Hi all,

I have the same problem! I created my phyloseq object with the tree, used the decontam package as well and then pruned unwanted samples and taxa. I checked and my tree is still "rooted = TRUE". But I get the same error message when trying to use PCoA and unifrac either weighted or unweighted in the ordinate function.

ps.ord<- ordinate(ps, method="PCoA", distance="unifrac", weighted=TRUE)

Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [2237] is not a sub-multiple or multiple of the number of rows [1119]

Looking forward to someone figuring this out!

@gkphylo
Copy link

gkphylo commented Oct 18, 2018

Hi all,

I have the same problem! I created my phyloseq object with the tree, used the decontam package as well and then pruned unwanted samples and taxa. I checked and my tree is still "rooted = TRUE". But I get the same error message when trying to use PCoA and unifrac either weighted or unweighted in the ordinate function.

ps.ord<- ordinate(ps, method="PCoA", distance="unifrac", weighted=TRUE)

Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [2237] is not a sub-multiple or multiple of the number of rows [1119]

Looking forward to someone figuring this out!

I don't have a good solution yet, but I have a patch for this. I traced the problem back to removing "family" level taxa using the pruning function and fixed it by removing "genus" level contaminants. Curious to know how you are removing the contaminants identified by decontam

@Nat211
Copy link

Nat211 commented Oct 18, 2018

@gkphylo

First I identified contaminants based on prevalence:

sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control"
contamdf.prev05 <- isContaminant(ps, method="prevalence", neg="is.neg", threshold=0.5)

Then I removed them from my ps object:

ps.decontam_Skin <- prune_taxa(!contamdf.prev05$contaminant, ps)

I am not sure I understand what you are saying about removing "family" or "genus" level?

@gkphylo
Copy link

gkphylo commented Oct 19, 2018

@gkphylo

First I identified contaminants based on prevalence:

sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control"
contamdf.prev05 <- isContaminant(ps, method="prevalence", neg="is.neg", threshold=0.5)

Then I removed them from my ps object:

ps.decontam_Skin <- prune_taxa(!contamdf.prev05$contaminant, ps)

I am not sure I understand what you are saying about removing "family" or "genus" level?

I traced my problem to the genus Klebsiella by removing all contaminants manually.
e.g.
""" ex6 = subset_taxa(ex5, !Genus %in% c("g__Delftia","g__Yersinia","g__Ralstonia","g__Microbacterium","g__Schlegelella",
"g__Escherichia","g__Klebsiella")) """

So when I sequentially add taxa to find what is causing the glitch Klebsiella is the culprit.
I have tried to see if this is a problem with the tree not being rooted and it does not seem to be the case.

I get these responses:
"""

any(is.na(phy_tree(ex6)))
[1] FALSE
is.rooted(phy_tree(ex6))
[1] TRUE
"""

@Nat211
Copy link

Nat211 commented Oct 24, 2018

@joey711 or anyone else. I still haven't been able to figure this out. Do you guys have any input? Highly appreciate it, Nat

@MSMortensen
Copy link

@Nat211 this problem is not due to how you remove contaminants, but solely because you remove the OTU which have been set as the root of your tree.
This means that phyloseq ends up with a tree which is aknowledge as a rooted tree, but which does not include the root itself. There are two solutions to this:

  1. Reroot your tree in R, or export and reroot elsewhere.
  2. Ensure that the root of your tree is not pruned out of the phyloseq object, but instead "just" have 0 reads in all samples.

@Nat211
Copy link

Nat211 commented Nov 5, 2018

Thank you @MSMortensen - I managed to get it to work!

@andrebolerbarros
Copy link

Hello everyone,

I'm having the same issue; however, I am not performing any kind of filtering whatsoever. Nevertheless, I get the same error:

data length [3983] is not a sub-multiple or multiple of the number of rows [1992]

This happens with rooted & unrooted.

@jiazhou0116
Copy link

I am experiencing the exact same issue but can't recalculate the tree as suggested.
I am using the new decontam package for filtering contaminants using a phyloseq object (i.e. "set").
After filtering out the contaminants using subset commands, e.g.
'''ex5 <- subset_taxa(set, !Genus=="g__Pseudomonas")'''
I am pruning to remove taxa_sums<1 and then trying to plot an ordination and keep getting this tree edge problem.
How can you recalculate the phylogenetic tree as suggested above to resolve this problem?
''' uuf = UniFrac(ex89, weighted = FALSE)
Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [3635] is not a sub-multiple or multiple of the number of rows [1818]'''

Hi,
The problem seems to be that one of the taxa you have removed were the root of your phylogenetic tree. If you reroot your tree it should solve the problem.
For me the solution was to export the repseq from the pruned phyloseq object, import that into Qiime2, calculate and root a new tree with just those OTUs and then use that new tree with the pruned phyloseq object.

Hi,
Just wondering how did you export the repseq from the pruned phyloseq object. I exported otu.biom and regenerated the tree in qiime2, but still got warning messages.

library(biomformat);packageVersion("biomformat")
otu<-as(otu_table(Toad5),"matrix")
otu_biom<-make_biom(data=otu)
write_biom(otu_biom,"otu_biom_Toad5.biom")
##RUN CODE IN QIIME2: desktop_qiime2_2.txt
Toad6<-qza_to_phyloseq(features="feature-table_Toad5.qza", tree="rooted-tree-Toad5.qza", taxonomy="taxonomy_gg.qza", metadata="MetaData_CaneToadData1_2.txt")
Toad6

@MSMortensen
Copy link

MSMortensen commented Apr 1, 2019

Hi,
For me it was easier just to export the representative sequences from R, import them into a new qiime2 object, build tree, export tree file, and import the new tree file into R (you might have to move some files, but the names are consistent):

Export repseq and redo tree with seq in clean.2k
Biostrings::writeXStringSet(refseq(ps.clean),"data/ps.clean.fasta")

Import into Qiime2
qiime tools import –input-path ps.clean.fasta –output-path sequences.qza –type ‘FeatureData[Sequence]’

Build midpoint rooted tree in Qiime2
qiime alignment mafft --i-sequences sequences.qza --o-alignment aligned-sequences.qza
qiime alignment mask --i-alignment aligned-sequences.qza --o-masked-alignment masked-aligned-sequences.qza
qiime phylogeny fasttree --i-alignment masked-aligned-sequences.qza --o-tree unrooted-tree.qza
qiime phylogeny midpoint-root --i-tree unrooted-tree.qza --o-rooted-tree rooted-tree.qza

Export tree
qiime tools export rooted-tree.qza --output-dir exported-feature-table/

Create the tree in qiime2 and import it here
tree.new <- read_tree("data/tree.nwk")
phy_tree(ps.clean) <- tree.new

@ghost
Copy link

ghost commented Apr 30, 2019

Hello,

I'm having the same issue. I am not performing any filtering and directly using an OTU table with taxonomy and rooted tree from QIIME2.

Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [7655] is not a sub-multiple or multiple of the number of rows [3828]

Any help is appreciated!

@cadl0590
Copy link

cadl0590 commented May 7, 2019

Hi,

I am also having the same issue with DADA2 classified OTUs from qiime2. I have tried the tree re-rooting, which doesn't seem to be helping.

Seems to be a problem a few are having, anybody found a solution that is not to do with rooting the tree?

Thanks!

Christina

@wallacelab
Copy link

Resurrecting this because I ran smack into it and think I found the root (no pun intended) of the problem.

This error seems to occur when a phylogenetic tree ends up with an edge matrix with an odd number of rows, so that this line (distance-methods.r, line 604)

node.desc <- matrix(tree$edge[order(tree$edge[,1]),][,2],byrow=TRUE,ncol=2)

fails because an odd number of elements can't go into a 2-column matrix evenly.

The odd number of rows, in turn, seems to happen when the phylogenetic tree is built so that some nodes have more than two children. This is output from my own tree (freshly imported from QIIME2):

edges=phy_tree(mytree)$edge
mycounts = table(edges[,1]) # Source nodes; 1st column of edge matrix
length(mycounts[mycounts ==2]) # Number of nodes with exactly 2 children
[1] 27366
length(mycounts[mycounts !=2]) # Number of nodes with more or fewer children
[1] 47
mycounts[mycounts !=2] # How many nodes each of the above has

28065 29006 29113 29372 29728 30091 30849 31021 31469 32032 33032 33069 33306 33400 33665 33948 34071 
    5     3     4     7     6     3     6     3     5     3     3     3     3     5     5     5     4 
34338 34507 34614 35250 35491 35932 36706 37447 43084 43309 44003 44120 44390 45018 45597 45791 45871 
    4     5     3     3     4     3     3     3     3     8     3     5     3     5    11     5     4 
46802 47317 48616 49981 51266 51740 52245 52277 53381 53626 53855 54298 54949 
    3     3     3     6     3     4     3     3    10     4     6    13     9

Most nodes have only 2 children, but some have more (up to 13!). The thing is, I don't know enough about phylogenetics to know if this is an error in phyloseq for making a faulty assumption, or an error in QIIME (or its wrapped software) for making a faulty tree. Can someone tell me where the error lies?

I've made a trimmed version of my tree where I trimmed it to 1000 taxa but still have some nodes with 4-5 children and an odd number of rows in the edge matrix. I'm new to Github; is there a way to upload/attach that file for reproducibility?

@mstagliamonte
Copy link

Hi, @wallacelab ,

Thank you for your input. If you want to attach a file, just drag and drop it into the message box.

Best,
Max

@tiffmnelson
Copy link

tiffmnelson commented Aug 22, 2019 via email

@tiffmnelson
Copy link

tiffmnelson commented Aug 22, 2019 via email

@wallacelab
Copy link

wallacelab commented Aug 29, 2019

Here's the example (trimmed-down version of my whole dataset):
example.newick.txt

Code:

> library(phyloseq)
> tree = read_tree('example.newick.txt')
> str(tree)
List of 5
 $ edge       : int [1:1795, 1:2] 902 903 904 905 906 907 908 908 907 909 ...
 $ edge.length: num [1:1795] 3.46e-02 1.17e-02 1.94e-02 1.89e-02 1.20e-08 ...
 $ Nnode      : int 895
 $ node.label : chr [1:895] "0.963" "0.806" "0.870" "0.714" ...
 $ tip.label  : chr [1:901] "EU202849.1.1443" "KP398529.1.1417" "LN572572.1.1372" "KR559961.1.1464" ...
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"
> node.desc <- matrix(tree$edge[order(tree$edge[,1]),][,2],byrow=TRUE,ncol=2) # copied from distance-methods.r
Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE,  :
  data length [1795] is not a sub-multiple or multiple of the number of rows [898]

Again, I'm not a phylogenetics expert, so I don't know if it's internal nodes or tips or whatever that is the issue, but it seems likely to mess up the resulting calculation.

@PandengWang
Copy link

PandengWang commented Dec 26, 2019

Hi all,
I have encountered the same issue.
@wallacelab is right!
When your tree is not dichotomy tree, you will have this issue.
dichotomy tree means that one internal node owns "two and only two" children.
One way to solve this problen is transforming all multichotomies into a series of dichotomies with one (or several) branch(es) of length zero.
new_tre <- ape::multi2di(tre)
After i did this, this issue was gone and I got the same unifrac values comparing with that were calculated by function picante::unifrac.

@andrewjmc
Copy link

Wow, am I glad someone had already gotten to the bottom of this. Note that for me though, the problem came down to having singleton lineages, and I had to add collapse.singles=TRUE to my drop.tip command as multi2di can't fix non-branching lineages.

@jonathanth
Copy link

For me, seems the solution from @PandengWang with ape::multi2di(tree) works! My tree was already rooted, but re-rooting with ape::root(tree, newroot) also worked @MSMortensen. Maybe they consolidate the tree stucture in the same way?

@wasade
Copy link

wasade commented May 22, 2024

I just became aware of this issue.

If encountering problems with UniFrac, I advise checking for expected results with the reference implementation. It has been validated against the original test suite in PyCogent developed by Cathy Lozupone and Rob Knight who originally defined UniFrac, as well as validated against the test suite in scikit-bio. The reference implementation is BSD-licensed, developed and maintained by Rob Knight's lab, has been extensively optimized (ref.1; ref.2), and is actively maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests