Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import MG-RAST biom-file using biom and phyloseq #272

Closed
elinekl opened this issue Nov 29, 2013 · 5 comments
Closed

Import MG-RAST biom-file using biom and phyloseq #272

elinekl opened this issue Nov 29, 2013 · 5 comments
Assignees

Comments

@elinekl
Copy link

elinekl commented Nov 29, 2013

With MG-RAST I generate a biom file (QIIME data file and QIIME metadata file). I tried to load that into R to use with phyloseq but I got an error. Is there a command line for this?

@joey711
Copy link
Owner

joey711 commented Dec 5, 2013

Can you provide example dadta and code so I can reproduce this?

@elinekl elinekl closed this as completed Dec 6, 2013
@elinekl elinekl reopened this Dec 6, 2013
@elinekl
Copy link
Author

elinekl commented Dec 6, 2013

I send you an email with the file.
I used this command line
file = import_biom("KO.biom", parseFunction = parse_taxonomy_greengenes)
and this
file = import_biom("KO.biom")

@joey711
Copy link
Owner

joey711 commented Dec 13, 2013

I will take a look and let you know. Thanks for sending the file...

@ghost ghost assigned joey711 Dec 13, 2013
@joey711
Copy link
Owner

joey711 commented Dec 13, 2013

@elinekl

MG-RAST did not include a taxonomy tag with their observation metadata, and this is what the import_biom import function expects in order to import data into R. Most of the use of import_biom was expected to be on biom-files generated by QIIME. The form of the taxonomy entries can be arbitrarily complex, and users can provide custom parsing functions to process it as an argument to parseFunction. However, in this case there is no taxonomy tag at all, and so this causes an error.

For your specific case I was able to make a workaround (shown at the bottom) in a few lines of code using both the biom and phyloseq packages. The details of diagnosing the error are shown immediately below, as well as a separate double-entry error in your data that you should make sure to address as well (but has nothing to do with phyloseq or biom).

Diagnosing problem

library("phyloseq")
data = import_biom("~/Downloads/KO.biom", parseFunction = parse_taxonomy_greengenes)
Error in `colnames<-`(`*tmp*`, value = c("ta1", "ta0")) : 
  length of 'dimnames' [2] not equal to array extent
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Check what the warnings are

warnings()
Warning messages:
1: In .TM.repl.i.mat(as(x, "TsparseMatrix"), i = i, value = value) :
  duplicate ij-entries in 'Matrix[ ij ] <- value'; using last
2: In parse_taxonomy_default(char.vec) : Empty taxonomy vector encountered.
3: In parseFunction(i$metadata$taxonomy) : No greengenes prefixes were found. 
Consider using parse_taxonomy_default() instead if true for all OTUs. 
...
library("biom")
read_biom("~/Downloads/KO.biom")
biom object. 
type: Function table 
matrix_type: sparse 
5333 rows and 24 columns 

Let's explore further, attempt to extract the abundance table as phyloseq::import_biom would do.

x = read_biom("~/Downloads/KO.biom")
head(biom_data(x))
[1] 3 3 2 2 1 1
Warning message:
In .TM.repl.i.mat(as(x, "TsparseMatrix"), i = i, value = value) :
  duplicate ij-entries in 'Matrix[ ij ] <- value'; using last

It looks as though there is something wrong with the sparse indexing of the file. The file has duplicate entries for the same pair of indices. This was one of the >50 original warnings. This doesn't cause an error by itself, but it is troubling because some aspect of your data is being lost/ignored.

The source of the error is that this .biom object doesn't have the expected taxonomy tag in its observation_metadata, even though the data itself looks reasonably well organized and can easily be made into a table with 5333 rows and 4 columns. Here is the part of import_biom that is handling the taxonomy, and expecting a list element with a taxonomy name.

parseFunction = parse_taxonomy_greengenes
mgrast_error = lapply(x$rows, function(i){parseFunction(i$metadata$taxonomy)})
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered.
2: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered.
...
head(mgrast_error, 3)
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)
...

As you can see, they are all empty, and that is because there is no taxonomy tag. When phyloseq attempts to put this list of empty/missing taxonomy entries into a table, it causes an error and fails.

It's not absolutely essential that phyloseq see a taxonomy tag, but that is how it is setup right now, and it follows the specialized instance of the biom-format written by QIIME. The biom-format itself is very flexible and supports more complicated lists and entries that are not easily represented in a character matrix of taxonomic classifications, as phyloseq requires for its taxonomy_table class.

Solution: Manually coerce the two tables into phyloseq objects

You can use tools in the biom-package to convert the observation/feature metadata into a separate data.frame, then into a character matrix, and then into a taxonomyTable class. You can also import the abundance matrix as a special type of Matrix class, then coerce this into a numeric matrix, then an otu_table class. The two can then be combined into a phyloseq object. Note that the row indices need to matchup, which is why the rownames.force option had to be set to TRUE. The complete code looks like the following (with an ordination graphic thrown in at the end for fun).

library("phyloseq")
library("biom")
x = read_biom("~/Downloads/KO.biom")
otumat = as(biom_data(x), "matrix")
OTU = otu_table(otumat, taxa_are_rows=TRUE)
taxmat = as.matrix(observation_metadata(x), rownames.force=TRUE)
TAX = tax_table(taxmat)
physeq = phyloseq(OTU, TAX)
ggsave("issue-272.png", plot_ordination(physeq, ordinate(physeq)) + geom_point(size=6))

issue-272

@joey711 joey711 closed this as completed Dec 13, 2013
@ghanesh
Copy link

ghanesh commented Oct 10, 2014

Hi Joey!
Thanks for the above details on the error.
It would be great if you could help me with following problem to import my files from qiime in phyloseq!
Thank you in advance.
I have a similar problem, but it is only partly solved by this thread.
The files result from qiime pipeline, where we used the rdp classifier.
It was not possible to import the biom file directly. I believe it because of the way how the rdp classifier is used in qiime. Well, what worked so far was:

otufile = "otu_table_uclust.biom"
mapfile = "map_bac_all_corrected.txt"
trefile = "rep_set.tre"
read_biom(otufile)
biom object.
type: OTU table
matrix_type: sparse
1149 rows and 1504 columns
x = read_biom(otufile)
x = read_biom(otufile)
otumat = as(biom_data(x), "matrix")
OTU = otu_table(otumat, taxa_are_rows=TRUE)
taxmat = as.matrix(observation_metadata(x), rownames.force=TRUE, byrows=FALSE)

BUT here the problems start This command produces a List with one column only:
row.names V1
1 denovo0 Bacteria
2 denovo1 Bacteria
3 denovo2 c("Bacteria", "Proteobacteria", "Alphaproteobacteria", "Rickettsiales", "SAR11", "Pelagibacter")
4 denovo3 c("Bacteria", "Bacteroidetes", "Flavobacteria", "Flavobacteriales", "Flavobacteriaceae")
5 denovo4 c("Bacteria", "Proteobacteria")

so therefore the command

TAX = tax_table(taxmat)
produces following:

Error in validObject(.Object) : invalid class “taxonomyTable” object:
Non-character matrix provided as Taxonomy Table.

Taxonomy is expected to be characters.

I believe this might be the reason, why I got stuck with using:
MyExp <- import_qiime(otufile, mapfile, trefile)
in the first place:

Error in fread(input = paste0(x, collapse = "\n"), sep = "\t", header = TRUE, :
'skip' must be a length 1 vector of type numeric or integer >=-1, or single character search string
In addition: Warning messages:
1: In readLines(file) :
incomplete final line found on 'otu_table_uclust.biom'
2: In max(which(substr(x[1:25L], 1, 1) == "#")) :
no non-missing arguments to max; returning -Inf

3: running command 'C:\Windows\system32\cmd.exe /c ({"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","type": "OTU table","generated_by": "QIIME 1.7.0-dev","date": "2014-10-09T16:21:41.267897","matrix_type": "sparse","matrix_element_type": "int","shape": [1149, 1504],"data":

Any ideas?
Cheers!
Alexander

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants