Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importing BIOM file #443

Closed
alejorojas2 opened this issue Mar 12, 2015 · 40 comments
Closed

importing BIOM file #443

alejorojas2 opened this issue Mar 12, 2015 · 40 comments
Labels

Comments

@alejorojas2
Copy link

Hi Joey711,

I have been trying to import a biom file that I generated with qiime, the file is binary file, it seems that is new format for this kind of files, then when trying to work with it in R I get this error:

This is my code so far:

#Importing with BIOM function
biom_file <- "./COI_seed_otu_table.biom"
tree_file <- "./COI_fasttree_usearch.tre"

COI_data <- import_biom(biom_file)

COI_data <- import_biom(biom_file)
Error in nchar(content) : invalid multibyte string 1

#importing with qiime function
COI_data <- import_qiime(biom_file)

COI_data <- import_qiime(BIOMfilename = biom_file)
Error in arglist[sapply(arglist, is.component.class)] :
invalid subscript type 'list'

Maybe I am missing something or should I use an older version of BIOM file?

@joey711
Copy link
Owner

joey711 commented Mar 12, 2015

Binary file? That sounds like the hdf5-based "version 2" biom format. The original biom format was not binary, a JSON based format. The original biom format is what is currently supported in phyloseq.

The biom-format team told me about the update probably 1 year ago. You are the first to post to phyloseq issue tracker about it. I've been discussing with @nosson (Joe Paulson) about adding biom-v2 support in the biom package on CRAN.

The more this becomes an issue, the faster one of us will support it. I personally have not encountered a need for HDF5-based storage of microbiome count data.

If you have some way of storing your OTU table in a biom v1 format, legacy QIIME format, or even just a tab-delimited table or any format that can be read by R, then you will be able to use phyloseq.

Hope that helps. I'll leave this issue open for now until there are further (and more helpful) comments about workarounds and/or actual biom-v2 support.

Cheers

joey

@pljames
Copy link

pljames commented Mar 12, 2015

I had to convert my biom format using "convert biom" which is distributed with qiime, perhaps this is the issue? Maybe try:

biom convert -i otu_table.biom -o otu_table_json.biom --table-type="OTU table" --to-json

and see if that works?

All the best,

Phill

@anmwinter
Copy link

In QIIME 1.9 the biom files are all in the new format and are not read in correctly into phyloseq. The only work around I have going at the moment is an install of QIIME 1.8 to make the biom files.

@jnpaulson
Copy link
Contributor

Hi Ara,

If you'd like, can you go ahead and test this function:
https://gist.github.com/jnpaulson/c9cdcb8cf95e7daae468

This will import biom vs. 2.0 files. If you find it works well I'm sure
Joey will be able to roll it into the Biom package after a few unit tests
are added (which shouldn't take us too long).

  • Joseph Paulson

On Sun, Mar 15, 2015 at 9:55 PM, Ara notifications@github.com wrote:

In QIIME 1.9 the biom files are all in the new format and are not read in
correctly into phyloseq. The only work around I have going at the moment is
an install of QIIME 1.8 to make the biom files.


Reply to this email directly or view it on GitHub
#443 (comment).

@alejorojas2
Copy link
Author

Joey and Phil,

I did try the command that Phil suggested converting the BIOM file to json format, it worked, now I am able to import my data to phyloseq without any issues.
But I looked forward to the function to read hdf5 files, this can be helpful since BIOM files are moving into that direction.

Thanks!

Alejandro

@anmwinter
Copy link

@nosson Thanks! I will give that a try.

@jnpaulson
Copy link
Contributor

@alejorojas2 @bioinfonm If you guys are willing to be testers for the gist I posted above I know Joey will import it into biom package. I just need to write the documentation. Is there a desire/need to "write" hdf5 biom2 files?

@anmwinter
Copy link

@nosson Sounds good. Phyloseq is our main tech for processing and understanding our microbial data. So anything I can do to help out I am game for.

I'll give it a try here around lunch!

@anmwinter
Copy link

@nosson Does the biom file need to be a rich format? I dumped out a biom file with the default settings in QIIME 1.9

Aras-MacBook-Air-2:qiime_1_9_test arakooser$ make_otu_table.py -i seqs_otus.txt -t seqs_rep_set_tax_assignments.txt -o de_novo_sumaclust.biom

and the got this error in phyloseq using the biom 2.x function

> biom_file = "de_novo_sumaclust.biom"    ###This is your .biom file
> map_file = "LavaBedsCombinedMapping.txt"      ###This is your mapping file with all the metadata
> y = read_hdf5_biom(biom_file)
HDF5-DIAG: Error detected in HDF5 (1.8.7) thread 0:
  #000: H5F.c line 1522 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: H5F.c line 1313 in H5F_open(): unable to read superblock
    major: File accessability
    minor: Read failed
  #002: H5Fsuper.c line 334 in H5F_super_read(): unable to find file signature
    major: File accessability
    minor: Not an HDF5 file
  #003: H5Fsuper.c line 155 in H5F_locate_signature(): unable to find a valid file signature
    major: Low-level I/O
    minor: Unable to initialize object
HDF5: unable to open file
Error in h5checktypeOrOpenLoc(file, readonly = TRUE) : 
  Error in h5checktypeOrOpenLoc(). File 'de_novo_sumaclust.biom' is not a valid HDF5 file.
Called from: h5read(file_input, "/", read.attributes = TRUE)
Browse[1]> Q

@jnpaulson
Copy link
Contributor

@bioinfonm Mhmm, just for clarification - this is not a phyloseq error. We're using the R packages biom rhdf5 to import this file. Converting it into a biom-class object (as Joey defined in the biom package) then lets us easily convert that to an object that can be parsed by phyloseq or metagenomeSeq.

Do you think you could send me a MRE (minimally reproducible example?). The ones I have from biom-format.org worked, but having a simple real example like yours would help with debugging. I don't personally have Qiime wrappers installed so I can't create my own without lots of effort. If you can get something to me my email is jpaulson@umiacs.umd.edu.

@anmwinter
Copy link

@nosson Will do! I'll send the whole file over.

Ara

@alejorojas2
Copy link
Author

@nosson I am traveling right know but as soon as I can test it, I will let you know. About the need for hdf5, for me it was just by chance since it is the first time that used qiime for my analysis and I think it is the default behavior now.

@anmwinter
Copy link

So a similar problem cropped up on the qiime forum here:
biocore/qiime#1928

Looks like something goofy with the way QIIME is handling the hdf5 format.

The solution at the moment that works is:
biom convert -i de_novo_sumaclust.biom -o fixed.biom --table-type="OTU table" --to-hdf5

@alejorojas2
Copy link
Author

@nosson I did try your function and it worked for me, I had to install the library for BIOM and rhdf5 to manage to run your function. The biom 0.3.12 and rhdf5 2.10.0 were the versions that installed on my R.

The I ran the the function and run your command:

y = read_hdf5_biom("table_even7620.biom")
biom(y)

then I get this output, that actually makes sense based on my data:

biom object. 
type: OTU table 
matrix_type: dense 
714 rows and 36 columns 

So, If this is a biom object, the next step will be importing the file using import_biom(BIOMfilename = y).
update: I assign y1 <- biom(y) to save it as biom object, but I still get the same message:

> import_biom(BIOMfilename = y1)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘fromJSON’ for signature ‘"biom", "missing"’

So it seems that your functions creates a list instead of a biom file, or maybe I am missing something?

Thanks!

Alejandro

@jnpaulson
Copy link
Contributor

@alejorojas2 Awesome! Glad to hear it works for you - it seems like some of the default outputs by Qiime are not 'true' HDF5 files (@bioinfonm's was not).

Glad to hear it worked for you! This is not an issue as Phyloseq uses biom class objects natively in R. Essentially instead of calling import_biom you would run a truncated version of that function:
https://github.com/joey711/phyloseq/blob/master/R/IO-methods.R#L1777-L1849

  1. As an example, you can use this function to load the file into phyloseq (import_biom2(y1))
    https://gist.github.com/jnpaulson/324ac1fa3eab1bc7f845
  2. @joey711 can easily rewrite the import_biom function to
import_biom <- function(BIOMfilename, 
    treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL,
    parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...){

    # initialize the argument-list for phyloseq. Start empty.
    argumentlist <- list()

    # Read the data
    if(class(x)=="biom"){ # or check if string first
        x = BIOMfilename
    } else {
        x = read_biom(biom_file=BIOMfilename)
    }
  1. A longer, worse way, would be to write the old biom file format write_biom(y1,"file/path/filename.biom") then import it using import_biom("/file/path/filename.biom").
  2. Use metagenomeSeq's biom2MRexperiment then use @joey711's MRexperiment2phyloseq function :)

@LBragg
Copy link

LBragg commented May 22, 2015

Hi there,

I am following this thread with interest as I am also using the latest QIIME. The read_hdf5_biom function provided worked, but I was getting the same error as CarlyRae did (see comments here https://gist.github.com/nosson/324ac1fa3eab1bc7f845) for the import_biom2 function. The error suggests a list is not the expected input, as it is expecting the biom-class (subclass of List) as input, I modified the code to convert the list to a biom object... and it seems to work. But I just wanted to check with you guys if this was the Right Thing to Do :)

import_biom2 <- function(x, 
                     treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL,
                     parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...){

  # initialize the argument-list for phyloseq. Start empty.
  argumentlist <- list()

 x = biom(x) # added this line

EDIT: I just realised I haven't got the latest bioconductor installed. I'll update everything and get back to you if there is still a problem.

@LBragg
Copy link

LBragg commented May 23, 2015

I upgraded to R 3.2 (Bioconductor version 3.1 (BiocInstaller 1.18.2)), installed the phyloseq library (‘1.13.2’), and still needed to add that line (x = biom(x)) to the import_biom2 function to get it to work. But I think there are still issues, as the taxonomy parsing isn't working correctly (I've tried default and the parse_taxonomy_qiime options too). I thought this all might work if I could install the dev version of phyloseq, but I can't seem to get this to work.

> library("devtools")
> install_github("phyloseq", "joey711")
Downloading github repo joey711/phyloseq@master
Installing phyloseq
Skipping 4 packages not available: BiocGenerics, Biostrings, DESeq2, multtest
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  INSTALL '/tmp/Rtmpm7oa1a/devtools2137cf4daa7/joey711-phyloseq-dd34ad0'  \
  --library='/home/bender/R/x86_64-pc-linux-gnu-library/3.2' --install-tests 
* installing *source* package ‘phyloseq’ ...
** R
** data
** inst
** tests
** preparing package for lazy loading
Note: the specification for S3 class “AsIs” in package ‘RJSONIO’ seems equivalent to one from package ‘BiocGenerics’: not turning on duplicate class definitions for this class.
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Note: the specification for S3 class “AsIs” in package ‘RJSONIO’ seems equivalent to one from package ‘BiocGenerics’: not turning on duplicate class definitions for this class.
* DONE (phyloseq)
Reloading installed phyloseq
Warning message:
Username parameter is deprecated. Please use joey711/phyloseq 
> library(phyloseq)
> packageVersion("phyloseq")
[1] ‘1.13.2’

@jnpaulson
Copy link
Contributor

Hi @LBragg,

Thanks for the research so far into the problem and apologies for the delay!

If you could email me a MRE (minimally reproducible example) that you're working with, that'll really help me debug and fix up the code so that we can get biom-class 2.1 objects imported into phyloseq and metagenomeSeq and functions available in @joey711's biom package.

@jnpaulson
Copy link
Contributor

@LBragg Thanks for the MRE - here's a link to the now updated import_biom2.R script https://gist.github.com/nosson/324ac1fa3eab1bc7f845 that should fix everything (sorry for the delay, been in Paris for the last week).

@jnpaulson
Copy link
Contributor

@joey711 If you're happy with this, let me know and I can make import_biom flexible and modify import_biom + read_biom and do a pull request when I get stateside in the next two weeks.

@joey711
Copy link
Owner

joey711 commented Jun 3, 2015

Hey guys, all LGTM!
Make sure to think of some unit tests for some of the special cases.
I'm happy to push out any/all improvements to get this functionality out there.
Feel free to ping me repeatedly if there's a pending PR I haven't noticed.

@LBragg
Copy link

LBragg commented Jun 4, 2015

The fixes work for me, the biom produced by qiime 1.9 can be imported. As an aside, I noticed when I passed a parameter (I was trying to pass the mapping file) to import_biom2() that wasn't expected, it gets passed to the read_tree function, causing the tree to not be read... It's a trivial thing, but thought I should mention it while you're updating the function =)

Warning message:
In import_biom2(hd5_biom, mapfilename = mapping_file, treefilename = treefilename) :
  treefilename failed import. It not included.

@joey711
Copy link
Owner

joey711 commented Jun 4, 2015

@LBragg

The warning you showed seems like expected/desired behavior if the treefilename argument is a missing/incorrect file. Was that the case? or were you expecting that tree-file to work?

Btw everyone, @nosson et al, I was imagining we add a new argument to import_biom regarding the biom format version (it might already have one, just currently ignored), and/or we autodetect it. Seems straightforward enough we shouldn't need a completely different function, and might help avoid mistakes (especially if we can do the autodetect).

Thoughts?

joey

@LBragg
Copy link

LBragg commented Jun 4, 2015

Sounds good to me. As for that issue I was describing, the treefilename argument is the correct tree file, however the tree fails to load if I supply additional, unexpected arguments to the import_biom2() function. I guess they are passed using the ... argument list. Maybe it's more that the read_tree function silently returns NULL if any errors occur, such as the passing of incorrect arguments. No biggie.

@jnpaulson
Copy link
Contributor

Boom! I just made a large number of changes to the BIOM package that should be ready for pulling in and will seamlessly fix the import_biom issue.

Check out: joey711/biom#11

@dadahan
Copy link

dadahan commented Jul 16, 2015

Hi all,

Sorry to be picking up on an earlier part of this thread so late in the game but I am just now running into my issue.

I am trying to import my biom file, converted from hdf5 to json using Phil's earlier suggestion with the biom convert in QIIME, and receiving the following error:

jsonbiom = system.file("extdata", "otu_table_json.biom", package="phyloseq")
treefile = system.file("extdata","rep_set.tre", package="phyloseq")
refseq = system.file("extdata","rep_set.fna",package="phyloseq")
import_biom(jsonbiom,treefile,refseq, parseFunction = parse_taxonomy_qiime)
Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input

I've tried using a txt file as well, and it just wont seem to work.

Any help is appreciated,
Dylan

@joey711
Copy link
Owner

joey711 commented Jul 16, 2015

that's not a valid file path. the system.file commands are shown in tutorials because they are OS-independent, and the data is within the package itself.

Your file path for your data should just be a standard relative or absolute Unix path string.

jsonbiomfile = "path/to/my/data/file.biom"
import_biom(jsonbiomfile)

@joey711
Copy link
Owner

joey711 commented Jul 16, 2015

By the way, the early version of the replacement that will solve this HDF5 / Biom-V2 issue is now a separate repository. Plan is to post it on Bioconductor:

https://github.com/joey711/biomformat

The release version of phyloseq package cannot officially support it as an internal dependency until it is released on BioC. The devel version can do so, as soon as the first version of "biomformat" is on BioC.

I will close this issue once the solutions are implemented enough in phyloseq and its doc.

@dadahan
Copy link

dadahan commented Jul 18, 2015

Hi Joey,

My apologies for not catching this was a tutorial specific command. Your suggestion worked wonderfully. Thanks for the HDF5 info as well.

D

@mafed
Copy link

mafed commented Nov 6, 2015

Hi all,
about importing a biome file in R: when read_biom reads a biom/HDF5, abundance data are stored in R as numerics instead of integers, that seems a huge waste of memory IMHO.
With American Gut Project's I went from 9.2GB to 480MB just by casting to integers.

@jnpaulson
Copy link
Contributor

@mafed interesting - can you perhaps post this in the biomformat package: https://github.com/joey711/biomformat/issues and i'll go ahead and fix this

@mafed
Copy link

mafed commented Nov 6, 2015

@jnpaulson sure, I'll do it immediately, thanks

@gputzel
Copy link

gputzel commented Oct 12, 2016

The documentation for the import_biom function has the following description:

New versions of QIIME produce a more-comprehensive and formally-defined
JSON file format, called biom file format

This is not correct, is it? Shouldn't it say "Old versions" instead of "New versions"?

@jmicrobe
Copy link

Has the hdf5 biom issue been fixed? I still get the error "Error in nchar(content) : invalid multibyte string 1" when trying to import a biom generated with QIIME 1.9.1. The documentation here seems to imply that after April this problem was fixed. I've checked that my version of phyloseq is up to date.

@joey711
Copy link
Owner

joey711 commented Nov 29, 2016

This issue should be fixed for some time now. The dependency in phyloseq was migrated in previous release version to point to "biomformat" on Bioconductor, which supports both JSON and HDF5 versions of the format, rather than "biom" on CRAN, which should be considered deprecated. All functionality from "biom" package on CRAN is now subsumed within "biomformat" on Bioconductor.

I don't know what QIIME did or did not do to fix the issue, but AFAIK biomformat is reading these formats just fine.

I will close for now, but feel free @jmicrobe to post more details of your problem if it turns out it is not solved by biomformat/phyloseq.

@jnpaulson feel free to re-open if you think there's something still TBD.

@joey711 joey711 closed this as completed Nov 29, 2016
@CarlyMuletzWolz
Copy link

CarlyMuletzWolz commented Jan 26, 2018

Ok, I solved my own issue. But putting comments here in case someone does this too!

Reproducible result:

data("GlobalPatterns")

MGS <- phyloseq_to_metagenomeSeq(GlobalPatterns)

MGS

p <- cumNormStatFast(MGS)
MGS <- cumNorm(MGS, p =p)

b <- MRexperiment2biom(MGS, norm = T)

write_biom(b, biom_file = "test.biom")

Error message

biom1 <- import_biom2("test.biom")

You have to read in the biom file first for import biom to work, needs an object not a file

biom2 <-read_biom("test.biom")

biom3 <- import_biom2(biom2)

@vanesotjim
Copy link

Hi everyone!

I have my otu_table.biom constructed by Qiime 1.9 and my metadata file and I tried to import in Phyloseq and until now I have not couldn't; I changed my otu_table.biom to json with this command

biom convert -i otu_table.biom -o otu_table_json.biom --table-type="OTU table" --to-json

However, with this file, I have not couldn't, please if you have any advice, I'm new in bioinformatics :)

Thanks in advance

@vanesotjim
Copy link

Sorry I forgot to wrote script that I used, it was:

jsonbiomfile = "C/Users/metagenomic/Desktop/otu_table_json.biom"
jsonbiomfile = "C/Users/metagenomic/Desktop/metadata_ITS.txt"

biom_file <- paste(jsonbiomfile, "otu_table_json.biom", sep = "")
map_file <- paste(jsonbiomfile, "metadata_ITS.txt", sep = "")

read_biom(biom_file)

Is there something wrong??

@CarlyMuletzWolz
Copy link

CarlyMuletzWolz commented Jun 28, 2018

This is what works for me from QIIME 1.9 .json files

BIOM FILE

biom <- import_biom("milk_final.biom", parseFunction = parse_taxonomy_greengenes)
warnings()

rank_names(biom)

get error message, but looks fine, adds a rank1 at the end for whatever reason. You can turn off the parse_taxonomy_greengenes maybe...

MAPPING FILE

map <- import_qiime_sample_data("map_milk_final_alpha.txt")

TREE FILE

the tree must be rooted. You can mid-point root it if you want see: http://www.phytools.org/static.help/midpoint.root.html

My tree was rooted with an outgroup in other R code

tree = read.tree("milk_final_rooted.tre")

GM <- merge_phyloseq(biom, tree, map)

When you merge all these files, the OTUs that are in the biom file, but not in the tree file are removed, but these OTUs are still 'linked' to the phyloseq object. It is best to filter them in case they cause some issue. What happens is when you do the alignment in QIIME some of the OTUs don't align (they are often classified as bacteria) and thus don't show up in the tree. It is a good filtering strategy.

GM_data = filter_taxa(GM, function(x) sum(x) != 0, TRUE)

sum(taxa_sums(GM_data) == 0)

ntaxa(GM_data)
nsamples(GM_data)

sort(sample_sums(GM_data))

difference in sequencing depth

max(sample_sums(GM_data))/min(sample_sums(GM_data))

Hope that works for you

@vanesotjim
Copy link

Thank you Carly, for your answer!!!

I tested this

####biom.json
jsonbiomfile = "C:/Users/metagenomic/Desktop/otu_table_json.biom"
read_biom(jsonbiomfile)
x = read_biom(jsonbiomfile)
x2 <- import_biom(x)
head(otu_table(x2))
head(tax_table(x2))

####Metadata import
metadata = "C:/Users/metagenomic/Desktop/metadata_ITS.txt"
read.csv(metadata)
t <- import_qiime_sample_data(metadata)

#####merge two phyloseq objects
x2_merged <- merge_phyloseq(x2, t)
colnames(tax_table(x2_merged)) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
sample_data(x2_merged)

And it worked!!!

Thnaks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests