Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutadapt installed via conda igzip error for some fastq files #513

Closed
zxl124 opened this issue Feb 17, 2021 · 18 comments
Closed

cutadapt installed via conda igzip error for some fastq files #513

zxl124 opened this issue Feb 17, 2021 · 18 comments

Comments

@zxl124
Copy link

zxl124 commented Feb 17, 2021

Only very recently (~2 weeks ago), cutadapt installed via conda has the following error:

This is cutadapt 3.2 with Python 3.8.6
Command line parameters: -j 4 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz
Processing reads on 4 cores in single-end mode ...
[--------->8 ] 00:00:26     5,536,084 reads  @      4.9 µs/read;  12.33 M reads/minuteERROR: Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 560, in run
    self.send_to_worker(chunk_index, chunk)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 141, in __exit__
    self.close()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 327, in close
    self._raise_if_error(allow_sigterm=allow_sigterm)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 350, in _raise_if_error
    raise OSError("{} (exit code {})".format(message, retcode))
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 626, in run
    raise e
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

Traceback (most recent call last):
  File "/opt/conda/envs/test/bin/cutadapt", line 10, in <module>
    sys.exit(main_cli())
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/__main__.py", line 845, in main_cli
    main(sys.argv[1:])
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/__main__.py", line 912, in main
    stats = r.run()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 824, in run
    raise e
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

This only happens with some fastq files, only in multi-thread mode (with -j specified), and only with conda-installed cutadapt. I've tried version 3.2 and 3.1. I've tried rolling back versions of some dependencies including pigz=2.3.4, xopen=1.0.1, dnaio=0.4.4, none of these helps.
I understand this is probably a bioconda problem rather than cutadapt problem. This has also been reported in bioconda recipes repo. So far no solutions. I am hoping maybe cutadapt developers might point me to a few places since you understand the error message better. Thank you.

@marcelm
Copy link
Owner

marcelm commented Feb 18, 2021

This should actually be an issue on the https://github.com/pycompression/xopen repository, but I seem to be unable to transfer the issue there, so let’s keep the discussion here.

@rhpvorderman You may also want to have a look.

@zxl124, @andreott Since I haven’t been able to reproduce this so far, would one of you be able to supply me with a fastq.gz file that causes the error?

The error comes from the igzip program, which apparently isn’t able to properly decompress the input gzip file. The message indicates that the input files are multi-block gzips.

A multi-block/concatenated gzip is created from the concatenation of multiple .gz files. Decompressing them should give you the same result as when you had concatenated the contents of these files. So if you want to concatenate two files file1 and file2, but they are actually compressed as file1.gz and file2.gz, you could do this:

gunzip file1.gz
gunzip file2.gz
cat file1 file2 > both
gzip both

which would give you both.gz. However, the gzip file format allows this shortcut:

cat file1.gz file2.gz > both.gz

This avoids having to recompress everything, but some programs reading .gz files cannot correctly deal with these files.

igzip is a very fast gzip compressor and decompressor, so to get some speedups, we started to use it in xopen 1.0, but only when it happened to be installed on the system. In Conda, we then added a dependency on isa-l at some point, which contains the igzip binary. So from then on, igzip would always be used if you install from Conda.

Support for reading concatenated .gz files was added to igzip only recently, and we actually waited for this to be added before having it as a dependency. But perhaps it does not work in all circumstances.

@marcelm
Copy link
Owner

marcelm commented Feb 18, 2021

As a workaround, I believe you can either

  • downgrade xopen to 1.0, but also remove the isa-l package from the environment to get rid of igzip. The problem might still occur if you happen to have igzip installed through some other means.
  • downgrade xopen to 0.9

@zxl124
Copy link
Author

zxl124 commented Feb 18, 2021

Thanks so much for pointing out the package culpable for this problem. I compared my environments with working and problematic cutadapt, and found slight differences in the build of isa-l.
Working cutadapt: isa-l=2.30.0=h36c2ea0_0
Problem cutadapt: isa-l=2.30.0=h7f98852_1
Once I rolled back isa-l to the previous build, cutadapt processed my fastqs correctly.

@marcelm
Copy link
Owner

marcelm commented Feb 18, 2021

Great, that is very helpful. Is it possible for you to share one of the files that failed with me? It is not a problem if not. I understand this might not be possible for sensitive or secret data.

@zxl124
Copy link
Author

zxl124 commented Feb 18, 2021

I've sent you an email with a link to the fastq file. Thank you.

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 19, 2021

Interesting. Thanks @zxl124 . Do you mind if @marcelm links me the fastq as well? EDIT: Or if you send me the link, that would be great too.
Is there private/patient information in this fastq? Do you mind if I share it with the isa-l developers once I track down the issue?

Version 2.29 of isa-l can't handle concatenated gzips properly. This bug was found thanks to the xopen test suite. It was fixed in 2.30.0. The isa-l test suite now also has a test for it. Strange that the new build causes errors. I will take a look at it.

@rhpvorderman
Copy link
Collaborator

Okay some extra information as I am also the isa-l-feedstock maintainer on conda:

  • The new build was triggered when I allowed isa-l to be build for macos.
  • This build required a new version of nasm. I updated the nasm-feedstock to accomplish this. The version of nasm has gone from 2.13 -> 2.14
  • Conda used a newer version of GCC for this build 7.5 -> 9.3.

Some stuff I did to try reproduce the issue:

  • Create a random gzip file: cat /dev/urandom | head -n 10000 | base64 | gzip -1 > random.gz. Concatenate it 10 times: for i in {1..10}; do cat random.gz >> random10.gz; done
  • igzip -cd random10.gz did not give any issues. This is with the latest conda build.

It must be something special with the gz file, but I cannot debug further if I do not have it. I wonder if it has something to do with NULL bytes. NULL bytes are allowed between two concatenated gz files. It could also have to do something with the header. It would be nice to be able to reproduce the issue.

@marcelm
Copy link
Owner

marcelm commented Feb 19, 2021

I have received the file that @zxl124 sent me and was strangely enough not able to reproduce the problem. I used the exact same build of the Conda package (isa-l=2.30.0=h7f98852_1). I tried to use igzip directly, but also with the given Cutadapt command.

I have also tried to reproduce the issue by decompressing all .gz files I could find on my system, but none of them failed with the "extra concatenatedgzip" message. I don’t have time for more today.

@zxl124 Can you try to decompress the file with igzip -cd filename.gz > /dev/null and see whether you get the error message? It would be helpful to know that this is sufficient to trigger the problem.

@zxl124
Copy link
Author

zxl124 commented Feb 19, 2021

This is interesting. I was indeed unable to reproduce this on linux (Ubuntu 18.04.3). I forgot to mention when I ran into this problem I was doing everything with docker, using nfcore/base as the base image. I didn't think it was important, but I was wrong. You learn new things every day. Here is how to reproduce it.

ubuntu@ip-10-1-11-63:~/temp$ docker run --rm -it -v /home/ubuntu/temp:/data nfcore/base:1.9
(base) root@3b29b01e195b:/# cd data
(base) root@3b29b01e195b:/data# conda config --add channels conda-forge
(base) root@3b29b01e195b:/data# conda config --add channels bioconda
(base) root@3b29b01e195b:/data# conda create -n test cutadapt isa-l=2.30.0=h7f98852_1
(base) root@3b29b01e195b:/data# conda activate test
(test) root@3b29b01e195b:/data# cutadapt -j 4 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz > temp.fq

Error message:

This is cutadapt 3.2 with Python 3.8.6
Command line parameters: -j 4 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz
Processing reads on 4 cores in single-end mode ...
[--------->8 ] 00:00:26     5,590,344 reads  @      4.8 µs/read;  12.63 M reads/minuteERROR: Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 560, in run
    self.send_to_worker(chunk_index, chunk)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 141, in __exit__
    self.close()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 327, in close
    self._raise_if_error(allow_sigterm=allow_sigterm)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/xopen/__init__.py", line 350, in _raise_if_error
    raise OSError("{} (exit code {})".format(message, retcode))
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 626, in run
    raise e
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

Traceback (most recent call last):
  File "/opt/conda/envs/test/bin/cutadapt", line 10, in <module>
    sys.exit(main_cli())
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/__main__.py", line 845, in main_cli
    main(sys.argv[1:])
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/__main__.py", line 912, in main
    stats = r.run()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/cutadapt/pipeline.py", line 824, in run
    raise e
OSError: igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz (exit code 1)

The result to using igzip command directly:

igzip: Error while decompressing extra concatenatedgzip files on in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz

Also tried a newer version, nfcore/base:1.12, results are the same. Maybe there is something missing the in Linux version in the nfcore/base image? Should I mention this to the nfcore folks?

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 19, 2021

@zxl124 Thanks for mentioning this.

I bet the container is using Alpine as a base container. (Very common in bioinformatics workflow stuff). But Alpine uses musl libc (https://musl.libc.org/) instead of proper GNU libc.
This is probably what causes the error. If so you should be able to reproduce it in any alpine linux container.

So basically it is just an almost pure debian buster container. Weird. I also run debian buster as my OS. Maybe it is debian related? Can you send the file to me?

@zxl124
Copy link
Author

zxl124 commented Feb 19, 2021

@rhpvorderman Just sent you an email with the link to the fastq

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 19, 2021

I can reproduce the issue on Debian buster. Not container related.

@rhpvorderman
Copy link
Collaborator

This has proably to do with the compiler move (as announced https://conda-forge.org/docs/user/announcements.html#announcements). A bit strange it does not work on debian though. I will check this out.

@marcelm
Copy link
Owner

marcelm commented Feb 19, 2021

I could also reproduce on Debian Buster.

I noticed that the file is actually bgzipped:

file in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz
in2438_3_CKDL210000739-2a-AK5142-AK6697_HVHF2DSXY_L2_1.fq.gz: Blocked GNU Zip Format (BGZF; gzip compatible), block length 11364

In case it helps, I was able to create a much smaller reproducer (33 MiB) that I can share publicly (some SRA data I had floating around):
https://stockholmuniversity.box.com/s/2d6237g1lzzuw34i2bza3y0pirtrhs2h

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 21, 2021

@marcelm Thanks. That's great. I only saw this just now. I have been working on a fix. There is a new conda build with the old GCC 7.5. compiler now for isa-l that should fix the issue.
@zxl124 could you check?
The build number for linux is ha770c72_2

@zxl124
Copy link
Author

zxl124 commented Feb 22, 2021

I can confirm that build ha770c72_2 works.

@rhpvorderman
Copy link
Collaborator

@zxl124 Thanks for reporting back. Can this issue be closed now? Or do you have some other unexpected errors with regards to this issue?

@zxl124
Copy link
Author

zxl124 commented Feb 23, 2021

Yes. Thank you so much for fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants