Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting UnicodeDecodeError when running faidx with a bed file input #217

Closed
yonniejon opened this issue Feb 15, 2024 · 8 comments
Closed

Comments

@yonniejon
Copy link

Hi!

I am running faidx version 0.7.2.1

I am running it with a bed file input like so:
faidx hg19/genome.fa.gz -b tmp.bed.gz

where tmp.bed.gz looks like:
chr6 132891948 132892108
chr10 127585142 127585221

I get the following error:
Traceback (most recent call last):
File "/cs/usr/jrosensk/.local/bin/faidx", line 8, in
sys.exit(main())
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main
write_sequence(args)
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 26, in write_sequence
for region in regions_to_fetch:
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I assume the problem is that my bed file has the "chr" prefix? It is a problem because my genome file has the chr prefix as well. Is there way around this or I need to change the reference .fa file?

@yonniejon
Copy link
Author

So the problem was no the chr prefix. I replaced my bed file to not contain the "chr" prefixes and I removed the "chr" prefixes in my fasta reference file and the problem persists.

@mdshw5
Copy link
Owner

mdshw5 commented Feb 15, 2024

This means that you have a non utf-8 character at the beginning of your file. Did you by chance export this from MS Excel as utf-16? If so then you need to convert your file to utf-8 encoding. You can also export from Excel in utf-8 encoding as well.

@yonniejon
Copy link
Author

yonniejon commented Feb 15, 2024

I did not. I ran nano tmp.bed and pasted the following contents exactly:

chr6 132891948 132892108
chr10 127585142 127585221

@mdshw5
Copy link
Owner

mdshw5 commented Feb 15, 2024

Just to confirm - you have said:

where tmp.bed.gz looks like:
chr6 132891948 132892108
chr10 127585142 127585221

Do you mean that the tmp.bed file contains this, and you have also gzipped it? If so I think I understand the issue. The --bed option does not handle gzipped input. If you want to pass a gzipped file you could do:

$ faidx hg19/genome.fa.gz -b - <( gzip -dc tmp.bed.gz)

The above would use a sub shell to decompress your bed file and send it to stdin, which can be read by the --bed argument using the "-" symbol. You could alternatively pass an uncompressed bed file.

@yonniejon
Copy link
Author

yonniejon commented Feb 16, 2024

"Do you mean that the tmp.bed file contains this, and you have also gzipped it?"

Yes you are correct. But I only gzipped it because when I ran it without gzip/bgzip I got the following error:

faidx genome.fa.gz -b tmp.bed

Traceback (most recent call last):
File "/cs/usr/jjj/.local/bin/faidx", line 8, in
sys.exit(main())
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main
write_sequence(args)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 53, in write_sequence
for line in fetch_sequence(args, fasta, name, start, end):
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 70, in fetch_sequence
sequence = fasta[name][start:end]
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 920, in getitem
return self._fa.get_seq(self.name, start + 1, stop)[::step]
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 1149, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 727, in fetch
seq = self.from_file(name, start, end)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 769, in from_file
self.file.seek(i.offset)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 650, in seek
self._load_block(start_offset)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 611, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xea^\x8b\xb0'; handle.tell() now says 16541

@mdshw5
Copy link
Owner

mdshw5 commented Feb 16, 2024

Ah I see. That error message is telling you that the FASTA file cannot be gzip compressed. You can however use block-gzip compression to compress the FASTA file. See https://www.htslib.org/doc/bgzip.html

@yonniejon
Copy link
Author

Got it! Thanks. Sorry about the confusion!

@mdshw5
Copy link
Owner

mdshw5 commented Feb 16, 2024

No worries - glad to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants