getting UnicodeDecodeError when running faidx with a bed file input #217

yonniejon · 2024-02-15T14:06:26Z

Hi!

I am running faidx version 0.7.2.1

I am running it with a bed file input like so:
faidx hg19/genome.fa.gz -b tmp.bed.gz

where tmp.bed.gz looks like:
chr6 132891948 132892108
chr10 127585142 127585221

I get the following error:
Traceback (most recent call last):
File "/cs/usr/jrosensk/.local/bin/faidx", line 8, in
sys.exit(main())
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main
write_sequence(args)
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 26, in write_sequence
for region in regions_to_fetch:
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I assume the problem is that my bed file has the "chr" prefix? It is a problem because my genome file has the chr prefix as well. Is there way around this or I need to change the reference .fa file?

The text was updated successfully, but these errors were encountered:

yonniejon · 2024-02-15T14:18:19Z

So the problem was no the chr prefix. I replaced my bed file to not contain the "chr" prefixes and I removed the "chr" prefixes in my fasta reference file and the problem persists.

mdshw5 · 2024-02-15T16:54:44Z

This means that you have a non utf-8 character at the beginning of your file. Did you by chance export this from MS Excel as utf-16? If so then you need to convert your file to utf-8 encoding. You can also export from Excel in utf-8 encoding as well.

yonniejon · 2024-02-15T17:12:28Z

I did not. I ran nano tmp.bed and pasted the following contents exactly:

chr6 132891948 132892108
chr10 127585142 127585221

mdshw5 · 2024-02-15T20:38:38Z

Just to confirm - you have said:

where tmp.bed.gz looks like:
chr6 132891948 132892108
chr10 127585142 127585221

Do you mean that the tmp.bed file contains this, and you have also gzipped it? If so I think I understand the issue. The --bed option does not handle gzipped input. If you want to pass a gzipped file you could do:

$ faidx hg19/genome.fa.gz -b - <( gzip -dc tmp.bed.gz)

The above would use a sub shell to decompress your bed file and send it to stdin, which can be read by the --bed argument using the "-" symbol. You could alternatively pass an uncompressed bed file.

yonniejon · 2024-02-16T11:59:28Z

"Do you mean that the tmp.bed file contains this, and you have also gzipped it?"

Yes you are correct. But I only gzipped it because when I ran it without gzip/bgzip I got the following error:

faidx genome.fa.gz -b tmp.bed

Traceback (most recent call last):
File "/cs/usr/jjj/.local/bin/faidx", line 8, in
sys.exit(main())
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main
write_sequence(args)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 53, in write_sequence
for line in fetch_sequence(args, fasta, name, start, end):
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 70, in fetch_sequence
sequence = fasta[name][start:end]
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 920, in getitem
return self._fa.get_seq(self.name, start + 1, stop)[::step]
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 1149, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 727, in fetch
seq = self.from_file(name, start, end)
File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 769, in from_file
self.file.seek(i.offset)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 650, in seek
self._load_block(start_offset)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 611, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xea^\x8b\xb0'; handle.tell() now says 16541

mdshw5 · 2024-02-16T12:29:55Z

Ah I see. That error message is telling you that the FASTA file cannot be gzip compressed. You can however use block-gzip compression to compress the FASTA file. See https://www.htslib.org/doc/bgzip.html

yonniejon · 2024-02-16T12:41:11Z

Got it! Thanks. Sorry about the confusion!

mdshw5 · 2024-02-16T13:42:54Z

No worries - glad to help!

yonniejon closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting UnicodeDecodeError when running faidx with a bed file input #217

getting UnicodeDecodeError when running faidx with a bed file input #217

yonniejon commented Feb 15, 2024

yonniejon commented Feb 15, 2024

mdshw5 commented Feb 15, 2024

yonniejon commented Feb 15, 2024 •

edited

Loading

mdshw5 commented Feb 15, 2024

yonniejon commented Feb 16, 2024 •

edited

Loading

mdshw5 commented Feb 16, 2024

yonniejon commented Feb 16, 2024

mdshw5 commented Feb 16, 2024

getting UnicodeDecodeError when running faidx with a bed file input #217

getting UnicodeDecodeError when running faidx with a bed file input #217

Comments

yonniejon commented Feb 15, 2024

yonniejon commented Feb 15, 2024

mdshw5 commented Feb 15, 2024

yonniejon commented Feb 15, 2024 • edited Loading

mdshw5 commented Feb 15, 2024

yonniejon commented Feb 16, 2024 • edited Loading

mdshw5 commented Feb 16, 2024

yonniejon commented Feb 16, 2024

mdshw5 commented Feb 16, 2024

yonniejon commented Feb 15, 2024 •

edited

Loading

yonniejon commented Feb 16, 2024 •

edited

Loading