-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting UnicodeDecodeError when running faidx with a bed file input #217
Comments
So the problem was no the chr prefix. I replaced my bed file to not contain the "chr" prefixes and I removed the "chr" prefixes in my fasta reference file and the problem persists. |
This means that you have a non utf-8 character at the beginning of your file. Did you by chance export this from MS Excel as utf-16? If so then you need to convert your file to utf-8 encoding. You can also export from Excel in utf-8 encoding as well. |
I did not. I ran nano tmp.bed and pasted the following contents exactly: chr6 132891948 132892108 |
Just to confirm - you have said:
Do you mean that the tmp.bed file contains this, and you have also gzipped it? If so I think I understand the issue. The --bed option does not handle gzipped input. If you want to pass a gzipped file you could do: $ faidx hg19/genome.fa.gz -b - <( gzip -dc tmp.bed.gz) The above would use a sub shell to decompress your bed file and send it to stdin, which can be read by the --bed argument using the "-" symbol. You could alternatively pass an uncompressed bed file. |
"Do you mean that the tmp.bed file contains this, and you have also gzipped it?" Yes you are correct. But I only gzipped it because when I ran it without gzip/bgzip I got the following error:
Traceback (most recent call last): |
Ah I see. That error message is telling you that the FASTA file cannot be gzip compressed. You can however use block-gzip compression to compress the FASTA file. See https://www.htslib.org/doc/bgzip.html |
Got it! Thanks. Sorry about the confusion! |
No worries - glad to help! |
Hi!
I am running faidx version 0.7.2.1
I am running it with a bed file input like so:
faidx hg19/genome.fa.gz -b tmp.bed.gz
where tmp.bed.gz looks like:
chr6 132891948 132892108
chr10 127585142 127585221
I get the following error:
Traceback (most recent call last):
File "/cs/usr/jrosensk/.local/bin/faidx", line 8, in
sys.exit(main())
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main
write_sequence(args)
File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 26, in write_sequence
for region in regions_to_fetch:
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I assume the problem is that my bed file has the "chr" prefix? It is a problem because my genome file has the chr prefix as well. Is there way around this or I need to change the reference .fa file?
The text was updated successfully, but these errors were encountered: