Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sra_ids_to_runinfo.py UnicodeEncodeError #525

Closed
carrere opened this issue Dec 4, 2020 · 11 comments
Closed

sra_ids_to_runinfo.py UnicodeEncodeError #525

carrere opened this issue Dec 4, 2020 · 11 comments
Labels
bug Something isn't working

Comments

@carrere
Copy link

carrere commented Dec 4, 2020

Dear nf-core team, first of all, many thanks for your amazing work that make our analyses more easy and straightforward !

I am using this nf-core/rnaseq pipeline (release 2.0) with the experimental feature --public_data_ids to retrieve SRA datasets and I face some issues with some SRA projects for which some characters are non-ascii.

Here is an example: looking for SRP290966, you can find the degree character "°" in the experiment_title field encoded in unicode: [ENA API RESULT] (https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRP290966&result=read_run&fields=experiment_title) )

The workflow ends with this error:

Traceback (most recent call last):
  File "/home/carrere/.nextflow/assets/nf-core/rnaseq/bin/sra_ids_to_runinfo.py", line 178, in <module>
    sys.exit(main())
  File "/home/carrere/.nextflow/assets/nf-core/rnaseq/bin/sra_ids_to_runinfo.py", line 174, in main
    fetch_sra_runinfo(args.FILE_IN,args.FILE_OUT,platform_list,library_layout_list)
  File "/home/carrere/.nextflow/assets/nf-core/rnaseq/bin/sra_ids_to_runinfo.py", line 131, in fetch_sra_runinfo
    for row in csv_dict:
  File "/opt/conda/lib/python2.7/csv.py", line 108, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 260: ordinal not in range(128)

Thanks for your help,

Sébastien

@drpatelh
Copy link
Member

drpatelh commented Dec 4, 2020

Thanks for reporting this @carrere 👍 I can indeed reproduce this locally by manually running the sra_ids_to_runinfo.py script with an id file containing just SRR12971731. @JoseEspinosa would be great if you can take a look at this please? 🙂

@drpatelh drpatelh added the bug Something isn't working label Dec 4, 2020
@carrere
Copy link
Author

carrere commented Dec 4, 2020

You're welcome. I think the main problem come from the EBI API that not declare the document encoding:

14:52 $ curl -I "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRP290966&result=read_run&fields=experiment_title"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Type: text/plain
Strict-Transport-Security: max-age=0
Date: Fri, 04 Dec 2020 13:52:07 GMT
Expires: 0
X-XSS-Protection: 1; mode=block
Pragma: no-cache
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Content-Length: 6226

Or in the firefox console:

"The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature. filereport"

I do not know if you can fix this automatically on the client side ... but if you know someone @ EBI, you could ask them to fix on the API side.

Sebastien

@JoseEspinosa
Copy link
Member

I been trying to solve the issue. The script works as it is when run with Python 3 but not with Python 2. The reason is that in Python 3 UTF-8 is the default source encoding but not in Python 2. Although I was trying to find a solution that worked both with Python 2 and Python 3 I didn't find it. That is why I just checked the python version and include this code. I don't know if you think this solution suitable if yes I can just make a PR with this patch

@drpatelh
Copy link
Member

Thanks! But we should be using Python 3 for that process?

@JoseEspinosa
Copy link
Member

I think we are using Python 2.7.13
Should we change the image instead?

@drpatelh
Copy link
Member

drpatelh commented Dec 13, 2020

Indeed we are!! Yes, that would be great we just need to replace those lines with the snippet below and that would be the best fix. Sorry, I was looking at the wrong process🤦🏽

    conda (params.enable_conda ? "conda-forge::python=3.8.3" : null)
    if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
        container "https://depot.galaxyproject.org/singularity/python:3.8.3"
    } else {
        container "quay.io/biocontainers/python:3.8.3"
    }

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 13, 2020
@drpatelh
Copy link
Member

Thanks @JoseEspinosa. This will be fixed in the next release via f33eb6d

@JoseEspinosa
Copy link
Member

Perfect @drpatelh ! was about to implement it now and saw that you closed the issue 😎

@drpatelh
Copy link
Member

No worries! I had to use another container in the end that specifically contained requests. Thinking about it, I don't know if that is Python 3 or not 🤦🏽 It was late. Will check in the morning.

@drpatelh
Copy link
Member

It's Python >3 in the requests container 💥 Also tested with SRR12971731 and it's working.

$ singularity shell depot.galaxyproject.org-singularity-requests-2.24.0.img

Singularity depot.galaxyproject.org-singularity-requests-2.24.0.img:> python
Python 3.8.3 | packaged by conda-forge | (default, Jun  1 2020, 17:43:00)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

@carrere
Copy link
Author

carrere commented Dec 14, 2020

👍
Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants