Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize refactor_reportingtools_table.rb #13

Closed
hoelzer opened this issue Nov 7, 2019 · 6 comments · Fixed by #166
Closed

generalize refactor_reportingtools_table.rb #13

hoelzer opened this issue Nov 7, 2019 · 6 comments · Fixed by #166
Assignees
Labels
enhancement New feature or request medium priority

Comments

@hoelzer
Copy link
Contributor

hoelzer commented Nov 7, 2019

I included this script that extends the ReportingTools output table by start/stop positions and gene names. However, the script has some code parts that depend on the input species.

gene_id = row_splitted[0].scan(/ER[0-9]+_[0-9]+/)[0]
new_row = [row_splitted[0].sub('<td class="">',"<td class=\"\"><a target=\"_blank\" href=\"https://bacteria.ensembl.org/Escherichia_coli_k_12/Gene/Summary?g=#{gene_id};\">") + '</a>', "<td class=\"\">#{gene_name}", "<td class=\"\">#{gene_biotype}", pos_part, row_splitted[1].sub('href=','target="_blank" href=').sub('<td class="">','<td class=""><div style="width: 200px">') + '</div>', row_splitted[2], row_splitted[3], row_splitted[4]].join('</td>') << '</td>'

In this example the script works for the input

--species eco

I think we have two options:

A

Remove the URL link to Ensembl that changes with species name and try to generalize the gene_id = part.

B

Depending on the input --species parameter define these values.

@hoelzer hoelzer added the enhancement New feature or request label Nov 7, 2019
@hoelzer
Copy link
Contributor Author

hoelzer commented Nov 7, 2019

I decided now to implement it like this for now:

    case species
      when 'eco' 
        scan_gene_id_pattern = 'ER[0-9]+_[0-9]+'
        ensembl_url = 'https://bacteria.ensembl.org/Escherichia_coli_k_12/Gene/Summary?g='
      when 'hsa'
        scan_gene_id_pattern = 'ENSG[0-9]+'
        ensembl_url = 'https://ensembl.org/Homo_sapiens/Gene/Summary?g='
      else
        scan_gene_id_pattern = false
        ensembl_url = false
      end

If we add new species, we have to simply extend this. Not the best solution, but works for now.

@hoelzer
Copy link
Contributor Author

hoelzer commented Nov 15, 2021

  • check Ensembl for a mapping between three letter species code & URL
  • use this list then to automatize this process

@fischer-hub
Copy link
Collaborator

I just found this prefix list, so mapping from 3 letter code to species name should be possible, but the different base URLs (bacteria.ensembl.org, plants.ensembl.org, etc.) will be difficult probably.. Maybe theres a direct mapping to the URLs somewhere too

@hoelzer
Copy link
Contributor Author

hoelzer commented Dec 12, 2021

Yes nice @fischer-hub , this list is already a very good starting point. But agree, the correct base url remains difficult.

Maybe implement a ping to the url and brute force test which one works (plant, bacteria,...)? :d

@fischer-hub
Copy link
Collaborator

Good idea I'll try that!

@fischer-hub
Copy link
Collaborator

fischer-hub commented Jan 14, 2022

So I looked a little deeper into this and the prefix list actually only contains a subset of all species listed on ensembl. However I think we can just use the ensemble REST API to 'map' the ensemble IDs from the annotation file to their species name? We can then also just make one call with all the IDs at once.

About the base url from what I tested pinging the url 5 times (worst case) per ID is just really slow and we have several thousand IDs per file so I think I'll have to find another solution..

Okay I scraped the species lists from every prefix.ensembl.org site so now we can just lookup the species name with the REST API once and then map the species name to the base url prefix and species url suffix. From here it should be done pretty soon.

fischer-hub added a commit that referenced this issue Jan 31, 2022
Generalize feature url retrieval in refactor_reportingtools_table.rb, closes #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants