Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCSCSession objects breaking possibly because of some change on the UCSC side #113

Closed
hpages opened this issue Mar 1, 2024 · 21 comments · Fixed by #114
Closed

UCSCSession objects breaking possibly because of some change on the UCSC side #113

hpages opened this issue Mar 1, 2024 · 21 comments · Fixed by #114

Comments

@hpages
Copy link
Contributor

hpages commented Mar 1, 2024

Looks like maybe something has changed on the UCSC side a few days ago that breaks UCSCSession objects:

library(rtracklayer)
session <- browserSession()

session
# Error in names(trackIds) <- sub("^ ", "", nms) : 
#   attempt to set an attribute on NULL

trackNames(session)
# Error in names(trackIds) <- sub("^ ", "", nms) : 
#   attempt to set an attribute on NULL

This is in release (rtracklayer 1.62.0, BioC 3.18) and devel (rtracklayer 1.63.1, BioC 3.19).

This breaks packages customProDB, GenomicFeatures, and goseq on all platforms in release and devel:

H.

@sanchit-saini
Copy link
Contributor

sanchit-saini commented Mar 4, 2024

Thanks for @hpages reporting,

library(rtracklayer)
# get cookie from https://genome-euro.ucsc.edu/cgi-bin/hgGateway
session <- browserSession()
# make a request to https://genome-euro.ucsc.edu/cgi-bin/hgTracks with previously obtained cookie
# however now the site is not functional without JS support
tracks <- rtracklayer:::ucscGet(session, "tracks", list())

tracks response is given below, seems like now, UCSC browser webpage is not functional without JS support.

  • Is there any other source/way from where we can retrieve tracks without parsing JS?
    • Could be possible solution: Table browser could be used to retrieve the tracks information, however I have to test it whether it can swap in without affecting any related functions of UCSCSession.
  • If not, we have to include a headless browser to parse and retrieve the tracks, which seems like an overkill for this task.

caveat: Track names could be extract with Table browser, though track mode information is only present on the genome browser.

@lawremi, what is your thoughts on this?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Security-Policy" content="default-src *; script-src 'self' blob: 'unsafe-inline' 'nonce-13incR3wfN2P97dHQvJzksSV2il0' code.jquery.com/jquery-1.9.1.min.js code.jquery.com/jquery-1.12.3.min.js code.jquery.com/ui/1.10.3/jquery-ui.min.js code.jquery.com/ui/1.11.0/jquery-ui.min.js code.jquery.com/ui/1.12.1/jquery-ui.js www.google-analytics.com/analytics.js www.googletagmanager.com/gtag/js www.samsarin.com/project/dagre-d3/latest/dagre-d3.js cdnjs.cloudflare.com/ajax/libs/bowser/1.6.1/bowser.min.js
cdnjs.cloudflare.com/ajax/libs/d3/3.4.4/d3.min.js cdnjs.cloudflare.com/ajax/libs/jquery/1.12.1/jquery.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.2.1/jstree.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.4/jstree.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.7/jstree.min.js cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js login.persona.org/include.js ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.js maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js d3js.org/d3.v3.min.js cdn.datatables.net/1.10.12/js/jquery.dataTables.min.js cdn.jsdelivr.net/npm/shepherd.js@11.0.1/dist/js/shepherd.min.js www.google.com/recaptcha/api.js; style-src * 'unsafe-inline'; font-src * data:; img-src * data:;">
<title>Human hg38 chr7:155,799,529-155,812,871 UCSC Genome Browser v461</title>
<meta http-equiv="Content-Script-Type" content="text/javascript">
<link rel="stylesheet" href="../style/HGStyle.css?v=1708368144" type="text/css">
<script async src="https://www.googletagmanager.com/gtag/js?id=G-G5K9F3K9H2"></script>
</head>
<body class="hgTracks cgi">
<center><div id="warnBox" style="display:none;">
<center><b id="warnHead"></b></center>
<ul id="warnList"></ul>
<center><button id="warnOK"></button></center>
</div></center>
<noscript><div class="noscript"><div class="noscript-inner">
<p><b>JavaScript is disabled in your web browser</b></p>
<p>You must have JavaScript enabled in your web browser to use the Genome Browser</p>
</div></div></noscript>
<script type="text/javascript" src="../js/jquery.js?v=1708368145"></script><script type="text/javascript" src="../js/utils.js?v=1708368145"></script><script type="text/javascript" nonce="13incR3wfN2P97dHQvJzksSV2il0">
function showWarnBox() {document.getElementById('warnOK').innerHTML='&nbsp;OK&nbsp;';var warnBox=document.getElementById('warnBox');warnBox.style.display='';document.getElementById('warnHead').innerHTML='Warning/Error(s):';window.scrollTo(0, 0);}
function hideWarnBox() {var warnBox=document.getElementById('warnBox');warnBox.style.display='none';var warnList=document.getElementById('warnList');warnList.innerHTML='';var endOfPage = document.body.innerHTML.substr(document.body.innerHTML.length-20);if(endOfPage.lastIndexOf('-- ERROR --') > 0) { history.back(); }}
document.getElementById('warnOK').onclick = function() {hideWarnBox();return false;};
window.onunload = function(){}; // Trick to avoid FF back button issue.
addPixAndReloadPage();// Google tag load (gtag.js)
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date()); gtag('config', 'G-G5K9F3K9H2');
// Google tag load end
  $(document).ready(function() {
          if (gtag) {
              /* send db to ga4 as an event on page load */
              gtag('event', 'hgTracksLoad', {'db': getDb()})
          };
  });</script>
</body>
</html>
<html><link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css"></html>

@sanchit-saini
Copy link
Contributor

Hi @maximilianh,

Is it possible to work directly with HTML content and avoid parsing JS (For https://genome-euro.ucsc.edu/cgi-bin/hgTracks)?

Can please you look into this? I hope we can find some fix for it if it is possible.

@maximilianh
Copy link

Can you give me a little more context? Our site has been requiring JS for at least 15 years, that hasn't changed.

What I changed is that I added code to detects if the "pix" session or URL variable (screensize) is not set and if it's not set it determines the screen size, then reloads the page. I have no idea why this would intefere with rtracklayer, but it's something that has changed recently. Maybe some other change broke the rtracklayer parser, idk, I don't have enough information yet to make an educated guess.

To get the list of tracks in a way that doesn't require parsing HTML, we have the "tracks" API endpoint, e.g. http://api.genome.ucsc.edu/list/tracks?genome=hg38 see api.genome.ucsc.edu for more documentation or feel free to ask me.

@maximilianh
Copy link

If you can tell me what exactly broke rtracklayer, I can try to do something to make it work again in the sort run, but in the long run it would probably reduce the number of firedrills to start parsing JSON rather than HTML :-)

@maximilianh
Copy link

@sanchit-saini You wrote "track mode information", what do you mean with "track mode" ? Do you mean the visibilities?

If this is something that the API doesn't return, we will add the information ASAP to the API. I wonder if there is a reason why you are not using the API, if that's the case, we will absolutely have to fix that.

@maximilianh
Copy link

@NullModel

@sanchit-saini
Copy link
Contributor

Sorry for the late reply.

Why rtracklayer need to scrape/parse HTML?

rtracklayer provides a command line interface to interact with Genome Browser. Which cannot be emulated with REST API.

What are track Modes?

If we open https://genome-euro.ucsc.edu/cgi-bin/hgTracks we can see trackNames (e.g Assembly) and a drop with options (hide, dense, squish, pack, full). These options are refered as track modes in rtracklayer.

240311-1947-24

What is the problem rtracklayer is facing?

Recently this https://genome-euro.ucsc.edu/cgi-bin/hgTracks stopped giving HTML response and needs JS to function. 

Request 1

 $ curl -I -H "User-Agent: rtracklayer" 'https://genome-euro.ucsc.edu/cgi-bin/hgGateway'
Response 1
HTTP/1.1 200 OK
Date: Mon, 11 Mar 2024 14:31:40 GMT
Server: Apache/2.4.53 (Rocky Linux) OpenSSL/3.0.1
Set-Cookie: hguid.genome-euro=467621116_o5SInCvSI4NA3g2kOiGXGgtvMfad; path=/; domain=.ucsc.edu; expires=Thu, 31-Dec-2037 23:59:59 GMT
Vary: Accept-Encoding
Origin-Trial: Ats6dcpzFne+6Djws3arcMPv1F64iEOPnBrs3VjBzvGcrG+EAc1D0+uMm00BglPAQqBh5ZHPZPXHyFU+rHjxOwUAAABweyJvcmlnaW4iOiJodHRwczovL3Vjc2MuZWR1OjQ0MyIsImZlYXR1cmUiOiJBbGxvd1N5bmNYSFJJblBhZ2VEaXNtaXNzYWwiLCJleHBpcnkiOjE1OTc5NzA5MjUsImlzU3ViZG9tYWluIjp0cnVlfQ==
Content-Type: text/html; charset=UTF-8

Request 2

$ curl -H "User-Agent: rtracklayer" -b "hguid.genome-euro=467621116_o5SInCvSI4NA3g2kOiGXGgtvMfad" 'https://genome-euro.ucsc.edu/cgi-bin/hgTracks'
Response 2
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>
<meta http-equiv='Content-Security-Policy' content="default-src *; script-src 'self' blob: 'unsafe-inline' 'nonce-eaR99FCvJjT3Qxw3ya71duJlfY2j' code.jquery.com/jquery-1.9.1.min.js code.jquery.com/jquery-1.12.3.min.js code.jquery.com/ui/1.10.3/jquery-ui.min.js code.jquery.com/ui/1.11.0/jquery-ui.min.js code.jquery.com/ui/1.12.1/jquery-ui.js www.google-analytics.com/analytics.js www.googletagmanager.com/gtag/js www.samsarin.com/project/dagre-d3/latest/dagre-d3.js cdnjs.cloudflare.com/ajax/libs/bowser/1.6.1/bowser.min.js cdnjs.cloudflare.com/ajax/libs/d3/3.4.4/d3.min.js cdnjs.cloudflare.com/ajax/libs/jquery/1.12.1/jquery.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.2.1/jstree.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.4/jstree.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.7/jstree.min.js cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js login.persona.org/include.js ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.js maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js d3js.org/d3.v3.min.js cdn.datatables.net/1.10.12/js/jquery.dataTables.min.js cdn.jsdelivr.net/npm/shepherd.js@11.0.1/dist/js/shepherd.min.js www.google.com/recaptcha/api.js; style-src * 'unsafe-inline'; font-src * data:; img-src * data:;">
<TITLE>Human hg38 chr7:155,799,529-155,812,871 UCSC Genome Browser v461</TITLE>
        <META http-equiv="Content-Script-Type" content="text/javascript">
<link rel='stylesheet' href='../style/HGStyle.css?v=1708368144' type='text/css'>
<script async src="https://www.googletagmanager.com/gtag/js?id=G-G5K9F3K9H2"></script>
</HEAD>

<BODY CLASS="hgTracks cgi">
<center><div id='warnBox' style='display:none;'><CENTER><B id='warnHead'></B></CENTER><UL id='warnList'></UL><CENTER><button id='warnOK'></button></CENTER></div></center>
<noscript><div class='noscript'><div class='noscript-inner'><p><b>JavaScript is disabled in your web browser</b></p><p>You must have JavaScript enabled in your web browser to use the Genome Browser</p></div></div></noscript>
<script type='text/javascript' SRC='../js/jquery.js?v=1708368145'></script>
<script type='text/javascript' SRC='../js/utils.js?v=1708368145'></script>
<script type='text/javascript' nonce='eaR99FCvJjT3Qxw3ya71duJlfY2j'>
function showWarnBox() {document.getElementById('warnOK').innerHTML='&nbsp;OK&nbsp;';var warnBox=document.getElementById('warnBox');warnBox.style.display='';document.getElementById('warnHead').innerHTML='Warning/Error(s):';window.scrollTo(0, 0);}
function hideWarnBox() {var warnBox=document.getElementById('warnBox');warnBox.style.display='none';var warnList=document.getElementById('warnList');warnList.innerHTML='';var endOfPage = document.body.innerHTML.substr(document.body.innerHTML.length-20);if(endOfPage.lastIndexOf('-- ERROR --') > 0) { history.back(); }}
document.getElementById('warnOK').onclick = function() {hideWarnBox();return false;};
window.onunload = function(){}; // Trick to avoid FF back button issue.
addPixAndReloadPage();// Google tag load (gtag.js)
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date()); gtag('config', 'G-G5K9F3K9H2');
// Google tag load end
  $(document).ready(function() {
          if (gtag) {
              /* send db to ga4 as an event on page load */
              gtag('event', 'hgTracksLoad', {'db': getDb()})
          };
  });</script>

</BODY>
</HTML>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">

@maximilianh
Copy link

rtracklayer provides a command line interface to interact with Genome Browser. Which cannot be emulated with REST API.

Sorry, I don't understand, can you give me more context? rtracklayer needs the list of tracks in this example, for one assembly. We have an API call for that. Why is rtracklayer parsing HTML to get the list of tracks?

track modes: ok great, then, we call them "visibilities" but the word doesn't matter. Thanks for explaining it.

rtracklayer retrieve cookies from https://genome-euro.ucsc.edu/cgi-bin/hgGateway from the HTTP headers.
You don't need these cookies to get the information, I think, but let's not focus too much on this.

Recently this https://genome-euro.ucsc.edu/cgi-bin/hgTracks stopped giving HTML response and needs JS to function.
I doesn't really need JS, the new thing is only that it's calling a single javascript function that determines the screen size and then stops the output. All that this javascript function does is to add "pix" to the URL as a variable. If you add the argument pix=800 to the hgTracks call e.g.

hgTracks?hgsid=xxxx&pix=800 your code should work as before, even today (I hope)

What I can do on our end with the next code release in two weeks is to suppress the javascript entirely if the user agent is "rtracklayer". I did this on my internal testing website, can you try if your code works here:

https://hgwdev-max.gi.ucsc.edu/cgi-bin/hgTracks

@maximilianh
Copy link

@sanchit-saini This problem will come up again whenever we change our HTML. In our group, we don't understand instead of parsing the HTML, you cannot use an API call to get the list of track names...

You PR fixes it, but older rtracklayer versions will be broken. Should I commit the fix from https://hgwdev-max.gi.ucsc.edu/cgi-bin/hgTracks and get it released in two weeks?

@maximilianh
Copy link

Maybe @hpages has some idea on why UCSC doesn't understand @sanchit-saini 's reply?

@sanchit-saini
Copy link
Contributor

Thanks @maximilianh, adding pix to the request fix the issue.

You PR fixes it, but older rtracklayer versions will be broken. Should I commit the fix from https://hgwdev-max.gi.ucsc.edu/cgi-bin/hgTracks and get it released in two weeks?

Yes, I tested it, and it seems to be working without the pix and would be great to put it on the release.

@sanchit-saini This problem will come up again whenever we change our HTML. In our group, we don't understand instead of parsing the HTML, you cannot use an API call to get the list of track names...

browseGenome() is part of the rtracklayer which opens a web browser and loads genome browser with the specified genome, range, browserView() (also part of rtracklayer) , etc.

Essentially, through these and other functions, we can interact with the genome browser from the command line. rtracklayer internally archives this by mimicking requests to the genome browser and parsing response HTML.

I hope it is clear now why we cannot use UCSC REST API's, as this feature depends on interacting with Genome Browser, which cannot be archived with the REST API's.

@hpages
Copy link
Contributor Author

hpages commented Mar 13, 2024

Maybe @hpages has some idea on why UCSC doesn't understand @sanchit-saini 's reply?

I'm not really sure I understand it either.

Anyways, I've started to use UCSC REST API instead of rtracklayer::trackNames() and it works great for my use case. See https://github.com/Bioconductor/txdbmaker/blob/devel/R/UCSC-utils.R.
Thanks @maximilianh!

@maximilianh
Copy link

maximilianh commented Mar 13, 2024 via email

@maximilianh
Copy link

maximilianh commented Mar 13, 2024 via email

@sanchit-saini
Copy link
Contributor

@maximilianh Yes, this feature is not used widely.
At this moment, we don't have any tests to check it, though I can write tests. I will try to make these tests portable so you folks can test them on your end too.

@maximilianh
Copy link

maximilianh commented Mar 14, 2024 via email

@sanchit-saini
Copy link
Contributor

It has tests for non network related features. Now, we will also add tests for the missing features and try to set up a GitHub action or some sort of automation to run those tests periodically.

Also, you can expect to the tests to be completed around the end of this month.

@maximilianh
Copy link

maximilianh commented Mar 18, 2024 via email

@jayoung
Copy link

jayoung commented Apr 8, 2024

hi there,

I don't really understand the thread details (sorry!), but I am interested in the bottom line.

From an ordinary user perspective, is browseGenome() etc etc likely to be fixable in the near-term - should I be keeping my eye out for rtracklayer package updates? Or should I use some other approach to make browser-style plots? I'd been hoping to use rtracklayer for that, but can switch to Gviz or something else if this issue seems intractable. (also would love suggestions of additional packages that make nice plots of genes + genomic data)

I've been running into similar errors as the ones that Herve reported at the top of the thread

thanks!

Janet

@sanchit-saini
Copy link
Contributor

Hi @maximilianh, it took a bit longer than I expected. I have created PR #120, which covers most of the commonly used functions. A few functions are missing, and I will add tests for them soon.

@sanchit-saini
Copy link
Contributor

Hi @jayoung

We are constantly trying to maintain stability and add features to the rtracklayer package.

If there's a change on the UCSC side, the browseGenome() function will break occasionally because it's implemented using miming requests and parsing responses (aka web scarping). That was also the case for this issue. However, to avoid these kinds of problems in the future, we have put some test cases in place, which will help us know if something went wrong so we can fix it immediately.

For plots, you have to observe what your use case is and which package comes close to solving it.
Based on it, you can weigh your options. For package recommendations, I am not sure if I can provide any help with it, though I think you can ask (or search) about it at https://support.bioconductor.org/. Others may be able to provide you with some suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants