Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress bar for get_hmm and get_big_pi #3

Closed
TS404 opened this issue May 18, 2018 · 5 comments
Closed

progress bar for get_hmm and get_big_pi #3

TS404 opened this issue May 18, 2018 · 5 comments

Comments

@TS404
Copy link

TS404 commented May 18, 2018

get_hmm and get_big_pi can take a while. When submitting >1000 sequences, I've tended to do them individually in order to salvage the results if it becomes unresponsive after an hour. I don't know if tghe server prefers batch queries rather than repeated individual queries, but even so, sequences could be submitted in batches of 10 or 50 to give an idea of estimated time.

annotx <- NULL
pbt    <- txtProgressBar(min = 0, max = length(sequences), style = 3)
pbw    <- winProgressBar(min = 0, max = length(sequences), title = "HMM progress")

for(i in 1:length(sequences)){
    seqsubset <- sequences[i]
    annotx <- rbind(annotx, ragp::get_hmm(sequence = seqsubset,
                                          id = names(seqsubset),
                                          verbose = FALSE,
                                          sleep = 0))
    setTxtProgressBar(pbt, i)
    setWinProgressBar(pbw, i, title= paste("HMM progress:",
                                           round(i/length(sequences)*100, 0),
                                           "%      (",
                                           names(sequences[i]),
                                           ")"
  ))
}
close(pbw)
annotx
@missuse
Copy link
Owner

missuse commented May 18, 2018

I have had similar experience with get_hmm. Some days the hmmscan server is very unresponsive. Batch upload to hmmscan is even worse. Uploads are put in a queue and sometimes hours pass just to start.

Currently I would like to change get_hmm to resubmit the sequence after some time if the result is not provided. If the 2nd submit hangs, the functions ends returning results for the sequences up to there and an Error message. I trust this is the best solution in this case.

For 10k+ sequences I recommend using hmmer. Perhaps a function that will take output from hmmer and import it in the same format as output of hmmscan function?

I haven't had this problem with get_big_pi, usually only sequences containing an N-sp are sent to big pi and in my experience it works solid for up to 5k sequences.
Batch queries would be a good addition (perhaps faster), this requires a complete rewrite of the function. I will do some testing and then decide if we shall go in this direction.

Progress bar would be a nice addition, perhaps when verbose = FALSE a progress bar is displayed. for both functions and even for the *_file functions.

@TS404
Copy link
Author

TS404 commented May 18, 2018

Good idea for get_hmm! I think that's a very sensible way to do it.

I've not managed to recreate the get_big_pi issue so perhaps there was something odd about the time it happened to me.

@missuse
Copy link
Owner

missuse commented May 30, 2018

I have managed to speed up get_big_pi significantly, and I have implemented the progress bar as per the suggestion. I just need to perform some checking before deployment. I trust it will be available during the weekend. I think I will update all the scraping functions with progress bars.

@missuse
Copy link
Owner

missuse commented May 31, 2018

get_big_pi has been updated. The update should provide a significant speed up, and it should be now more in line with the speed of get_phobius.

For instance:

system.time(
  test_big_pi <- ragp::get_big_pi(at_nsp[1:1000,],
                                  sequence,
                                  Transcript.id,
                                  simplify = FALSE)
  
)
  #output
   user  system elapsed 
   3.53    0.13   57.88 

Bugs are possible, if you stumble upon any please report them.

@missuse
Copy link
Owner

missuse commented Jun 2, 2018

get_hmm has been updated.

New arguments are:

  1. timeout - time in seconds to wait for the server response (default = 10s).
  2. attempts - number of attempts if server unresponsive (default = 2 times).

If number of attempts is exhausted the function will issue a warning and return the queries finished so far.

Additionally a progress bar has been added.

Bugs are possible, if you stumble upon any please report them.

Thank you for suggestions.

Closing this issue.

@missuse missuse closed this as completed Jun 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants