Skip to content
This repository has been archived by the owner on Jun 7, 2023. It is now read-only.

Is speed an issue? #18

Open
ax42 opened this issue Oct 15, 2015 · 9 comments
Open

Is speed an issue? #18

ax42 opened this issue Oct 15, 2015 · 9 comments

Comments

@ax42
Copy link
Contributor

ax42 commented Oct 15, 2015

Hi

RForceCom feels slow -- it's taking about 4 minutes to execute a query which returns about 18'500 records with 9 variables.

    SELECT Id, Field, CreatedById, CreatedDate, IsDeleted, NewValue, OldValue, OpportunityId,     Opportunity.Name 
                FROM OpportunityFieldHistory 
                WHERE Opportunity.CreatedDate >= ", startDate, "T00:00:00.000Z")

I've not started tracing through this to figure out where the issue could be (SF? Network? XML vs JSON?). What's the best way to start digging into this issue?

@ax42
Copy link
Contributor Author

ax42 commented Oct 15, 2015

Ok, I've done some ghetto profiling, and I think the line

xls <- lapply(lapply(xns, xmlToList), enlist)

in rforcecom.query() is the issue. Using a query which takes overall 4.2 seconds (427 records, so no need to refetch), around 1 second is for the data transfer (which is the same if I use another REST tool), and 2.2 seconds are used for the line above (so 2.2x the time the data transfer takes).

Creating a query which returns more records (~1'400) makes the discrepancy worse:

  • 2.3s for the curl
  • 7.299s for the xmlToList (3.2x the time of the curl)

I'd be interested in trying to improve this. Is there a faster XML library? Does one really need to convert the whole XML to a list? Would JSON be quicker?

@ax42
Copy link
Contributor Author

ax42 commented Oct 15, 2015

More updates:

  • Asking for the results in JSON is not quicker (in fact it's slower to fetch JSON than XML in my tests, by a factor of 2)
  • I think the xml2 library will be faster, busy checking this out

@ax42
Copy link
Contributor Author

ax42 commented Oct 16, 2015

OK, I've done some further benchmarking -- it seems that we can get some good speed increases if we use curl instead or RCurl (~50% speed improvement on a small test) and xml2 instead of XML.

@hiratake55 -- I think refactoring the whole library will be a challenge to do in one shot, what is your appetite to do this one function at a time? It will mean requiring both xml libraries and both curl libraries (as some functions are likely to use the one, and others the other.

@hiratake55
Copy link
Owner

Hi @ax42 ,

Thank you for contacting me, I'll check xml2 package fits RForcecom or not.

@StevenMMortimer
Copy link
Contributor

@ax42 Have you considered using the Bulk API features of the package? Below are some timings with roughly 650K records and I've pulled 50K in a second or two. Salesforce caches the queries, so they become faster if you repeat them.

Note: rforcecom.bulkQuery is a convenience wrapper I've written around rforcecom.submitBulkQuery and friends, but I'll submit PR today and hopefully have it be included in package as standard function. I'll post its code here as well, so you have it.

# before salesforce caching
# Salesforce runtime = 69 seconds, so 11 second overhead from R
system.time(d <- rforcecom.bulkQuery(session, 
                     soqlQuery = "Select Id from Account", 
                     object = 'Account', 
                     interval_seconds=1, 
                     max_attempts=100, 
                     verbose=FALSE))
   user  system elapsed 
   0.24    0.05    80.52 

dim(d)
[1] 644239      1

# after salesforce caching
# Salesforce runtime = 16 seconds, so 10 second overhead from R
system.time(d <- rforcecom.bulkQuery(session, 
                     soqlQuery = "Select Id from Account", 
                     object = 'Account', 
                     interval_seconds=1, 
                     max_attempts=100, 
                     verbose=FALSE))
   user  system elapsed 
   0.24    0.05    26.17 

dim(d)
[1] 644239      1

rforcecom.bulkQuery

rforcecom.bulkQuery <- function(session,
                               soqlQuery,
                               object,
                               interval_seconds=5,
                               max_attempts=100, 
                               verbose=FALSE){

  job_info <- rforcecom.createBulkJob(session, operation='query', object=object)
  batch_query_info <- rforcecom.submitBulkQuery(session,
                                                jobId=job_info$id,
                                                query=soqlQuery)
  status_complete <- FALSE
  z <- 1
  Sys.sleep(interval_seconds)
  while (z < max_attempts & !status_complete){
    if (verbose){
      message(paste0("Pass #", z))
    }
    Sys.sleep(interval_seconds)
    batch_query_status <- rforcecom.checkBatchStatus(session,
                                                     jobId=batch_query_info$jobId,
                                                     batchId=batch_query_info$id)
    status_complete <- (batch_query_status$state=='Completed')
    z <- z + 1
  }
  if (!status_complete) {
    message(paste('Issue with batches submitted', print(proceed_on_batches)))
    batch_query_details <- NULL
    tryCatch({
      batch_query_details <- rforcecom.getBatchDetails(session,
                                                       jobId=batch_query_info$jobId,
                                                       batchId=batch_query_info$id)
    }, error=function(e){
    })
    # close the job
    close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)
    return(batch_query_details)
  }
  batch_query_details <- rforcecom.getBatchDetails(session,
                                                   jobId=batch_query_info$jobId,
                                                   batchId=batch_query_info$id)

  batch_query_recordset <- rforcecom.getBulkQueryResult(session,
                                                        jobId=batch_query_info$jobId,
                                                        batchId=batch_query_info$id,
                                                        resultId=batch_query_details$result)
  close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)

  return(batch_query_recordset)
}

@ax42
Copy link
Contributor Author

ax42 commented Oct 18, 2015

That's cool, thanks. I'd not seen the bulk API pieces as I seem to be using v0.7 (off CRAN).

One thing the bulk query does not seem to be able to do is deal with foreign keys (e.g. fetching the details of an Account owner). SalesForce returns:
Foreign Key Relationships not supported in Bulk Query

So it's probably really useful in some situations (straight dumps) but not in others (complex queries), although it may be faster to pull straight dumps off SF and combine them in R than run complex queries.

@StevenMMortimer
Copy link
Contributor

Yes, for large joins I would recommend pulling straight dumps of each object and joining in R (I use the dplyr package for joins). For small sets I would recommend using the SOAP-based function rforcecom.query just to cut down on R code. I haven't run performance tests to see who does the join faster on large datasets (R vs. Salesforce), but would be interesting to know.

If you want to experiment with some of the Bulk functions (since it's not on CRAN yet) you can install from the maintainer's Github or mine

Github Install

library(devtools)
install_github('ReportMort/RForcecom')

@ax42
Copy link
Contributor Author

ax42 commented Oct 18, 2015

I've installed the github version from the maintainer, and just copy/pasted the code you kindly provided. I'll try and get some benchmarks done over the next few days (although it seems our SalesForce instance is a lot smaller than yours).

My workflow so far has been to use http://dataloader.io to help me formulate my queries, and then I run them with rforcecom.query, which is why I have been focusing on benchmarking and/or improving that function specifically. Moving to xml2 should already make a huge difference.

The advantage of letting SalesForce do the joins is that you don't have to worry about consistency and you always get your dataset back exactly like you want it (especially if you are calling a bunch of lookup fields in a query). Each approach has its applications in the right place, and having both available in the library is great!

@robertocross
Copy link

Any update on this, my queries from SFDC are extremely slow thus far.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants