Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

critical refactoring: attempt to solve graal-js callonce / callSingle issues and excessive memory use for call #1685

Closed
ptrthomas opened this issue Jul 16, 2021 · 21 comments
Assignees

Comments

@ptrthomas
Copy link
Member

cc @workwithprashant who raised this: https://stackoverflow.com/q/68411732/143475

not really looking forward to this fix. but maybe we shouldn't "return" any variables that were in the calling context

and would like to attempt a re-factor for the horrible graal-js limitation for using js objects emanating from one context on another thread e.g. #1633 and #1558

have a rough idea of moving callonce and callSingle onto a separate "special" context and try to make any use of that context thread safe. wish me luck :|

@ptrthomas ptrthomas self-assigned this Jul 16, 2021
ptrthomas added a commit that referenced this issue Jul 17, 2021
glad could get rid of scenario-listener - that was really clunky
and a new approach seems promising - clone the js engine if needed for another thread etc
@ptrthomas
Copy link
Member Author

so @workwithprashant I think you already figured that this solves your problem. set variables to null to avoid them being cloned. unless you have some better ideas

    * def out = null
    * def response = null
    * def response = call read('classpath:examples/library.feature@getLibraryData')

ptrthomas added a commit that referenced this issue Jul 17, 2021
now we have learnt that java native host-object like functions can be passed around freely in graal
exploring if we convert to runnable / consumer / function as in java lambdas
seems to work well - this commit tries this with the active-mq example
one possible limitation is only single argument functions are supported, sure this can be overcome
but it is a reasonable limit - e.g. use maps to pass data in / out
@workwithprashant
Copy link

workwithprashant commented Jul 17, 2021

Well, it's going to be painful with setting null throughout all the scripts. I was looking at the code at ScenarioRuntime.java where we could save the keys from vars and remove those keys from nested vars in ScenarioEngine.java followed with runtime.engine.setVariables(allVars)

Not sure about the approach but exploring different options.

@ptrthomas
Copy link
Member Author

@workwithprashant well, please suggest a different design then. let me first say that I personally discourage over-use of call you should never have more than one level for test-automation, at least that is what I think. I have written a bit about this here: https://stackoverflow.com/a/54126724/143475

so in this case if you have some complicated re-use, please use Java code and not feature files or JavaScript

a suggestion given in #1675 was to use a Java singleton and write to a temp file if needed. if your tests are complicated, there will be pain.

a core principle is that any variable defined using def has to be visible to called features. but I'm eager to hear any proposals

@ptrthomas
Copy link
Member Author

ptrthomas commented Jul 17, 2021

@workwithprashant I found a reasonable solution. use a "wrapper" variable. because "inner" variables will not be cloned. try this change:

@unexpected
Scenario: Not over-writing nested variable
    * def data = {}
    * data.response = karate.call('classpath:examples/library.feature@getLibraryData')
    * string out = data.response
    * karate.log('FIRST RESPONSE = ', data.response)
    * karate.log('FIRST RESPONSE SIZE = ', out.length)
    * def out = null

    * data.response = karate.call('classpath:examples/library.feature@getLibraryData')
    * string out = data.response
    * karate.log('SECOND RESPONSE = ', data.response)
    * karate.log('SECOND RESPONSE SIZE = ', out.length)
    * def out = null
    
    * data.response = karate.call('classpath:examples/library.feature@getLibraryData')
    * string out = data.response
    * karate.log('THIRD RESPONSE SIZE = ', out.length)
    * karate.log('UNEXPECTED RESPONSE = ', data.response)

i'm setting out to null just to focus on the response

@ptrthomas
Copy link
Member Author

@workwithprashant actually. you were right and we were doing an un-necessary variable definition. you can ignore my comment above. made a fix - so will be great if you can test with the latest version in develop !

@workwithprashant
Copy link

@ptrthomas Your fix worked beautifully on sample project.

PREVIOUS EXAMPLE: Called feature returning variables of calling context...

10:54:28.417 [main] INFO  com.intuit.karate - FIRST RESPONSE SIZE =  331
10:54:28.992 [main] INFO  com.intuit.karate - SECOND RESPONSE SIZE =  2125 
10:54:29.008 [main] INFO  com.intuit.karate - THIRD RESPONSE SIZE =  11919

WITH THE FIX NOW:

11:01:16.150 [main] INFO  com.intuit.karate - FIRST RESPONSE SIZE =  155
11:01:16.165 [main] INFO  com.intuit.karate - SECOND RESPONSE SIZE =  155
11:01:16.178 [main] INFO  com.intuit.karate - THIRD RESPONSE SIZE =  155

I am going to try the fix with performance testing on actual project and compare JVM heap usage. I will update this thread with my findings (probably in 24 hours)

Thank you @ptrthomas for reconsidering the approach :)

@ptrthomas
Copy link
Member Author

@workwithprashant great to hear. apologies for all the grumbling - at the end I'm very glad we caught this. thanks for raising this, it possibly improves the other callSingle() / callonce situation because earlier we were passing everything around and now we don't - so cc @aleruz

@ptrthomas ptrthomas added this to the 1.1.0 milestone Jul 18, 2021
@workwithprashant
Copy link

workwithprashant commented Jul 19, 2021

@ptrthomas I have some results to share for performance testing before and after the fix as below.

Environment

  • Gatling : 3.1.2
  • JDK : 14
  • Maven : 3.6.3

Setup

  • 100 users ramp up in 100 seconds and sustained for 3600 seconds

Before the FIX

System Stats : (System where tests are executed)

  • CPU usage average = 90%
  • JVM usage at the end = 37 GB (growing consistently)
    image
    Gatling Results:
  • total transactions = 48933
    image

After the FIX

System Stats : (System where tests are executed)

  • CPU usage average = 16% (decreased significantly) 👍
  • JVM usage at the end = 50 GB (growing consistently)
    image
    Gatling Results:
  • total transactions = 256395 (increased exponentially by 423.972%) 👍
    image

Even though results are much better, I believe that we may be able to decrease the JVM usage during performance testing. When I used the memory profiler during the start of the test, it looks like ScenarioEngine.AttachResult() memory usage keeps growing. Following stats from initial 15-20 minutes of starting the tests. Even though GC is automatically performed, JVM usage trend keeps growing. (Need to investigate optimal GC settings passed to gatling plugin)
image

Since we don't need lot of results during performance testing, I was thinking of detecting if it's perf tests then skip some of the result calls which are not needed for Gatling (Not sure about its impact on gatling result collection).

Do you have any ideas?

@ptrthomas
Copy link
Member Author

@workwithprashant this is fantastic work. I will try avoid the AttachResult part, but it is taking more time than I expected. will keep you posted.

@ptrthomas
Copy link
Member Author

@workwithprashant okay, made the change, please try now !

@workwithprashant
Copy link

After 6d8aa7c FIX

image

It has helped little and I will keep investigating more. If I find something more then will raise a new request.

@ptrthomas
Copy link
Member Author

@workwithprashant thanks. if you use callonce or callSingle there is a similar cloning of data that happens.

if you can remove callonce and callSingle for a test run, please try to do that and compare. and then we will know where to focus

@ptrthomas
Copy link
Member Author

@workwithprashant an update - I tried to improve the callSingle / callonce code to avoid cloning too many objects. so ignore my message above, just run again with the latest from develop and let me know how it goes

@workwithprashant
Copy link

@ptrthomas I did not see any changes so I will investigate by migrating my karate scripts (performance testing scenarios) to profiling-test project which will be easier to debug. I may have to look into GC config passed to gatling maven plugin.
image

Thank you for all the improvements!!

@workwithprashant
Copy link

@ptrthomas I had log levels set to ERROR so I didn't see following warning.
[WARN ] 2021-07-20 22:10:19.692 [GatlingSystem-akka.actor.default-dispatcher-10] | - attach - immutable map: __gatling

I will try to reproduce this in sample project.

@ptrthomas
Copy link
Member Author

@workwithprashant thanks that helps. this is the fix for the other issue you reported with gatling + call. but I will think of a better fix. maybe it is this hashmap coming from gatling which is the problem

@workwithprashant
Copy link

workwithprashant commented Jul 21, 2021

@ptrthomas I migrated my Karate scripts to profiling-test and executed performance tests without passing any GC related override settings to gatling plugin for the same test as before. JVM heap size keep increasing especially prominent for longer tests.
image

Following is the profiling after first 5 minutes and these top 3 classes keeps increasing.
image

I was passing following GC settings to gatling plugin in my framework.

-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:+ParallelRefProcEnabled

It looks like JDK 16 has significant changes in GC strategy. I will try running same tests with JDK 16 and maven 3.8.1 and will update this thread.

@ptrthomas
Copy link
Member Author

@workwithprashant I have been able to get VisualVM running and with the profiling-test project (I made a few small changes) I see absolutely no memory leaks. but I am on JDK 11 !

the next step here in my opinion is please simulate the problem you see within the profiling-test project - add more call or more JSON / variables etc. meanwhile I think I shall release 1.1.0.RC5

@workwithprashant
Copy link

@ptrthomas RC5 sounds good!! This effort is going to be iterative process and I will try to simulate the similar trend using profiling-test project which I can share with you.

@ptrthomas
Copy link
Member Author

1.1.0 released

@ptrthomas
Copy link
Member Author

note to self to run this benchmark as part of #2546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants