Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Test - scheduled for 2024-07-17 #226

Closed
29 tasks done
jffcamp opened this issue Jul 15, 2024 · 11 comments
Closed
29 tasks done

Performance Test - scheduled for 2024-07-17 #226

jffcamp opened this issue Jul 15, 2024 · 11 comments

Comments

@jffcamp
Copy link
Contributor

jffcamp commented Jul 15, 2024

This was the first performance test executed by NeoLoad.

Primary Objective

We will be following scenario AM, which is similar to scenario J but with Advanced Search Configuration and Data Contants being moved from Group-1 to Group-2. This was done to minimize errors during the performance test.
Scenario AM and state the primary objective.

The purpose of this test is to validate that ML 11.3 performs sufficiently to be moved to production with the 7/29 Blue/Green switch.

Changes Being Tested

We are primarily testing an upgrade to ML 11.3.0 GA, with the 2024-05-29 dataset.

Other changes compared to both previous perform tests:

Context

Environment and Versions

  • Environment: TST which is configured to the BLUE backend.
  • MarkLogic 11.3.0 GA. As part of the upgrade from ML 11.0.3 GA, the /var/opt/MarkLogic data directory was deleted to ensure there were no remaining remnants of a ML 11.2.0 nightly build.
  • Backend v1.21.0 WIP. At the time, the only runtime change since v1.20.0 was the split of an error check within _getSearchTermConfig.
  • Middle Tier 77-78-test-2
  • Frontend v1.30.0
  • Dataset produced on 5/29/24

Backend Application Server Configuration

  • lux-request-group-1 on port 8003: The middle tier is expected to send all requests here document, facets, searchEstimate, searchInfo, searchWillMatch, stats, translate, autoComplate
    Maximum of 6 concurrent requests.
  • lux-request-group-2 on port 8004: The middle tier is expected to send relatedList, search, advancedSearchConfig, dataConstants requests here. Maximum of 12 concurrent requests.
  • Maximum of 18 concurrent requests per node.

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

  • Confirm the most recent blue-green switch is 100% complete (i.e., no part of TST is using PROD).
  • Deploy code and/or configuration changes that are being tested.
  • Disable middle-tier caching in TST (instructions).
  • Verify LUX trace events are enabled plus v8 delay timeout.
  • Smoke test the front end.
  • Start collecting OS-level metrics (instructions).
  • Start collecting middle tier metrics (script)
  • QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes <-- initially true but changed after at least one failed test. See this comment for the final ramp-up schedule.
  • QA: Verify test is configured with three second wait times.
  • QA: Verify scripts point to TST, https://lux-front-tst.collections.yale.edu/
  • Team: Sign off on the above before proceeding.
  • QA: Start performance test. ~ 12:45
  • Team: begin monitoring for v8 engine crashes.
  • Team: check total request count at 10 minutes.
  • Team: check total request count at 15 minutes.
  • Team: check total request count at 20 minutes.
  • QA: Finish performance test.

Collect data

  • Stop collecting OS-level metrics and attach to the ticket.
  • Stop collecting middle-tier metrics and attach to the ticket.
  • [ ] Collect data from AWS and attach to ticket.
  • Collect ML monitoring history (instructions).
  • Collect (script), trim (script), and attach backend logs to the ticket.
  • Pull app server queue metrics (script), attach to the ticket, and record in Perf: Key Metrics. Starting in ML 11.2.0, this information may be included in the ML monitoring history screenshots and exports.
  • Update online spreadsheet tabs with what is known at this point.

Restore and Verify Environment

  • [ ] Revert this test's code and configuration changes
  • Enable middle-tier caching (instructions).
  • Smoke test the front end.

Analyze

  • Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
  • Mine the backend logs.
  • Determine if the test is valid: Considered valid despite the NeoLoad implementation of the test being in flux and not directly comparable to the previous tests.
  • Determine if the performance is acceptable. Team agrees we can go forward with ML 11.3.0 GA.
@brent-hartwig
Copy link
Contributor

brent-hartwig commented Jul 18, 2024

ML Monitoring History

Time period: 19:50 - 20:20 UTC (last test of the day with aggressive ramp up)

CPU:

01-cpu

File IO Detail:

02-io

Memory:

03-memory

Intra-cluster activity, 1 of 2:

intra-1-of-2

Intra-cluster activity, 2 of 2:

intra-2-of-2

Data node characteristics for the lux-content database alone:

07-database

Exports:

memory-detail-20240718-175928.xls
network-detail-20240718-180345.xls
servers-detail-20240718-180623.xls
xdqp-server requests detail-20240718-180259.xls
cpu-detail-20240718-175455.xls
databases-detail-20240718-180643.xls
file-i_o detail-20240718-175825.xls

@xinjianguo
Copy link

xinjianguo commented Jul 18, 2024

Status code counts
2024-07-17 19:50:00 - 20:20:00 UTC or 15:50:00 - 16:20:00 EDT

CloudFront (non-frontend routes):
run from AWS console Athena

select sc_status,count(*) as count
from lux_cloudfront_tst 
where date=date('2024-07-17') and time between '19:50:00' and '20:20:00' 
group by sc_status 
order by sc_status;
Screen Shot 2024-07-18 at 4 12 40 PM

WebCache ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_webcache_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 4 01 56 PM

Middle tier ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_middle_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 3 59 03 PM

MarkLogic ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_marklogic_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 4 19 09 PM

@brent-hartwig
Copy link
Contributor

Thanks for the request counts and queries, @xinjianguo!

@brent-hartwig
Copy link
Contributor

brent-hartwig commented Jul 19, 2024

Based on the following chart from QA's report, the ramp up schedule was five VUs every minute for 24 minutes until reaching a peak of 120 for just over a minute before a steep drop off (presumed NeoLoad crash). The five VUs per minute were comprised of [TODO: breakdown by flow / transaction].

We switched to a more aggressive ramp up schedule as NeoLoad is having trouble dealing with LUX errors that we believe LoadRunner handled better; QA is in contact with the vendor on the matter.

image

@brent-hartwig brent-hartwig self-assigned this Jul 19, 2024
@brent-hartwig
Copy link
Contributor

Trimmed backend logs: 20240717-blue-as-test-backend-logs-trimmed.zip

@brent-hartwig
Copy link
Contributor

Backend log mining output: 20240717-1950-2020-mined-log-output.zip

@brent-hartwig
Copy link
Contributor

During #181's performance test, node 217 had a higher CPU utilization than the other two nodes (#181's CPU utilization comment). This time, that was the case for node 22: node 22 had 17 of the 18 data points when utilization was over 95%. As part of the upgrade, all of Blue's nodes received new ec2 instances.

image

@brent-hartwig
Copy link
Contributor

The ratio of requests by type appears to have changed between LoadRunner and NeoLoad. The team concluded the NeoLoad version is still in flux and thus any comparisons could be questionable. Nonetheless, capturing here as the work was done.

image

Tables supporting the above table:

image

image

image

@brent-hartwig
Copy link
Contributor

Caveated, we experienced a decrease in 504s received by the MarkLogic load balancer.

During 18 Jun's performance test, there were 3x as many 504s at the ML load balancer than ML processed. During this test, it was only 0.22x. This reduced pressure on the data service retry mechanism. Rerouting advancedSearchConfig and dataConstants requests may have helped somewhat; however, due to differences between the LoadRunner and NeoLoad implementations of the test, the ratio of requests by type was significantly different including thousands fewer advancedSearchConfig and dataConstants requests than anticipated. For more on the request type composite change, see this comment.

image

@brent-hartwig
Copy link
Contributor

Other observations copied from Teams...

Per CPU utilization, we know the NeoLoad test pushed MarkLogic hard and that neither the v8 engine crashed nor MarkLogic process restarted --all good.

But I otherwise find it hard to compare yesterday's test to the previous tests.

A metric I like checking is the number of facet requests per search request. During the 18 Jun test, there were 11.86 facet requests per search request. During yesterday's test, there were only 6.10 facet requests per search request. That's someone concerning given all/most search results tabs have more than six facets.

I find it odd that there were over 18K fewer than expected advanced search configuration and data constant requests (combined) but doubt that had a material effect given they are the equivalent of document requests, as they are all very lightweight requests.

I am surprised that there were any failed advanced search configuration or data constants requests given the successful requests were served up by the second app server and that app server never registered any queued requests.

@brent-hartwig
Copy link
Contributor

Executive Summary

  • MarkLogic 11.3.0 is approved for production.
  • We are to rethink our approach to performance testing. We need to avoid making accommodations for the performance test and should have the web cache enabled, as it always is in production.
  • We are to update the performance test to align with present-day production usage.
  • No need to spend additional cycles trying to reconcile differences between LoadRunner and NeoLoad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants