Performance Test - scheduled for 2024-07-17 #226

jffcamp · 2024-07-15T21:25:08Z

This was the first performance test executed by NeoLoad.

Primary Objective

We will be following scenario AM, which is similar to scenario J but with Advanced Search Configuration and Data Contants being moved from Group-1 to Group-2. This was done to minimize errors during the performance test.
Scenario AM and state the primary objective.

The purpose of this test is to validate that ML 11.3 performs sufficiently to be moved to production with the 7/29 Blue/Green switch.

Changes Being Tested

We are primarily testing an upgrade to ML 11.3.0 GA, with the 2024-05-29 dataset.

#162's performance test was in Green with 11.0.3 GA with remnants of a 11.2 nightly build, also with the 2024-05-29 dataset.
#181's performance test was in DEV with 11.2.0 GA, with the 2024-05-14 dataset (questionable).

Other changes compared to both previous perform tests:

Update setItemsWithImages HAL Link lux-middletier#77: performance improvement on the search results page
Move Ports for Advanced Search Config and Data Constants lux-middletier#78: from request-group-1 to request-group-2
First full performance test using NeoLoad.

Context

Environment and Versions

Environment: TST which is configured to the BLUE backend.
MarkLogic 11.3.0 GA. As part of the upgrade from ML 11.0.3 GA, the /var/opt/MarkLogic data directory was deleted to ensure there were no remaining remnants of a ML 11.2.0 nightly build.
Backend v1.21.0 WIP. At the time, the only runtime change since v1.20.0 was the split of an error check within _getSearchTermConfig.
Middle Tier 77-78-test-2
Frontend v1.30.0
Dataset produced on 5/29/24

Backend Application Server Configuration

lux-request-group-1 on port 8003: The middle tier is expected to send all requests here document, facets, searchEstimate, searchInfo, searchWillMatch, stats, translate, autoComplate
Maximum of 6 concurrent requests.
lux-request-group-2 on port 8004: The middle tier is expected to send relatedList, search, advancedSearchConfig, dataConstants requests here. Maximum of 12 concurrent requests.
Maximum of 18 concurrent requests per node.

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

Collect data

Stop collecting OS-level metrics and attach to the ticket.
Stop collecting middle-tier metrics and attach to the ticket.
~~[ ] Collect data from AWS and attach to ticket.~~
Collect ML monitoring history (instructions).
Collect (script), trim (script), and attach backend logs to the ticket.
Pull app server queue metrics (script), attach to the ticket, and record in Perf: Key Metrics. Starting in ML 11.2.0, this information may be included in the ML monitoring history screenshots and exports.
Update online spreadsheet tabs with what is known at this point.

Restore and Verify Environment

~~[ ] Revert this test's code and configuration changes~~
Enable middle-tier caching (instructions).
Smoke test the front end.

Analyze

Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
Mine the backend logs.
Determine if the test is valid: Considered valid despite the NeoLoad implementation of the test being in flux and not directly comparable to the previous tests.
Determine if the performance is acceptable. Team agrees we can go forward with ML 11.3.0 GA.

The text was updated successfully, but these errors were encountered:

brent-hartwig · 2024-07-18T18:14:08Z

ML Monitoring History

Time period: 19:50 - 20:20 UTC (last test of the day with aggressive ramp up)

CPU:

File IO Detail:

Memory:

Intra-cluster activity, 1 of 2:

Intra-cluster activity, 2 of 2:

Data node characteristics for the lux-content database alone:

Exports:

memory-detail-20240718-175928.xls
network-detail-20240718-180345.xls
servers-detail-20240718-180623.xls
xdqp-server requests detail-20240718-180259.xls
cpu-detail-20240718-175455.xls
databases-detail-20240718-180643.xls
file-i_o detail-20240718-175825.xls

xinjianguo · 2024-07-18T20:19:20Z

Status code counts
2024-07-17 19:50:00 - 20:20:00 UTC or 15:50:00 - 16:20:00 EDT

CloudFront (non-frontend routes):
run from AWS console Athena

select sc_status,count(*) as count
from lux_cloudfront_tst 
where date=date('2024-07-17') and time between '19:50:00' and '20:20:00' 
group by sc_status 
order by sc_status;

WebCache ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_webcache_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

Middle tier ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_middle_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

MarkLogic ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_marklogic_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

brent-hartwig · 2024-07-19T13:05:20Z

Thanks for the request counts and queries, @xinjianguo!

brent-hartwig · 2024-07-19T13:06:18Z

Based on the following chart from QA's report, the ramp up schedule was five VUs every minute for 24 minutes until reaching a peak of 120 for just over a minute before a steep drop off (presumed NeoLoad crash). The five VUs per minute were comprised of [TODO: breakdown by flow / transaction].

We switched to a more aggressive ramp up schedule as NeoLoad is having trouble dealing with LUX errors that we believe LoadRunner handled better; QA is in contact with the vendor on the matter.

brent-hartwig · 2024-07-19T13:45:41Z

Trimmed backend logs: 20240717-blue-as-test-backend-logs-trimmed.zip

brent-hartwig · 2024-07-19T14:31:02Z

Backend log mining output: 20240717-1950-2020-mined-log-output.zip

brent-hartwig · 2024-07-19T14:47:57Z

During #181's performance test, node 217 had a higher CPU utilization than the other two nodes (#181's CPU utilization comment). This time, that was the case for node 22: node 22 had 17 of the 18 data points when utilization was over 95%. As part of the upgrade, all of Blue's nodes received new ec2 instances.

brent-hartwig · 2024-07-19T15:26:56Z

The ratio of requests by type appears to have changed between LoadRunner and NeoLoad. The team concluded the NeoLoad version is still in flux and thus any comparisons could be questionable. Nonetheless, capturing here as the work was done.

Tables supporting the above table:

brent-hartwig · 2024-07-19T15:39:59Z

Caveated, we experienced a decrease in 504s received by the MarkLogic load balancer.

During 18 Jun's performance test, there were 3x as many 504s at the ML load balancer than ML processed. During this test, it was only 0.22x. This reduced pressure on the data service retry mechanism. Rerouting advancedSearchConfig and dataConstants requests may have helped somewhat; however, due to differences between the LoadRunner and NeoLoad implementations of the test, the ratio of requests by type was significantly different including thousands fewer advancedSearchConfig and dataConstants requests than anticipated. For more on the request type composite change, see this comment.

brent-hartwig · 2024-07-19T15:42:50Z

Other observations copied from Teams...

Per CPU utilization, we know the NeoLoad test pushed MarkLogic hard and that neither the v8 engine crashed nor MarkLogic process restarted --all good.

But I otherwise find it hard to compare yesterday's test to the previous tests.

A metric I like checking is the number of facet requests per search request. During the 18 Jun test, there were 11.86 facet requests per search request. During yesterday's test, there were only 6.10 facet requests per search request. That's someone concerning given all/most search results tabs have more than six facets.

I find it odd that there were over 18K fewer than expected advanced search configuration and data constant requests (combined) but doubt that had a material effect given they are the equivalent of document requests, as they are all very lightweight requests.

I am surprised that there were any failed advanced search configuration or data constants requests given the successful requests were served up by the second app server and that app server never registered any queued requests.

brent-hartwig · 2024-07-19T16:11:17Z

Executive Summary

MarkLogic 11.3.0 is approved for production.
We are to rethink our approach to performance testing. We need to avoid making accommodations for the performance test and should have the web cache enabled, as it always is in production.
We are to update the performance test to align with present-day production usage.
No need to spend additional cycles trying to reconcile differences between LoadRunner and NeoLoad.

jffcamp added performance testing labels Jul 15, 2024

jffcamp assigned gigamorph, xinjianguo, clarkepeterf, jffcamp, prowns and kamerynB Jul 15, 2024

brent-hartwig mentioned this issue Jul 16, 2024

Move Ports for Advanced Search Config and Data Constants project-lux/lux-middletier#78

Closed

brent-hartwig self-assigned this Jul 19, 2024

brent-hartwig closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Test - scheduled for 2024-07-17 #226

Performance Test - scheduled for 2024-07-17 #226

jffcamp commented Jul 15, 2024 •

edited

Loading

brent-hartwig commented Jul 18, 2024 •

edited

Loading

xinjianguo commented Jul 18, 2024 •

edited

Loading

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024 •

edited

Loading

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

Performance Test - scheduled for 2024-07-17 #226

Performance Test - scheduled for 2024-07-17 #226

Comments

jffcamp commented Jul 15, 2024 • edited Loading

Primary Objective

Changes Being Tested

Context

Environment and Versions

Backend Application Server Configuration

Tasks

Prep, Start, and Preliminary Checks

Collect data

Restore and Verify Environment

Analyze

brent-hartwig commented Jul 18, 2024 • edited Loading

ML Monitoring History

xinjianguo commented Jul 18, 2024 • edited Loading

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024 • edited Loading

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

brent-hartwig commented Jul 19, 2024

Executive Summary

jffcamp commented Jul 15, 2024 •

edited

Loading

brent-hartwig commented Jul 18, 2024 •

edited

Loading

xinjianguo commented Jul 18, 2024 •

edited

Loading

brent-hartwig commented Jul 19, 2024 •

edited

Loading