Try out a large data test #49

gordonwatts · 2024-04-18T19:06:12Z

Use one of Alex's very large datasets and run the simple single dataset test.

Save the query with a servicex_query_cache.json file so we don't have to re-run when running later.
Find a large dataset from Alex's file.
Get @ivukotic to describe what he changed in SX for posterity.

The text was updated successfully, but these errors were encountered:

gordonwatts · 2024-04-18T19:06:59Z

gordonwatts · 2024-04-18T19:11:03Z

Lets use:

    "data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026": {
        "nevts": 6367686831,
        "nfiles": 64803,
        "size_TB": 49.632
    },

ivukotic · 2024-04-18T19:15:17Z

I set threshold for transformer HPA to 5% (normally it is 30%). this will make for a much faster ramp up.
Dask worker settings:
args: [ "dask", "worker", "--nworkers", "8", "--nthreads", "1", "--memory-limit", "8GB", "--death-timeout", "60", ] resources: limits: cpu: "10" memory: 20G requests: cpu: "8" memory: 10G

gordonwatts · 2024-04-18T19:34:51Z

Took almost 6 minutes for the DID finder to find the files - so nothing happened for 6 minuets.

ivukotic · 2024-04-18T19:36:13Z

it is a lot of files... I still don't see transformers starting?

gordonwatts · 2024-04-18T19:36:23Z

They took a good chunk of time before they were inserted into the DB... It was doing at a rate of about 300 files ever three seconds.

SX is not starting transformers until after all the files have been put in the DB I guess?

gordonwatts · 2024-04-18T19:41:01Z

It is currently sitting at 26K files...

gordonwatts · 2024-04-18T19:46:13Z

Now 34K files...

gordonwatts · 2024-04-18T20:14:52Z

Ok - transform submitted at 12:29pm, finished loading files at 1:14 pm. That means it was 45 minutes for the rucio look up (about 6 minutes) and then whatever it does after the rucio lookup.

gordonwatts · 2024-04-18T20:15:41Z

Still don't see transformers...

gordonwatts · 2024-04-18T20:23:25Z

But wait - it is adding more files, oddly.... Than @alexander-held specified.

gordonwatts · 2024-04-19T02:25:24Z

With @ivukotic;'s help.

[bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --num-files 0 --dataset mc_1TB 
0000.8002 - INFO - Using release 22.2.107
0000.8003 - INFO - Building ServiceX query
0000.8004 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0000.8522 - INFO - Starting ServiceX query
0470.7920 - INFO - Running servicex query for d683de10-8015-4ecf-9c4e-04f8987aa381 took 0:07:46.980131 (no files downloaded)                                                                                                                                                                                                                                                                                                    
0470.8000 - INFO - Finished ServiceX query
0470.8039 - INFO - Using `uproot.dask` to open files
0471.4034 - INFO - Generating the dask compute graph for 27 fields
0471.4931 - INFO - Computing the total count

That was a very long tail for the last 30 or so, for whatever reason.

S3 monitoring.

gordonwatts · 2024-04-19T06:16:13Z

We saturated an internal 80 Gbps network switch. We think this is what caused some S3 copy errors.

Controlling the number of transformer pods brought us back down below the 80 Gbps.

gordonwatts · 2024-04-19T06:20:12Z

Saw a big difference in the efficiency of the pods on river as compared to AF. AF was running at about 8% CPU, and River was seeing 45% CPU. Could that be networking? Not known. Tracked here.

gordonwatts · 2024-04-19T06:21:27Z

According to @ivukotic , SX pod scaling was modified so that it started with 10 nodes, and 1% busy spawned new ones. This made the scaling much faster than we've previously used.

It is clear we need better scaling - we'd end up with more pods than files at various points.

gordonwatts · 2024-04-19T06:22:33Z

Event with lots of transformer pods running, S3 wasn't looking stressed (this was write only!)- about 1500 pods.

gordonwatts · 2024-04-19T06:24:37Z

We saw a real slowdown of the number of files being added to the dataset as a function of time:

We think that was due to the network switch saturation - we really reduced the nubmer of pods (I think 100?), and the insertion speed picked up again.

That plot is the frequency as a function of time that a list of 30 files was inserted into the SX processing queue. YOu can see a large number at first, and then it rapidly declines.

gordonwatts · 2024-04-19T06:28:46Z

The problem of extra files:

This dataset has 64803 files. And, indeed, it gets up to that number and then stabilizes for 5 minutes. And then new files start coming in again.

We believe this is because it takes more than 30 minutes for the files to be inserted. RabbitMQ thinks the DID finder has died, takes the message back, and then re-sends it.

Tracked in #53.

gordonwatts · 2024-04-19T06:31:15Z

We'll need to perform some debugging before we can re-run this. So this is done!

gordonwatts added performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests labels Apr 18, 2024

gordonwatts added this to the Week 3 milestone Apr 18, 2024

gordonwatts self-assigned this Apr 18, 2024

gordonwatts mentioned this issue Apr 19, 2024

Add a range of possible individual datasets to run #51

Merged

gordonwatts mentioned this issue Apr 19, 2024

No cancel button for transforms that need a DID finder lookup ssl-hep/ServiceX#743

Open

gordonwatts mentioned this issue Apr 19, 2024

Understand why SX Pods in the AF are so much worse than in River #52

Closed

gordonwatts mentioned this issue Apr 19, 2024

Large Datasets get duplicated files #53

Closed

gordonwatts closed this as completed Apr 19, 2024

gordonwatts mentioned this issue Apr 19, 2024

Do SX Thursday Speed Test #45

Closed

gordonwatts added the perf test Log of running a performance test label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try out a large data test #49

Try out a large data test #49

gordonwatts commented Apr 18, 2024 •

edited

Loading

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

ivukotic commented Apr 18, 2024 •

edited

Loading

gordonwatts commented Apr 18, 2024

ivukotic commented Apr 18, 2024

gordonwatts commented Apr 18, 2024 •

edited

Loading

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024 •

edited

Loading

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024 •

edited

Loading

gordonwatts commented Apr 19, 2024

Try out a large data test #49

Try out a large data test #49

Comments

gordonwatts commented Apr 18, 2024 • edited Loading

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

ivukotic commented Apr 18, 2024 • edited Loading

gordonwatts commented Apr 18, 2024

ivukotic commented Apr 18, 2024

gordonwatts commented Apr 18, 2024 • edited Loading

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 18, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024 • edited Loading

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 19, 2024 • edited Loading

gordonwatts commented Apr 19, 2024

gordonwatts commented Apr 18, 2024 •

edited

Loading

ivukotic commented Apr 18, 2024 •

edited

Loading

gordonwatts commented Apr 18, 2024 •

edited

Loading

gordonwatts commented Apr 19, 2024 •

edited

Loading

gordonwatts commented Apr 19, 2024 •

edited

Loading