Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try out a large data test #49

Closed
3 tasks done
gordonwatts opened this issue Apr 18, 2024 · 19 comments
Closed
3 tasks done

Try out a large data test #49

gordonwatts opened this issue Apr 18, 2024 · 19 comments
Assignees
Labels
perf test Log of running a performance test performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests
Milestone

Comments

@gordonwatts
Copy link
Member

gordonwatts commented Apr 18, 2024

Use one of Alex's very large datasets and run the simple single dataset test.

  • Save the query with a servicex_query_cache.json file so we don't have to re-run when running later.
  • Find a large dataset from Alex's file.
  • Get @ivukotic to describe what he changed in SX for posterity.
@gordonwatts gordonwatts added performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests labels Apr 18, 2024
@gordonwatts gordonwatts added this to the Week 3 milestone Apr 18, 2024
@gordonwatts gordonwatts self-assigned this Apr 18, 2024
@gordonwatts
Copy link
Member Author

image

@gordonwatts
Copy link
Member Author

Lets use:

    "data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026": {
        "nevts": 6367686831,
        "nfiles": 64803,
        "size_TB": 49.632
    },

@ivukotic
Copy link
Collaborator

ivukotic commented Apr 18, 2024

I set threshold for transformer HPA to 5% (normally it is 30%). this will make for a much faster ramp up.
Dask worker settings:
args: [ "dask", "worker", "--nworkers", "8", "--nthreads", "1", "--memory-limit", "8GB", "--death-timeout", "60", ] resources: limits: cpu: "10" memory: 20G requests: cpu: "8" memory: 10G

@gordonwatts
Copy link
Member Author

Took almost 6 minutes for the DID finder to find the files - so nothing happened for 6 minuets.

@ivukotic
Copy link
Collaborator

it is a lot of files... I still don't see transformers starting?

@gordonwatts
Copy link
Member Author

gordonwatts commented Apr 18, 2024

They took a good chunk of time before they were inserted into the DB... It was doing at a rate of about 300 files ever three seconds.

SX is not starting transformers until after all the files have been put in the DB I guess?

@gordonwatts
Copy link
Member Author

It is currently sitting at 26K files...

@gordonwatts
Copy link
Member Author

Now 34K files...

@gordonwatts
Copy link
Member Author

Ok - transform submitted at 12:29pm, finished loading files at 1:14 pm. That means it was 45 minutes for the rucio look up (about 6 minutes) and then whatever it does after the rucio lookup.

@gordonwatts
Copy link
Member Author

Still don't see transformers...

@gordonwatts
Copy link
Member Author

But wait - it is adding more files, oddly.... Than @alexander-held specified.

@gordonwatts
Copy link
Member Author

With @ivukotic;'s help.

[bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --num-files 0 --dataset mc_1TB 
0000.8002 - INFO - Using release 22.2.107
0000.8003 - INFO - Building ServiceX query
0000.8004 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0000.8522 - INFO - Starting ServiceX query
0470.7920 - INFO - Running servicex query for d683de10-8015-4ecf-9c4e-04f8987aa381 took 0:07:46.980131 (no files downloaded)                                                                                                                                                                                                                                                                                                    
0470.8000 - INFO - Finished ServiceX query
0470.8039 - INFO - Using `uproot.dask` to open files
0471.4034 - INFO - Generating the dask compute graph for 27 fields
0471.4931 - INFO - Computing the total count

That was a very long tail for the last 30 or so, for whatever reason.

image

S3 monitoring.

@gordonwatts
Copy link
Member Author

We saturated an internal 80 Gbps network switch. We think this is what caused some S3 copy errors.

image

Controlling the number of transformer pods brought us back down below the 80 Gbps.

@gordonwatts
Copy link
Member Author

Saw a big difference in the efficiency of the pods on river as compared to AF. AF was running at about 8% CPU, and River was seeing 45% CPU. Could that be networking? Not known. Tracked here.

@gordonwatts
Copy link
Member Author

According to @ivukotic , SX pod scaling was modified so that it started with 10 nodes, and 1% busy spawned new ones. This made the scaling much faster than we've previously used.

It is clear we need better scaling - we'd end up with more pods than files at various points.

@gordonwatts
Copy link
Member Author

gordonwatts commented Apr 19, 2024

Event with lots of transformer pods running, S3 wasn't looking stressed (this was write only!)- about 1500 pods.

image

@gordonwatts
Copy link
Member Author

We saw a real slowdown of the number of files being added to the dataset as a function of time:

image

We think that was due to the network switch saturation - we really reduced the nubmer of pods (I think 100?), and the insertion speed picked up again.

image

That plot is the frequency as a function of time that a list of 30 files was inserted into the SX processing queue. YOu can see a large number at first, and then it rapidly declines.

@gordonwatts
Copy link
Member Author

gordonwatts commented Apr 19, 2024

The problem of extra files:

image

This dataset has 64803 files. And, indeed, it gets up to that number and then stabilizes for 5 minutes. And then new files start coming in again.

image

We believe this is because it takes more than 30 minutes for the files to be inserted. RabbitMQ thinks the DID finder has died, takes the message back, and then re-sends it.

Tracked in #53.

@gordonwatts
Copy link
Member Author

We'll need to perform some debugging before we can re-run this. So this is done!

@gordonwatts gordonwatts added the perf test Log of running a performance test label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf test Log of running a performance test performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests
Projects
None yet
Development

No branches or pull requests

2 participants