-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try out a large data test #49
Comments
Lets use: "data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026": {
"nevts": 6367686831,
"nfiles": 64803,
"size_TB": 49.632
}, |
I set threshold for transformer HPA to 5% (normally it is 30%). this will make for a much faster ramp up. |
Took almost 6 minutes for the DID finder to find the files - so nothing happened for 6 minuets. |
it is a lot of files... I still don't see transformers starting? |
They took a good chunk of time before they were inserted into the DB... It was doing at a rate of about 300 files ever three seconds. SX is not starting transformers until after all the files have been put in the DB I guess? |
It is currently sitting at 26K files... |
Now 34K files... |
Ok - transform submitted at 12:29pm, finished loading files at 1:14 pm. That means it was 45 minutes for the rucio look up (about 6 minutes) and then whatever it does after the rucio lookup. |
Still don't see transformers... |
But wait - it is adding more files, oddly.... Than @alexander-held specified. |
With @ivukotic;'s help.
That was a very long tail for the last 30 or so, for whatever reason. S3 monitoring. |
Saw a big difference in the efficiency of the pods on river as compared to AF. AF was running at about 8% CPU, and River was seeing 45% CPU. Could that be networking? Not known. Tracked here. |
According to @ivukotic , SX pod scaling was modified so that it started with 10 nodes, and 1% busy spawned new ones. This made the scaling much faster than we've previously used. It is clear we need better scaling - we'd end up with more pods than files at various points. |
We saw a real slowdown of the number of files being added to the dataset as a function of time: We think that was due to the network switch saturation - we really reduced the nubmer of pods (I think 100?), and the insertion speed picked up again. That plot is the frequency as a function of time that a list of 30 files was inserted into the SX processing queue. YOu can see a large number at first, and then it rapidly declines. |
The problem of extra files: This dataset has 64803 files. And, indeed, it gets up to that number and then stabilizes for 5 minutes. And then new files start coming in again. We believe this is because it takes more than 30 minutes for the files to be inserted. RabbitMQ thinks the DID finder has died, takes the message back, and then re-sends it. Tracked in #53. |
We'll need to perform some debugging before we can re-run this. So this is done! |
Use one of Alex's very large datasets and run the simple single dataset test.
servicex_query_cache.json
file so we don't have to re-run when running later.The text was updated successfully, but these errors were encountered: