Replies: 1 comment 1 reply
-
Hi @vfrank66 , thanks very much for the detailed write up. I guess the first thing to say is that from the looks of things, you've done at least as much testing as us, so this is a very useful example and may imply the current recommendation to use DuckDB isn't quite right. I should also say up front there is no plan to remove Spark support in Splink 4 (it's already there, and working in the prereleases) A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Hello I know this has partially been discussed several times over but I think I must be misunderstanding something. I am questioning the recommendation to use ductdb primarily due to performance. I see that in splink4 the recommendation is to use duckdb for all cases if you can get a large enough machine. I personally do not understand this recommendation. There is too much code to show so I will try to describe past few weeks of runs with as little code as possible.
I have a large dataset of records to dedupe. This all runs in AWS
Spark - AWS EMR
11 million record dataset -Training and prediction occurs in 3 hours.
Master - c6.8xlarge - Core - 3 c6gd.8xlarge, 5 c6gd.12xlarge
Duckdb - AWS Batch (EC2 with ECS)
500_000 record dataset (started at 16 mil records but that was not going to run successfully) - fails after 2 hours
1 - x2idn.32xlarge with a 500 GB SSD EBS volume (I just cant believe I need this)
duckdb Runs out of memory after 2 hours in training. This occurs while running this code where ssn is 70% null (I know I know I am running too many comparisons this is a bad column but it does help inform the other m values for comparisons, and this runs in Spark)
Here is what I am seeing in duckdb, I am running in AWS Batch on EC2 instances in a ECS definition so my system memory will be lower than my container memory. I can increase my container but I am already at 2T of RAM:
Initial data loaded
After about 1.5 hours
Then a release of memory. (Based on swapping and memory usage I am guessing duckdb is performing all calc's
:in-memory
up until now and it has reached OOM and is attempting to spill to disk which my EBS volume is not large enough. To handle 1 terabyte of data. And as it is spilling then it runs OOM.)Shortly after it dies. I do realize there are a lot of tweaking here I can do including changing the model. But I do not see why I should change the model I can run this code in Spark, of which is known to be better for larger datasets, yet will not in the future be the recommended approach. I have been running the model for 3 days with tweaks of which include changing the available duckdb memory, change the thread count, changing partitioning on blocking rules, change instance, reducing dataset size (problem here is that I am not the dataset expert so I do not have a curated list of valid comparison scenerios).
Setup for duckdb
Once I even get past training this I will need to .predict(). What I would like to do is have a trained model in duckdb which I can then use to process transactional incoming data by comparing against the deduped dataset. This would not be cost optimized in Spark but could be in ec2 or even lambda if the blocking rules become pushdown predicates against a RDMS.
I am also aware I could increase my EBS volume size to handle the large number of comparisons without changing my model. I could also change my full_name comparison levels, but I do not want to do that my model output from Spark is really good with the inclusion of full_name, I have lots of families and bad data so first/last name create too many similarities especially amongst adults and minors. I could remove SSN since from EM training but without it I struggle to even populate m/u for all comparisons.
Beta Was this translation helpful? Give feedback.
All reactions