Performance between Spark and Duckdb #2209

vfrank66 · 2024-06-13T12:47:39Z

vfrank66
Jun 13, 2024

Hello I know this has partially been discussed several times over but I think I must be misunderstanding something. I am questioning the recommendation to use ductdb primarily due to performance. I see that in splink4 the recommendation is to use duckdb for all cases if you can get a large enough machine. I personally do not understand this recommendation. There is too much code to show so I will try to describe past few weeks of runs with as little code as possible.

I have a large dataset of records to dedupe. This all runs in AWS

Spark - AWS EMR
11 million record dataset -Training and prediction occurs in 3 hours.
Master - c6.8xlarge - Core - 3 c6gd.8xlarge, 5 c6gd.12xlarge

Duckdb - AWS Batch (EC2 with ECS)
500_000 record dataset (started at 16 mil records but that was not going to run successfully) - fails after 2 hours
1 - x2idn.32xlarge with a 500 GB SSD EBS volume (I just cant believe I need this)

INFO:main:Set DuckDB memory limit to: 1200.07 GB
INFO:main:Set DuckDB temp directory to: /tmp/duckdb_temp
INFO:main:Set DuckDB threads to: 128 (or 48)

duckdb Runs out of memory after 2 hours in training. This occurs while running this code where ssn is 70% null (I know I know I am running too many comparisons this is a bad column but it does help inform the other m values for comparisons, and this runs in Spark)

linker.estimate_parameters_using_expectation_maximisation(
            "l.ssn = r.ssn", 
            estimate_without_term_frequencies=True
        ) # Number of comparisons generated by 'l."ssn" = r."ssn"' : 3,325,326,114

Here is what I am seeing in duckdb, I am running in AWS Batch on EC2 instances in a ECS definition so my system memory will be lower than my container memory. I can increase my container but I am already at 2T of RAM:

Initial data loaded

INFOProcess memory usage: 16471.24 MB, System memory - Total: 2000.11 GB, Used: 17.63 GB, Available: 1973.13 GB, Swap - Total: 0.00 GB, Used: 0.00 GB, Free: 0.00 GB, Container memory limit: 2000.98 GB, Container memory usage: 16.45 GB, Container memory available: 1984.52 GB

After about 1.5 hours

INFOProcess memory usage: 19971.24 MB, System memory - Total: 2000.11 GB, Used: 17.63 GB, Available: 1973.13 GB, Swap - Total: 0.00 GB, Used: 0.00 GB, Free: 0.00 GB, Container memory limit: 2000.98 GB, Container memory usage: 1980.45 GB, Container memory available: 20 GB

Then a release of memory. (Based on swapping and memory usage I am guessing duckdb is performing all calc's :in-memory up until now and it has reached OOM and is attempting to spill to disk which my EBS volume is not large enough. To handle 1 terabyte of data. And as it is spilling then it runs OOM.)

INFO:Process memory usage: 16471.24 MB, System memory - Total: 2000.11 GB, Used: 17.63 GB, Available: 1973.13 GB, Swap - Total: 0.00 GB, Used: 0.00 GB, Free: 0.00 GB, Container memory limit: 2000.98 GB, Container memory usage: 1600 GB, Container memory available: 484 GB

Shortly after it dies. I do realize there are a lot of tweaking here I can do including changing the model. But I do not see why I should change the model I can run this code in Spark, of which is known to be better for larger datasets, yet will not in the future be the recommended approach. I have been running the model for 3 days with tweaks of which include changing the available duckdb memory, change the thread count, changing partitioning on blocking rules, change instance, reducing dataset size (problem here is that I am not the dataset expert so I do not have a curated list of valid comparison scenerios).

prediction_blocking_rules = [
        block_on(["last_name_dm", "first_name_dm"]), # Number of comparisons generated by '(l."last_name_dm" = r."last_name_dm") AND (l."first_name_dm" = r."first_name_dm")' : 7,927,626
        block_on(["last_name", "birth_date"]), # Number of comparisons generated by '(l."last_name" = r."last_name") AND (l."birth_date" = r."birth_date")' : 256,817
        block_on(["last_name", "first_name", "birth_date", "gender", "postal_code"]),  # Number of comparisons generated by '(l."last_name" = r."last_name") AND (l."first_name" = r."first_name") AND (l."birth_date" = r."birth_date") AND (l."gender" = r."gender") AND (l."postal_code" = r."postal_code")' : 174,036
        block_on(["last_name", "first_name", "birth_date", "gender", "phone_number"]),
    ]

model_settings = {
    "unique_id_column_name": "unique_id",
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": prediction_blocking_rules,
    "comparisons": [
        {
            "output_column_name": "Full name",
            "comparison_levels": [
                {
                    "sql_condition": "(first_name_l IS NULL OR first_name_r IS NULL) and (last_name_l IS NULL OR last_name_r IS NULL)",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                # full name match
                cll.exact_match_level("full_name", term_frequency_adjustments=True),
                # typos - keep levels across full name rather than scoring separately
                cll.jaro_winkler_level("full_name", 0.9),
                cll.jaro_winkler_level("full_name", 0.7),
                # name switched
                cll.columns_reversed_level("first_name", "last_name"),
                # name switched + typo
                {
                    "sql_condition": "jaro_winkler_similarity(first_name_l, last_name_r) + jaro_winkler_similarity(last_name_l, first_name_r) >= 1.8",
                    "label_for_charts": "switched + jaro_winkler >= 1.8",
                },
                {
                    "sql_condition": "jaro_winkler_similarity(first_name_l, last_name_r) + jaro_winkler_similarity(last_name_l, first_name_r) >= 1.4",
                    "label_for_charts": "switched + jaro_winkler >= 1.4",
                },
                # single name match
                cll.exact_match_level("first_name", term_frequency_adjustments=True),
                cll.exact_match_level("last_name", term_frequency_adjustments=True),
                # single name cross-match
                {
                    "sql_condition": "first_name_l = last_name_r OR last_name_l = first_name_r",
                    "label_for_charts": "single name cross-matches",
                },  # single name typos
                cll.jaro_winkler_level("first_name", 0.9),
                cll.jaro_winkler_level("last_name", 0.9),
                # the rest
                cll.else_level(),
            ],
        },
        ctl.date_comparison(
            "birth_date",
            damerau_levenshtein_thresholds=[],
            cast_strings_to_date=True,
            invalid_dates_as_null=True,
            date_format=(
                "yyyy-MM-dd" if "spark_session" in locals() else "%Y-%m-%d"
            ),  # date_format="%Y-%m-%d", Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB
        ),
        ctl.postcode_comparison(
            "postal_code",
            invalid_postcodes_as_null=False,  # invalid_postcodes_as_null=True,
            set_to_lowercase=False,
            valid_postcode_regex="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$",
            term_frequency_adjustments_full=True,
            include_full_match_level=True,
            include_sector_match_level=False,
            include_district_match_level=False,
            include_area_match_level=False,
        ),
        cl.exact_match("ssn"),
        cl.exact_match("gender", term_frequency_adjustments=True, m_probability_exact_match=0.9849399252128236),
    ],
    "retain_intermediate_calculation_columns": True,
    "retain_matching_columns": True,
}
con = duckdb.connect(database=':memory:', read_only=False)
configure_duckdb(con)
...
linker.estimate_u_using_random_sampling(max_pairs=1e10)
...
linker.estimate_parameters_using_expectation_maximisation(
    "l.ssn = r.ssn", 
    estimate_without_term_frequencies=True
)

Setup for duckdb

def set_duckdb_memory_limit(con):
    total_memory = psutil.virtual_memory().total  # in bytes

    # Set the memory limit to current usage + 40% buffer
    memory_limit = total_memory * 0.6

    # Convert to gigabytes for DuckDB PRAGMA
    memory_limit_gb = memory_limit / (1024 ** 3)

    # Ensure the limit is not higher than the total system memory
    if memory_limit_gb > total_memory / (1024 ** 3):
        memory_limit_gb = total_memory / (1024 ** 3)

    con.execute(f"PRAGMA memory_limit='{memory_limit_gb}GB';")
    logger.info(f"Set DuckDB memory limit to: {memory_limit_gb:.2f} GB")

def set_duckdb_threads(con, num_threads):
    con.execute(f"PRAGMA threads={num_threads};")
    logger.info(f"Set DuckDB threads to: {num_threads}")

def get_available_cpu_count():
    return psutil.cpu_count(logical=True)

def get_ec2_instance_vcpus():
    try:
        response = requests.get('http://169.254.169.254/latest/meta-data/instance-type')
        instance_type = response.text

        response = requests.get('http://169.254.169.254/latest/meta-data/cpu-options/core-count')
        core_count = int(response.text)

        response = requests.get('http://169.254.169.254/latest/meta-data/cpu-options/threads-per-core')
        threads_per_core = int(response.text)

        vcpus = core_count * threads_per_core
        logger.info(f"Instance type: {instance_type}, vCPUs: {vcpus}")
        return vcpus
    except Exception as e:
        logger.error(f"Error retrieving instance metadata: {e}")
        return get_available_cpu_count()

def configure_duckdb(con):
    # Set memory limit
    set_duckdb_memory_limit(con)
    
    # Set temporary directory
    temp_dir = '/tmp/duckdb_temp'
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    con.execute(f"PRAGMA temp_directory='{temp_dir}';")
    logger.info(f"Set DuckDB temp directory to: {temp_dir}")
    
    # Set number of threads
    available_cpus = get_ec2_instance_vcpus()
    if available_cpus > 48:
        # no idea what the limit is but do not allow duckdb to have too many threads
        available_cpus = 48
    set_duckdb_threads(con, available_cpus)

Once I even get past training this I will need to .predict(). What I would like to do is have a trained model in duckdb which I can then use to process transactional incoming data by comparing against the deduped dataset. This would not be cost optimized in Spark but could be in ec2 or even lambda if the blocking rules become pushdown predicates against a RDMS.

I am also aware I could increase my EBS volume size to handle the large number of comparisons without changing my model. I could also change my full_name comparison levels, but I do not want to do that my model output from Spark is really good with the inclusion of full_name, I have lots of families and bad data so first/last name create too many similarities especially amongst adults and minors. I could remove SSN since from EM training but without it I struggle to even populate m/u for all comparisons.

RobinL · 2024-06-18T13:30:38Z

RobinL
Jun 18, 2024
Maintainer

Hi @vfrank66 , thanks very much for the detailed write up.

I guess the first thing to say is that from the looks of things, you've done at least as much testing as us, so this is a very useful example and may imply the current recommendation to use DuckDB isn't quite right. I should also say up front there is no plan to remove Spark support in Splink 4 (it's already there, and working in the prereleases)

A few thoughts:

An important reason for the recommendation to use DuckDB is the benchmarking in this blog post. On the big machines, memory usage stayed pretty low. I had probably over-generalised from this that on large machines, memory shouldn't be too much of an issue. I've also heard various things to suggest DuckDB is pretty good with memory e.g. here and here. However, it's totally plausible to me that on a different workload you may get out of memory errors. I've also experienced them myself in DuckDB on non-record-linkage related workflows.
I should also note that the recommendation is (of should be!) to use DuckDB for most use cases, or maybe 'all but the biggest linkage projects'. I think for most users, good advice is 'use DuckDB on a large machine until you start to hit problems, then switch to Spark'. The current workding is here but I agree it could probably be clarified a bit. We're certainly going to continue to support Spark and other backends.

1 reply

vfrank66 Jun 22, 2024
Author

I appreciate the response thank you. I just did not fully understand the benchmark tests since I am using a much large machine in my testing, but the data I have is just not correct for what I desire to model.

I just removed the ssn comparison which creates 3 bil comparisons and all is well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance between Spark and Duckdb #2209

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Performance between Spark and Duckdb #2209

vfrank66 Jun 13, 2024

Replies: 1 comment · 1 reply

RobinL Jun 18, 2024 Maintainer

vfrank66 Jun 22, 2024 Author

vfrank66
Jun 13, 2024

Replies: 1 comment 1 reply

RobinL
Jun 18, 2024
Maintainer

vfrank66 Jun 22, 2024
Author