The cluster_pairwise_predictions_at_threshold is failing #2636

andrzejurbanowicz · 2025-02-25T04:50:57Z

andrzejurbanowicz
Feb 25, 2025

Hello!

I am trying link 4 data sets (25M, 26M, 26M and 56M) with Splink 4.0.6. For this I am using spark backend with parameters:
--num-executors 100 --executor-cores 4 --executor-memory 64G --driver-memory 40G --conf spark.driver.maxResultSize=5G --conf spark.sql.shuffle.partitions=400 --conf spark.default.parallelism=1000 --conf spark.yarn.maxAppAttempts=4 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.execution.arrow.enabled=true --conf spark.sql.files.maxPartitionBytes=512MB

Training and prediction took around 1h, the prediction generated 54M but when I tried to run :
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.95 )

The code failed with:

File "/home/hadoop/splink_model.py", line 185, in run_prediction clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(df_predictions, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/linker_components/clustering.py", line 137, in cluster_pairwise_predictions_at_threshold cc = solve_connected_components( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/connected_components.py", line 442, in solve_connected_components prev_representatives_thinned = db_api.sql_pipeline_to_splink_dataframe(pipeline) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/database_api.py", line 200, in sql_pipeline_to_splink_dataframe splink_dataframe = self.sql_to_splink_dataframe_checking_cache( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/database_api.py", line 171, in sql_to_splink_dataframe_checking_cache splink_dataframe = self._sql_to_splink_dataframe( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/database_api.py", line 94, in _sql_to_splink_dataframe output_df = self._cleanup_for_execute_sql( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/spark/database_api.py", line 112, in _cleanup_for_execute_sql spark_df = self._break_lineage_and_repartition( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/splink/internals/spark/database_api.py", line 309, in _break_lineage_and_repartition spark_df.write.mode("overwrite").parquet(write_path) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1721, in parquet File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o1258.parquet. : org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 21.0 GiB. at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableBytesError(QueryExecutionErrors.scala:2201) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.org$apache$spark$sql$execution$exchange$BroadcastExchangeExec$$doComputeRelation(BroadcastExchangeExec.scala:224) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:191) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:184) at org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$compute$1(AsyncDriverOperation.scala:75) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:384) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:376) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:359)
I spent already few days and thinking to use MLib for clustering. Any recommendations or tips ?
My target is to link 4 data sets with around 300M records each.

Thanks
Andrzej

RobinL · 2025-02-25T15:46:50Z

RobinL
Feb 25, 2025
Maintainer

If you're hitting scaling constrats with clustering I would suggest minimising the size of the data by mapping all your ids into int32, clustering, then joining back on your business ids. here is some code that does something similar in duckdb and works well for us. Clustering sped up by about 10x when doing this.

With your scale of data at a guess clustering should work fine in duckdb on a fairly moderate size machine - we have found clustering in DuckDB to be dramatically faster than Spark as in went from hours down to minutes

The relevant function is cluster_pairwise_predictions_at_threshold which is here

def run_clustering_at_thresholds_with_int_32_ids(
    nodes: Union["SplinkDataFrame", str],
    edges: "SplinkDataFrame",
    match_probability_thresholds: list[float],
    connection: "DuckDBPyConnection",
    write_scheme: DataWriter,
    source_dataset_column_name="source_dataset",
    unique_id_column_name="person_id",
):
    """
    Run clustering with a list of thresholds.
    """

    ################################
    # NODES - convert IDs to int32 #
    ################################

    nodes_name = nodes if isinstance(nodes, str) else nodes.physical_name

    sql = f"""
    SELECT *,
        CAST(
            ROW_NUMBER() OVER (
                ORDER BY CONCAT({source_dataset_column_name},'-__-',{unique_id_column_name})
            ) AS INT32
        ) AS int_32_id
    FROM {nodes_name};
    """
    nodes_with_int_32_id = connection.sql(sql)
    nodes_with_int_32_id.create("nodes_with_int_32_ids")

    connection.execute(f"DROP TABLE {nodes_name};")

    ##########################
    # EDGES - join int32 IDs #
    ##########################

    edges.create_view("edges_unioned")

    sql = f"""
    SELECT
        l.int_32_id AS int_32_id_l,
        r.int_32_id AS int_32_id_r,
        e.match_probability
    FROM edges_unioned e
    JOIN nodes_with_int_32_ids l
        ON e.{unique_id_column_name}_l = l.{unique_id_column_name}
        AND e.{source_dataset_column_name}_l = l.{source_dataset_column_name}
    JOIN nodes_with_int_32_ids r
        ON e.{unique_id_column_name}_r = r.{unique_id_column_name}
        AND e.{source_dataset_column_name}_r = r.{source_dataset_column_name}
    """
    edges = connection.sql(sql)
    edges.create("edges_with_int_32_ids")

    ##################################
    # CLUSTERS - multiple thresholds #
    ##################################
    db_api = DuckDBAPI(connection)

    clusters = cluster_pairwise_predictions_at_multiple_thresholds(
        nodes="nodes_with_int_32_ids",
        edges="edges_with_int_32_ids",
        node_id_column_name="int_32_id",
        db_api=db_api,
        match_probability_thresholds=match_probability_thresholds,
    )

    cluster_id_exprs = []
    cluster_ids = []
    join_exprs = []

    for t in match_probability_thresholds:
        suffix = str(t).replace(".", "_")
        expr = f"""
        CONCAT(
            n_{suffix}.{source_dataset_column_name},'-__-',n_{suffix}.{unique_id_column_name}
        ) AS cluster_{suffix}
        """
        # Create elements to add to SQL query
        cluster_id_exprs.append(expr)
        cluster_ids.append(f"cluster_{suffix}")
        join_exprs.append(
            f"""
        INNER JOIN nodes_with_int_32_ids n_{suffix}
            ON n_{suffix}.int_32_id = c.cluster_{suffix}
        """
        )

    sql = f"""
    SELECT
        n.*,
        {", ".join(cluster_id_exprs)},
        c.* EXCLUDE ({", ".join(cluster_ids)}, int_32_id)
    FROM {clusters.physical_name} c
    INNER JOIN nodes_with_int_32_ids n
        ON n.int_32_id = c.int_32_id
    {"".join(join_exprs)}
    """

    clusters_with_fixed_columns = connection.sql(sql)
    clusters_with_fixed_columns.create("clusters_fixed")

    # QA checks
    sql = f"""
    SELECT
        COUNT(*) AS count_rows,
        {", ".join(f'COUNT(DISTINCT {id}) AS count_distinct_{id}' for id in cluster_ids)}
    FROM clusters_fixed
    """
    log_message_at_info_level("Number of rows in clusters:")
    log_message_at_info_level(connection.sql(sql).to_df().to_dict(orient="records"))

    # Write the cluster to s3
    clusters_splink_df = db_api.table_to_splink_dataframe(
        "__splink__df_clustered_with_input_data",
        "clusters_fixed",
    )
    log_message_at_info_level("Writing clusters to S3")
    write_scheme.save_splink_df(clusters_splink_df)

    return clusters_splink_df

1 reply

andrzejurbanowicz Feb 26, 2025
Author

Thanks for suggestion. For clustering I am using this function here

I am trying to implement your approach but taking more as I used spark. I was not able to use spark for training and prediction but duck db for clustering only.

I have still concern if DuckDB can handle this. In the final datasets. I have to link 4 datasets with 300M each, with around 400M clusters. Do you think DuckDB still can handle this?

Thanks
Andrzej

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The cluster_pairwise_predictions_at_threshold is failing #2636

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The cluster_pairwise_predictions_at_threshold is failing #2636

Uh oh!

Uh oh!

andrzejurbanowicz Feb 25, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

RobinL Feb 25, 2025 Maintainer

Uh oh!

andrzejurbanowicz Feb 26, 2025 Author

andrzejurbanowicz
Feb 25, 2025

Replies: 1 comment 1 reply

RobinL
Feb 25, 2025
Maintainer

andrzejurbanowicz Feb 26, 2025
Author