 #### Fuzzy Matching & Deduplication Across Multiple Columns with a Patient Deduplication scenario for the HealthcareData Lakehouse.

Goal: Identify potentially duplicate patient records originating from different sources or containing data entry variations (typos, abbreviations, missing info) using fuzzy matching techniques in a Fabric Spark Notebook.

Techniques to Showcase:

    1. Data loading and preprocessing/standardization (lowercase, removing punctuation).
    2. Blocking strategy (Soundex on name + State) to reduce comparison space.
    3. Generating candidate pairs within blocks via self-join.
    4. Calculating similarity scores using built-in Spark functions (levenshtein, soundex, datediff).
    5. Applying rule-based logic based on multiple field similarities to flag potential duplicates.
    6. (Optional) Introduce Pandas UDFs for more advanced fuzzy algorithms (e.g., from thefuzz library).
    7. Saving potential duplicate pairs for review.


Sample Data:

 A single dataset representing raw patient entries from potentially different systems, containing variations that make exact matching difficult.

Files/landing/healthcare/patient_raw.csv

    Contains raw patient records with variations

```mermaid
graph TD
    A[Raw Patient Data (CSV)] --> B[Data Cleaning & Standardization]
    B --> C[Save Cleaned Data as Delta Table]
    C --> D[Blocking (SOUNDEX + State)]
    D --> E[Candidate Pair Generation (Self-Join on Block Key)]
    E --> F[Similarity Scoring (Levenshtein, DOB, Phone)]
    F --> G[Apply Matching Rules]
    G --> H[Potential Duplicates Table]
    H --> I[Query & Review Results]
```



###### Reset Demo

In [1]:
%%sql

DROP TABLE IF EXISTS patient_cleaned;
DROP TABLE IF EXISTS patient_potential_duplicates;

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 3, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

#### Cell 1: Setup & Imports

In [2]:
# Import PySpark functions
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DateType, IntegerType

# Optional: For Pandas UDFs later
# from pyspark.sql.functions import pandas_udf, PandasUDFType
# from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
# import pandas as pd
# from thefuzz import fuzz # Requires 'thefuzz' library installed in environment

print("Setup complete. PySpark functions imported.")

# --- Configuration ---
landing_zone_path = "Files/landing/healthcare"
raw_patient_csv_path = f"{landing_zone_path}/patient_raw.csv"

# Define Delta table names within the Lakehouse ('Tables' folder)
cleaned_patient_table = "patient_cleaned"
potential_duplicates_table = "patient_potential_duplicates"

print(f"Raw Patient CSV Path: {raw_patient_csv_path}")
print(f"Cleaned Table: {cleaned_patient_table}")
print(f"Output Duplicates Table: {potential_duplicates_table}")

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 5, Finished, Available, Finished)

Setup complete. PySpark functions imported.
Raw Patient CSV Path: Files/landing/healthcare/patient_raw.csv
Cleaned Table: patient_cleaned
Output Duplicates Table: patient_potential_duplicates


#### Cell 2: Load and Preprocess Patient Data

In [3]:
# Load raw data
raw_df = spark.read.csv(raw_patient_csv_path, header=True, inferSchema=True)

# --- Preprocessing and Standardization ---
patients_df = raw_df \
    .withColumn("dob", F.to_date(F.col("dob"), "yyyy-MM-dd")) \
    .withColumn("first_name_clean", F.lower(F.trim(F.col("first_name")))) \
    .withColumn("last_name_clean", F.lower(F.trim(F.col("last_name")))) \
    .withColumn("street_address_clean", F.lower(F.trim(F.regexp_replace("street_address", r'[^\w\s]', '')))) \
    .withColumn("city_clean", F.lower(F.trim(F.col("city")))) \
    .withColumn("state_clean", F.upper(F.trim(F.col("state")))) \
    .withColumn("zip_code_clean", F.trim(F.col("zip_code"))) \
    .withColumn("phone_clean", F.regexp_replace(F.col("phone_number"), r'[^0-9]', '')) \
    .withColumn("full_name_clean", F.trim(F.concat_ws(" ", F.col("first_name_clean"), F.col("last_name_clean")))) \
    .fillna("", subset=["middle_name"]) # Replace null middle names with empty string for consistency if needed

# Filter out records with essential missing info if necessary (e.g., name or dob)
# patients_df = patients_df.filter(F.col("last_name_clean") != "" & F.col("first_name_clean") != "" & F.col("dob").isNotNull())

print("Cleaned Patient Data Schema & Sample:")
patients_df.printSchema()
patients_df.select(
    "record_id", "first_name_clean", "last_name_clean", "dob",
    "street_address_clean", "city_clean", "state_clean", "phone_clean"
).show(truncate=False)

# Save cleaned data as a Delta table (useful checkpoint)
patients_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(cleaned_patient_table)
print(f"Cleaned patient data saved to Delta table: {cleaned_patient_table}")

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 6, Finished, Available, Finished)

Cleaned Patient Data Schema & Sample:
root
 |-- record_id: string (nullable = true)
 |-- source_system: string (nullable = true)
 |-- mrn: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- middle_name: string (nullable = false)
 |-- dob: date (nullable = true)
 |-- street_address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- first_name_clean: string (nullable = true)
 |-- last_name_clean: string (nullable = true)
 |-- street_address_clean: string (nullable = true)
 |-- city_clean: string (nullable = true)
 |-- state_clean: string (nullable = true)
 |-- zip_code_clean: string (nullable = true)
 |-- phone_clean: string (nullable = true)
 |-- full_name_clean: string (nullable = false)

+---------+----------------+---------------+----------+----------------------+-------------+-----------

Demo Points: 

1. Explain the importance of cleaning and standardizing data before matching (lowercase, trim, remove punctuation). 
2. This is a common task for Bronze to Silver in the Medallion architecture.
3. Show the standardized columns. Saving to Delta provides a checkpoint.

#### Cell 3: Define Blocking Strategy

In [4]:
# Read cleaned data
patients_df = spark.table(cleaned_patient_table)

# --- Add Blocking Key ---
# Strategy: SOUNDEX of last name + State Abbreviation
patients_df = patients_df.withColumn(
    "block_key", F.concat_ws("_", F.soundex(F.col("last_name_clean")), F.col("state_clean"))
)

print("Data with Blocking Key:")
patients_df.select("record_id", "last_name_clean", "state_clean", "block_key").show()

# --- Analyze Block Sizes (Optional but Recommended) ---
print("\nBlock Key Distribution:")
block_counts = patients_df.groupBy("block_key").count().orderBy(F.desc("count"))
block_counts.show(50)
# Check for excessively large blocks which might need a refined blocking key
large_blocks = block_counts.filter("count > 100").count() # Adjust threshold as needed
print(f"Number of blocks with size > 100: {large_blocks}")

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 7, Finished, Available, Finished)

Data with Blocking Key:
+---------+---------------+-----------+---------+
|record_id|last_name_clean|state_clean|block_key|
+---------+---------------+-----------+---------+
|   REC001|          smith|         CA|  S530_CA|
|   REC002|          smith|         CA|  S530_CA|
|   REC003|          smyth|         CA|  S530_CA|
|   REC004|          smith|         CA|  S530_CA|
|   REC005|         wonder|         NY|  W536_NY|
|   REC006|         wonder|         NY|  W536_NY|
|   REC007|         wonder|         NY|  W536_NY|
|   REC008|          jones|         CA|  J520_CA|
|   REC009|          jones|         CA|  J520_CA|
|   REC010|          jones|         CA|  J520_CA|
|   REC011|          patel|         UK|  P340_UK|
|   REC012|          young|         CA|  Y520_CA|
|   REC013|          smith|         CA|  S530_CA|
+---------+---------------+-----------+---------+


Block Key Distribution:
+---------+-----+
|block_key|count|
+---------+-----+
|  S530_CA|    5|
|  J520_CA|    3|
|  W536_NY

Demo Point: Explain the purpose of blocking – to reduce the number of comparisons from N*N to manageable sizes. Discuss the chosen strategy (Soundex + State) and why it helps group potentially similar records. Show the block key distribution; ideally, most blocks should be relatively small.

Talking points: 

- "Imagine finding duplicate patients in millions of healthcare records. Comparing every single record to every other record looking for typos or variations is computationally impossible – it would take forever."
- "Blocking is the smart shortcut. For our patient data, we first quickly group records using a specific strategy: grouping by how the last name sounds (using an algorithm called Soundex) combined with the patient's State."
- "Why this combo? Soundex catches common phonetic typos (like 'Smith' vs. 'Smyth'), while adding the State drastically narrows down the possibilities – we only compare similar-sounding names within the same state."
- "Then, we only run the detailed, slower fuzzy comparison on these much smaller, targeted groups. It dramatically cuts down the workload, making large-scale patient deduplication fast and efficient."


#### Cell 4: Generate Candidate Pairs within Blocks

In [5]:
# --- Generate Candidate Pairs by Self-Joining on Block Key ---
df1 = patients_df.alias("df1")
df2 = patients_df.alias("df2")

# Join on block_key and ensure we compare different records only once
candidate_pairs_df = df1.join(
    df2,
    on=(F.col("df1.block_key") == F.col("df2.block_key")) & (F.col("df1.record_id") < F.col("df2.record_id")),
    how="inner"
)

# Select relevant columns for comparison
candidate_pairs_df = candidate_pairs_df.select(
    F.col("df1.record_id").alias("id1"),
    F.col("df1.full_name_clean").alias("name1"),
    F.col("df1.dob").alias("dob1"),
    F.col("df1.street_address_clean").alias("address1"),
    F.col("df1.city_clean").alias("city1"),
    F.col("df1.state_clean").alias("state1"),
    F.col("df1.zip_code_clean").alias("zip1"),
    F.col("df1.phone_clean").alias("phone1"),
    F.col("df2.record_id").alias("id2"),
    F.col("df2.full_name_clean").alias("name2"),
    F.col("df2.dob").alias("dob2"),
    F.col("df2.street_address_clean").alias("address2"),
    F.col("df2.city_clean").alias("city2"),
    F.col("df2.state_clean").alias("state2"),
    F.col("df2.zip_code_clean").alias("zip2"),
    F.col("df2.phone_clean").alias("phone2")
)

print(f"\nGenerated {candidate_pairs_df.count()} candidate pairs for comparison.")
print("Sample Candidate Pairs:")
candidate_pairs_df.show(5, truncate=False)

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 8, Finished, Available, Finished)


Generated 16 candidate pairs for comparison.
Sample Candidate Pairs:
+------+----------+----------+-----------+-------+------+-----+----------+------+--------------+----------+-----------------+-------+------+-----+-----------+
|id1   |name1     |dob1      |address1   |city1  |state1|zip1 |phone1    |id2   |name2         |dob2      |address2         |city2  |state2|zip2 |phone2     |
+------+----------+----------+-----------+-------+------+-----+----------+------+--------------+----------+-----------------+-------+------+-----+-----------+
|REC001|john smith|1980-05-20|123 main st|anytown|CA    |90210|5551234567|REC013|jonn smith    |1980-05-20|123 main st      |anytown|CA    |90210|5551234567 |
|REC001|john smith|1980-05-20|123 main st|anytown|CA    |90210|5551234567|REC004|jonathan smith|1980-05-20|123 main st apt 1|anytown|CA    |90210|5551234567 |
|REC001|john smith|1980-05-20|123 main st|anytown|CA    |90210|5551234567|REC003|john smyth    |1980-05-20|123 main street  |anytown|CA

Demo Point: Explain the self-join on the block_key. Emphasize the df1.record_id < df2.record_id condition to prevent self-comparisons and duplicate pairs. Show the resulting candidate pairs ready for scoring.

Talking points:

- "Alright, we've now grouped potentially similar patients into 'blocks' using Soundex and State."
- "The next step is crucial: creating the actual pairs of patients within each block that we need to compare closely. We do this efficiently using a 'self-join' – essentially, matching our patient list back to itself only when the block keys match."
- "The really clever part is adding one simple condition: Record ID 1 < Record ID 2. This instantly prevents two things: comparing a patient record to itself (which is pointless) and creating duplicate pairs (like comparing Alice to Bob and Bob to Alice – we only need it once!)."
- "So, this step quickly and efficiently gives us just the unique 'candidate pairs' we actually need to analyze, ready for the detailed fuzzy scoring in the next stage."


#### Cell 5: Calculate Similarities & Score Pairs (Rule-Based)

In [6]:
# --- Calculate Similarity Scores ---
scored_pairs_df = candidate_pairs_df.withColumn(
    # Lower score = more similar for Levenshtein distance
    "name_lev", F.levenshtein(F.col("name1"), F.col("name2"))
).withColumn(
    # Difference in days for DOB. Lower = more similar.
    "dob_diff", F.abs(F.datediff(F.col("dob1"), F.col("dob2")))
).withColumn(
    # Levenshtein on street address
    "address_lev", F.levenshtein(F.col("address1"), F.col("address2"))
).withColumn(
    # Levenshtein on cleaned phone number
    "phone_lev", F.levenshtein(F.col("phone1"), F.col("phone2"))
)

print("\nPairs with Similarity Scores:")
scored_pairs_df.select("id1", "name1", "id2", "name2", "name_lev", "dob_diff", "address_lev", "phone_lev").show(truncate=False)

# --- Apply Matching Rules ---
# Example rules (adjust thresholds based on data and requirements):
# Rule 1: Very similar name, exact DOB -> Potential Match
# Rule 2: Exact name, similar address, close DOB -> Potential Match
# Rule 3: Similar name, exact DOB, similar phone -> Potential Match

match_threshold_name_close = 3
match_threshold_name_exact = 0
match_threshold_dob_exact = 0
match_threshold_dob_close = 7 # Allow 1 week difference? Or less.
match_threshold_address_close = 5
match_threshold_phone_close = 2

scored_pairs_df = scored_pairs_df.withColumn(
    "is_potential_duplicate",
    F.when(
        (F.col("name_lev") <= match_threshold_name_close) &
        (F.col("dob_diff") <= match_threshold_dob_exact),
        True
    ).when(
        (F.col("name_lev") <= match_threshold_name_exact) &
        (F.col("address_lev") <= match_threshold_address_close) &
        (F.col("dob_diff") <= match_threshold_dob_close),
        True
    ).when(
        (F.col("name_lev") <= match_threshold_name_close) &
        (F.col("dob_diff") <= match_threshold_dob_exact) &
        (F.col("phone_lev") <= match_threshold_phone_close),
        True
    ).otherwise(False)
)

print("\nPairs with Potential Duplicate Flag:")
scored_pairs_df.select(
    "id1", "name1", "id2", "name2", "name_lev", "dob_diff",
    "address_lev", "phone_lev", "is_potential_duplicate"
).show(truncate=False)

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 9, Finished, Available, Finished)


Pairs with Similarity Scores:
+------+--------------+------+--------------+--------+--------+-----------+---------+
|id1   |name1         |id2   |name2         |name_lev|dob_diff|address_lev|phone_lev|
+------+--------------+------+--------------+--------+--------+-----------+---------+
|REC001|john smith    |REC013|jonn smith    |1       |0       |0          |0        |
|REC001|john smith    |REC004|jonathan smith|4       |0       |6          |0        |
|REC001|john smith    |REC003|john smyth    |1       |0       |4          |1        |
|REC001|john smith    |REC002|jon smith     |1       |0       |0          |0        |
|REC002|jon smith     |REC013|jonn smith    |1       |0       |0          |0        |
|REC002|jon smith     |REC004|jonathan smith|5       |0       |6          |0        |
|REC002|jon smith     |REC003|john smyth    |2       |0       |4          |1        |
|REC003|john smyth    |REC013|jonn smith    |2       |0       |4          |1        |
|REC003|john smyth    |

Demo Point: Explain the use of levenshtein for string similarity (lower is better) and datediff for dates. Show the calculated scores. Explain the simple rule-based logic combining thresholds across multiple fields using when/otherwise to flag potential duplicates. Discuss how these thresholds would be tuned.

Talking points:

- "So, we have our 'candidate pairs' – records that might be duplicates because they were in the same block. Now, we need to score how similar they truly are."
- "For text like names and addresses, we're using a function called 'Levenshtein distance'. It basically counts the minimum 'typos' or edits needed to make one string match the other – the key is, a lower Levenshtein score means a better, more similar match."
- "For dates of birth, we simply calculate the difference in days – again, closer to zero is better."
- "But just one similar field isn't usually enough proof. So, we combine these scores using simple 'WHEN...THEN' rules: for example, 'WHEN the name similarity score is very low AND the date difference is zero or tiny, THEN flag this pair as a potential duplicate'. We can add other rules combining different fields."
- "Crucially, those 'very low' thresholds are adjustable – we'd tune them based on analyzing the data here in St. Louis and deciding how strict we need to be. This step flags the most promising pairs for review."


#### Cell 6 (Optional): Advanced Scoring with Pandas UDF

In [7]:
# --- Optional: Advanced scoring using 'thefuzz' library via Pandas UDF ---
# Ensure 'thefuzz' is installed in your Fabric Spark environment.

# # Example Schema for UDF output
# score_schema = StructType([
#     StructField("name_ratio", IntegerType()),
#     StructField("name_partial_ratio", IntegerType()),
#     StructField("name_token_sort_ratio", IntegerType()),
#     StructField("address_ratio", IntegerType())
# ])

# # Define the Pandas UDF
# @pandas_udf(score_schema, PandasUDFType.SCALAR)
# def calculate_fuzzy_scores_udf(name1_series: pd.Series, name2_series: pd.Series, address1_series: pd.Series, address2_series: pd.Series) -> pd.DataFrame:
#     """Calculates various fuzzy scores for pairs."""
#     results = []
#     for name1, name2, address1, address2 in zip(name1_series, name2_series, address1_series, address2_series):
#         name_r = fuzz.ratio(name1, name2)
#         name_pr = fuzz.partial_ratio(name1, name2)
#         name_tsr = fuzz.token_sort_ratio(name1, name2)
#         address_r = fuzz.ratio(address1, address2)
#         results.append((name_r, name_pr, name_tsr, address_r))
#     return pd.DataFrame(results, columns=["name_ratio", "name_partial_ratio", "name_token_sort_ratio", "address_ratio"])

# # Apply the UDF
# # Note: Ensure candidate_pairs_df is available if running this cell
# scored_pairs_adv_df = candidate_pairs_df.withColumn(
#     "fuzzy_scores",
#     calculate_fuzzy_scores_udf(
#         F.col("name1"), F.col("name2"), F.col("address1"), F.col("address2")
#     )
# ).select("*", "fuzzy_scores.*").drop("fuzzy_scores") # Expand the struct

# print("\nPairs with Advanced Fuzzy Scores (using thefuzz):")
# scored_pairs_adv_df.select(
#     "id1", "name1", "id2", "name2",
#     "name_ratio", "name_partial_ratio", "name_token_sort_ratio", "address_ratio"
# ).show(truncate=False)

# # You could then define 'is_potential_duplicate' based on these richer scores, e.g.,
# # high token_sort_ratio for name AND high address_ratio AND close DOB

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 10, Finished, Available, Finished)

Demo Point: Briefly explain that for more complex matching, Pandas UDFs allow using powerful Python libraries like thefuzz. Show the structure (define schema, define UDF, apply UDF). Mention higher scores (like ratio) mean more similar, unlike Levenshtein distance. Keep this optional or brief unless the audience is very technical.    

#### Cell 7: Filter, Show, and Save Potential Duplicates

In [8]:
# --- Filter for potential duplicates based on the rules ---
potential_duplicates_df = scored_pairs_df.filter(F.col("is_potential_duplicate") == True)

print(f"\nFound {potential_duplicates_df.count()} potential duplicate pairs.")
print("Potential Duplicate Pairs Details:")

# Select a clean set of columns to display/save
display_cols = [
    "id1", "name1", "dob1", "address1", "city1", "state1", "zip1", "phone1",
    "id2", "name2", "dob2", "address2", "city2", "state2", "zip2", "phone2",
    "name_lev", "dob_diff", "address_lev", "phone_lev" # Include scores for context
]
potential_duplicates_to_show = potential_duplicates_df.select(display_cols)
potential_duplicates_to_show.show(truncate=False)

# --- Save the potential duplicate pairs to a Delta table ---
try:
    potential_duplicates_to_show.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(potential_duplicates_table)
    print(f"Potential duplicate pairs saved to Delta table: {potential_duplicates_table}")
except Exception as e:
    print(f"Error saving potential duplicates: {e}")

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 11, Finished, Available, Finished)


Found 10 potential duplicate pairs.
Potential Duplicate Pairs Details:
+------+------------+----------+---------------+----------+------+-----+-----------+------+------------+----------+---------------+----------+------+-----+-----------+--------+--------+-----------+---------+
|id1   |name1       |dob1      |address1       |city1     |state1|zip1 |phone1     |id2   |name2       |dob2      |address2       |city2     |state2|zip2 |phone2     |name_lev|dob_diff|address_lev|phone_lev|
+------+------------+----------+---------------+----------+------+-----+-----------+------+------------+----------+---------------+----------+------+-----+-----------+--------+--------+-----------+---------+
|REC001|john smith  |1980-05-20|123 main st    |anytown   |CA    |90210|5551234567 |REC013|jonn smith  |1980-05-20|123 main st    |anytown   |CA    |90210|5551234567 |1       |0       |0          |0        |
|REC001|john smith  |1980-05-20|123 main st    |anytown   |CA    |90210|5551234567 |REC003|john 

Demo Point: Show the final filtered list of pairs flagged as potential duplicates. Explain that this output would typically feed into a review process or a more advanced clustering step. Save the results.

#### Cell 8: Query Results & Discussion

In [9]:
%%sql
-- Query the potential duplicates table
SELECT
    id1, name1, dob1, address1,
    id2, name2, dob2, address2,
    name_lev, dob_diff -- Show similarity scores
FROM
    patient_potential_duplicates -- Use table name directly
ORDER BY name1, id1, id2;

StatementMeta(, 3327df26-7fae-40ec-8601-7c7ecfce6e61, 12, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 10 fields>

Demo Point: Show querying the results using SQL. Discuss next steps: Manual review, using graph algorithms (like Connected Components on the pairs) to group records belonging to the same entity, assigning a master ID (Golden Record). Emphasize that this demo focused on identifying pairs, and resolving them is often the subsequent step.

Talking Points:

- "So, right here in the notebook, using standard SQL, we can immediately query the potential duplicate patient pairs our fuzzy logic just flagged – you can see the different record IDs side-by-side, along with the similarity scores that triggered the match."
- "This output table, stored right here in our Fabric Lakehouse, is the crucial input for the next phase. Typically, these potential duplicates would either go to data stewards for review, or more powerfully, you'd feed these pairs into graph algorithms, like Connected Components."
- "Think of it like connecting the dots – if the algorithm sees 'Record A matches B' and 'Record B matches C', it automatically groups A, B, and C together as likely the same person."
- "The ultimate goal is usually assigning one unique Master Patient ID or creating a single 'Golden Record' for each individual, essential for accurate care and reporting right here for our patients in St. Louis."
- "What we focused on in this demo was that vital, complex first step: using Spark's power to accurately identify the potential duplicate pairs from messy data, paving the way for creating that clean, trusted patient view."

Technical points:

name_lev: This  stands for "Name Levenshtein distance".

    It's calculated using the levenshtein() Spark function between the cleaned full names of the two patient records being compared (name1 and name2).
    What it means: Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to change one 1 name into the other. 

Levenshtein distance - Wikipedia

https://en.wikipedia.org/wiki/Levenshtein_distance

        Interpretation: A lower name_lev score means the names are more similar. A score of 0 means the names are identical. A small score (like 1 or 2) indicates a minor typo or difference (e.g., "Jon" vs "John", "Smith" vs "Smyth").

dob_diff: This stands for "Date of Birth Difference".
        It's calculated using the abs(datediff(dob1, dob2)) Spark functions between the dates of birth of the two patient records (dob1 and dob2).
        What it means: It measures the absolute difference between the two dates of birth, expressed in days.
        Interpretation: A lower dob_diff score means the dates of birth are closer together. A score of 0 means the DOBs are identical. A small score might indicate a minor data entry error, while a large score indicates they are definitely different birth dates.

In the demo, these scores are used together in rules (like WHEN name_lev <= 3 AND dob_diff <= 1 THEN...) to decide if a pair of records is similar enough across multiple fields to be flagged as a potential duplicate.
