Replies: 2 comments 4 replies
-
What do you get if you run |
Beta Was this translation helpful? Give feedback.
4 replies
-
Hi Robin,
Silly question for you: Have you seen something like this screen shot below before? It doesn’t really look like any of
the fields is particularly useful.. Am I missing something? Is there anything that one can do? Thank you!!
Sincerely,
Tom
My code is below the picture
***@***.***
from splink.duckdb.blocking_rule_library import block_on
blocking_rules = [
block_on("lastname_dm"),
block_on("PER_LastName"),
block_on("new_address"),
#block_on("full_name"),
block_on("firstname_dm"),
block_on("PER_FirstName"),
block_on("PER_DOB"),
block_on("zip_valid"),
block_on("PER_HomePhone"),
block_on("PER_Email"),
block_on("PER_CellularPhoneOrPager"),
block_on("middlename_dm")
]
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
settings = {
"unique_id_column_name": "id",
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": blocking_rules,
"comparisons": [
ctl.name_comparison("PER_FirstName"),
ctl.name_comparison("PER_LastName"),
ctl.name_comparison("full_name"),
ctl.email_comparison("PER_Email", include_username_fuzzy_level=False),
ctl.forename_surname_comparison("PER_FirstName","PER_LastName"),
cl.levenshtein_at_thresholds("PER_SSN", [2]),
cl.exact_match("new_address", term_frequency_adjustments=True),
ctl.date_comparison("PER_DOB"),
cl.exact_match("city_valid", term_frequency_adjustments=True),
cl.exact_match("zip_valid", term_frequency_adjustments=True),
],
"retain_intermediate_calculation_columns": True
}
linker = DuckDBLinker(cleaned_df9_subset, settings, set_up_basic_logging=False)
deterministic_rules = [
"l.PER_SSN = r.PER_SSN and l.PER_DOB = r.PER_DOB",
"l.PER_FirstName = r.PER_FirstName and l.PER_DOB = r.PER_DOB",
"l.PER_LastName = r.PER_LastName and l.PER_DOB = r.PER_DOB",
]
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.9)
linker.estimate_u_using_random_sampling(max_pairs=1e8)
linker.estimate_m_from_label_column("PER_SSN")
and that gives me a full model and I then use
linker.match_weights_chart()
From: Robin Linacre ***@***.***>
Sent: Friday, February 2, 2024 11:31 AM
To: moj-analytical-services/splink ***@***.***>
Cc: Thomas Heiman ***@***.***>; Author ***@***.***>
Subject: Re: [moj-analytical-services/splink] No module named 'splink.duckdb.linker (Discussion #1920)
CAUTION: External Email. Proceed Responsibly.
Closed #1920<#1920> as resolved.
—
Reply to this email directly, view it on GitHub<#1920>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BETOF4JCFS4GWHGM2ZNBT2LYRU5IFAVCNFSM6AAAAABCVSRDGKVHI2DSMVQWIX3LMV45UABFIRUXGY3VONZWS33OIV3GK3TUHI5E433UNFTGSY3BORUW63R3GEYDQOJTG44Q>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Cool project!! When I try to run the demo code below I get No module named 'splink.duckdb.linker. Any ideas on what is going wrong? The code is below. Thank you!
tom
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets
import logging, sys
logging.disable(sys.maxsize)
df = splink_datasets.fake_1000
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on("first_name"),
block_on("surname"),
block_on("email")
]
}
linker = DuckDBLinker(df, settings)
linker.cumulative_num_comparisons_from_blocking_rules_chart()
Beta Was this translation helpful? Give feedback.
All reactions