Skip to content

Levenshtein help #699

Answered by RobinL
msiemionCalistapw asked this question in Q&A
Discussion options

You must be logged in to vote

That happens in duckdb if you call the levenstein function where one on the inputs is a zero length string i.e. "". You need to turn zero length strings into true nulls before inputting your data into Splink

Sample code for cleaning up the input dataframe:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by RobinL
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants