π§ under construction π§
GroupByRule is a Python package for data cleaning and deduplication. It integrates with pandas' groupby
function to not only group dataframe rows by a given identifier, but also groups rows based on logical rules and partial matching. In other words, it provides tools for deterministic record linkage and entity resolution in structured databases. It can also be used for blocking, a form of filtering used to speed-up more complex entity resolution algorithms. See the references below to learn more about these topics.
One of the main goal of GroupByRule is to be user-friendly. Matching rules and clustering algorithms are composable and the performance of algorithms can be readily evaluated given training data. The package is built on top of pandas for data manipulation and on igraph for graph clustering and related computations.
Additionally, GroupByRule provides highly efficient C++ implementations of common string distance functions through its comparator
submodule. This can be used independently of record linkage algorithms.
Install from github using the following command:
pip install git+https://github.com/OlivierBinette/groupbyrule.git
Consider the RLdata500
dataset from the RecordLinkage R package.
from groupbyrule.data import load_RLdata500
df = load_RLdata500()
df.head()
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | identity | |
---|---|---|---|---|---|---|---|---|
1 | CARSTEN | NaN | MEIER | NaN | 1949 | 7 | 22 | 34 |
2 | GERD | NaN | BAUER | NaN | 1968 | 7 | 27 | 51 |
3 | ROBERT | NaN | HARTMANN | NaN | 1930 | 4 | 30 | 115 |
4 | STEFAN | NaN | WOLFF | NaN | 1957 | 9 | 2 | 189 |
5 | RALF | NaN | KRUEGER | NaN | 1966 | 1 | 13 | 72 |
We deduplicate this dataset by linking records which match either on both first name (fname_c1
) and last name (lname_c1
), on both first name and birth day (bd
), or on both last name and birth day. Linkage transitivity is resolved, by default, by considering connected components of the resulting graph. Precision and recall are computed from the ground truth membership vector identity_RLdata500
.
from groupbyrule import Any, Match, precision_recall
# Specify linkage rule
rule = Any(Match("fname_c1", "lname_c1"),
Match("fname_c1", "bd"),
Match("lname_c1", "bd"))
# Apply the rule to a dataset
rule.fit(df)
# Evaluate performance by computing precision and recall
precision_recall(rule.groups, df.identity)
(0.11538461538461539, 0.96)
This is not the best way to deduplicate this dataset, but the above showcases the composability of matching rules. The specific rules themselves (exact matching, similarity-based string matching, and different clustering algorithms) can be customized as needed. A more complete overview is available here π§.
A better way to deduplicate this data is to link all pairs of records which agree on all but at most one attribute. This is done below.
from groupbyrule import AllButK
# Link records agreeing on all but at most k=1 of the specified attributes
rule = AllButK("fname_c1", "lname_c1", "bd", "bm", "by", k=1)
# Apply the rule to a dataset
rule.fit(df)
# Evaluate performance by computing precision and recall
precision_recall(rule.groups, df.identity)
(1.0, 0.92)
Following record linkage, records can be processed using pandas's groupby and aggregation functions. Below, we only keep the first non-NA attribute value for each record cluster. This is a simple way to obtain a deduplicated dataset.
deduplicated = df.groupby(rule.groups).first()
GroupByRule provides a suite of string and numerical similarity functions as part of its comparator
submodule. String similarity functions include the Levenshtein distance, the Jaro-Winkler distance, and π§. These similarity functions can be used on their own as shown below, or for the definition of linkage rules as explained in the following section. This is heavily inspired by Neil Marchant's excellent Comparator R package, but not quite equivalent in its scope and implementation.
String distance functions are implemented through subclasses of the Comparator
abstract base case. Comparator
objects are used to instanciate comparison functions while allowing data in memory to be recycled across function calls. The compare()
method can then be used to compare elements, the pairwise()
method can be used to compare all pairs of elements between two lists, and the elementwise()
method can be used to compare corresponding elements.
Below are examples of the comparison functions currently provided. These are implemented in C++ for efficiency.
from groupbyrule.comparator import Levenshtein
cmp = Levenshtein(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.25
from groupbyrule.comparator import LCSDistance
cmp = LCSDistance(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.25
from groupbyrule.comparator import DamerauLevenshtein
cmp = DamerauLevenshtein(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.13333333333333333
from groupbyrule.comparator import Jaro
cmp = Jaro(similarity=False)
cmp.compare("Olivier", "Oilvier")
0.04761904761904756
from groupbyrule.comparator import JaroWinkler
cmp = JaroWinkler(similarity=False)
cmp.compare("Olivier", "Oilvier")
0.042857142857142816
π§
π§
π§
π§
π§
π§
π§
π§