Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.
/ pydedupe Public archive

(Archived) A Python library for record linkage and deduplication.

License

Notifications You must be signed in to change notification settings

gpoulter/pydedupe

Repository files navigation

PyDedupe

PyDedupe Logo

Archival

PyDedupe is archived (read-only) because it has not been updated since 2008 and was never converted to Python 3.

Introduction

PyDedupe is a Python library for performing record linkage, which identifies similar groups of records.

Background

I wrote and used PyDedupe at Naspers in 2007 and 2008, and received permission to publish it under the GNU General Public License.

PyDedupe supports a general model of tabular data and decouples the input formats from the algorithm.

The only other open-source Python record-linkage library at the time was FEBRL which supported only scalar-valued fields. PyDedupe supports row transformations for generated fields, multi-valued fields derived from delimited values in a column or combined from several columns, and compound values such as geographic coordinates. The API operates on iterations of tuples so that it is decoupled from input which could come from databases or files. A convenience module loads records from CSV files and re-writes them with similar records grouped together.

How record linkage works

The general strategy for record linkage is:

  1. Index records into blocks
  2. Compare all pairs of records in each block with a similarity function
  3. Cluster record pairs into "matches" and "non-matches" from the vector of similarity values.

What PyDedupe offers

The PyDedupe API offers multiple levels of abstraction:

  • Low-level functions to
    • Normalise values
    • Generate indexed values
    • Compare values for similarity
    • Do binary classification of floating-point vectors
  • Higher level classes to
    • Index records into blocks
    • Compare pairs of records for similarity vectors
    • Classify pairs of records as matches/non-matches
    • Group records together
  • Highest level API to
    • Use a record linkage strategy
    • Accept records from CSV input and write groups to CSV output

How to use PyDedupe

Using the library on CSV files requires writing a small script that defines the strategies for indexing, comparison and classification, then calls a high-level function with the name of the CSV input file and a folder in which to write the output. Records may be linked either within a single file, or between two files.

Record linkage on a database requires writing additional code to retrieves tuples, using the PyDedupe API to index, compare and classify the tuples, thentag the pairs of linked records in the database - or present a user interface for manually merging them.

About

(Archived) A Python library for record linkage and deduplication.

Topics

Resources

License

Stars

Watchers

Forks

Languages