PyDedupe

Archival

PyDedupe is archived (read-only) because it has not been updated since 2008 and was never converted to Python 3.

Introduction

PyDedupe is a Python library for performing record linkage, which identifies similar groups of records.

Background

I wrote and used PyDedupe at Naspers in 2007 and 2008, and received permission to publish it under the GNU General Public License.

PyDedupe supports a general model of tabular data and decouples the input formats from the algorithm.

The only other open-source Python record-linkage library at the time was FEBRL which supported only scalar-valued fields. PyDedupe supports row transformations for generated fields, multi-valued fields derived from delimited values in a column or combined from several columns, and compound values such as geographic coordinates. The API operates on iterations of tuples so that it is decoupled from input which could come from databases or files. A convenience module loads records from CSV files and re-writes them with similar records grouped together.

How record linkage works

The general strategy for record linkage is:

Index records into blocks
Compare all pairs of records in each block with a similarity function
Cluster record pairs into "matches" and "non-matches" from the vector of similarity values.

What PyDedupe offers

The PyDedupe API offers multiple levels of abstraction:

Low-level functions to
- Normalise values
- Generate indexed values
- Compare values for similarity
- Do binary classification of floating-point vectors
Higher level classes to
- Index records into blocks
- Compare pairs of records for similarity vectors
- Classify pairs of records as matches/non-matches
- Group records together
Highest level API to
- Use a record linkage strategy
- Accept records from CSV input and write groups to CSV output

How to use PyDedupe

Using the library on CSV files requires writing a small script that defines the strategies for indexing, comparison and classification, then calls a high-level function with the name of the CSV input file and a folder in which to write the output. Records may be linked either within a single file, or between two files.

Record linkage on a database requires writing additional code to retrieves tuples, using the PyDedupe API to index, compare and classify the tuples, thentag the pairs of linked records in the database - or present a user interface for manually merging them.

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
.settings		.settings
dedupe		dedupe
docs		docs
tests		tests
.bzrignore		.bzrignore
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
.rsync-filter		.rsync-filter
LICENSE		LICENSE
README.md		README.md
pydedupe.png		pydedupe.png
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyDedupe

Archival

Introduction

Background

How record linkage works

What PyDedupe offers

How to use PyDedupe

About

Releases

Languages

License

gpoulter/pydedupe

Folders and files

Latest commit

History

Repository files navigation

PyDedupe

Archival

Introduction

Background

How record linkage works

What PyDedupe offers

How to use PyDedupe

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages