dry4python finds candidate duplicate Python code across files and directories. It reports fuzzy structural matches by filename and line range so another mechanism can evaluate and reduce duplication.
dry4python is a Python-focused port of dry4go, adapted to Python syntax, AST nodes, and project conventions.
dry4python compares Python functions and methods by converting each function body and signature shape into normalized syntax nodes. The normalized tree is walked to collect a set of structural fingerprints, one for the whole function and one for each nested syntax node.
Similarity is Jaccard similarity over those fingerprint sets:
score = shared fingerprints / all fingerprints seen in either function
A score of 1.0 means the normalized structures have the same fingerprint set. Lower scores mean the functions still share structure, but each function also has structure the other does not. The default --threshold 0.82 reports candidates whose normalized structures are close enough to be worth review.
Python differs from Go in important ways, so dry4python treats functions, class methods, nested functions, and nested class methods as the comparison units and uses Python's ast parser instead of textual forms. Identifiers, local names, attribute names, keyword names, literal values, and docstrings normalize away. Structural Python syntax is preserved, including:
- function, async function, method, and async method shape
- positional, keyword-only, variadic, and annotated parameters
- return annotations and decorators
- blocks and statement order
if,for,while,with,try, andmatch- assignments, returns, calls, attributes, subscripts, and slices
- lists, tuples, sets, dictionaries, comprehensions, and lambdas
- operators such as
+,==,and, andor
For example, these functions can match strongly even though their names, local variables, predicates, and attribute names differ:
def alpha(xs):
ys = []
for x in xs:
if x % 2 == 1:
ys.append(x + 1)
return ys
def beta(items):
kept = []
for item in items:
if item % 2 == 0:
kept.append(item + 1)
return keptdry4python [options] [file-or-directory ...]Options:
--threshold N Minimum structural similarity score, default 0.82
--min-lines N Minimum source lines in a candidate function, default 4
--min-nodes N Minimum normalized syntax nodes, default 20
--format F text or json, default text
--exclude GLOB Exclude paths matching GLOB; can be repeated
--json Same as --format json
--text Same as --format text
Examples:
dry4python .
dry4python package/foo.py package/bar.py
dry4python --json --threshold 0.9 ./src ./tests
python -m dry4python --threshold 0.75 .
dry4python --exclude '*/migrations/*' --exclude '*_pb2.py' .Every file named on the command line participates in the same duplication search. When an argument is a directory, dry4python recursively includes every .py file under that directory in the same search set, skipping common generated and environment directories such as .git, .venv, venv, __pycache__, __pypackages__, .eggs, migrations, build, and dist. It also skips common generated files such as *_pb2.py, *_pb2_grpc.py, and *_generated.py by default.
Default text output is intended for quick reading. Pairwise matches that belong to the same connected duplicate cluster are collapsed into a group to avoid pairwise spam:
DUPLICATE GROUP score=0.89 pairs=3
package/billing/invoice.py:12-25 Invoice.summary
package/billing/receipt.py:30-44 Receipt.report
package/billing/quote.py:42-56 Quote.summary
Two-location groups are printed as a simple pair:
DUPLICATE score=0.89
package/billing/invoice.py:12-25 Invoice.summary
package/billing/receipt.py:30-44 Receipt.report
JSON output is intended for tools. It includes the raw pairwise candidates and the grouped clusters:
{
"candidates": [
{
"score": 0.8909090909090909,
"left": {"file": "package/billing/invoice.py", "start_line": 12, "end_line": 25, "qualname": "Invoice.summary"},
"right": {"file": "package/billing/receipt.py", "start_line": 30, "end_line": 44, "qualname": "Receipt.report"},
"left_nodes": 88,
"right_nodes": 91
}
],
"groups": [
{
"score": 0.8909090909090909,
"locations": [
{"file": "package/billing/invoice.py", "start_line": 12, "end_line": 25, "qualname": "Invoice.summary"},
{"file": "package/billing/receipt.py", "start_line": 30, "end_line": 44, "qualname": "Receipt.report"}
],
"pairs": 1
}
]
}Project defaults can be stored in pyproject.toml:
[tool.dry4python]
threshold = 0.85
min-lines = 5
min-nodes = 30
format = "text"
paths = ["src", "tests"]
exclude = ["*/migrations/*", "*_pb2.py"]dry4python reads the nearest parent pyproject.toml. Configured relative paths are resolved from that file's directory. Command-line options override configured scalar values, and command-line paths override configured paths. --exclude patterns are added to configured excludes.
Use comments to suppress known false positives:
# dry4python: ignoreAt the top of a file, this skips the whole file. Immediately before a function, method, decorator, or on the def line, it skips that function. # dry4python: ignore-file skips the whole file from any line.
python -m unittest discover -s tests
python -m dry4python --help
python -m dry4python --threshold 0.75 .MIT. See LICENSE.