Compare TSV files cell by cell with focus on similarity.
$ difftsv foo.tsv bar.tsv
[Similarity distribution] (9 rows)
100%: [||||||||||||||||||||||| ] 7/9 ( 77%)
95%: [||| ] 1/9 ( 11%)
90%: [ ] 0/9 ( 0%)
80%: [ ] 0/9 ( 0%)
---: [||| ] 1/9 ( 11%)
Similarity: 88.9% (0.0 sec) MEM:5.2MB
- Support similarity (reports percentage of similarity)
- Support multiple columns for primary key
- Support Float (compares float values with delta)
- Fast (100% written in Crystal)
- Large amount of memory (ex. 1GB is required for two 35MB files)
Static Binary is ready for x86_64 linux
Just put two TSV files. It will report the similarity.
$ difftsv foo.tsv bar.tsv
Treats the first row as the column names.
date value
01/29 2
01/30 5
If the first line starts with #
, it is automatically recognized as a header, regardless of this option.
#date value
...
This specifies primary keys by 1-origin indexes. Accepts the same format as cut(1). Default is 1.
$ difftsv -f 1-3,5 ...
Compares float values with this delta. Default is 0.001
.
$ difftsv ...
Similarity: 99.993 (0.0 sec) MEM:5.1MB
$ difftsv --delta 0.1 ...
Similarity: 100 (0.0 sec) MEM:5.2MB
This outputs only float values of similarity to STDOUT. If an error occurs, the content of the error is output to STDERR, and nothing is output to STDOUT.
$ difftsv -s ...
100.0
This shows nothing except error messages. It is useful when you want just status code.
$ difftsv -q ...
By default, the CSV parser is used, so handling data files with strings containing double quotes, for example, may result in an error.
Expecting comma, newline or end, not '-' at 27029:97
In that case, use the "donkey mode", which may be slow but is a simple process of analysis.
$ difftsv -L donkey ...
- cli: Specify the value keys
- lib: handle duplicated keys (skip or raise)
- lib: write different rows and cols into file
As repository name says, it is available as Crystal library. See README.cr.md for details.
- maiha - creator and maintainer