Skip to content

lifeunleaded/textcompare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

textcompare

License

Experimental PoC to piggyback on the NCD text comparison used in this paper. This, however, tries to use it to just measure the distance between strings to see if they can serve as suggestions for each other, and omits labelling and categorization using k-nearest-neighbor.

Prerequisites: numpy, saxonche, python-Levenshtein (pip installable)

textdiffs.py takes one more more input file names as argument. If there is only one and it ends in .xml it is assumed to be a docbook document where the sections are to be compared. If there are multiple files, the files are compared as text.

To run, simply do ./textdiffs.py inputfile1...

The result is a directory report/ with an index.html contains the similarity report, first for NCD, followed by Levenshtein.