Skip to content

jschaul/duplicates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

duplicates

Find duplicate sentences or sentence fragments in a large (e.g. book-length) text file.

Behaviour is primitive; text is only split on new lines and punctuation, and any splits shorter than 20 characters are ignored. While nothing fancy is done (i.e. better-performing suffix trees are not used, I'm using simple lists), performance for a 400 page test document is sub-second.

Status: experimental. Things are hardcoded.

Requirements

  • haskell stack to compile this project.
  • Since few people write books in text or markdown, you probably want pandoc

How to use

create a input.txt file in the current directory, e.g. using pandoc to convert if necessary:

pandoc <input file> -o input.txt --wrap=none

Compile with make build and analyse for duplicates with make run

About

Find duplicate sentences in long text document

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages