Skip to content

jschaul/duplicates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

duplicates

Find duplicate sentences or sentence fragments in a large (e.g. book-length) text file.

Behaviour is primitive; text is only split on new lines and punctuation, and any splits shorter than 20 characters are ignored. While nothing fancy is done (i.e. better-performing suffix trees are not used, I'm using simple lists), performance for a 400 page test document is sub-second.

Status: experimental. Things are hardcoded.

Requirements

  • haskell stack to compile this project.
  • Since few people write books in text or markdown, you probably want pandoc

How to use

create a input.txt file in the current directory, e.g. using pandoc to convert if necessary:

pandoc <input file> -o input.txt --wrap=none

Compile with make build and analyse for duplicates with make run

About

Find duplicate sentences in long text document

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published