Text Similarity in Java

The purpose of this project is to measure the similarity between text documents.

You can use it to create a Related Posts widget on a blog that shows other blog posts that are similar to the one that the user is currently viewing — something similar to the example below:

Related Posts:

What are RESTful Web Services?

The SQL Developer's Guide to REST Services

How to Model Workflows in REST APIs

Avoid Data Corruption in Your REST API with ETags

The underlying implementation converts the documents to Term Frequency Inverse Document Frequency (tf-idf) vectors and measures the (cosine) distance between them.

Prerequisites

Maven 3
Java 8+
IntelliJ IDEA (optional)

Code examples

Example 1: The minimal example

The code below shows a minimal example with three short text snippets (as a document in this context can also be simple sentences):

TextSimilarity ts = new TextSimilarity();
ts.addDocument("doc1", "My cat is yellow");
ts.addDocument("doc2", "My cat is nice.");
ts.addDocument("doc3", "My dog is called Charlie");
ts.calculate();

List<String> similarDocs = ts.getSimilarDocuments("doc1");
for(String s : similarDocs) {
    System.out.println(s);
}

The code's output is shown below:

doc2
doc3

In the output, doc2 is shown before doc3, because doc2 is more similar to doc1.

Note that the addDocument method has two parameters. The first is a unique identifier of the document. This can be a primary key from a database or an URL. The only requirement is that it's unique across all the documents.

Example 2: Related posts on a blog

You can combine TextSimilarity with Jsoup to find related blog posts.

For example, if I want to see what posts on my blog are similar to 7 Tips for Designing a Better REST API, I can use the code snippet below:

TextSimilarity ts = new TextSimilarity();

Document feed = Jsoup.connect("http://www.kennethlange.com/feed/").get();
for(Element linkElement : feed.select("link")) {
    Document blogPost = Jsoup.connect(linkElement.text()).get();
    ts.addDocument(linkElement.text(), blogPost.select("body").text());
}
ts.calculate();

List<String> similarDocs = ts.getSimilarDocuments("https://www.kennethlange.com/7-tips-for-designing-a-better-rest-api/");
for(String s : similarDocs) {
    System.out.println(s);
}

And the output is listed below (and accidentally all the posts are related to REST Services):

https://www.kennethlange.com/what-are-restful-web-services/
https://www.kennethlange.com/the-sql-developers-guide-to-rest-services/
https://www.kennethlange.com/how-to-model-workflows-in-rest-apis/
https://www.kennethlange.com/avoid-data-corruption-in-your-rest-api-with-etags/
https://www.kennethlange.com/dont-limit-your-rest-api-to-crud-operations/

Advanced usage

Implement your own tokenizer

The project includes a tokenizer (TokenizerImpl) that is used for splitting text documents into tokens.

If you want to implement your own tokenizer (to add support for stemming, non-English languages, or something else) you can make your own custom implementation of the Tokenizer interface — as shown below (in practice, you wouldn't use an anonymous class, but you get the idea):

TextSimilarity ts = new TextSimilarity(new Tokenizer() {
    public List<String> tokenize(String document) {
        // TODO: Implement tokenization...
    }
});

Use your own stop words

You can provide your own set of stop words, such a set of non-English words, to the existing tokenizer if necessary:

TextSimilarity ts = new TextSimilarity(
    new TokenizerImpl(new HashSet<>(Arrays.asList("dem", "de", "deres")))
);

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
src		src
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
similarity.iml		similarity.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Similarity in Java

Prerequisites

Code examples

Example 1: The minimal example

Example 2: Related posts on a blog

Advanced usage

Implement your own tokenizer

Use your own stop words

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

kenneth-lange/java-nlp-text-similarity

Folders and files

Latest commit

History

Repository files navigation

Text Similarity in Java

Prerequisites

Code examples

Example 1: The minimal example

Example 2: Related posts on a blog

Advanced usage

Implement your own tokenizer

Use your own stop words

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages