Next Text

Experimental library for a simple text similarity measure.

CRI Distance

The CRI distance is a measure for superficial text similarity. "CRI" stands for character repetition intervals. The idea behind it is to gauge a text by capturing the numbers of symbols it takes for each character to repeat.

The normalized CRI distance of two texts is a value between 0 and 1, where 0 means the texts are identical and 0 means they have nothing in common at all.

Theory

By Example

Given an alphabet consisting of the letters "a" and "b". Let's index them with 0 and 1. Let's define an "interval" simply as the difference between the character indexes in the string. The length of the first interval is the position of the respective character. Because it takes that many symbols until the character "repeats" for the first time.

So much for the interval.

Now, let's look at the string "abba". It takes 0 positions to increment for the "a" to appear for the first time and one for the "b". From the first "b" to the second "b", the position increments by one, so the interval is one. From the first "a" to the next, the position increments by 3.

Now, let's put this in a matrix where the first dimension is the alphabet indexes and the second are the frequencies (counts) of the respective repetition interval (CRI) lengths:

String "abba":

Position	0	1	2	3
Character	a	b	b	a

CRI counts of "abba":

⬐ Symbol / CRI length →	0	1	2	3
0 ("a")	1	0	0	1
1 ("b")	0	2	0	0

Let's compare this with the matrix we get for the string "baba":

String "baba":

0	1	2	3
b	a	b	a

CRI counts of "baba":

⬐ Symbol / CRI length →	0	1	2	3
0 ("a")	0	1	1	0
1 ("b")	1	0	1	0

Now, let's calculate a naïve distance between those two matrixes simply by adding up the absolute differences between the individual cell values.

The distance between "abba" and "baba" is 8.

Let's compare the strings "abbababa" and "babaabba" in the same way. The two strings have been concatenated and concatenated in reverse order. The distance is still 8.

The algorithm honors identical subsequences that can appear in different orders and in different locations.

Finally, it would be nice to have a normalized value which is easier to interpret than the raw distance. We divide the distance by the sum of the two text's lengths and we get a result between 0 and 1, 0 meaning identical and 1 meaning nothing in common at all.

⚠️ The metric can be tricky if the compared strings don't contain any repeating characters. It returns 1 when differences occur at the beginning.

More Formally

Let s_i be a unique symbol where 0 ≤ i ≤ n .

Let A be an alphabet of n symbols:
A = {s₀, ... s_n}

Let S be a sequence of length L of symbols from A.

Let j,k (0≤j<;k, k<L) be the positions of two occurrences of s_i in S where no other occurrence of s_i intervenes.

Then a CRI interval is just the range of positions from j to k, not including k:
CRI_i,j,k = [j..k)

For the first occurrence of s_i we define j=0 and set k to the position of the first occurrence of s_i (j≤k).

The length l_k,j of the CRI is k - j.

The CRI count c_i,l is the count of all repetition intervals of s_i of length l.

Let M₁^i⨯l be a matrix of CRI counts c of S₁ by symbol index i and CRI length l.

Let M₂^i⨯l be a matrix of CRI counts c of S₂ by symbol index i and CRI length l.

Let M_diff be the subtraction of the two matrixes: M_diff = M₁ - M₂

Then the absolute CRI distance D is just the sum over the absolute values of d of M_diff: D = ∑|d_i,l|

The normalized CRI distance is D/(L₁+L₂).

Disclaimer

I have no idea whether this algorithm has any academic or whatsoever discourse. Didn't find it among the top 10 on Google. (Or googled the wrong buzzwords.) And didn't bother to do any research on the field. (Actually, I was just thinking about a pragmatic solution for an imminent task.) Giving the algorithm is so trivial, I'm sure it already exists. Or otherwise it's just crap. I'm implementing it anyway as a nice playground to learn Kotlin.

So, forgive me and let me know if I (unintentionally) plagiarized.

🙈 And condone the amateurish Kotlin...

Usage

I have a weakness for the builder pattern, so the API looks like this:

        val nextText = NextText.Builder()
                .withMinCodePoint(0)
                .withMaxCodePoint(127)
                .build()

        val criDistance = nextText.criDistance("text 1", "text 2")

        println("The normalized CRI distance is $criDistance.")

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
benchmarks		benchmarks
gradle/wrapper		gradle/wrapper
license		license
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

License

objecttrouve/nexttext

Folders and files

Latest commit

History

Repository files navigation

Next Text

CRI Distance

Theory

By Example

More Formally

Disclaimer

Usage

About

Resources

License

Stars

Watchers

Forks

Languages