Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 31 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,31 @@

A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...

* [Download](#download)
* [Overview](#overview)
* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
* [Levenshtein](#levenshtein)
* [Normalized Levenshtein](#normalized-levenshtein)
* [Weighted Levenshtein](#weighted-levenshtein)
* [Damerau-Levenshtein](#damerau-levenshtein)
* [Optimal String Alignment](#optimal-string-alignment)
* [Jaro-Winkler](#jaro-winkler)
* [Longest Common Subsequence](#longest-common-subsequence)
* [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
* [N-Gram](#n-gram)
* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
* [Q-Gram](#shingle-n-gram-based-algorithms)
* [Cosine similarity](#shingle-n-gram-based-algorithms)
* [Jaccard index](#shingle-n-gram-based-algorithms)
* [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
* [Experimental](#experimental)
* [SIFT4](#sift4)
* [Users](#users)
- [python-string-similarity](#python-string-similarity)
- [Download](#download)
- [Overview](#overview)
- [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
- [(Normalized) similarity and distance](#normalized-similarity-and-distance)
- [Metric distances](#metric-distances)
- [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
- [Levenshtein](#levenshtein)
- [Normalized Levenshtein](#normalized-levenshtein)
- [Weighted Levenshtein](#weighted-levenshtein)
- [Damerau-Levenshtein](#damerau-levenshtein)
- [Optimal String Alignment](#optimal-string-alignment)
- [Jaro-Winkler](#jaro-winkler)
- [Longest Common Subsequence](#longest-common-subsequence)
- [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
- [N-Gram](#n-gram)
- [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
- [Q-Gram](#q-gram)
- [Cosine similarity](#cosine-similarity)
- [Jaccard index](#jaccard-index)
- [Sorensen-Dice coefficient](#sorensen-dice-coefficient)
- [Overlap coefficient (i.e., Szymkiewicz-Simpson)](#overlap-coefficient-ie-szymkiewicz-simpson)
- [Experimental](#experimental)
- [SIFT4](#sift4)
- [Users](#users)


## Download
Expand Down Expand Up @@ -55,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
| [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) | |
| [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) | |
| [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) | |
| [Overlap coefficient](#overlap-coefficient-ie-szymkiewicz-simpson) |similarity<br>distance | Yes | No | Set | O(m+n) | |

[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.

Expand Down Expand Up @@ -360,6 +365,11 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in

Distance is computed as 1 - similarity.

### Overlap coefficient (i.e., Szymkiewicz-Simpson)
Very similar to Jaccard and Sorensen-Dice measures, but this time the similarity is computed as |V1 inter V2| / Min(|V1|,|V2|). Tends to yield higher similarity scores compared to the other overlapping coefficients. Always returns the highest similarity score (1) if one given string is the subset of the other.

Distance is computed as 1 - similarity.

## Experimental

### SIFT4
Expand Down
28 changes: 28 additions & 0 deletions strsimpy/overlap_coefficient.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from .shingle_based import ShingleBased
from .string_distance import NormalizedStringDistance
from .string_similarity import NormalizedStringSimilarity


class OverlapCoefficient(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):

def __init__(self, k=3):
super().__init__(k)

def distance(self, s0, s1):
return 1.0 - self.similarity(s0, s1)

def similarity(self, s0, s1):
if s0 is None:
raise TypeError("Argument s0 is NoneType.")
if s1 is None:
raise TypeError("Argument s1 is NoneType.")
if s0 == s1:
return 1.0
union = set()
profile0, profile1 = self.get_profile(s0), self.get_profile(s1)
for k in profile0.keys():
union.add(k)
for k in profile1.keys():
union.add(k)
inter = int(len(profile0.keys()) + len(profile1.keys()) - len(union))
return inter / min(len(profile0),len(profile1))
35 changes: 35 additions & 0 deletions strsimpy/overlap_coefficient_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import unittest

from strsimpy.overlap_coefficient import OverlapCoefficient

class TestOverlapCoefficient(unittest.TestCase):

def test_overlap_coefficient_onestringissubsetofother_return0(self):
sim = OverlapCoefficient(3)
s1,s2 = "eat","eating"
actual = sim.distance(s1,s2)
print("distance: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
self.assertEqual(0,actual)

def test_overlap_coefficient_onestringissubset_return1(self):
sim = OverlapCoefficient(3)
s1,s2 = "eat","eating"
actual = sim.similarity(s1,s2)
print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
self.assertEqual(1,actual)

def test_overlap_coefficient_onestringissubsetofother_return1(self):
sim = OverlapCoefficient(3)
s1,s2 = "eat","eating"
actual = sim.similarity(s1,s2)
print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
self.assertEqual(1,actual)

def test_overlap_coefficient_halfsimilar_return1(self):
sim = OverlapCoefficient(2)
s1,s2 = "car","bar"
self.assertEqual(1/2,sim.similarity(s1,s2))
self.assertEqual(1/2,sim.distance(s1,s2))

if __name__ == "__main__":
unittest.main()