Reduce memory usage for edit distances? #7

OlivierBinette · 2022-01-07T12:29:41Z

@ngmarchant the Levenshtein distance can be implemented using only two rows for dmat, instead of using a square matrix. That could significantly reduce memory usage when comparing long sequences (400 Mb to 80 Kb when comparing strings of length 10,000).

Would it be worth it to implement this? I could propose the changes.

Example Python implementation:

import numpy as np
dmat = np.zeros((100,2))

def levenshtein(s, t, dmat):
    m = len(s)
    n = len(t)
    dmat[:, 0] = np.arange(dmat.shape[0])

    for j in range(1, n+1):
        dmat[0, (j-1) % 2] = j-1
        dmat[0, j % 2] = j
        for i in range(1, m+1):
            cost = 0
            if s[i-1] != t[j-1]:
                cost = 1
            dmat[i, j % 2] = min(dmat[i-1, j % 2] + 1, dmat[i, (j-1) % 2] +
                                 1, dmat[i-1, (j-1) % 2] + cost)
    return dmat[m, n % 2]

levenshtein("test", "testt", dmat)

The text was updated successfully, but these errors were encountered:

ngmarchant · 2022-01-09T07:35:20Z

Hi Olivier,

I'd be in favor of this. Keeping the distance matrix in memory is only useful if one is interested in recording the optimal sequence of edit operations. But here, we're only interested in the distance (the last entry in the matrix).

One thing to be mindful of, is that the Levenshtein distance is related to other edit-based distances in the package through class inheritance. The inheritance diagram is as follows: DamerauLevenshtein -> OSA -> Levenshtein -> LCS.

Since all of these classes currently rely on the full distance matrix, they would also be impacted by your proposal. I believe it's possible to implement LCS and Levenshtein by keeping 2 rows in memory, and OSA by keeping 3 rows in memory. For DamerauLevenshtein, it may be necessary to keep all rows in memory (need to check this).

Would you be happy to submit a pull request?

ngmarchant · 2022-01-09T07:39:28Z

You might be interested in this: https://web.archive.org/web/20180612143641/https://bitbucket.org/clearer/iosifovich/

OlivierBinette · 2022-01-23T22:22:10Z

@ngmarchant thanks for the link! I'll look into it for further optimization, and then see if I can make a pull request.

OlivierBinette · 2022-01-24T00:43:29Z

For reference, here's the optimization I could get by reducing the memory usage a bit more:

def levenshtein(s, t):
    m = len(s)
    n = len(t)
    buffer = np.arange(max(n, m) + 1)

    p = m
    for j in range(1, n+1):
        temp = j-1
        p = j
        for i in range(1, m+1):
            p = min(p + 1, buffer[i] + 1, temp + (s[i-1] != t[j-1]))
            temp = buffer[i]
            buffer[i] = p
    return p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage for edit distances? #7

Reduce memory usage for edit distances? #7

OlivierBinette commented Jan 7, 2022

ngmarchant commented Jan 9, 2022

ngmarchant commented Jan 9, 2022

OlivierBinette commented Jan 23, 2022 •

edited

Loading

OlivierBinette commented Jan 24, 2022

Reduce memory usage for edit distances? #7

Reduce memory usage for edit distances? #7

Comments

OlivierBinette commented Jan 7, 2022

ngmarchant commented Jan 9, 2022

ngmarchant commented Jan 9, 2022

OlivierBinette commented Jan 23, 2022 • edited Loading

OlivierBinette commented Jan 24, 2022

OlivierBinette commented Jan 23, 2022 •

edited

Loading