Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for edit distances? #7

Open
OlivierBinette opened this issue Jan 7, 2022 · 4 comments
Open

Reduce memory usage for edit distances? #7

OlivierBinette opened this issue Jan 7, 2022 · 4 comments

Comments

@OlivierBinette
Copy link

@ngmarchant the Levenshtein distance can be implemented using only two rows for dmat, instead of using a square matrix. That could significantly reduce memory usage when comparing long sequences (400 Mb to 80 Kb when comparing strings of length 10,000).

Would it be worth it to implement this? I could propose the changes.

Example Python implementation:

import numpy as np
dmat = np.zeros((100,2))

def levenshtein(s, t, dmat):
    m = len(s)
    n = len(t)
    dmat[:, 0] = np.arange(dmat.shape[0])

    for j in range(1, n+1):
        dmat[0, (j-1) % 2] = j-1
        dmat[0, j % 2] = j
        for i in range(1, m+1):
            cost = 0
            if s[i-1] != t[j-1]:
                cost = 1
            dmat[i, j % 2] = min(dmat[i-1, j % 2] + 1, dmat[i, (j-1) % 2] +
                                 1, dmat[i-1, (j-1) % 2] + cost)
    return dmat[m, n % 2]

levenshtein("test", "testt", dmat)
@ngmarchant
Copy link
Owner

Hi Olivier,

I'd be in favor of this. Keeping the distance matrix in memory is only useful if one is interested in recording the optimal sequence of edit operations. But here, we're only interested in the distance (the last entry in the matrix).

One thing to be mindful of, is that the Levenshtein distance is related to other edit-based distances in the package through class inheritance. The inheritance diagram is as follows: DamerauLevenshtein -> OSA -> Levenshtein -> LCS.

Since all of these classes currently rely on the full distance matrix, they would also be impacted by your proposal. I believe it's possible to implement LCS and Levenshtein by keeping 2 rows in memory, and OSA by keeping 3 rows in memory. For DamerauLevenshtein, it may be necessary to keep all rows in memory (need to check this).

Would you be happy to submit a pull request?

@ngmarchant
Copy link
Owner

You might be interested in this: https://web.archive.org/web/20180612143641/https://bitbucket.org/clearer/iosifovich/

@OlivierBinette
Copy link
Author

OlivierBinette commented Jan 23, 2022

@ngmarchant thanks for the link! I'll look into it for further optimization, and then see if I can make a pull request.

@OlivierBinette
Copy link
Author

For reference, here's the optimization I could get by reducing the memory usage a bit more:

def levenshtein(s, t):
    m = len(s)
    n = len(t)
    buffer = np.arange(max(n, m) + 1)

    p = m
    for j in range(1, n+1):
        temp = j-1
        p = j
        for i in range(1, m+1):
            p = min(p + 1, buffer[i] + 1, temp + (s[i-1] != t[j-1]))
            temp = buffer[i]
            buffer[i] = p
    return p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants