The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns in strings. A string is an abstract data type that consists of a sequence of characters. Letters, words, sentences, and more can be represented as strings.

A rolling hash allows an algorithm to calculate a hash value without having to rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right (as in the animation for brute force), instead of having to calculate the hash of the section of text[1:3] as it shifts from text[0:2], the algorithm can use a rolling hash to do an operation on the hash to get the new hash from the old hash.

i.e "abc" has hash $prime_a \times prime_b \times prime_c$, and for "bcd" we don't have to calculate $prime_b \times prime_c \times prime_d$. we can calculate $\frac{prime_a, prime_b, prime_c}{prime_a} \times prime_d$ instead

The Rabin-Karp Rolling Hash

$H = c_1a^{k-1} + c_2a^{k-2} + c_3a^{k-3} + \cdots + c_ka^{0} \;\;\; \mbox{mod} \;\;m$

$H = \sum^{n}_{i=0} c_i p ^i \;\;\; \mbox{mod} \;\;m$

a is a constant, $c_1 \ldots c_k$ are the input characters, and $k$ is the number of characters there are in the string we are comparing (this is the length of the word). m is some large prime number

_usually for lower case of English alphabets, $a = 31$, consider upper cases $a=51, m = 10^9 + 7$


Patter = Hash(Pattern)
Text = Hash(prefix)

k = len(pattern)

$(hash(T[0:i]) + p^{i+1}T_{i+1}) \;\;\; mod\;\; m$

When we iterate on `Text`

We calculate

$(m + hash(T[0:i+k]) - hash(T[0:i]))\;\;\; mod\;\; m$

which is actually $hash(T[i:i+k] \cdot (p^i \;\;\; mod\;\; m)$

In [16]:
def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    p, q = 31, 10**9+7
    
    p_pow = 1
    for i in range(m-1):
        p_pow = (p_pow * p) % q
        
    pattern_hash = 0
    window_hash = 0
    for i in range(m):
        pattern_hash = (pattern_hash * p + ord(pattern[i]) - ord('a') + 1) % q
        window_hash = (window_hash * p + ord(text[i]) - ord('a') + 1) % q
        
    indices = []
        
    for i in range(n - m + 1):
        if pattern_hash == window_hash:
            match = True
            for j in range(m):
                if pattern[j] != text[i+j]:
                    match = False
                    break
            if match:
                indices.append(i)
                
        if i < n - m:
            window_hash = (window_hash - (ord(text[i]) - ord('a') + 1) * p_pow) % q
            window_hash = (window_hash * p + (ord(text[i+m]) - ord('a') + 1)) % q
            window_hash = (window_hash + q) % q
            
    return indices

In [17]:
text = "a;lsdjfladj kashd fhalsdhf alsd fabc as;ldfja ljsdfa lksdfabc lsjdfl;aksjdfj ahsdkj sf"
pattern = "abc"

rabin_karp(text, pattern)

[33, 58]