In the worst case of naive seach substring。

Imagine that the string S[] consists of 1 million characters that are all A, and that the word W[] is 999 A characters terminating in a final B character. The simple string-matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 million positions for 1 billion character comparisons. If the length of W[] is k, then the worst-case performance is O(k⋅n).


The KMP algorithm has a better worst-case performance than the straightforward algorithm. KMP spends a little time precomputing a table (on the order of the size of W[], O(k)), and then it uses that table to do an efficient search of the string in O(n).


The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In the example above, when KMP sees a trial match fail on the 1000th character (i = 999) because S[m+999] ≠ W[999], it will increment m by 1, but it will know that the first 998 characters at the new position already match. KMP matched 999 A characters before discovering a mismatch at the 1000th character (position 999). Advancing the trial match position m by one throws away the first A, so KMP knows there are 998 A characters that match W[] and does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable m) and where it will resume testing (variable i).

`W` = "ABCDABD" and `S` = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers.

`m`, denoting the position within `S` where the prospective match for `W` begins,
`i`, denoting the index of the currently considered character in `W`.

In each step the algorithm compares `S[m+i]` with `W[i]` and increments `i` if they are equal
```
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
```
we note that no 'A' occurs between positions 1 and 2 in S; hence, having checked all those characters previously (and knowing they matched the corresponding characters in W), there is no chance of finding the beginning of a match. Therefore, the algorithm sets m = 3 and i = 0.

---

Then it fail again at the initial character, so the algorithm sets m = 4 and i = 0

--- 
```
             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:     ABCDABD
i:     0123456
```
Here, `i` increments through a nearly complete match "ABCDAB" until `i = 6` giving a mismatch at `W[6]` and `S[10]`.  
However, there was that substring "AB" at `m=8` that could be the beginning of a new match   
the algorithm sets m = 8 (the start of the initial prefix) and i = 2 (signaling the first two characters match) and continues matching   
_Thus the algorithm not only omits previously matched characters of S (the "AB"), but also previously matched characters of W (the prefix "AB")._

---

then it fail, until `m = 11` and `i = 0`, and then set  `m = 15` and `i = 0`, and we have a complelete match.

pseudo code

For the moment, we assume the existence of a "partial match" table `T`, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch.  

when comparing `S[m + i]` to `W[i]` failes, then the next possible match will start at index `m + i - T[i]` in `S` (that is, `T[i]` is the amount of "backtracking" we need to do after a mismatch).

first, T[0] = -1, which indicates that if W[0] is a mismatch. we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]].

```
algorithm kmp_search:
    input:
        an array of characters, S (the text to be searched)
        an array of characters, W (the word sought)
    output:
        an array of integers, P (positions in S at which W is found)
        an integer, nP (number of positions)

    define variables:
        an integer, j ← 0 (the position of the current character in S)
        an integer, k ← 0 (the position of the current character in W)
        an array of integers, T (the table, computed elsewhere)

    let nP ← 0

    while j < length(S) do
        if W[k] = S[j] then
            let j ← j + 1
            let k ← k + 1
            if k = length(W) then
                (occurrence found, if only first occurrence is needed, m ← j - k  may be returned here)
                let P[nP] ← j - k, nP ← nP + 1
                let k ← T[k] (T[length(W)] can't be -1)
        else
            let k ← T[k]
            if k < 0 then
                let j ← j + 1
                let k ← k + 1
```

prefix function (denoted $\pi$ ) of the test string

Given a pattern $p$ of length $m$, the function $\pi$ maps $\left \{ 1, 2, \ldots, m \right \} to \left \{ 0, 1, \ldots, m-1 \right \}$ such that $\pi(q)$ is the length of the longest prefix of $p$ that is a proper suffix of $p_q$
 
$\pi[q]=max\{k: k<q \;\mbox{and}\; p_k\sqsupset p_q\} $

|    i   | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|--------|---|---|---|---|---|---|---|
|  P[i]  | a | b | a | b | a | c | a |
|$\pi[i]$| 0 | 0 | 1 | 2 | 3 | 0 | 1 |

In [3]:
def prefix(p):
    m=len(p)
    pi=[0]*m
    j=0 
    for i in range(1,m):
        while j>=0 and p[j]!=p[i]:
            if j-1>=0:
                j=pi[j-1]
            else:
                j=-1 
        j+=1
        pi[i]=j
    return pi

In [4]:
def find_occurrences(S,p):
    matches = []
    f=prefix(p)
    n,m=len(S),len(p)
    j=0
    for i in range(n):
        while j>=0 and S[i]!=p[j]:
            if j>0: 
                j=f[j-1]
            else: 
                j=-1
        j+=1  
        if j==m:
            j=f[m-1]
            matches.append(i-m+1)
    return matches


W = "acabacacd"
T = "acfacabacabacacdk"

find_occurrences(T, W)

[7]

In [6]:
def createAux(W):
    # initializing the array aux with 0's
    aux = [0] * len(W)

    # for index 0, it will always be 0
    # so starting from index 1
    i = 1
    # m can also be viewed as index of first mismatch
    m = 0
    while i < len(W):
        # prefix = suffix till m-1
        if W[i] == W[m]:
            m += 1
            aux[i] = m
            i += 1
        # this one is a little tricky,
        # when there is a mismatch,
        # we will check the index of previous
        # possible prefix.
        elif W[i] != W[m] and m != 0:
            # Note that we do not increment i here.
            m = aux[m-1]
        else:
            # m = 0, we move to the next letter,
            # there was no any prefix found which 
            # is equal to the suffix for index i
            aux[i] = 0
            i += 1

    return aux

W = "acabacacd"
print(createAux(W))

[0, 0, 1, 0, 1, 2, 3, 2, 0]


In [None]:
def createAux_v2(text):
    
    n = len(text)
    prefix_func = [o for i in range(n)]
    # we start from 
    go_through_index= 1
    # prefix_count
    prefix_count = 1
    
    while 
    
    return prefix_func    

In [6]:
W = "acabacacd"
T = "acfacabacabacacdk"

# this method is from above code snippet.
aux = createAux(W)

# counter for word W
i = 0
# counter for text T
j = 0
while j < len(T):
    # We need to handle 2 conditions when there is a mismatch
    if W[i] != T[j]:
        # 1st condition
        if i == 0:
            # starting again from the next character in the text T
            j += 1
        else:
            # aux[i-1] will tell from where to compare next
            # and no need to match for W[0..aux[i-1] - 1],
            # they will match anyway, that’s what kmp is about.
            i = aux[i-1]
    else:
        i += 1
        j += 1
        # we found the pattern
        if i == len(W):
            # printing the index
            print("found pattern at " + str(j - i))
            # if we want to find more patterns, we can 
            # continue as if no match was found at this point.
            i = aux[i-1]

found pattern at 7


In [11]:
def prefixFunction(text):
    n = len(text)
    prefix_func = [0 for i in range(n)]

    # YOUR CODE GOES HERE
    for i in range(1, n):
        x = prefix_func[i-1]
        while x > 0 and text[i] != text[x]:
            x = prefix_func[x-1]

        if text[i] == text[x]:
            prefix_func[i] = x + 1

    return prefix_func

In [16]:
def kmp(text, pattern):
    s = pattern+'#'+text
    n = len(pattern)
    
    prefix = prefix_func(s)
    for i in range(n+1, len(prefix)):
        if prefix[i] == n:
            return i - (n+1) - n + 1 # i - 2*n
        
def KMP(text, pattern):
    n, m = len(text), len(pattern)
    special_symbol = "#"
    indices = []

    # YOUR CODE GOES HERE
    s = pattern+special_symbol+text
    lens = len(s)
    prefix = [0 for i in range(lens)]
    
    for i in range(1, lens):
        x = prefix[i-1]
        while x > 0 and s[i] != s[x]:
            x = prefix[x-1]

        if s[i] == s[x]:
            prefix[i] = x + 1

    for i in range(m+1, len(prefix)):
        if prefix[i] == m:
              indices.append(i - (m+1) - m + 1) # i - 2*m

    return indices        

W = "acabacacd"
T = "acfacabacabacacdk"

KMP(T, W)

[7]

__Application__

You are given two strings, where the second string is the first string that has been cyclically shifted (or has it?). Example of a cyclic shift: abcde -> deabc. Output the minimum possible cycle shift to obtain the second string from the first, or -1 if it's not possible.

In [None]:
def minCyclicShift(original_string, shifted_string):
    min_shift = float("inf")

    # YOUR CODE GOES HERE
    if original_string == shifted_string:
        return 0

    return min_shift

original_string = 'abcde'
shifted_string = 'deabc'
# check that your code works correctly on provided example
assert minCyclicShift(original_string, shifted_string) == 2, 'Wrong answer'