<a href="https://colab.research.google.com/github/lblogan14/data_structures_and_algorithms/blob/master/ch13_text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#13.1 Abundance of Digitized Text

##13.1.1 Notations for Strings and the Python `str` Class
Character strings are used as a model for text when dicussing algorithms for text processing. To allow fairly general notions of a string in the algorithm descriptions, assume that the characters of a string come from a known **alphabet**, denoted as $\Sigma$. For example, in the context of DNA, there are four symbols in the standard alphabet, $\Sigma=\{A,C,G,T\}$. The alphabet $\Sigma$ can also be a subset of the ASCII or Unicode character sets.

In order to be able to speak about the pieces that result from string-processing operations, the Python's *indexing* and *slicing* notations are used here. `S[0:k]` refers to any substring for $0\leq k\leq n$ as a **prefix** of `S`. This can also be used as `S[:k]`. Similarly, `S[j:n]` refers to any substring for $0\leq j\leq n$ as a **suffix** of `S`, or used as `S[j:]`.

#13.2 Pattern-Matching Algorithms
In the classic **pattern-matching** problem, a **text** string `T` of length $n$ and a **pattern** string `P` of length $m$ are given, and the problem is asked to find whether `P` is a substring of `T`. If so, find the lowest index $j$ within `T`at which `P` begins, such that `T[j:j+m]` equals `P`, or perhaps to find all indices of `T` at which pattern `P` begins.

This problem can be solved by using Python's `str` class, such as `P in T`, `T.find(P)`, `T.index(P)`, `T.count(P)`, or more complex tasks such as `T.partition(P)`, `T.split(P)`, and `T.replace(P,Q)`.

For example:


In [0]:
T = 'CGTAAACTGCTTTAATCAAACGC'
#                 ^index 13
P = 'AATCA'

In [0]:
T.find(P)

13

In [0]:
T.index(P)

13

In [0]:
T.count(P)

1

In [0]:
T.count('AAA')

2

In [0]:
T.partition(P)

('CGTAAACTGCTTT', 'AATCA', 'AACGC')

In [0]:
T.split(P)

['CGTAAACTGCTTT', 'AACGC']

For simplicity, the outward semantics of the functions are modeled upon the `find` method of the string class, returning the lowest index at which the pattern begins, or `-1` if the pattern is not found.

##13.2.1 Brute Force
The **brute-force** design pattern is used when searching or optimizing some functions. When it is applied in a general situation,  it enumerates all possible configurations of the inputs involved and pick the best of all these enumerated configurations.

In [0]:
def find_brute(T, P):
  '''Return the lowest index of T at which substring P begins (or else -1)'''
  n, m = len(T), len(P) # introduce convenient notations
  for i in range(n-m+1): # try every potential starting index within T
    k = 0 # an index into pattern P
    while k < m and T[i+k] == P[k]: # kth character of P matches
      k += 1
    if k == m: # if we reached the end of pattern
      return i # substring T[i:i+m] matches P
  return -1 # failed to find a match starting with any i

The worst-case running time of the brute-force methods is $O(nm)$.

##13.2.2 The Boyer-Moore Algorithm
The **Boyer-Moore** pattern-matching algorithm can sometimes avoid comparisons between `P` and a sizable fraction of the characters in `T`, as it may not be necessary to examine every character in `T` in order to locate a pattern `P` as a substring or to rule out its existence.

The main idea here is to improve the running time of the brute-force algorithm by adding two potentially time-saving heuristics:
1. **Looking-Glass Heuristic**: When testing a possible placement of `P` against `T`, begin the comparisons from the end of `P` and move backward to the front of `P`.
2. **Character-Jump Heuristic**: During the testing of a possible placement of `P` within `T`, a mismatch of text character `T[i]=c` with the corresponding pattern character `P[k]` is handled as follows. If $c$ is not contained anywhere in `P`, then shift `P` completely past `T[i]` (for it cannot match any character in `P`). Otherwise, shift `P` until an occurrence of character $c$ in `P` gets aligned with `T[i]`.

More generally, when a match is found for
that last character, the algorithm continues by trying to extend the match with the
second-to-last character of the pattern in its current alignment. That process continues
until either matching the entire pattern, or finding a mismatch at some interior
position of the pattern.

If a mismatch is found, and the mismatched character of the text does not occur in the pattern, the entire pattern is shifted beyond that location.

The efficiency of the Boyer-Moore algorithm relies on creating a lookup table that quickly determines where a mismatched character occurs elsewhere in the pattern. Thus define a function `last(c)` as
* If $c$ is in `P`, `last(c)` is the index of the last (rightmost) occurrence of $c$ in `P`. Otherwise, `last(c)=-1`.

Assume that the alphabet is of fixed, finite size, and characters can be converted to indices of an array, then the `last(c)` function runs with worst-case $O(1)$ time, but the table would have length equal to the size of the alphabet. A hash table is preferred to represent the `last` function with only those characters from the pattern occurring in the structure. Then the expected lookup time is $O(m)$.

In [0]:
def find_boyer_moore(T, P):
  '''Return the lowest index of T at which substring P begins (or else -1)'''
  n, m = len(T), len(P) # introduce convenient notations
  if m == 0: return 0 # trivial search for empty string
  last = {} # build 'last' dictionary
  for k in range(m):
    last[ P[k] ] = k # later occurrence overwrites
  # align end of pattern at index m-1 of text
  i = m-1 # an index into T
  k = m-1 # an index into P
  while i < n:
    if T[i] == P[k]: # a matching character
      if k == 0:
        return i # pattern begins at index i of text
      else:
        i -= 1 # examine previous character
        k -= 1 # of both T and P
    else:
      j = last.get(T[i], -1) # last(T[i]) is -1 if not found
      i += m - min(k, j+1) # case analysis for jump step
      k = m-1 # restart at end of pattern
  return -1

The correctness of the Boyer-Moore pattern-matching algorithm follows from
the fact that each time the method makes a shift, it is guaranteed not to “skip” over
any possible matches. For `last(c)` is the location of the last occurrence of $c$ in `P`.

The running time is $O(n+m+|\Sigma|)$ in the original algorithm.

##13.2.3 The Knuth-Morris-Pratt Algorithm
The main idea of the KMP algorithm is to percompute self-overlaps between portions of the pattern so that when a mismatch occurs at one location, the maximum amount to shift the pattern is immediately known before continuing the search.

###The Failure Function
A **failure function**, $f$, indicates the proper shift of `P` upon a failed comparison. The failure function $f(k)$ is defined as the length of the longest prefix of `P` that is a suffix of `P[1:k+1]` (note that `P[0]` is excluded here, since here the shift is at least one unit). If a mismatch is found upon character `P[k+1]`, the function $f(k)$ tells how many of the immediately preceding characters can be reused to restart the pattern.

###Implementation

In [0]:
def find_kmp(T, P):
  '''Return the lowest index of T at which substring P begins (or else -1)'''
  n, m = len(T), len(P)
  if m == 0: return 0
  fail = compute_kmp_fail(P) # rely on utility to precompute
  j = 0 # index into text
  k = 0 # index into pattern
  while j < n:
    if T[j] == P[k]: # P[0:1+k] matched thus far
      if k == m-1: # match is complete
        return j - m + 1
      j += 1 # try to extend match
      k += 1
    elif k > 0:
      k = fail[k-1] # reuse suffix of P[0:k]
    else:
      j += 1
  return -1 # reached end without match

The implementation of KMP pattern-matching algorithm relies on a utility function, `compute_kmp_fail` to compute the failure function efficiently.

The main part of the KMP algorithm is its **while** loop, each iteration of which
performs a comparison between the character at index `j` in `T` and the character at
index `k` in `P`. If the outcome of this comparison is a match, the algorithm moves on
to the next characters in both `T` and `P` (or reports a match if reaching the end of the
pattern). If the comparison failed, the algorithm consults the failure function for a
new candidate character in `P`, or starts over with the next index in `T` if failing on
the first character of the pattern (since nothing can be reused).

###KMP Failure Function

In [0]:
def compute_kmp_fail(P):
  '''Utility that computes and returns KMP fail list'''
  m = len(P)
  fail = [0] * m # by default, presume overlap of 0 everywhere
  j = 1
  k = 0
  while j < m: # compute f(j) during this pass, if nonzero
    if P[j] == P[k]: # k+1 characters match thus far
      fail[j] = k+1
      j += 1
      k += 1
    elif k > 0: # k follows a matching prefix
      k = fail[k-1]
    else: # no match found starting at j
      j += 1
  return fail

The failure function applies a "bootstrapping" process that compares the pattern to itself as in the KMP algorithm. Each time two characters match, set $f(j)=k+1$.

*The Knuth-Morris-Pratt algorithm performs pattern matching
on a text string of length $n$ and a pattern string of length $m$ in $O(n+m)$ time.*

#13.3 Dynamic Programming
**Dynamic programming** algorithm-design technique can be used to take problems that seem to require exponential time and produce polynomial-time algorithms to solve them.

##13.3.1 Matrix Chain-Product
The matrix chain-product problem is to determine the parenthesization of the
expression defining the product $A$ that minimizes the total number of scalar multiplications
performed.

The straightforward ("brute force") way to solve this problem is to enumerate all possible ways of parenthesizing the expression for $A$ and determine the number of multiplications performed by each one, however the number of different parenthesizations of expression for $A$ is euqal in number to the set of all different binary trees that have $n$ leaves, which is exponential in $n$.

###Characterizing Optimal Solutions
Characterize an optimal solution to a particular subproblem in terms of optimal solutons to its subproblems.

###Designing a Dynamic Programming Algorithm
**Sharing of subproblems** helps prevent from dividing the problem into completely independent subproblems.

In [0]:
def matrix_chain(d):
  '''d is a list of n+1 numbers such that size of kth matrix is d[k]-by-d[k+1]
  
  Return a n-by-n table such that N[i][j] represents the minimum number of 
  multiplications needed to compute the product of Ai through Aj inclusive
  '''
  n = len(d) - 1 # number of matrices
  N = [[0] * n for i in range(n)] # initialize n-by-n result to zero
  for b in range(1, n): # number of products in subchain
    for i in range(n-b): # start of subchain
      j = i + b # end of subchain
      N[i][j] = min(N[i][k]+N[k+1][j]+d[i]*d[k+1]*d[j+1] for k in range(i,j))
  return N

The total running time of this algorithm is $O(n^3)$.

##13.3.2 DNA and Text Sequence Alignment
In a genetics application, the two sttrings could correspond to two strands of DNA, for which are computed similarities.

Given a string $X=x_0x_1x_2...x_{n-1}$, a subsequence of $X$ is any string that is of the form $x_{i_1}x_{i_2}...x_{i_k}$, where $i_j<i_{j+1}$; that is, it is a sequence of characters that are not necessarily contiguous but are nevertheless taken in order from $X$. For example, The string AAAG is a subsequence of the string CG**A**T**AA**TT**G**AGA.

This type of problems is the **longest common subsequence** (LCS) problem. In this problem, two character strings are given, $X=x_0x_1x_2...x_{n-1}$ and $Y=y_0y_1y_2...y_{m-1}$, over some alphabet and are asked to find a longest string $S$ that is subsequence of both $X$ and $Y$.

###Components of a Dynamic Programming Solution
The dynamic programming technique is used primarily for **optimization** problem, where the objective is to find the "best" way of doing something.

Dynamic programming is applied if the problem has certain properties:
* **Simple Subproblems**: There has to be some way of repeatedly breaking the global
optimization problem into subproblems. Moreover, there should be a way to
parameterize subproblems with just a few indices, like i, j, k, and so on.
* **Subproblem Optimization**: An optimal solution to the global problem must be a
composition of optimal subproblem solutions.
* **Subproblem Overlap**: Optimal solutions to unrelated subproblems can contain
subproblems in common

###The LCS Algorithm

In [0]:
def LCS(X, Y):
  '''Return table such that L[j][k] is length of LCS for X[0:j] and Y[0:k]'''
  n, m = len(X), len(Y)
  L = [[0]*(m+1) for k in range(n+1)] # (n+1) x (m+1) table
  for j in range(n):
    for k in range(m):
      if X[j] == Y[k]: # align this match
        L[j+1][k+1] = L[j][k] + 1
      else: # choose to ignore one character
        L[j+1][k+1] = max(L[j][k+1], L[j+1][k])
  return L

This algorithm runs in $O(nm)$ time, so the dynamic programming technique can be applied to this problem.

In [0]:
def LCS_solution(X, Y, L):
  '''Return the longest common substring of X and Y, given LCS table L'''
  solution = []
  j,k = len(X), len(Y)
  while L[j][k] > 0: # common characters remain
    if X[j-1] == Y[k-1]:
      solution.append(X[j-1])
      j -= 1
      k -= 1
    elif L[j-1][k] >= L[j][k-1]:
      j -= 1
    else:
      k -= 1
  return ''.join(reversed(solution)) # return left-to-right version

The `LCS` function computes the length of the longest common subsequence but not the subsequence itself. The `LCS_solution` function reconstructs back to front by reverse engineering the calculation, constructs a longest common subsequence in $O(n+m)$ additional time.