# Longest Common Subsequence

In [1]:
project_name = 'longest-common-subsequence'

In [2]:
!pip install jovian --upgrade --quiet

In [3]:
import jovian

In [4]:
jovian.commit(project=project_name)

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "poduguvenu/longest-common-subsequence" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/poduguvenu/longest-common-subsequence[0m


'https://jovian.ai/poduguvenu/longest-common-subsequence'

## Problem Statement


> **QUESTION 1**: Write a function to find the length of the longest common subsequence between two sequences. E.g. Given the strings "serendipitous" and "precipitation", the longest common subsequence is "reipito" and its length is 7.
>
> A "sequence" is a group of items with a deterministic ordering. Lists, tuples and ranges are some common sequence types in Python.
>
> A "subsequence" is a sequence obtained by deleting zero or more elements from another sequence. For example, "edpt" is a subsequence of "serendipitous".
<img src="https://i.imgur.com/ry4Y0wS.png" width="420">


## The Method

Here's the systematic strategy we'll apply for solving problems:

1. State the problem clearly. Identify the input & output formats.
2. Come up with some example inputs & outputs. Try to cover all edge cases.
3. Come up with a correct solution for the problem. State it in plain English.
4. Implement the solution and test it using example inputs. Fix bugs, if any.
5. Analyze the algorithm's complexity and identify inefficiencies, if any.
6. Apply the right technique to overcome the inefficiency. Repeat steps 3 to 6.

This approach is explained in detail in [Lesson 1](https://jovian.ai/learn/data-structures-and-algorithms-in-python/lesson/lesson-1-binary-search-linked-lists-and-complexity) of the course. Let's apply this approach step-by-step.

## Solution


### 1. State the problem clearly. Identify the input & output formats.

While this problem is stated clearly enough, it's always useful to try and express in your own words, in a way that makes it most clear for you. 


**Problem**

> We are given two sequences and we need to find the length of longest common subsequence between them

<br/>


**Input**

1. **seq1**: A sequence e.g. `'serendipitous'`
2. **seq2**: Another sequence e.g. `'precipitation'`


**Output**

1. **len_lcs**: Length of the longest common subsequence e.g. 7

<br/>

Based on the above, we can now create a signature of our function:

In [5]:
def len_lcs(seq1, seq2):
    pass

### 2. Come up with some example inputs & outputs. Try to cover all edge cases.

Our function should be able to handle any set of valid inputs we pass into it. Here's a list of some possible variations we might encounter:

**Test cases**

1. General case (string)
2. General case (list)
3. No common subsequence
4. One is a subsequence of the other
5. One sequence is empty
6. Both sequences are empty
7. Multiple subsequences with same length
    1. “abcdef” and “badcfe”


We'll express our test cases as dictionaries, to test them easily. Each dictionary will contain 2 keys: `input` (a dictionary itself containing one key for each argument to the function and `output` (the expected result from the function).  

In [6]:
# General case(string)
T0 = {
    'input': {
        'seq1': 'serendipitous',
        'seq2': 'precipitation'
    },
    'output': 7
}
# General case(list)
T1 = {
    'input': {
        'seq1': [1, 3, 5, 6, 7, 2, 5, 2, 3],
        'seq2': [6, 2, 4, 7, 1, 5, 6, 2, 3]
    },
    'output': 5
}
# Another general case(string)
T2 = {
    'input': {
        'seq1': 'longest',
        'seq2': 'stone'
    },
    'output': 3
}
# No common subsequence
T3 = {
    'input': {
        'seq1': 'asfasgasdgsad',
        'seq2': 'kj;ljjlj;ljjl'
    },
    'output': 0
}
# One is a subsequence of the other
T4 = {
    'input': {
        'seq1': 'dense',
        'seq2': 'condensed'
    },
    'output': 5
}
# One sequence is empty
T5 = {
    'input': {
        'seq1': '',
        'seq2': 'opkpoiklklj'
    },
    'output': 0
}
# Both sequences are empty
T6 = {
    'input': {
        'seq1': '',
        'seq2': ''
    },
    'output': 0
}
# Multiple subsequences with same length
T7 = {
    'input': {
        'seq1': 'abcdef',
        'seq2': 'badcfe'
    },
    'output': 3
}

In [7]:
lcq_tests = [T0, T1, T2, T3, T4, T5, T6, T7]

### 3. Come up with a correct solution for the problem. State it in plain English.

Our first goal should always be to come up with a _correct_ solution to the problem, which may not necessarily be the most _efficient_ solution. Come with a correct solution and explain it in simple words below:

## Recursive Solution


1. Create two counters `idx1` and `idx2` starting at 0. Our recursive function will compute the LCS of `seq1[idx1:]` and `seq2[idx2:]`


2. If `seq1[idx1]` and `seq2[idx2]` are equal, then this character belongs to the LCS of `seq1[idx1:]` and `seq2[idx2:]` (why?). Further the length of this LCS is one more than LCS of `seq1[idx1+1:]` and  `seq2[idx2+1:]`

<img src="https://i.imgur.com/um7LDiX.png" width="400">

3. If not, then the LCS of `seq1[idx1:]` and `seq2[idx2:]` is the longer one among the LCS of `seq1[idx1+1:], seq2[idx2:]` and the LCS of `seq1[idx1:]`, `seq2[idx2+1:]`

<img src="https://i.imgur.com/DRanmOy.png" width="360">

5. If either `seq1[idx1:]` or `seq2[idx2:]` is empty, then their LCS is empty.



Here's what the tree of recursive calls looks like:


![](https://i.imgur.com/JJrq3KH.png)

###  4. Implement the solution and test it using example inputs. Fix bugs, if any.

In [8]:
def lcs_recursive(seq1, seq2, idx1=0, idx2=0):
    if idx1 == len(seq1) or idx2 == len(seq2):
        return 0
    if seq1[idx1] == seq2[idx2]:
        return 1 + lcs_recursive(seq1, seq2, idx1+1, idx2+1)
    else:
        option1 = lcs_recursive(seq1, seq2, idx1+1, idx2) 
        option2 = lcs_recursive(seq1, seq2, idx1, idx2+1)
        return max(option1, option2)

In [9]:
T0

{'input': {'seq1': 'serendipitous', 'seq2': 'precipitation'}, 'output': 7}

In [10]:
%%time
# lcs_recursive(T0['input']['seq1'], T0['input']['seq2']) == T0['output']
lcs_recursive(**T0['input']) == T0['output']

CPU times: user 338 ms, sys: 0 ns, total: 338 ms
Wall time: 337 ms


True

We can test the function by passing the input to it directly or by using the `evaluate_test_case` function from `jovian`.

In [11]:
from jovian.pythondsa import evaluate_test_case

In [12]:
evaluate_test_case(lcs_recursive, T0)


Input:
{'seq1': 'serendipitous', 'seq2': 'precipitation'}

Expected Output:
7


Actual Output:
7

Execution Time:
370.582 ms

Test Result:
[92mPASSED[0m



(7, True, 370.582)

Evaluate your function against all the test cases together using the `evaluate_test_cases` (plural) function from `jovian`.

In [13]:
from jovian.pythondsa import evaluate_test_cases

In [14]:
evaluate_test_cases(lcs_recursive, lcq_tests)


[1mTEST CASE #0[0m

Input:
{'seq1': 'serendipitous', 'seq2': 'precipitation'}

Expected Output:
7


Actual Output:
7

Execution Time:
322.676 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #1[0m

Input:
{'seq1': [1, 3, 5, 6, 7, 2, 5, 2, 3], 'seq2': [6, 2, 4, 7, 1, 5, 6, 2, 3]}

Expected Output:
5


Actual Output:
5

Execution Time:
5.018 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #2[0m

Input:
{'seq1': 'longest', 'seq2': 'stone'}

Expected Output:
3


Actual Output:
3

Execution Time:
0.204 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #3[0m

Input:
{'seq1': 'asfasgasdgsad', 'seq2': 'kj;ljjlj;ljjl'}

Expected Output:
0


Actual Output:
0

Execution Time:
6606.877 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #4[0m

Input:
{'seq1': 'dense', 'seq2': 'condensed'}

Expected Output:
5


Actual Output:
5

Execution Time:
0.23 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #5[0m

Input:
{'seq1': '', 'seq2': 'opkpoiklklj'}

Expected Output:
0


Actual Output:
0

Executi

[(7, True, 322.676),
 (5, True, 5.018),
 (3, True, 0.204),
 (0, True, 6606.877),
 (5, True, 0.23),
 (0, True, 0.002),
 (0, True, 0.002),
 (3, True, 0.062)]

Verify that all the test cases were evaluated. We expect them all to fail, since we haven't implemented the function yet.

### 5. Analyze the algorithm's complexity and identify inefficiencies, if any.
#### Complexity Analysis

Worst case occurs when each time we have to try 2 subproblems i.e. when the sequences have no common elements.

<img src="https://i.imgur.com/z5m36m8.png" width="360">

Here's what the tree looks like in such a case (source - Techie Delight):

<img src="https://i.imgur.com/n8ZgBYj.png" width="500">

All the leaf nodes are `(0, 0)`. Can you count the number of leaf nodes?

*HINT*: Count the number of unique paths from root to leaf. The length of each path is `m+n` and at each level there are 2 choices. 

Based on the above we can infer that the time complexity is $O(2^{m+n})$.


### 6. Apply the right technique to overcome the inefficiency. Repeat steps 3 to 6.

## Memoization

In [15]:
def lcs_memo(seq1, seq2):
    memo = {}
    def recurse(idx1=0, idx2=0):
        key = (idx1, idx2)
        if key in memo:
            return memo[key]
        elif idx1 == len(seq1) or idx2 == len(seq2):
            memo[key] = 0
        elif seq1[idx1] == seq2[idx2]:
            memo[key] = 1 + recurse(idx1+1, idx2+1)
        else:
            memo[key] = max(recurse(idx1+1, idx2), recurse(idx1, idx2+1))
        return memo[key]
    return recurse(0, 0)

In [16]:
evaluate_test_cases(lcs_memo, lcq_tests)


[1mTEST CASE #0[0m

Input:
{'seq1': 'serendipitous', 'seq2': 'precipitation'}

Expected Output:
7


Actual Output:
7

Execution Time:
0.176 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #1[0m

Input:
{'seq1': [1, 3, 5, 6, 7, 2, 5, 2, 3], 'seq2': [6, 2, 4, 7, 1, 5, 6, 2, 3]}

Expected Output:
5


Actual Output:
5

Execution Time:
0.065 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #2[0m

Input:
{'seq1': 'longest', 'seq2': 'stone'}

Expected Output:
3


Actual Output:
3

Execution Time:
0.052 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #3[0m

Input:
{'seq1': 'asfasgasdgsad', 'seq2': 'kj;ljjlj;ljjl'}

Expected Output:
0


Actual Output:
0

Execution Time:
0.163 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #4[0m

Input:
{'seq1': 'dense', 'seq2': 'condensed'}

Expected Output:
5


Actual Output:
5

Execution Time:
0.031 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #5[0m

Input:
{'seq1': '', 'seq2': 'opkpoiklklj'}

Expected Output:
0


Actual Output:
0

Execution T

[(7, True, 0.176),
 (5, True, 0.065),
 (3, True, 0.052),
 (0, True, 0.163),
 (5, True, 0.031),
 (0, True, 0.002),
 (0, True, 0.002),
 (3, True, 0.028)]

#### Complexity Analysis
From above we can say that if key is already exists in the memo dictionary then the computation for that key can be avoided. The only no.of computations that we need to do is equal to the max no.of elements that can end up in memo. With this we can reduce the complexity from $O(2^{m+n})$ to $O(mxn)$

## Dynamic Programming

1. Create a table of size `(n1+1) * (n2+1)` initialized with 0s, where `n1` and `n2` are the lengths of the sequences. `table[i][j]` represents the longest common subsequence of `seq1[:i]` and `seq2[:j]`. Here's what the table looks like (source: Kevin Mavani, Medium).


<img src="https://i.imgur.com/SAsEol6.png">



2. If `seq1[i]` and `seq2[j]` are equal, then `table[i+1][j+1] = 1 + table[i][j]` 

3. If `seq1[i]` and `seq2[j]` are equal, then `table[i+1][j+1] = max(table[i][j+1], table[i+1][j])`


The complexity of the dynamic programming is the size of the table.

Verify that the complexity of the dynamic programming approach is $O(N1 * N2)$ which is same as Memoization. Dynamic programming approach is faster compared to Memoization.

In [17]:
n1, n2 = 5, 7
[[0 for x in range(n2)] for x in range(n1)]

[[0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0]]

In [18]:
def lcs_dyn_prog(seq1, seq2):
    n1, n2 = len(seq1), len(seq2)
    table = [[0 for x in range(n2+1)] for x in range(n1+1)]
    for i in range(n1):
        for j in range(n2):
            if seq1[i] == seq2[j]:
                table[i+1][j+1] = 1 + table[i][j]
            else:
                table[i+1][j+1] = max(table[i][j+1], table[i+1][j])
    return table[-1][-1]    

In [19]:
evaluate_test_cases(lcs_dyn_prog, lcq_tests)


[1mTEST CASE #0[0m

Input:
{'seq1': 'serendipitous', 'seq2': 'precipitation'}

Expected Output:
7


Actual Output:
7

Execution Time:
0.156 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #1[0m

Input:
{'seq1': [1, 3, 5, 6, 7, 2, 5, 2, 3], 'seq2': [6, 2, 4, 7, 1, 5, 6, 2, 3]}

Expected Output:
5


Actual Output:
5

Execution Time:
0.09 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #2[0m

Input:
{'seq1': 'longest', 'seq2': 'stone'}

Expected Output:
3


Actual Output:
3

Execution Time:
0.044 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #3[0m

Input:
{'seq1': 'asfasgasdgsad', 'seq2': 'kj;ljjlj;ljjl'}

Expected Output:
0


Actual Output:
0

Execution Time:
0.125 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #4[0m

Input:
{'seq1': 'dense', 'seq2': 'condensed'}

Expected Output:
5


Actual Output:
5

Execution Time:
0.054 ms

Test Result:
[92mPASSED[0m


[1mTEST CASE #5[0m

Input:
{'seq1': '', 'seq2': 'opkpoiklklj'}

Expected Output:
0


Actual Output:
0

Execution Ti

[(7, True, 0.156),
 (5, True, 0.09),
 (3, True, 0.044),
 (0, True, 0.125),
 (5, True, 0.054),
 (0, True, 0.006),
 (0, True, 0.009),
 (3, True, 0.043)]

In [20]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "poduguvenu/longest-common-subsequence" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/poduguvenu/longest-common-subsequence[0m


'https://jovian.ai/poduguvenu/longest-common-subsequence'