# <font color="blue">Submitted by: Kaspar Kadalipp </font>
# HW11. Text algorithms, NFA and projects

### <font color='orange'> Less important code is placed here</font>
### <font color='orange'> Report is below </font>

In [1]:
# Exercise 1

from termcolor import colored
import sys

# print matched pattern in red
def print_matches(matches, word):
    result = ''
    prev = 0
    for match in matches:
        result += word[prev:match['start']] + colored(word[match['start']:match['end']], 'red')
        prev = match['end']
    result += word[matches[-1]['end']:]
    print(result, end='')

def naive_search(pattern, word):
    result = []
    for i in range(len(word) - len(pattern) + 1):
        for j in range(len(pattern)):
            if word[i + j] != pattern[j]: break
        else:
            start = i
            # don't print overlapping
            if not result or start >= result[-1]['end']:
                result.append({'start': start, 'end': start + len(pattern)})
    return result

def kmp_preprocess(pattern):
    fail_links = [0] * len(pattern)
    i = 1
    j = 0  # prefix index
    while i < len(pattern):
        if pattern[i] == pattern[j]:
            j += 1
            fail_links[i] = j
            i += 1
        elif j == 0:
            fail_links[i] = 0
            i += 1
        else:
            j = fail_links[j - 1]
    return fail_links


# similar to www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching
def kmp_search(pattern, word):
    result = []
    fail_links = kmp_preprocess(pattern)
    j = 0  # pattern index
    i = 0  # word index
    while (len(word) - i) >= (len(pattern) - j):
        if pattern[j] == word[i]:
            i += 1
            j += 1
        if j == len(pattern):
            start = i - j
            # don't print overlapping matches
            if not result or start >= result[-1]['end']:
                result.append({'start': start, 'end': start + len(pattern)})
            j = fail_links[j - 1]
        elif i < len(word) and pattern[j] != word[i]:
            if j != 0:
                j = fail_links[j - 1]
            else:
                i += 1
    return result

# for command line
def grep_naive():
    pattern = sys.argv[1]
    for line in sys.stdin:
        matches = naive_search(pattern, line)
        if matches:
            print_matches(matches, line)


def grep_naive_(pattern, file):
    with open(file, 'r') as f:
        while True:
            line = f.readline()
            if not line: break
            matches = naive_search(pattern, line)
            if matches:
                print_matches(matches, line)

# for command line
def grep_kmp():
    pattern = sys.argv[1]
    for line in sys.stdin:
        matches = kmp_search(pattern, line)
        if matches:
            print_matches(matches, line)

def grep_kmp_(pattern, file):
    with open(file, 'r') as f:
        while True:
            line = f.readline()
            if not line: break
            matches = kmp_search(pattern, line)
            if matches:
                print_matches(matches, line)

# EX1

#####  Write a very simple "grep" all by yourself - for comparing naive search and one other method of your choice from KMP, BMH, BMHHS algorithms. Use for example the same dictionary of English words as was used for hash function collision counting (<a href="https://raw.githubusercontent.com/dwyl/english-words/master/words.txt">https://raw.githubusercontent.com/dwyl/english-words/master/words.txt</a>).

I implemented grep using naive search and KMP algorithm. They take input from words.txt and print out words that include the pattern. Matched part is printed out in red just like grep does. Also, I demonstrated that a line can have multiple matches.

In [2]:
grep_naive_(pattern='kitty', file='words.txt')

bitch-[31mkitty[0m
[31mkitty[0m-cat
[31mkitty[0mcorner
[31mkitty[0m-corner
[31mkitty[0mcornered
[31mkitty[0m-cornered
[31mkitty[0msol
s[31mkitty[0m
s[31mkitty[0mboot


In [3]:
grep_kmp_(pattern= 'kitty', file='words.txt')

bitch-[31mkitty[0m
[31mkitty[0m-cat
[31mkitty[0mcorner
[31mkitty[0m-corner
[31mkitty[0mcornered
[31mkitty[0m-cornered
[31mkitty[0msol
s[31mkitty[0m
s[31mkitty[0mboot


![grep](https://i.imgur.com/7MDPgqw.png)

# EX2

##### Now, in comparison, use the grep (<a href="https://en.wikipedia.org/wiki/Grep">https://en.wikipedia.org/wiki/Grep</a>) and fgrep (or grep -F) for multiple patterns. State the speed - how many MB/s and how many times faster are these than your code? What is the speed difference matching one (long, short) pattern vs many patterns?

I created a 46.3 MB file from words.txt by copying its contents to a new file 10 times. I used 'random' as the long pattern, because it has few matches and 'ab' for the short pattern.

Naive algorithm speed was ~15 MB/s for long pattern and ~8.8 MB/s for short pattern.

KMP algorithm speed was ~8.8 MB/s for long pattern and ~ 7.1 MB/s for short pattern.

Both grep and fgrep speed was ~1500 MB/s for long and ~1000 MB/s for short pattern.

So grep is at least 100 times faster than my Python implementation of naive and KMP algorithm when running them from command line.
It's clear that finding a long pattern takes less time than finding a short one. That might be because some words don't need to be searched as they are shorter than the long pattern and the short pattern is more likely to be matched multiple times.

It's interesting to note that CPU usage was 99% for KMP and naive algorithm, but for grep and fgrep it was at most 96%.


![comparison](https://i.imgur.com/enVfxdB.png)

# EX3

#####  Given the following

```
(ab|(aa|b)(ba)*(bb|a))*
```

##### Construct an NFA (Nondeterministic-Finite Automaton) using Thompson construction algorithm.

##### Minimize the constructed NFA.

##### Provide example patterns which are accepted by this automaton and some that are not.

Examples of accepted patterns:
- aaa
- aabb
- ba
- ab
- baba
- (empty string)

Examples of not accepted patterns:
- a
- b
- aab
- baa
- aba
- bab

![ex3](https://i.imgur.com/ZbxveUU.png)
<font color="gray" size="-2">regex representation from: https://regexper.com/#%28ab%7C%28aa%7Cb%29%28ba%29*%28bb%7Ca%29%29*/</font>

# EX4

#####  A finite state transducer (FST) is a type of deterministic finite automaton whose output is a string and not just accept or reject. The following are state diagrams of finite state transducers T1 and T2.

![ex4](https://courses.cs.ut.ee/MTAT.03.238/2022_fall/uploads/Main/automata.png)

##### Each transition of an FST is labeled with two symbols, one designating the input symbol for that transition and the other designating the output symbol. The two symbols are written with a slash, /, separating them. In $T_1$, the transition from $q_1$ to $q_2$ has input symbol $2$ and output symbol $1$. Some transitions may have multiple input–output pairs, such as the transition in $T_1$ from $q_1$ to itself. When an FST computes on an input string $w$, it takes the input symbols $w_1 \cdots w_n$ one by one and, starting at the start state, follows the transitions by matching the input labels with the sequence of symbols $w_1 \cdots w_n = w$. Every time it goes along a transition, it outputs the corresponding output symbol. For example, on input 2212011, machine T1 enters the sequence of states $q_1$, $q_2$, $q_2$, $q_2$, $q_2$, $q_1$, $q_1$, $q_1$ and produces output 1111000. On input abbb, $T_2$ outputs 1011. Give the sequence of states entered and the output produced in each of the following parts:

- $T_1$ on input 011
- $T_1$ on input 211
- $T_2$ on input b
- $T_2$ on input bbab

I visualised the sequence of states for each one.

![T1](https://i.imgur.com/nlIF0dF.png)
![T2](https://i.imgur.com/oCaPlF6.png)

# EX5

##### Use the finite state transducer (FST) from EX4. Give the state diagram of an FST with the following behavior. Its input and output alphabets are {0,1}. Its output string is identical to the input string on the even positions but inverted on the odd positions. For example, on input 0000111 it should output 1010010.

Transition from q1 to q2 means position was odd so the input string is inverted.
Transition from q2 to q1 means position was even so the input string remain the same.

I visualised the example input.

![ex5](https://i.imgur.com/Fh0QprU.png)

# EX6 & EX7

##### The course ends with the project to be completed usually by a small team of 2-3 people. Your task is to come up with one project proposal all by yourself (this is an individual task, you will have time to share) - taking ideas from project ideas file, from previous homeworks, or your imagination + google. E.g. you may attempt to extend some of the past exercises. Note - this is a proposal that you may use to attract other students to join; and this is at the same time a proposal that you do not need to start executing necessarily. So. it's more a planning exercise, not execution exercise.

##### Your project proposal should have:
- Title
- 2-5 sentences of a short description
- Briefly described the motivation and the main challenge of this project
- Division of tasks, the estimated number of work hours per task, and deadline (aim at our poster session!)
- Allocation of the 2-3 people to tasks and hours.
- Recommended: create a GANTT chart to cover the previous two points.
- Description of the envisioned end results that would go to the project report/poster.

##### We will discuss these in the practice session.


###### Visualizing linear programing problems

Linear programming problems in Design and Analysis of Algorithms course are quite difficult to comprehend. Creating step-by-step visualizations of how these algorithms work can help comprehend difficult concepts. For example, it would be interesting to see how each step in the simplex method narrows down possible solutions in the feasible region and how the derived dual problem is similar to the primal problem visually. Visualizations could be made for linear programming solutions of graph problems as well, such as finding maximum flow, matching and vertex cover.

The result would take as input a linear programming problem, then solve it and finally provide step by step visualizations for the solution.
<br>
<br>
Main difficulty comes from thoroughly understanding topics covered in Design and Analysis of Algorithms.

Tasks could easily be divided between 2-3 people if everyone was assigned a different type of visualization.

This project can certainly satisfy the required amount of hours requirement depending on how many and which types of linear programming problems are implemented.
