# Homework 1

The maximum score of this homework is 100+20 points. Grading is listed in this table:

| Grade | Score range |
| --- | --- |
| 5 | 85+ |
| 4 | 70-84 |
| 3 | 55-69 |
| 2 | 40-54 |
| 1 | 0-39 |

Most exercises include tests which should pass if your solution is correct.
However successful test do not guarantee that your solution is correct.
You are free to add more tests.


# Exercise 1, small exercises (30 points)

## 1.1 Groupby function (10 points)

Write a function that takes a sequene and a callable as parameters. The function should call its second parameter on every element on the sequence and group them by return value. It should return a dictionary whose keys are the return values of the callable and values are lists of sequence elements that the callable return that value to.

In [15]:
def group_by_retval(sequence, grouper_func):
    ret = {}
    for elem in sequence:
        r = grouper_func(elem)
        try:
            ret[r].append(elem)
        except KeyError:
            ret[r] = [elem]
    return ret
        

l = ["ab", 12, "cd", "d", 3]

assert(group_by_retval(l, lambda x: isinstance(x, str)) == {True: ["ab", "cd", "d"], False: [12, 3]})
assert(group_by_retval([1, 1, 2, 3, 4], lambda x: x % 3) == {0: [3], 1: [1, 1, 4], 2: [2]})

## 1.2 Replace rare words (10 points)

Write a function that takes a text and a number $N$ as parameters and replaces every word other than the most common $N$ in the text with a common symbol. The symbol by default is `__RARE__` but it can be redefined.

In [4]:
def wordify(txt):
    return txt.split(' ')  # TODO punctuation not handled


def most_common_words(words, N):
    order = {}
    for word in words:
        occ = order.get(word, 0)
        order[word] = occ + 1
    decr_comm = [o[0] for o in sorted(list(order.items()), key=lambda t: t[1], reverse=True)]
    return decr_comm[:N]


assert(most_common_words(wordify('a aa aa aa b b'), 2) == ['aa', 'b'])

In [5]:
def replace_rare_words(txt, N, rare_symbol="__RARE__"):
    words = wordify(txt)
    commons = most_common_words(words, N)
    return ' '.join([w if w in commons else rare_symbol for w in words])
    

assert(replace_rare_words("a b a b b c", 2) == "a b a b b __RARE__")
assert(replace_rare_words("a b a b b c", 2, rare_symbol="rare") == "a b a b b rare")

## 1.3 Levenshtein distance (10 points)

Write a function that returns the Levenshtein distance of two strings.

https://en.wikipedia.org/wiki/Levenshtein_distance

In [65]:
def levenshtein(s, t):
    """
        implementation based on:
        https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm
    """
    m, n = len(s), len(t)
    if m == 0:
        return n
    if n == 0:
        return m
    # init levenshtein matrix
    lev = [[j for j in range(n+1)] for i in range(m+1)]
    # fill matrix
    for i in range(1, m+1):
        for j in range(1, n+1):
            cost = int((s[i-1] != t[j-1]))
            lev[i][j] = min(
                lev[i-1][j] + 1,
                lev[i][j-1] + 1,
                lev[i-1][j-1] + cost,
            )
    return lev[m][n]


assert(levenshtein("abc", "ab") == 1)
assert(levenshtein("abc", "abc") == 0)
assert(levenshtein("abc", "ab c") == 1)
assert(levenshtein("", "abc") == 3)

In [66]:
import numpy as np


def levenshtein_np(s, t):
    """
        implementation based on:
        https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm
    """
    m, n = len(s), len(t)
    if m == 0:
        return n
    if n == 0:
        return m
    # init levenshtein matrix
    lev = np.empty([m+1, n+1], dtype=np.int32)
    lev[0, :] = np.arange(n+1)
    lev[:,0] = np.arange(m+1)
    # fill matrix
    for i in range(1, m+1):
        for j in range(1, n+1):
            cost = int((s[i-1] != t[j-1]))
            lev[i, j] = min(
                lev[i-1, j] + 1,
                lev[i, j-1] + 1,
                lev[i-1, j-1] + cost,
            )
    return lev[m, n]


assert(levenshtein_np("abc", "ab") == 1)
assert(levenshtein_np("abc", "abc") == 0)
assert(levenshtein_np("abc", "ab c") == 1)
assert(levenshtein_np("", "abc") == 3)

# Exercise 2, Mutable string (40 points)

Python strings are immutable. Create a mutable string class.

Implement the following features (see the tests below).

- initialization from `str`.
- assignment (i.e. modifying a character),
  - if the index is out of range, it should fill the blanks with spaces (see the tests below)
- conversion to built-in `str` and `list`. The latter is a list of the characters.
- addition with other `MutableString` instances and built-in strings,
- multiplication with integers. Multiplying a string with 3 means repeating the string 3 times.
- built-in `len` function,
- comparision with strings,
- iteration.

In [105]:
class MutableString(object):
    def __init__(self, s):
        if not isinstance(s, str):
            raise TypeError("str is expected")  # TODO human err msg
        self._s = s
    
    def __repr__(self):
        return '"' + self._s + '"'
    
    def __getitem__(self, i):
        return self._s[i]
    
    def __setitem__(self, i, c):
        l = len(self._s)
        if i > l:
            self._s = self._s + " " * (i-l)
        self._s = self._s[:i] + c + self._s[i+1:]
    
    def __str__(self):
        return self._s
    
    def __iter__(self):
        for c in self._s:
            yield c
            
    def __add__(self, other):
        if isinstance(other, str):
            return MutableString(self._s + other)
        if isinstance(other, MutableString):
            return MutableString(self._s + other._s)
        raise TypeError("addition supported only for str and MutableString types")
    
    def __radd__(self, other):
        if isinstance(other, str):
            return MutableString(other + self._s)
        raise TypeError("addition supported only for str and MutableString types")
    
    def __mul__(self, other):
        if isinstance(other, int):
            return MutableString(self._s * other)
        raise TypeError("multiplication supported only for integers")
    
    def __len__(self):
        return len(self._s)
    
    def __eq__(self, other):
        if isinstance(other, str):
            return self._s == other
        if isinstance(other, MutableString):
            return self._s == other._s
        raise TypeError("comparison supported only with str and MutableString")

In [107]:
m1 = MutableString("abc")
m1[1] = "d"
assert(m1[1] == "d")
m1[1] = "b"
m1[4] = "d"
assert(m1[3] == " " and m1[4] == "d" and len(m1) == 5)

assert(list(m1) == list("abc d"))
assert(str(m1) == "abc d")

m1 = MutableString("abc")
m2 = m1 + "def"
assert(isinstance(m2, MutableString))
assert(m2 == "abcdef")

m3 = m1 * 3
assert(isinstance(m3, MutableString) and m3 == "abcabcabc")

m2[0] = "A"  # modifying m2 should not change m1
assert(m1 == "abc")

# right addition with strings
m1 = MutableString("abc")
m2 = "def" + m1
assert(m2 == "defabc")

# Exercise 3 - Text generation (30+20 points)

## 3.1 (Same as a laboratory exercise) Write a function that computes N-gram frequencies in a string. (0 point)

In [None]:
# TODO

assert(count_ngram_freqs("abcc", 1) == {"a": 1, "b": 1, "c": 2})
assert(count_ngram_freqs("abccab", 2) == {"ab": 2, "bc": 1, "cc": 1, "ca": 1})

## 3.2 Define a text generator function. (25 points)

The function takes 4 arguments:

1. starting text (at least $N-1$ long,
2. target length: length of the output string,
3. n-gram frequency dictionary,
4. N, length of the n-grams.

The function generates one character at a time given the last $N-1$ characters.
The probability of `c` being generated after `ab` is defined as:

$$
P(c | a b ) = \frac{\text{freq}(a b c)}{\text{freq}(a b)},
$$

where  $\text{freq}(a b c)$ is obtained by counting how many times `abc` occurs in the training corpus (`count_ngram_freqs` function).

If the generated text ends with a $N-1$-gram that does not occur in the training data, generate the next character from the full distribution.

In [None]:
# TODO

toy_freqs = count_ngram_freqs("abcabcda", 3)
gen = generate_text("abc", 5, toy_freqs, 3)

assert(len(gen) == 5)
assert(set(gen) <= set("abcd"))

## 3.3 Test your solution on a small Wikipedia corpus. (5 points)

Collect a sample of at least 1 million characters from Wikipedia using the `wikipedia` module.

## \*3.4 Smoothing (20 points)

Implement one or more smoothing methods such as Jelinek-Mercer smoothing or Katz's backoff.

https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf