# Exercise 17: Bag of words

## Beginner

Consider the table above that contains the bag-of-word representation of the This Little Piggy nursery rhyme. Compare the lines word by word and add up the differences in the counts of each word. For example the difference (distance) between the first two lines becomes

|1–1| + |0–1| + |0–0| + |1–0| + ... + |1–1| = 0 + 1 + 0 + 1 + ... + 0 = 5

where the |·| marks the absolute value so for example |0–1| = 1.

Which two lines are the most similar to each other?

Select the correct answer

the 3rd and the 4th

## Intermediate

Your task is to write a function that calculates the distances (or differences) between a pair of lines in the This Little Piggy rhyme.

Every row in the list data represents one line in the rhyme.

When you run the code, you see that the output of the whole program is a list of lists. When your function works correctly, each list will contain the distances between a single row and all the other rows in data.

Note that the program will compare every row also with itself. In this case – when the compared rows are the same – their distance will be zero.

You can use the function abs(x-y) to calculate the distance between numbers x and y, where x comes from list row1 and y comes from row2.

Your program must work with any text, not only with the rhyme This Little Piggy.

In [1]:
import numpy as np

# this data here is the bag of words representation of This Little Piggy
data = [[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
        [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1],
        [1, 1, 1, 0, 1, 3, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1]]

def distance(row1, row2):
    # fix this function so that it returns
    # the sum of differences between the occurrences
    # of each word in row1 and row2.
    # you can assume that row1 and row2 are lists with equal length, containing numeric values.

    row1 = np.asarray(row1)
    row2 = np.asarray(row2)

    return sum(abs(row1 - row2))


def all_pairs(data):
    # this calls the distance function for all the two-row combinations in the data
    # you do not need to change this
    dist = [[distance(sent1, sent2) for sent1 in data] for sent2 in data]
    print(dist)

all_pairs(data)


[[0, 5, 6, 5, 12], [5, 0, 5, 4, 9], [6, 5, 0, 3, 12], [5, 4, 3, 0, 11], [12, 9, 12, 11, 0]]


## Advanced

Your task is to write a program that calculates the distances (or differences) between every pair of lines in the This Little Piggy rhyme and finds the most similar pair. Use the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (also called Taxicab distance) as your distance metric.

You can start by building a numpy array to store all the distances. Notice that the diagonal elements in the array (elements at positions [i, j] with i=j) will be equal to zero. This happens because the program will compare every row also with itself. To avoid selecting those elements, you can assign the value np.inf (the maximum possible floating point value). To do this, it's necessary to make sure the type of the array is float. Do not use np.float to set the type of the array, as it is deprecated. Use Python's built-in type float instead.

A quick way to get the index of the element with the lowest value in a 2D array (or in fact, any dimension) is by the function

np.unravel_index(np.argmin(dist), dist.shape))

where dist is the 2D array. This will return the index as a list of length two. If you're curious, here's an [intuitive explanation](https://stackoverflow.com/q/48135736) of the function, and here's its [documentation](https://numpy.org/doc/stable/reference/generated/numpy.unravel_index.html).

In [34]:
import numpy as np

data = [[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
        [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1],
        [1, 1, 1, 0, 1, 3, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1]]


def distance(row1, row2):

    row1 = np.asarray(row1)
    row2 = np.asarray(row2)

    return sum(abs(row1 - row2))

def find_nearest_pair(data):

    dist = [[distance(sent1, sent2) for sent1 in data] for sent2 in data]

    # print(dist)

    for i in range(len(dist)):
        for j in range(len(dist[i])):
            if dist[i][j] == 0:
                dist[i][j] = float("inf")

    # print(dist)

    dist = np.asarray(dist)

    # print(dist)

    print(np.unravel_index(np.argmin(dist), dist.shape))

find_nearest_pair(data)


[[0, 5, 6, 5, 12], [5, 0, 5, 4, 9], [6, 5, 0, 3, 12], [5, 4, 3, 0, 11], [12, 9, 12, 11, 0]]
[[inf, 5, 6, 5, 12], [5, inf, 5, 4, 9], [6, 5, inf, 3, 12], [5, 4, 3, inf, 11], [12, 9, 12, 11, inf]]
[[inf  5.  6.  5. 12.]
 [ 5. inf  5.  4.  9.]
 [ 6.  5. inf  3. 12.]
 [ 5.  4.  3. inf 11.]
 [12.  9. 12. 11. inf]]
(2, 3)
