### Parts of Speech Tagging - Working with tags and Numpy

In [59]:
import math
import numpy as np
import pandas as pd

#### Some information on tags

In [47]:
# Define tags for adverb, noun and to (the preposition), respectively
tags = ['RB', 'NN', 'TO']

One of these dictionaries is the `transition_counts` which counts the number of times a particular tag happened next to another. The keys of this dictionary have the form `(previous_tag, tag)` and the values are the frequency of occurrences.

Another one is the `emission_counts` dictionary which will count the number of times a particular pair of `(tag, word)` appeared in the training dataset.

In general think of `transition` when working with tags only and of `emission` when working with tags and words.

In this notebook you will be looking at the first one:

In [48]:
# Define 'transition_counts' dictionary
transition_counts = {
    ('NN', 'NN'): 16241,
    ('RB', 'RB'): 2263,
    ('TO', 'TO'): 2,
    ('NN', 'TO'): 5256,
    ('RB', 'TO'): 855,
    ('TO', 'NN'): 734,
    ('NN', 'RB'): 2431,
    ('RB', 'NN'): 358,
    ('TO', 'RB'): 200
}

#### Using Numpy for matrix creation

In [49]:
# Store the number of tags in the 'num_tags' variable
num_tags = len(tags)

# Initialize a 3x3 numpy array with zeros
transition_matrix = np.zeros((num_tags, num_tags))

transition_matrix

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [50]:
transition_matrix.shape

(3, 3)

In [51]:
# Create sorted version of the tag's list
sorted_tags = sorted(tags)
sorted_tags

['NN', 'RB', 'TO']

In [52]:
# Loop rows
for i in range(num_tags):
    # Loop columns
    for j in range(num_tags):
        # Define tag pair
        tag_tuple = (sorted_tags[i], sorted_tags[j])
        # Get frequency from transition_counts dict and assign to (i, j) position in the matrix
        transition_matrix[i, j] = transition_counts.get(tag_tuple)

# Print matrix
transition_matrix

array([[1.6241e+04, 2.4310e+03, 5.2560e+03],
       [3.5800e+02, 2.2630e+03, 8.5500e+02],
       [7.3400e+02, 2.0000e+02, 2.0000e+00]])

In [53]:
def print_matrix(matrix):
    print(pd.DataFrame(matrix, index=sorted_tags, columns=sorted_tags))

In [55]:
print_matrix(transition_matrix)

         NN      RB      TO
NN  16241.0  2431.0  5256.0
RB    358.0  2263.0   855.0
TO    734.0   200.0     2.0


In [56]:
# Compute sum or row for each row
rows_sum = transition_matrix.sum(axis=1, keepdims=True)
rows_sum

array([[23928.],
       [ 3476.],
       [  936.]])

In [57]:
# Normalize transition matrix
transition_matrix = transition_matrix / rows_sum

# Print normalized matrix
print_matrix(transition_matrix)

          NN        RB        TO
NN  0.678745  0.101596  0.219659
RB  0.102992  0.651036  0.245972
TO  0.784188  0.213675  0.002137


In [58]:
transition_matrix.sum(axis=1, keepdims=True)

array([[1.],
       [1.],
       [1.]])

In [60]:
# Copy transition matrix for for-loop example
t_matrix_for = np.copy(transition_matrix)

# Copy transition matrix for numpy functions example
t_matrix_np = np.copy(transition_matrix)

##### Using a for-loop

In [61]:
# Loop values in the diagonal
for i in range(num_tags):
    t_matrix_for[i, i] =  t_matrix_for[i, i] + math.log(rows_sum[i])

print_matrix(t_matrix_for)

           NN        RB        TO
NN  10.761549  0.101596  0.219659
RB   0.102992  8.804673  0.245972
TO   0.784188  0.213675  6.843752


##### Using vectorization

In [62]:
# Save diagonal in a numpy array
d = np.diag(t_matrix_np)
d.shape

(3,)

In [64]:
# Reshape diagonal numpy array
d = np.reshape(d, (3,1))
d.shape

(3, 1)

Now that the diagonal has the correct shape you can do the vectorized operation by applying the `math.log()` function to the `rows_sum` array and adding the diagonal. 

To apply a function to each element of a numpy array use Numpy's `vectorize()` function providing the desired function as a parameter. This function returns a vectorized function that accepts a numpy array as a parameter. 

To update the original matrix you can use Numpy's `fill_diagonal()` function.

In [67]:
# Perform the vectorized operation
d = d + np.vectorize(math.log)(rows_sum)

# Use numpy's 'fill_diagonal' function to update the diagonal
np.fill_diagonal(t_matrix_np, d)

print_matrix(t_matrix_np)

           NN        RB        TO
NN  10.761549  0.101596  0.219659
RB   0.102992  8.804673  0.245972
TO   0.784188  0.213675  6.843752


In [68]:
# Check for equality
t_matrix_for == t_matrix_np

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])