# Section 12. Parts-of-Speech tagging (Part 1 - LogReg)

# Overview
- A POS tag is a category you can assign a word, according to its syntactic function
- Examples: noun, pronoun, adjective, determiner, verb, adverb, preposition, conjuciton, interjection

Let's first explore the data:
- https://www.clips.uantwerpen.be/conll2000/chunking/
- Save 'test.txt' and 'train.txt' files
- Each line should be a unique word, a tag is beside it

In [6]:
with open("NLP_12_chunking_train.txt") as f:
    train = f.readlines()
with open("NLP_12_chunking_test.txt") as f:
    test = f.readlines()
train[0:10]

['Confidence NN B-NP\n',
 'in IN B-PP\n',
 'the DT B-NP\n',
 'pound NN I-NP\n',
 'is VBZ B-VP\n',
 'widely RB I-VP\n',
 'expected VBN I-VP\n',
 'to TO I-VP\n',
 'take VB I-VP\n',
 'another DT B-NP\n']

### Baseline model
- Logistic regression would be a good baseline model
- Just one-hot encode the words and tags, and do softmax (multiclass)
- We should be able to get over 90% accuracry with just this model... **this is because it is rare that a word would have two different meanings...**

### F1 score
- We will also calculate the F1 score (a better measure than straight accuracy for this model)
- F1 score is the harmonic means of precision and recall
    - F1 = 2 * (precision * recal) / (precision + recall)
    - Precision = TP / (TP + FP)
    - Recall = TP / (TP + FN)

### Hidden Markov Models
- Hidden states = POS tags, observerd = words
- We don't need any predictive algos to train (we can use expectation maximization)
- We don't need expectation maximization because we actually know the hidden states of the training data. HMM are defin
- HMMs are definined by 3 distributions (pi, A, B):
    - pi = frequency of start tags (initial state distribution)
    - A = P(tag(t) | tag(t-1)) (state transition matrix)
    - B = p(word(t) | tag(t)) (observation probability matrix)
    - All can be calculated by just counting!!!
    
### Path forward
We will try using three models to solve POS tagging:
    - Baseline: Logistic Regression
    - RNNs
    - HMMs (Covered in Part 2)

# Model 1: Logistic Regression

- Main problem is that it doesnt not capture sequence information
    - p(tag | word) = softmax([W[word_index])
    - we have input and target, we map from input to target with weight matrix, and we run output through softmax to get a posterior probability distribution
    - So the input is just a single word, and output is just a parts of speech tag
    - As before, since input is just word idx, we dont actually have to do matrix multiplication, we take the index the weight matrix with the word index, and take the softmax over that output
- Limitations
    - It only maps word to tag... ignoring context
    - "I just drank milk" is not the same as "I'm going to milk them for all they got!"
    - context matters for tags, and logreg doesnt capture context
    - Nonetheless, logreg acheives decent accuracy since most words don't often have double meanings/usages
![](pictures/NLP_12_logreg.png)

# Let's write some code!

# 1. Import packages

In [18]:
import os, sys
import numpy as np
import theano
import theano.tensor as T
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier

# 2. Create get_data function

In [17]:
def get_data(split_sequences=False, train_data = 'NLP_12_chunking_train.txt', test_data='NLP_12_chunking_test.txt'):
    if not os.path.exists(train_data):
        print("Training data is not in root folder.")
        print("Please check the comments to get the download link.")
        exit()
    elif not os.path.exists(test_data):
        print("Test data is not in root folder.")
        print("Please check the comments to get the download link.")
        exit()

    word2idx = {}
    tag2idx = {}
    word_idx = 0
    tag_idx = 0
    Xtrain = []
    Ytrain = []
    currentX = []
    currentY = []
    for line in open(train_data):
        line = line.rstrip() #right strip empty chars
        if line:
            r = line.split()
            word, tag, _ = r
            if word not in word2idx:
                word2idx[word] = word_idx
                word_idx += 1
            currentX.append(word2idx[word])
            
            if tag not in tag2idx:
                tag2idx[tag] = tag_idx
                tag_idx += 1
            currentY.append(tag2idx[tag])
        elif split_sequences:
            Xtrain.append(currentX)
            Ytrain.append(currentY)
            currentX = []
            currentY = []

    if not split_sequences:
        Xtrain = currentX
        Ytrain = currentY

    # load and score test data
    Xtest = []
    Ytest = []
    currentX = []
    currentY = []
    for line in open(test_data):
        line = line.rstrip()
        if line:
            r = line.split()
            word, tag, _ = r
            if word in word2idx:
                currentX.append(word2idx[word])
            else:
                currentX.append(word_idx) # use this as unknown
            currentY.append(tag2idx[tag])
        elif split_sequences:
            Xtest.append(currentX)
            Ytest.append(currentY)
            currentX = []
            currentY = []
    if not split_sequences:
        Xtest = currentX
        Ytest = currentY

    return Xtrain, Ytrain, Xtest, Ytest, word2idx

In [25]:
X_train, Y_train, _, _, word2idx = get_data()
np.unique(Y_train)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43])

# 3. Create LogisticRegression Class

In [41]:
class LogisticRegression:
    def __init__(self):
        pass

    def fit(self, X, Y, V=None, K=None, D=50, lr=1e-1, mu=0.99, batch_sz=100, epochs=6):
        if V is None:
            V = len(set(X))
        if K is None:
            K = len(set(Y))
        N = len(X)

        W = np.random.randn(V, K) / np.sqrt(V + K)
        b = np.zeros(K)
        self.W = theano.shared(W)
        self.b = theano.shared(b)
        self.params = [self.W, self.b]

        thX = T.ivector('X')
        thY = T.ivector('Y')

        py_x = T.nnet.softmax(self.W[thX] + self.b)
        prediction = T.argmax(py_x, axis=1)

        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]
        self.cost_predict_op = theano.function(
            inputs=[thX, thY],
            outputs=[cost, prediction],
            allow_input_downcast=True,
        )

        updates = [(p, p + mu*dp - lr*g) for p, dp, g in zip(self.params, dparams, grads)] + [(dp, mu*dp - lr*g) for dp, g in zip(dparams, grads)]
        
        train_op = theano.function(inputs=[thX, thY], outputs=[cost, prediction], updates=updates, allow_input_downcast=True)

        costs = []
        n_batches = N // batch_sz
        for i in range(epochs):
            X, Y = shuffle(X, Y)
            print("epoch:", i)
            for j in range(n_batches):
                Xbatch = X[j*batch_sz:(j*batch_sz + batch_sz)]
                Ybatch = Y[j*batch_sz:(j*batch_sz + batch_sz)]

                c, p = train_op(Xbatch, Ybatch)
                costs.append(c)
                if j % 200 == 0:
                    print(
                        "i:", i, "j:", j,
                        "n_batches:", n_batches,
                        "cost:", c,
                        "error:", np.mean(p != Ybatch))
        plt.plot(costs)
        plt.show()

    def score(self, X, Y):
        _, p = self.cost_predict_op(X, Y)
        return np.mean(p == Y)

    def f1_score(self, X, Y):
        _, p = self.cost_predict_op(X, Y)
        return f1_score(Y, p, average=None).mean()

# 4. Get data

In [42]:
# Get data
Xtrain, Ytrain, Xtest, Ytest, word2idx = get_data()

# convert to numpy arrays
Xtrain = np.array(Xtrain)
Ytrain = np.array(Ytrain)

# convert Xtrain to indicator matrix
N = len(Xtrain)
V = len(word2idx) + 1
print("vocabulary size:", V)
# Xtrain_indicator = np.zeros((N, V))
# Xtrain_indicator[np.arange(N), Xtrain] = 1

vocabulary size: 19123


# 5. DecisionTree model

In [43]:
# decision tree
dt = DecisionTreeClassifier()

# without indicator
dt.fit(Xtrain.reshape(N, 1), Ytrain)
print("dt train score:", dt.score(Xtrain.reshape(N, 1), Ytrain))
p = dt.predict(Xtrain.reshape(N, 1))
print("dt train f1:", f1_score(Ytrain, p, average=None).mean())

# with indicator -- too slow!!
# dt.fit(Xtrain_indicator, Ytrain)
# print("dt score:", dt.score(Xtrain_indicator, Ytrain))

dt train score: 0.964959594194
dt train f1: 0.907858696936


  'precision', 'predicted', average, warn_for)


# 6. LogisticRegression model

In [45]:
# train and score
model = LogisticRegression()
model.fit(Xtrain, Ytrain, V=V)
print("training complete")
print("lr train score:", model.score(Xtrain, Ytrain))
print("lr train f1:", model.f1_score(Xtrain, Ytrain))

epoch: 0
i: 0 j: 0 n_batches: 2117 cost: 3.784513628317909 error: 0.98
i: 0 j: 200 n_batches: 2117 cost: 1.563287085371617 error: 0.39
i: 0 j: 400 n_batches: 2117 cost: 1.2355213409448025 error: 0.34
i: 0 j: 600 n_batches: 2117 cost: 0.9103403825422514 error: 0.27
i: 0 j: 800 n_batches: 2117 cost: 0.8226196776620424 error: 0.23
i: 0 j: 1000 n_batches: 2117 cost: 1.017887443965278 error: 0.25
i: 0 j: 1200 n_batches: 2117 cost: 0.7095214129260953 error: 0.18
i: 0 j: 1400 n_batches: 2117 cost: 0.9485272741272235 error: 0.29
i: 0 j: 1600 n_batches: 2117 cost: 0.6508969645406806 error: 0.17
i: 0 j: 1800 n_batches: 2117 cost: 0.5818016395096105 error: 0.12
i: 0 j: 2000 n_batches: 2117 cost: 0.766307142459805 error: 0.2
epoch: 1
i: 1 j: 0 n_batches: 2117 cost: 0.6006885255174224 error: 0.1
i: 1 j: 200 n_batches: 2117 cost: 0.6764466050867491 error: 0.16
i: 1 j: 400 n_batches: 2117 cost: 0.8058362654594714 error: 0.18
i: 1 j: 600 n_batches: 2117 cost: 0.5783888856058421 error: 0.12
i: 1 j: 800

SystemError: <class 'range'> returned a result with an error set
Apply node that caused the error: Elemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](<TensorType(float64, matrix)>, Elemwise{Mul}[(0, 1)].0, Elemwise{Mul}[(0, 1)].0)
Toposort index: 22
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(19123, 44), (19123, 44), (19123, 44)]
Inputs strides: [(352, 8), (352, 8), (352, 8)]
Inputs values: ['not shown', 'not shown', 'not shown']
Outputs clients: [['output']]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

# 7. Compare baseline models

In [None]:
Ntest = len(Xtest)
Xtest = np.array(Xtest)
Ytest = np.array(Ytest)
# convert Xtest to indicator
# Xtest_indicator = np.zeros((Ntest, V))
# Xtest_indicator[np.arange(Ntest), Xtest] = 1

# decision tree test score
print("dt test score:", dt.score(Xtest.reshape(Ntest, 1), Ytest))
p = dt.predict(Xtest.reshape(Ntest, 1))
print("dt test f1:", f1_score(Ytest, p, average=None).mean())
# print("dt test score:", dt.score(Xtest_indicator, Ytest)) # too slow!

# logistic test score -- too slow!!
print("lr test score:", model.score(Xtest, Ytest))
print("lr test f1:", model.f1_score(Xtest, Ytest))