Today we're going to use positional vectors to predict what part-of-speech a word belongs to. A *positional vector* encodes context within a sentence while the other vectors we've used have encoded specific features regardless of their position. We start, as always, by loading our dependencies.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


This time we're going to work with corpora from the Universal Dependencies project (universaldependencies.org). We've modified the original corpora, but the format is quite similar. Each word is a row. The word-form is one column. And the part-of-speech tag is another column. We have sentence information as well; this let's us make sure that we don't encode sequences which cross a sentence boundary.

In [2]:
file = os.path.join(ai.data_dir, "syntax.pos_english.gz")
df = pd.read_csv(file)
print(df)

        Source  Sentence_ID Word_ID    Word    POS
0       en_ewt            1       1    from    ADP
1       en_ewt            1       2     the    DET
2       en_ewt            1       3      ap  PROPN
3       en_ewt            1       4   comes   VERB
4       en_ewt            1       5    this    DET
...        ...          ...     ...     ...    ...
559457  en_pud        32356      21       a    DET
559458  en_pud        32356      22  friend   NOUN
559459  en_pud        32356      23      of    ADP
559460  en_pud        32356      24   peace   NOUN
559461  en_pud        32356      25       .  PUNCT

[559462 rows x 5 columns]


Let's create a dataframe for just one sentence to experiment with.

In [3]:
test_df = df.loc[df.loc[:,"Sentence_ID"]==2]
print(test_df)

    Source  Sentence_ID Word_ID         Word    POS
7   en_ewt            2       1    president  PROPN
8   en_ewt            2       2         bush  PROPN
9   en_ewt            2       3           on    ADP
10  en_ewt            2       4      tuesday  PROPN
11  en_ewt            2       5    nominated   VERB
12  en_ewt            2       6          two    NUM
13  en_ewt            2       7  individuals   NOUN
14  en_ewt            2       8           to   PART
15  en_ewt            2       9      replace   VERB
16  en_ewt            2      10     retiring   VERB
17  en_ewt            2      11      jurists   NOUN
18  en_ewt            2      12           on    ADP
19  en_ewt            2      13      federal    ADJ
20  en_ewt            2      14       courts   NOUN
21  en_ewt            2      15           in    ADP
22  en_ewt            2      16          the    DET
23  en_ewt            2      17   washington  PROPN
24  en_ewt            2      18         area   NOUN
25  en_ewt  

This code below will iterate over each word in the sentence and create a positional vector for that specific word. The vector contains two words before and two words after. We then print out the vector and save it.

In [4]:
x_vectors = []
y_vector = []

#A list of words and a list of ground-truth labels
words = test_df.loc[:,"Word"].values
tags = test_df.loc[:,"POS"].values
            
#Create a positional vector for each word in the sentence
for i in range(len(words)):
    y_vector.append(tags[i])
    vector = []
                
    #Find the correct context window, filling in slots at the edges
    for j in [-2, -1, 0, 1, 2]:
        if i+j < 0 or i+j > len(words)-1:
            vector.append("#")
        else:
            vector.append(words[i+j])
                        
    #Save the positional vector for this word
    print(vector)
    x_vectors.append(vector)

['#', '#', 'president', 'bush', 'on']
['#', 'president', 'bush', 'on', 'tuesday']
['president', 'bush', 'on', 'tuesday', 'nominated']
['bush', 'on', 'tuesday', 'nominated', 'two']
['on', 'tuesday', 'nominated', 'two', 'individuals']
['tuesday', 'nominated', 'two', 'individuals', 'to']
['nominated', 'two', 'individuals', 'to', 'replace']
['two', 'individuals', 'to', 'replace', 'retiring']
['individuals', 'to', 'replace', 'retiring', 'jurists']
['to', 'replace', 'retiring', 'jurists', 'on']
['replace', 'retiring', 'jurists', 'on', 'federal']
['retiring', 'jurists', 'on', 'federal', 'courts']
['jurists', 'on', 'federal', 'courts', 'in']
['on', 'federal', 'courts', 'in', 'the']
['federal', 'courts', 'in', 'the', 'washington']
['courts', 'in', 'the', 'washington', 'area']
['in', 'the', 'washington', 'area', '.']
['the', 'washington', 'area', '.', '#']
['washington', 'area', '.', '#', '#']


Of course, these sequences are just arrays of strings. We need to convert this into a one-hot encoding in which each position has a unique column in the vector. Here we convert these strings into one-hot encodings and then we print out the complete arrays to show how they look. Each row (each new array) represents just one word and its surrounding context.

In [5]:
import numpy as np
np.set_printoptions(threshold=np.inf)
from sklearn.preprocessing import OneHotEncoder

#With all sentences finished, conert into numpy array
x_vectors = np.array(x_vectors)
y_vector = np.array(y_vector)

#Convert into a one-hot encoding
encoder = OneHotEncoder(categories='auto', handle_unknown='ignore')
encoder.fit(x_vectors)
        
#Return the x and y vectors
x_vectors = encoder.transform(x_vectors)
print(x_vectors.todense())

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0

How well does this work for classifying parts-of-speech? Let's find out!

The line below will use the *text_analytics* package to create a positional one-hot encoding for about half a million words and predict part-of-speech tags using logistic regression.

In [6]:
report = ai.pos_tagger(df, classifier="lm")
print(report)

              precision    recall  f1-score   support

         ADJ       0.89      0.84      0.87      3693
         ADP       0.93      0.97      0.95      5358
         ADV       0.90      0.81      0.85      2627
         AUX       0.95      0.98      0.97      3071
       CCONJ       0.99      0.99      0.99      1753
         DET       0.98      0.97      0.97      4672
        INTJ       0.96      0.72      0.82       174
        NOUN       0.86      0.95      0.90      9923
         NUM       0.95      0.84      0.89       965
        PART       0.95      0.98      0.96      1450
        PRON       0.96      0.96      0.96      4612
       PROPN       0.84      0.75      0.79      3167
       PUNCT       0.99      1.00      1.00      6659
       SCONJ       0.86      0.76      0.80      1050
         SYM       0.88      0.68      0.77       118
        VERB       0.92      0.91      0.92      5988
           X       0.91      0.28      0.42       156
           _       0.97    

There are always methods we can use to improve our tagging performance. But this lab shows how a simple two-word context window can be used to classify the syntactic properties of words.