<a href="https://colab.research.google.com/github/neochoon/2024_UTS/blob/main/Week3_POSTagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## In this exercise, we will
- try an existing POS tool from a Python library called "nltk", and then
- we will create our own POS tool.

In [3]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

| Abbreviation | Meaning                                            | Abbreviation | Meaning                                            | Abbreviation | Meaning                                            |
|--------------|----------------------------------------------------|--------------|----------------------------------------------------|--------------|----------------------------------------------------|
| CC           | coordinating conjunction                           | CD           | cardinal digit                                     | DT           | determiner                                         |
| EX           | existential there                                  | FW           | foreign word                                       | IN           | preposition/subordinating conjunction              |
| JJ           | This NLTK POS Tag is an adjective (large)          | JJR          | adjective, comparative (larger)                    | JJS          | adjective, superlative (largest)                   |
| LS           | list market                                        | MD           | modal (could, will)                                | NN           | noun, singular (cat, tree)                         |
| NNS          | noun plural (desks)                                | NNP          | proper noun, singular (sarah)                      | NNPS         | proper noun, plural (indians or americans)         |
| PDT          | predeterminer (all, both, half)                    | POS          | possessive ending (parent' s)                      | PRP          | personal pronoun (hers, herself, him, himself)     |
| PRP$         | possessive pronoun (her, his, mine, my, our)       | RB           | adverb (occasionally, swiftly)                     | RBR          | adverb, comparative (greater)                      |
| RBS          | adverb, superlative (biggest)                      | RP           | particle (about)                                   | TO           | infinite marker (to)                               |
| UH           | interjection (goodbye)                              | VB           | verb (ask)                                         | VBG          | verb gerund (judging)                              |
| VBD          | verb past tense (pleaded)                          | VBN          | verb past participle (reunified)                   | VBP          | verb, present tense not 3rd person singular(wrap)  |
| VBZ          | verb, present tense with 3rd person singular (bases) | WDT          | wh-determiner (that, what)                         | WP           | wh- pronoun (who)                                  |
| WRB          | wh- adverb (how)                                   |


In [4]:
text = word_tokenize("And now for something completely different")
result1 = nltk.pos_tag(text)

# Same word with different POS tags
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
result2=nltk.pos_tag(text)

display(result1)
print('')
display(result2)
# Your Turn - what same words in a sentence can you think of that have different POS tags?

############################

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]




[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [6]:
# Now let's train our own POS tag classifier
# Download tagged text data
nltk.download('treebank')
tagged_sentences = nltk.corpus.treebank.tagged_sents()

# Check a sentence
display(tagged_sentences[0])

############################


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [9]:
# Now let's create features for each word.

def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }

import pprint
pprint.pprint(features(['This', 'is', 'a', 'sentence'], 3)) # index == 2

{'capitals_inside': False,
 'has_hyphen': False,
 'is_all_caps': False,
 'is_all_lower': True,
 'is_capitalized': False,
 'is_first': False,
 'is_last': False,
 'is_numeric': False,
 'next_word': 'sentence',
 'prefix-1': 'a',
 'prefix-2': 'a',
 'prefix-3': 'a',
 'prev_word': 'is',
 'suffix-1': 'a',
 'suffix-2': 'a',
 'suffix-3': 'a',
 'word': 'a'}
print('------')
display(features(['This', 'is', 'a', 'sentence'], 3))
# Try different indices

############################

# Small helper function to strip the tags from our tagged corpus and feed it to our classifier:
def untag(tagged_sentence):
    return [w for w, t in tagged_sentence]


# Split the dataset for training and testing
cutoff = int(.75 * len(tagged_sentences))
training_sentences = tagged_sentences[:cutoff]
test_sentences = tagged_sentences[cutoff:]

print(len(training_sentences))   # 2935
print(len(test_sentences))      # 979

# Transform the list of sentences to a list of features
def transform_to_dataset(tagged_sentences):
    X, y = [], []

    for tagged in tagged_sentences:
        for index in range(len(tagged)):
            X.append(features(untag(tagged), index))
            y.append(tagged[index][1])

    return X, y

X, y = transform_to_dataset(training_sentences)

print(len(X)) # 75784
print(len(y)) # 75784

{'capitals_inside': False,
 'has_hyphen': False,
 'is_all_caps': False,
 'is_all_lower': True,
 'is_capitalized': False,
 'is_first': False,
 'is_last': True,
 'is_numeric': False,
 'next_word': '',
 'prefix-1': 's',
 'prefix-2': 'se',
 'prefix-3': 'sen',
 'prev_word': 'a',
 'suffix-1': 'e',
 'suffix-2': 'ce',
 'suffix-3': 'nce',
 'word': 'sentence'}
------


{'word': 'sentence',
 'is_first': False,
 'is_last': True,
 'is_capitalized': False,
 'is_all_caps': False,
 'is_all_lower': True,
 'prefix-1': 's',
 'prefix-2': 'se',
 'prefix-3': 'sen',
 'suffix-1': 'e',
 'suffix-2': 'ce',
 'suffix-3': 'nce',
 'prev_word': 'a',
 'next_word': '',
 'has_hyphen': False,
 'is_numeric': False,
 'capitals_inside': False}

2935
979
75784
75784


In [10]:
# We are now ready to train a classifier.

###### Now you have to click the "Run" bottom above ######
###### Do NOT copy & paste the below ######

from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
    ('classifier', DecisionTreeClassifier(criterion='entropy'))
])

clf.fit(X[:100], y[:100])   # Use only the first 100 samples if you're running it multiple times. It takes a fair bit :)

print("Training completed")

X_test, y_test = transform_to_dataset(test_sentences)

print ("Accuracy:", clf.score(X_test, y_test))


Training completed
Accuracy: 0.5193636509721999


In [14]:
print(X[0])
print(y[0])

{'word': 'Pierre', 'is_first': True, 'is_last': False, 'is_capitalized': True, 'is_all_caps': False, 'is_all_lower': False, 'prefix-1': 'P', 'prefix-2': 'Pi', 'prefix-3': 'Pie', 'suffix-1': 'e', 'suffix-2': 're', 'suffix-3': 'rre', 'prev_word': '', 'next_word': 'Vinken', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}
NNP


In [11]:
# Now you can use your classifier to tag any tagged_sentences, such as "I am studying NLP in UTS."
def my_pos_tag(sentence):
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    return (sentence, tags)

my_text = word_tokenize("I am studying NLP at UTS.")
print(list(my_pos_tag(my_text)))

[['I', 'am', 'studying', 'NLP', 'at', 'UTS', '.'], array(['DT', 'IN', 'NN', 'NNP', 'IN', 'NNP', '.'], dtype='<U6')]


In [12]:
nltk.pos_tag(my_text)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('studying', 'VBG'),
 ('NLP', 'NNP'),
 ('at', 'IN'),
 ('UTS', 'NNP'),
 ('.', '.')]