#### PART OF SPEECH TAGGING ####
Part of Speech (POS) tagging is the process of assigning a part of speech to each word in a sentence. The part of speech can be a noun, verb, adjective, adverb, etc. POS tagging is important for understanding the grammatical structure of a sentence and for many NLP tasks such as named entity recognition, sentiment analysis, and machine translation.

- CC - Coordinating conjunction
- CD - Cardinal number
- DT - Determiner
- EX - Existential there
- FW - Foreign word
- IN - Preposition or subordinating conjunction
- ...

In [1]:
paragraph = """It started as just another uneventful afternoon, the kind where the sky hangs dull and gray like a curtain that forgot how to be blue. 
Out of nowhere, this beat-up ice cream truck rolled into the cul-de-sac, its jingle playing a warped, off-key melody that sounded more like a haunted lullaby than a summer tune. Kids peeked through their curtains but didn’t rush out like usual—something about the driver’s too-wide grin and mirrored sunglasses made everyone hesitate. Still, curiosity’s a powerful thing. One by one, they trickled outside, drawn like moths to a weird, sticky flame. 
But instead of ice cream, the truck was handing out little jars filled with glowing jellybeans—each one pulsing like it had a heartbeat. No one knew what would happen if you ate one, but the neighborhood would never be the same again after that day."""

In [7]:
import nltk
from nltk.corpus import stopwords
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\hoang\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [12]:
nltk.download('tagsets_json')

[nltk_data] Downloading package tagsets_json to
[nltk_data]     C:\Users\hoang\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets_json.zip.


True

In [13]:
# show all the pos tags
print(nltk.help.upenn_tagset())

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [8]:
sentences = nltk.sent_tokenize(paragraph)
sentences

['It started as just another uneventful afternoon, the kind where the sky hangs dull and gray like a curtain that forgot how to be blue.',
 'Out of nowhere, this beat-up ice cream truck rolled into the cul-de-sac, its jingle playing a warped, off-key melody that sounded more like a haunted lullaby than a summer tune.',
 'Kids peeked through their curtains but didn’t rush out like usual—something about the driver’s too-wide grin and mirrored sunglasses made everyone hesitate.',
 'Still, curiosity’s a powerful thing.',
 'One by one, they trickled outside, drawn like moths to a weird, sticky flame.',
 'But instead of ice cream, the truck was handing out little jars filled with glowing jellybeans—each one pulsing like it had a heartbeat.',
 'No one knew what would happen if you ate one, but the neighborhood would never be the same again after that day.']

In [9]:
# find the pos tag
 
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [word for word in words if word not in set(stopwords.words('english'))]
    pos_tag = nltk.pos_tag(words)
    print(pos_tag)

[('It', 'PRP'), ('started', 'VBD'), ('another', 'DT'), ('uneventful', 'JJ'), ('afternoon', 'NN'), (',', ','), ('kind', 'NN'), ('sky', 'JJ'), ('hangs', 'NNS'), ('dull', 'JJ'), ('gray', 'NN'), ('like', 'IN'), ('curtain', 'NN'), ('forgot', 'VBD'), ('blue', 'JJ'), ('.', '.')]
[('Out', 'IN'), ('nowhere', 'RB'), (',', ','), ('beat-up', 'JJ'), ('ice', 'NN'), ('cream', 'NN'), ('truck', 'NN'), ('rolled', 'VBN'), ('cul-de-sac', 'NN'), (',', ','), ('jingle', 'NN'), ('playing', 'NN'), ('warped', 'VBD'), (',', ','), ('off-key', 'JJ'), ('melody', 'NN'), ('sounded', 'VBD'), ('like', 'IN'), ('haunted', 'VBN'), ('lullaby', 'NN'), ('summer', 'NN'), ('tune', 'NN'), ('.', '.')]
[('Kids', 'NNS'), ('peeked', 'VBD'), ('curtains', 'NNS'), ('’', 'JJ'), ('rush', 'NN'), ('like', 'IN'), ('usual—something', 'VBG'), ('driver', 'NN'), ('’', 'VBD'), ('too-wide', 'JJ'), ('grin', 'NN'), ('mirrored', 'VBD'), ('sunglasses', 'NNS'), ('made', 'VBD'), ('everyone', 'NN'), ('hesitate', 'NN'), ('.', '.')]
[('Still', 'RB'), (',

In [None]:
sentence = "I am a student"
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag_sents([tokens])  # phải bọc tokens trong list

print(tagged)

[[('I', 'PRP'), ('am', 'VBP'), ('a', 'DT'), ('student', 'NN')]]


In [17]:
nltk.pos_tag("I am a student".split())  # không cần bọc trong list

[('I', 'PRP'), ('am', 'VBP'), ('a', 'DT'), ('student', 'NN')]