# Kriss Sitapara NLTK

N-Grams and Part of Speech Tagging using NLTK in Python. I will generate bi-grams and tri-grams from the text to count their frequencies, and identify the most common patterns. I will also use NLTK's `pos_tag()` to label each token with its part-of-speech. This helps in showing the grammatical structure of the text and help analyze patterns. There is also a small demonstration of how context serves an important role in tokenization and n-grams. 



# Setup & Data
Importing libraries and checking if NLTK resoruces work. Also importing the short description about Vancouver from **intext.txt**


In [26]:
#importing libraries
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk import pos_tag
from nltk.util import ngrams
from nltk.probability import FreqDist
import re
from collections import Counter
nltk.download("tagsets")



[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\kriss\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [27]:
with open("intext.txt", "r", encoding="utf-8") as f:
    text = f.read()

print(text[:20])


Whistler is a mounta


# Tokenization & Normalization
Tokenizing the text into sentences and words,normalizing , and stemming them using NLTK's PorterStemmer.


In [28]:
#tokenizing sentences and words
sentences = sent_tokenize(text)
tokens = word_tokenize(text)
print(f"Sentences: {len(sentences)}")
print(f"Tokens: {len(tokens)}\n")


#checking for correct output
for s in sentences[:5]:
    print("-", s)
print("\nTokens:", tokens[:20])

Sentences: 30
Tokens: 726

- Whistler is a mountain town and resort in British Columbia, Canada, about two hours north of Vancouver.
- It sits in the Coast Mountains, in a valley surrounded by tall peaks, rivers, and lakes.
- The land has been home to the Squamish and Lil’wat First Nations for thousands of years, and both communities still have a presence in the area today.
- They used the valley and nearby mountains for hunting, fishing, and trading long before the first settlers arrived.
- The European settlers came much later, in the late 1800s, and the place began to develop as a small stop for trappers, prospectors, and travelers.

Tokens: ['Whistler', 'is', 'a', 'mountain', 'town', 'and', 'resort', 'in', 'British', 'Columbia', ',', 'Canada', ',', 'about', 'two', 'hours', 'north', 'of', 'Vancouver', '.']


In [29]:
#normalizing and stemming
lowercasetokens = [text.lower() for text in tokens]
alphabetictokens = [text for text in lowercasetokens if text.isalpha()]
stemmer = PorterStemmer()
stems = [stemmer.stem(text) for text in alphabetictokens]

#counting to make sure that the alphabetic tokens are less, since they are filtered versions of the lowercase tokens
print("Lowercase Tokens: ", {len(lowercasetokens)})
print("Alphabetic Tokens: ", {len(alphabetictokens)})
print("Stemmed Tokens: ", {len(stems)})


Lowercase Tokens:  {726}
Alphabetic Tokens:  {618}
Stemmed Tokens:  {618}


I was a bit curious as to why lowercase included characters like "." or "," when clearly in grammar, we consider them as characters and punctuation marks. So I used a method called .isalpha() to get only the "words" to say. But after using this function, all the words with hyphen were not includeded such as "sea-to-sky", "easy-access" etc. In real world scenarios, the relevance of these words can differ by context, so for this current task I decided to let it be with just the .isalpha() check.  

I was a bit curious and wanted to test this, so the following code is a possible implementation of this if interested!

In [30]:
#Custom versions of token filter to show the difference and possibly determine which one to use based on the words excluded.
def testtoken(tokens):
    
    #lowercase version
    ltoken = [t.lower() for t in tokens]
    #.isalpha() version
    atoken = [t for t in ltoken if t.isalpha()]
    #custom regex version to include other characters like hyphen etc
    rtoken = [t for t in ltoken if re.match(r"^[a-zA-Z\-']+$", t)]
    
    
    print("total tokens:", len(tokens))
    print("lowercase:", len(ltoken))
    print("alphabetic tokens '.isalpha()':", len(atoken))
    print("regex tokens:", len(rtoken))
    
    #differences in excluded tokens
    print("\ntokens dropped by .isalpha() but kept by regex:")
    dropped = set(atoken) ^ set(rtoken) 
    print(list(dropped)[0:])
    
    return ltoken, atoken, rtoken

ltoken, atoken, rtoken = testtoken(tokens)

total tokens: 726
lowercase: 726
alphabetic tokens '.isalpha()': 618
regex tokens: 623

tokens dropped by .isalpha() but kept by regex:
['pedestrian-friendly', 'warm-weather', 'world-class', 'well-known', 'year-round']


In [31]:
print("some lowercased tokens: ", lowercasetokens[12:30])
print("\nsome alpha-only tokens: ", alphabetictokens[12:30])
print("\nsome stemmed tokens: ", stems[12:30])

#I wonder why something like a "," or "." is considered a lowercase token. Shouldnt it be removed since it isnt really "lowercase"?
#Another thing that I noticed was that the stemming behaved a bit odd. Words like tired became tire, and quads became quad. Is this to avoid grammer tenses?
#Some words like activities, activity, active all became activ, so that they can be grouped in the same category of "active".

some lowercased tokens:  [',', 'about', 'two', 'hours', 'north', 'of', 'vancouver', '.', 'it', 'sits', 'in', 'the', 'coast', 'mountains', ',', 'in', 'a', 'valley']

some alpha-only tokens:  ['two', 'hours', 'north', 'of', 'vancouver', 'it', 'sits', 'in', 'the', 'coast', 'mountains', 'in', 'a', 'valley', 'surrounded', 'by', 'tall', 'peaks']

some stemmed tokens:  ['two', 'hour', 'north', 'of', 'vancouv', 'it', 'sit', 'in', 'the', 'coast', 'mountain', 'in', 'a', 'valley', 'surround', 'by', 'tall', 'peak']


# Part of Speech Tagging
Using NLTK's `pos_tag()` to label each token with its part-of-speech. This helps in showing the grammatical structure of the text and help analyze patterns.


In [32]:
postags = pos_tag(alphabetictokens)
print("Some POS-tagged tokens:")
print(postags[:5])

Some POS-tagged tokens:
[('whistler', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('mountain', 'NN'), ('town', 'NN')]


In [33]:
tcount = Counter(tag for word, tag in postags)

print("POS tags:\n")
for tag, count in tcount.most_common():
    print(f"{tag} - {count}")

POS tags:

NN - 129
IN - 78
DT - 77
NNS - 67
JJ - 55
CC - 40
VBD - 26
RB - 19
VBN - 17
TO - 16
VBZ - 14
VBG - 14
VB - 13
VBP - 11
PRP - 10
JJS - 5
CD - 4
WDT - 4
PRP$ - 4
RBR - 3
WRB - 3
JJR - 2
WP - 2
RP - 2
MD - 1
RBS - 1
EX - 1


In [34]:
group = {
    "NN": "Nouns", "NNS": "Nouns", "NNP": "Nouns", "NNPS": "Nouns",
    "VB": "Verbs", "VBD": "Verbs", "VBG": "Verbs", "VBN": "Verbs", "VBP": "Verbs", "VBZ": "Verbs",
    "JJ": "Adjectives", "JJR": "Adjectives", "JJS": "Adjectives",
    "RB": "Adverbs", "RBR": "Adverbs", "RBS": "Adverbs",
    "DT": "Determiners",
    "IN": "Prepositions",
    "CC": "Conjunctions",
    "PRP": "Pronouns", "PRP$": "Pronouns", "WP": "Pronouns", "WP$": "Pronouns",
    "MD": "Modals",
}


gcount = Counter()
for tag, count in tcount.items():
    category = group.get(tag, "Other")
    gcount[category] += count


print("General POS categories:\n")
for category, count in gcount.most_common():
    print(f"{category} - {count}")


General POS categories:

Nouns - 196
Verbs - 95
Prepositions - 78
Determiners - 77
Adjectives - 62
Conjunctions - 40
Other - 30
Adverbs - 23
Pronouns - 16
Modals - 1


By the looks of this data, it seems to be pretty accurate. Since my original text is about Whistler and Vancouver, it makes sense that a places and attractions are mentioned in it. There are a lot of adjectives as well which makes sense, since they are used to describe nouns. Another thing that I noticed is something like "Whistler" or "Blackcomb" is just tagged as NN(common nouns) instead of NNP(proper nouns). One huge reason why I suspect this is because the tokens extracted are lowercased, therefore the POS tagger will mark it as NN instead of NNP. Also, some words(for example - early) can be tagged as either RB(adverb) or JJ(adjective ) depending on sentence position, so I wonder how the pos_tag() determines this. 

# N-Grams
Generating bi-grams and tri-grams from the text to count their frequencies, and identify the most common patterns.


In [35]:
#grams using the alphabetic tokens
bigrams = list(ngrams(alphabetictokens, 2))
trigrams = list(ngrams(alphabetictokens, 3))
bigramf = FreqDist(bigrams)
trigramf = FreqDist(trigrams)

#results
print("top bigrams:")
for pair, count in bigramf.most_common(10):
    print(f"{pair} - {count}")
print("\ntop trigrams:")
for triplet, count in trigramf.most_common(10):
    print(f"{triplet} - {count}")


top bigrams:
('in', 'the') - 8
('of', 'the') - 6
('the', 'area') - 3
('and', 'the') - 3
('at', 'the') - 3
('it', 'was') - 3
('a', 'mountain') - 2
('mountain', 'town') - 2
('of', 'vancouver') - 2
('in', 'a') - 2

top trigrams:
('a', 'mountain', 'town') - 2
('for', 'the', 'winter') - 2
('the', 'winter', 'olympics') - 2
('is', 'one', 'of') - 2
('one', 'of', 'the') - 2
('whistler', 'and', 'blackcomb') - 2
('whistler', 'is', 'a') - 1
('is', 'a', 'mountain') - 1
('mountain', 'town', 'and') - 1
('town', 'and', 'resort') - 1


In [36]:
#grams using the stemmed tokens
bigrams = list(ngrams(stems, 2))
trigrams = list(ngrams(stems, 3))
bigramf = FreqDist(bigrams)
trigramf = FreqDist(trigrams)

#results
print("top bigrams:")
for pair, count in bigramf.most_common(10):
    print(f"{pair} - {count}")
print("\ntop trigrams:")
for triplet, count in trigramf.most_common(10):
    print(f"{triplet} - {count}")


top bigrams:
('in', 'the') - 8
('of', 'the') - 6
('the', 'area') - 3
('and', 'the') - 3
('at', 'the') - 3
('the', 'mountain') - 3
('it', 'wa') - 3
('the', 'ski') - 3
('a', 'mountain') - 2
('mountain', 'town') - 2

top trigrams:
('a', 'mountain', 'town') - 2
('for', 'the', 'winter') - 2
('the', 'winter', 'olymp') - 2
('is', 'one', 'of') - 2
('one', 'of', 'the') - 2
('whistler', 'and', 'blackcomb') - 2
('whistler', 'is', 'a') - 1
('is', 'a', 'mountain') - 1
('mountain', 'town', 'and') - 1
('town', 'and', 'resort') - 1


This was a pretty interesting find. The most common bigrams were words like "in the", "of the", "the area". I tried creating grams for both the alphabetic tokens and the stemmed tokens because I had a feeling that there will be some kind of difference but didnt know what exactly.  The trigrams were same for both the tokens, but the major difference was in the bigrams. I noticed that Vancouver was mentioned in the bigram for the alphabetic tokens but not in the stemmed tokens. This is pretty interesting, because although Vancouver is a proper noun and serves more generality and importance here in the text, it gets stemmed to vancouv which is not a proper noun anymore. This means that other stemmed words appeared more earlier and frequently than "of vancouver".  Imagine you had two words such as "the car" and the "the carousel". In this case even though both are different words, the stemmed version of carousel can possibly be "the car" which can cause confusion and misinterpretation. This means that sometimes you need to be careful on how to tokenize based on the context of the text.