CSCI 393: NLP
Assignment 1: Tokenisation, lemmatisation &stemming, POS-tagging & NER
Alibek Abilmazhit

Question 1: Tokenisation

Data:
“ATG-CGA-TTT-AGC”
“The quick brown fox jumps over the lazy dog.”
“Just landed in NYC!!! 😎✈️ #travel #blessed”

(a) For each of the examples, decide what kind of tokenisation strategy would be most appropriate:
1) “ATG-CGA-TTT-AGC”: Rule-based splitting.
2) “The quick brown fox jumps over the lazy dog.”: Whitespace + punctuation-based tokenisation.
3) “Just landed in NYC!!! 😎✈️ #travel #blessed”: Specialised tokenizers

(b) Implement tokenisation in Python using at least two methods or libraries (e.g. NLTK’s word_tokenize, spaCy’s built-in tokeniser) and compare results.

In [9]:
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
import spacy

nltk.download('punkt_tab')

data = [
    "ATG-CGA-TTT-AGC",
    "The quick brown fox jumps over the lazy dog.",
    "Just landed in NYC!!! 😎✈️ #travel #blessed"
]

print("Method 1: NLTK word_tokenize\n")
for sentence in data:
    print(sentence)
    print("NLTK:", word_tokenize(sentence))

print("\nMethod 2: TweetTokenizer\n")
tweet_tokenizer = TweetTokenizer()
for sentence in data:
    print("TweetTokenizer:", tweet_tokenizer.tokenize(sentence))

print("\nMethod 3: spaCy\n")
nlp = spacy.load("en_core_web_sm")
for text in data:
    doc = nlp(text)
    print("spaCy:", [token.text for token in doc])

Method 1: NLTK word_tokenize

ATG-CGA-TTT-AGC
NLTK: ['ATG-CGA-TTT-AGC']
The quick brown fox jumps over the lazy dog.
NLTK: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Just landed in NYC!!! 😎✈️ #travel #blessed
NLTK: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', '😎✈️', '#', 'travel', '#', 'blessed']

Method 2: TweetTokenizer

TweetTokenizer: ['ATG-CGA-TTT-AGC']
TweetTokenizer: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
TweetTokenizer: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', '😎', '✈', '️', '#travel', '#blessed']

Method 3: spaCy

spaCy: ['ATG', '-', 'CGA', '-', 'TTT', '-', 'AGC']
spaCy: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
spaCy: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', '😎', '✈', '️', '#', 'travel', '#', 'blessed']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


(c)  Discuss the pros and cons of each implemented tokeniser for each example.

Question 2:  POS tagging
Data:
"The quick brown fox jumps over the lazy dog."
"omg 😂 can't believe @john_doe said that #shocked"

(a) Use NLTK’s pos_tag on the tokenised text.

In [18]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

data = [
    "The quick brown fox jumps over the lazy dog.",
    "omg 😂 can't believe @john_doe said that #shocked"
]

for sentence in data:
    print(sentence)
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    print("POS Tags:", pos_tags)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...


The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
omg 😂 can't believe @john_doe said that #shocked
POS Tags: [('omg', 'NN'), ('😂', 'NN'), ('ca', 'MD'), ("n't", 'RB'), ('believe', 'VB'), ('@', 'NNP'), ('john_doe', 'NN'), ('said', 'VBD'), ('that', 'IN'), ('#', '#'), ('shocked', 'VBD')]


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


(b) Use spaCy’s doc[i].pos_ to get POS tags.

In [19]:
doc1 = nlp("The quick brown fox jumps over the lazy dog.")
doc2 = nlp("omg 😂 can't believe @john_doe said that #shocked")

print("spaCy POS 1:", [(token.text, token.pos_) for token in doc1])
print("spaCy POS 2:", [(token.text, token.pos_) for token in doc2])

spaCy POS 1: [('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', 'PUNCT')]
spaCy POS 2: [('omg', 'X'), ('😂', 'PROPN'), ('ca', 'AUX'), ("n't", 'PART'), ('believe', 'VERB'), ('@john_doe', 'NUM'), ('said', 'VERB'), ('that', 'SCONJ'), ('#', 'PUNCT'), ('shocked', 'VERB')]


(c) Compare outputs — are they identical? Where do they differ? Which one is better for the
tweet?

Question 3: Stemming vs. Lemmatisation

Example sentence:
“The children are running and ate their meals quickly."

(a) Use PorterStemmer or SnowballStemmer from NLTK to reduce words to their stems.

In [23]:
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize

sentence = "The children are running and ate their meals quickly."
stemmer = PorterStemmer()

stems = [stemmer.stem(word) for word in word_tokenize(sentence)]
print("Stems:", stems)

Stems: ['the', 'children', 'are', 'run', 'and', 'ate', 'their', 'meal', 'quickli', '.']


(b) Use token.lemma_ to get the lemma for each token

In [25]:
from nltk.tokenize import word_tokenize

sentence = "The children are running and ate their meals quickly."

lemmas = [word.lemma_ for word in nlp(sentence)]
print("Lemmas:", lemmas)

Lemmas: ['the', 'child', 'be', 'run', 'and', 'eat', 'their', 'meal', 'quickly', '.']


(c) Compare the outputs - Why does stemming sometimes cut words awkwardly (e.g.,
"quickly" → "quick") while lemmatisation returns "quickly"? When is stemming still
useful and when is lemmatisation preferable?

Question 4: Named Entity Recognition (NER)

Data:
“Elon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission
that cost $500 million."

(a) Use spaCy’s doc.ents to extract named entities.

In [27]:
doc = nlp("Elon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission that cost $500 million.")

print(doc.ents)

(Elon Musk, 2002, California, 2023, $500 million)


(b) Evaluate performance:
● Did spaCy miss any entity?
● Did it misclassify anything?
● Suggest one case where rule-based approaches (like regex) could work better, and
one case where machine learning (NER) is superior

Question 5: Mini Project

Take a piece of text of your choice in any language you like apart from English (but
preferably one you understand). Run the full pipeline (make sure you choose appropriate
tokenisers, POS taggers, lemmatisers and NERs for the source language):
● Choose an appropriate tokeniser
● POS tag
● Lemmatise
● Extract named entities

Show all outputs (if using spaCy, use displacy for visualisation). Write a short report (1–2
paragraphs) about:
● What kinds of words/entities were recognised well
● What errors or limitations you observed while tokenising, POS tagging and lemmatising