CSCI 393: NLP
Assignment 1: Tokenisation, lemmatisation &stemming, POS-tagging & NER
Alibek Abilmazhit

Question 1: Tokenisation

Data:
“ATG-CGA-TTT-AGC”
“The quick brown fox jumps over the lazy dog.”
“Just landed in NYC!!! 😎✈️ #travel #blessed”

(a) For each of the examples, decide what kind of tokenisation strategy would be most appropriate:
1) “ATG-CGA-TTT-AGC”: Rule-based splitting - the example probably shows a DNA-chain, or something domain-specific with most likely some fixed structure. Thus, rules set for the task would be most approriate here.
2) “The quick brown fox jumps over the lazy dog.”: Whitespace + punctuation-based tokenisation - example shows a regular sentence, with nothing specific, so Whitespace + punctuation-based tokenisation would handle it well.
3) “Just landed in NYC!!! 😎✈️ #travel #blessed”: Specialised tokenizers, like NLTK's TweetTokenizer. The text contains emojis and hashtags, which is handled by TweetTokenizer. 

(b) Implement tokenisation in Python using at least two methods or libraries (e.g. NLTK’s word_tokenize, spaCy’s built-in tokeniser) and compare results.

In [19]:
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
import spacy
import re

nltk.download('punkt_tab')
nlp = spacy.load("en_core_web_sm")
tweet_tokenizer = TweetTokenizer()

data = [
    "ATG-CGA-TTT-AGC",
    "The quick brown fox jumps over the lazy dog.",
    "Just landed in NYC!!! 😎✈️ #travel #blessed"
]

#ATG-CGA-TTT-AGC
print("1st example: ATG-CGA-TTT-AGC\n")
print(f"1) RegEx: {re.findall(r'[ACGT]{3}', data[0])}")
print(f"2) spaCy tokenizer: {[token.text for token in nlp(data[0])]}")

#The quick brown fox jumps over the lazy dog.
print("\n2nd example: The quick brown fox jumps over the lazy dog.\n")
print(f"1) NLTK word_tokenize: {word_tokenize(data[1])}")
print(f"2) spaCy tokenizer: {[token.text for token in nlp(data[1])]}")

#Just landed in NYC!!! 😎✈️ #travel #blessed
print("\n3rd example: Just landed in NYC!!! 😎✈️ #travel #blessed\n")
print("1) spaCy tokenizer:", [token.text for token in nlp(data[2])])
print("2) NLTK TweetTokenizer:", tweet_tokenizer.tokenize(data[2]))

1st example: ATG-CGA-TTT-AGC

1) RegEx: ['ATG', 'CGA', 'TTT', 'AGC']
2) spaCy tokenizer: ['ATG', '-', 'CGA', '-', 'TTT', '-', 'AGC']

2nd example: The quick brown fox jumps over the lazy dog.

1) NLTK word_tokenize: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
2) spaCy tokenizer: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

3rd example: Just landed in NYC!!! 😎✈️ #travel #blessed

1) spaCy tokenizer: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', '😎', '✈', '️', '#', 'travel', '#', 'blessed']
2) NLTK TweetTokenizer: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', '😎', '✈', '️', '#travel', '#blessed']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


For 1st example of a DNA structure, spaCy tokenizer did pretty good, but RegEx did a better job at not tokenizing "-", as it probably does not hold any significance to the task. Also, if the task needed a separate characher 'A', 'T', etc., then spaCy would do a bad job at getting right tokens, while RegEx can be easily changed for the task's requirements.

For 2nd example of a regular sentence, both NLTK word_tokenize and spaCy tokenizer did equally good at splitting the sentence into separate words tokens. Since there is nothing special about this sentence, I think most tokenizers probably do similarly well.

For 3rd example of a social media post, spaCy tokenizer correctly tokenized the regular words and even emojis, but failed at tokenizing hashtags into one token. NLTK's TweetTokenizer, on the other hand, handled the hashtags correctly and identified "#travel" and "#blessed" as one token.

(c)  Discuss the pros and cons of each implemented tokeniser for each example.

RegEx: 

Good for specific tasks, performs tokenization as the task needs (like splitting into DNA triplets).
However, needs manual definition and if the task changes, the RegEx also has to be manually changed.

SpaCy:

Fast and modern tokenizer, good for general purpose sentences (example 2), and may be occasionally suitable for some domain-specific ones. But not suitable for social media related tasks, as it can't handle hashtags.



NLTK word_tokenize:

Works fine for general-purpose tasks, easy to use, and supports different languages. But probably fails with some specific tasks.

NLTK TweetTokenizer:

Perfect for social media related tasks, handles emojis and hashtags well. However, may fail as time goes and social media posts evolve, e.g, new emojis or unseen patterns.

Question 2:  POS tagging
Data:
"The quick brown fox jumps over the lazy dog."
"omg 😂 can't believe @john_doe said that #shocked"

(a) Use NLTK’s pos_tag on the tokenised text.

In [20]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

data = [
    "The quick brown fox jumps over the lazy dog.",
    "omg 😂 can't believe @john_doe said that #shocked"
]

for sentence in data:
    print(f"\n{sentence}")
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    print("POS Tags:", pos_tags)


The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]

omg 😂 can't believe @john_doe said that #shocked
POS Tags: [('omg', 'NN'), ('😂', 'NN'), ('ca', 'MD'), ("n't", 'RB'), ('believe', 'VB'), ('@', 'NNP'), ('john_doe', 'NN'), ('said', 'VBD'), ('that', 'IN'), ('#', '#'), ('shocked', 'VBD')]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


(b) Use spaCy’s doc[i].pos_ to get POS tags.

In [21]:
doc1 = nlp("The quick brown fox jumps over the lazy dog.")
doc2 = nlp("omg 😂 can't believe @john_doe said that #shocked")

print("spaCy POS Tags:\n")

print("1st sentence:")  
print([(token.text, token.pos_) for token in doc1])
print("\n2nd sentence:")
print([(token.text, token.pos_) for token in doc2])

spaCy POS Tags:

1st sentence:
[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', 'PUNCT')]

2nd sentence:
[('omg', 'X'), ('😂', 'PROPN'), ('ca', 'AUX'), ("n't", 'PART'), ('believe', 'VERB'), ('@john_doe', 'NUM'), ('said', 'VERB'), ('that', 'SCONJ'), ('#', 'PUNCT'), ('shocked', 'VERB')]


(c) Compare outputs — are they identical? Where do they differ? Which one is better for the
tweet?

Naming of the parts of speech are a bit different, but it is not significant.
In simple general sentence, they do pretty same, however, NLTK's pos_tag incorrectly identified 'brown' as a noun, but spaCy tagger correctly marked it as an adjective. For tweet sentence, both didn't do that well. NLTK tagger separated @john_doe into @ and john_doe tokens, and didn't identify john_doe as proper noun (but rather just noun), identified 'omg' as noun, instead of interjection, but I guess it is still better than spaCy's X (not identified). It also splitted the hashtag into two tokens. spaCy does better job at tagging n't of can't as a PART rather than nltk's RB (adverb). It also does not handle well the hashtag, identifying # as a punctuation, and shochked as a verb. Also, it incorrectly marks @john_doe as a number, which may significantly impact the task. 

Both taggers are not suitable for tweets, but if I had to choose, I think NLTK would do a better job, as it at least identified john_doe as a noun, which is important for the understanding the meaning of the tweet. Other differences in the performance are not that significant, in my opinion, so I choose nltk. 

Question 3: Stemming vs. Lemmatisation

Example sentence:
“The children are running and ate their meals quickly."

(a) Use PorterStemmer or SnowballStemmer from NLTK to reduce words to their stems.

In [23]:
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize

sentence = "The children are running and ate their meals quickly."
stemmer = PorterStemmer()

stems = [stemmer.stem(word) for word in word_tokenize(sentence)]
print("Stems:", stems)

Stems: ['the', 'children', 'are', 'run', 'and', 'ate', 'their', 'meal', 'quickli', '.']


(b) Use token.lemma_ to get the lemma for each token

In [23]:
sentence = "The children are running and ate their meals quickly."

lemmas = [word.lemma_ for word in nlp(sentence)]
print("Lemmas:", lemmas)

Lemmas: ['the', 'child', 'be', 'run', 'and', 'eat', 'their', 'meal', 'quickly', '.']


(c) Compare the outputs - Why does stemming sometimes cut words awkwardly (e.g.,
"quickly" → "quick") while lemmatisation returns "quickly"? When is stemming still
useful and when is lemmatisation preferable?


While lemmatisation reduced the word to its base, stemming reduced to its root form. The reason is that stemming uses heuristic rules without looking up in the dictionary, and ignoring context, thus, not caring if the resulted word makes sense or not. Lemmatisation, on the other hand, sses morphological analysis and POS information, thus returning the actual base. That's why stemming cut to "quickli" (some specific rule was applied) and lemmatisation  keeps" quickly" (it is already in its actual base form). 

Stemming is useful in tasks that do not require exact words' forms since it is fast - Search engines (IR), quick preprocessing for large text corpora. Lemmatisation os preferred in tasks when the actual form is important - for instance, in translating, or topic modelling.

Question 4: Named Entity Recognition (NER)

Data:
“Elon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission
that cost $500 million."

(a) Use spaCy’s doc.ents to extract named entities.

In [24]:
doc = nlp("Elon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission that cost $500 million.")

for ent in doc.ents:
    print(f"{ent.text:25} {ent.label_}")

Elon Musk                 PERSON
2002                      DATE
California                GPE
2023                      DATE
$500 million              MONEY


(b) Evaluate performance:
● Did spaCy miss any entity?
● Did it misclassify anything?
● Suggest one case where rule-based approaches (like regex) could work better, and
one case where machine learning (NER) is superior

In the given sentence, spaCy missed the "SpaceX" name of organisation entity. Other than that, it classified each entity correctly. If there were not any persons', or organisations' names, then RegEx could be utilized, as dates and money values can be easily handled by them, because they follow pretty straight-forward pattern. However, it is not true with persons and organisations, as there is no one easily identified pattern, so proper trained NER model is better suited for the tasks of identifying those names. 

Question 5: Mini Project

Take a piece of text of your choice in any language you like apart from English (but
preferably one you understand). Run the full pipeline (make sure you choose appropriate
tokenisers, POS taggers, lemmatisers and NERs for the source language):
● Choose an appropriate tokeniser
● POS tag
● Lemmatise
● Extract named entities

Show all outputs (if using spaCy, use displacy for visualisation). Write a short report (1–2
paragraphs) about:
● What kinds of words/entities were recognised well
● What errors or limitations you observed while tokenising, POS tagging and lemmatising

In [16]:
import IPython.display
import sys

sys.modules['IPython.core.display'] = IPython.display # IPython.core.display was deprecated recently, but spaCy still utilizes it
                                                      # in order to resolve this issue, I added this line

import spacy
from spacy import displacy
from tabulate import tabulate


nlp = spacy.load("ru_core_news_sm")

import ru_core_news_sm
nlp = ru_core_news_sm.load()


text = ("Настоящий Закон регулирует общественные отношения в сфере прохождения воинской службы гражданами Республики Казахстан и определяет основы государственной политики по социальному обеспечению военнослужащих. "
        "Глава 1. ОБЩИЕ ПОЛОЖЕНИЯ Статья 1. Основные понятия, используемые в настоящем Законе. В настоящем Законе используются следующие основные понятия: 1) адъюнкт – военнослужащий офицерского и сержантского составов,"
        " обучающийся в иностранном военном учебном заведении, реализующем образовательные программы послевузовского образования; 1-1) лица гражданского персонала (работники) – граждане Республики Казахстан, находящиеся "
        "на государственной службе или состоящие в трудовых отношениях в Вооруженных Силах Республики Казахстан, других войсках и воинских формированиях (далее – Вооруженные Силы, другие войска и воинские формирования); "
        "2) переменный состав – категория военнослужащих Вооруженных Сил, других войск и воинских формирований, обучающихся в военных, специальных учебных заведениях, не входящих в штатную численность Вооруженных Сил, других войск и воинских формирований;"
        " 3) военнослужащие, проходящие воинскую службу по призыву, – граждане Республики Казахстан, призванные на воинскую службу в Вооруженные Силы, другие войска и воинские формирования на срок, предусмотренный настоящим Законом;"
        " 4) отсрочка – перенос срока призыва граждан на воинскую службу по основаниям, предусмотренным настоящим Законом;"
        " Примечание ИЗПИ! В подпункт 5) предусматривается изменение Законом РК от 16.07.2025 № 211-VIII (вводится в действие по истечении шестидесяти календарных дней после дня его первого официального опубликования)."
        " 5) допризывники – граждане Республики Казахстан мужского пола, проходящие подготовку к воинской службе до принятия на воинский учет; Примечание ИЗПИ!"
        " В подпункт 6) предусматривается изменение Законом РК от 16.07.2025 № 211-VIII (вводится в действие по истечении шестидесяти календарных дней после дня его первого официального опубликования)."
        " 6) призывники – граждане Республики Казахстан мужского пола, приписанные к призывным участкам местных органов военного управления и подлежащие призыву на срочную воинскую службу;"
        " 7) воинское звание – знак воинского различия, присваиваемый военнослужащему и военнообязанному;"
        " 8) военный билет – единый бессрочный личный учетно-воинский документ гражданина, определяющий его принадлежность к воинской службе и отношение к воинской обязанности;"
        " 9) штат воинской части (учреждения) – документ, определяющий состав, организационно-штатную структуру, численность личного состава и количество закрепленного основного вооружения и военной техники в соответствии с кадастром вооружения и военной техники Вооруженных Сил, других войск и воинских формирований;"
        " 10) воинский учет – система учета и анализа количественных и качественных данных о призывниках, военнослужащих и мобилизационных ресурсах;"
        " 11) воинские сборы – мероприятия, проводимые органами военного управления, уполномоченными государственными органами по военной подготовке в целях приобретения и совершенствования военных знаний военнообязанными и гражданами, а также в иных случаях, предусмотренных законами Республики Казахстан.")

doc = nlp(text)

rows = [(token.text, token.pos_, token.lemma_) for token in doc]
print(tabulate(rows, headers=["Token", "POS", "Lemma"], tablefmt="simple"))

displacy.render(doc, style="dep")
displacy.render(doc, style="ent")

Token              POS    Lemma
-----------------  -----  -----------------
Настоящий          ADJ    настоящий
Закон              NOUN   закон
регулирует         VERB   регулировать
общественные       ADJ    общественный
отношения          NOUN   отношение
в                  ADP    в
сфере              NOUN   сфера
прохождения        NOUN   прохождение
воинской           ADJ    воинский
службы             NOUN   служба
гражданами         NOUN   гражданин
Республики         PROPN  республика
Казахстан          PROPN  казахстан
и                  CCONJ  и
определяет         VERB   определять
основы             NOUN   основа
государственной    ADJ    государственный
политики           NOUN   политика
по                 ADP    по
социальному        ADJ    социальный
обеспечению        NOUN   обеспечение
военнослужащих     NOUN   военнослужащий
.                  PUNCT  .
Глава              NOUN   глава
1                  NUM    1
.                  PUNCT  .
ОБЩИЕ              NOUN   общие

The selected text is a piece from the Military Law of Kazakhstan. Thus, it has a pretty specialized legal vocabulary. I have run the full pipeline using spaCy Russian model. 



The model correctly recognised most words and tokenized them to separate words, numbers, punctuation. POS tagger tagged most words corretly with occassional mistakes (discussed in next paragraph). The Lemmatiser identified the words' bases correctly. Proper names were generally identified right, i.e., "Республика Казахстан" - PROPN. 
Overall, general non-domain-specific words were recognized well.

However, there are some limitations present. Most of them are due to domain specific words/entities. Firstly, the tokenizer split some legal document's references into multiple tokens, although it is preferable to keep them as a single token. For instance, "Глава 1. ОБЩИЕ ПОЛОЖЕНИЯ" was split into 5 tokens, instead of 1 "legal" token. POS tagger made occasional mistakes, like identifying "Вооруженные" in "Вооруженные Силы" as a VERB instead of PROPN. NER in Russian spaCy model is significantly limited, with only LOC, ORG, PER labels available. Instead of LAW label like in English spaCy model, it identified "Закон" entity as a Person. Occassionally, Organisation "Вооруженные Силы" was incorrectly identified as a Location, and "Общие" as an Organisation. All these suggest that the model lacks training on law-specific texts. So, while suitable for general-purpose texts and tasks, for legal-specific texts, it may require some fine-tuning or other modifications.  