CSCI 393: NLP
Assignment 1: Tokenisation, lemmatisation &stemming, POS-tagging & NER
Alibek Abilmazhit

Question 1: Tokenisation

Data:
‚ÄúATG-CGA-TTT-AGC‚Äù
‚ÄúThe quick brown fox jumps over the lazy dog.‚Äù
‚ÄúJust landed in NYC!!! üòé‚úàÔ∏è #travel #blessed‚Äù

(a) For each of the examples, decide what kind of tokenisation strategy would be most appropriate:
1) ‚ÄúATG-CGA-TTT-AGC‚Äù: Rule-based splitting - the example probably shows a DNA-chain, or something domain-specific with most likely some fixed structure. Thus, rules set for the task would be most approriate here.
2) ‚ÄúThe quick brown fox jumps over the lazy dog.‚Äù: Whitespace + punctuation-based tokenisation - example shows a regular sentence, with nothing specific, so Whitespace + punctuation-based tokenisation would handle it well.
3) ‚ÄúJust landed in NYC!!! üòé‚úàÔ∏è #travel #blessed‚Äù: Specialised tokenizers, like NLTK's TweetTokenizer. The text contains emojis and hashtags, which is handled by TweetTokenizer. 

(b) Implement tokenisation in Python using at least two methods or libraries (e.g. NLTK‚Äôs word_tokenize, spaCy‚Äôs built-in tokeniser) and compare results.

In [19]:
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
import spacy
import re

nltk.download('punkt_tab')
nlp = spacy.load("en_core_web_sm")
tweet_tokenizer = TweetTokenizer()

data = [
    "ATG-CGA-TTT-AGC",
    "The quick brown fox jumps over the lazy dog.",
    "Just landed in NYC!!! üòé‚úàÔ∏è #travel #blessed"
]

#ATG-CGA-TTT-AGC
print("1st example: ATG-CGA-TTT-AGC\n")
print(f"1) RegEx: {re.findall(r'[ACGT]{3}', data[0])}")
print(f"2) spaCy tokenizer: {[token.text for token in nlp(data[0])]}")

#The quick brown fox jumps over the lazy dog.
print("\n2nd example: The quick brown fox jumps over the lazy dog.\n")
print(f"1) NLTK word_tokenize: {word_tokenize(data[1])}")
print(f"2) spaCy tokenizer: {[token.text for token in nlp(data[1])]}")

#Just landed in NYC!!! üòé‚úàÔ∏è #travel #blessed
print("\n3rd example: Just landed in NYC!!! üòé‚úàÔ∏è #travel #blessed\n")
print("1) spaCy tokenizer:", [token.text for token in nlp(data[2])])
print("2) NLTK TweetTokenizer:", tweet_tokenizer.tokenize(data[2]))

1st example: ATG-CGA-TTT-AGC

1) RegEx: ['ATG', 'CGA', 'TTT', 'AGC']
2) spaCy tokenizer: ['ATG', '-', 'CGA', '-', 'TTT', '-', 'AGC']

2nd example: The quick brown fox jumps over the lazy dog.

1) NLTK word_tokenize: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
2) spaCy tokenizer: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

3rd example: Just landed in NYC!!! üòé‚úàÔ∏è #travel #blessed

1) spaCy tokenizer: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', 'üòé', '‚úà', 'Ô∏è', '#', 'travel', '#', 'blessed']
2) NLTK TweetTokenizer: ['Just', 'landed', 'in', 'NYC', '!', '!', '!', 'üòé', '‚úà', 'Ô∏è', '#travel', '#blessed']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


For 1st example of a DNA structure, spaCy tokenizer did pretty good, but RegEx did a better job at not tokenizing "-", as it probably does not hold any significance to the task. Also, if the task needed a separate characher 'A', 'T', etc., then spaCy would do a bad job at getting right tokens, while RegEx can be easily changed for the task's requirements.

For 2nd example of a regular sentence, both NLTK word_tokenize and spaCy tokenizer did equally good at splitting the sentence into separate words tokens. Since there is nothing special about this sentence, I think most tokenizers probably do similarly well.

For 3rd example of a social media post, spaCy tokenizer correctly tokenized the regular words and even emojis, but failed at tokenizing hashtags into one token. NLTK's TweetTokenizer, on the other hand, handled the hashtags correctly and identified "#travel" and "#blessed" as one token.

(c)  Discuss the pros and cons of each implemented tokeniser for each example.

RegEx: 

Good for specific tasks, performs tokenization as the task needs (like splitting into DNA triplets).
However, needs manual definition and if the task changes, the RegEx also has to be manually changed.

SpaCy:

Fast and modern tokenizer, good for general purpose sentences (example 2), and may be occasionally suitable for some domain-specific ones. But not suitable for social media related tasks, as it can't handle hashtags.



NLTK word_tokenize:

Works fine for general-purpose tasks, easy to use, and supports different languages. But probably fails with some specific tasks.

NLTK TweetTokenizer:

Perfect for social media related tasks, handles emojis and hashtags well. However, may fail as time goes and social media posts evolve, e.g, new emojis or unseen patterns.

Question 2:  POS tagging
Data:
"The quick brown fox jumps over the lazy dog."
"omg üòÇ can't believe @john_doe said that #shocked"

(a) Use NLTK‚Äôs pos_tag on the tokenised text.

In [20]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

data = [
    "The quick brown fox jumps over the lazy dog.",
    "omg üòÇ can't believe @john_doe said that #shocked"
]

for sentence in data:
    print(f"\n{sentence}")
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    print("POS Tags:", pos_tags)


The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]

omg üòÇ can't believe @john_doe said that #shocked
POS Tags: [('omg', 'NN'), ('üòÇ', 'NN'), ('ca', 'MD'), ("n't", 'RB'), ('believe', 'VB'), ('@', 'NNP'), ('john_doe', 'NN'), ('said', 'VBD'), ('that', 'IN'), ('#', '#'), ('shocked', 'VBD')]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/alibekabilmazhit/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


(b) Use spaCy‚Äôs doc[i].pos_ to get POS tags.

In [21]:
doc1 = nlp("The quick brown fox jumps over the lazy dog.")
doc2 = nlp("omg üòÇ can't believe @john_doe said that #shocked")

print("spaCy POS Tags:\n")

print("1st sentence:")  
print([(token.text, token.pos_) for token in doc1])
print("\n2nd sentence:")
print([(token.text, token.pos_) for token in doc2])

spaCy POS Tags:

1st sentence:
[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', 'PUNCT')]

2nd sentence:
[('omg', 'X'), ('üòÇ', 'PROPN'), ('ca', 'AUX'), ("n't", 'PART'), ('believe', 'VERB'), ('@john_doe', 'NUM'), ('said', 'VERB'), ('that', 'SCONJ'), ('#', 'PUNCT'), ('shocked', 'VERB')]


(c) Compare outputs ‚Äî are they identical? Where do they differ? Which one is better for the
tweet?

Naming of the parts of speech are a bit different, but it is not significant.
In simple general sentence, they do pretty same, however, NLTK's pos_tag incorrectly identified 'brown' as a noun, but spaCy tagger correctly marked it as an adjective. For tweet sentence, both didn't do that well. NLTK tagger separated @john_doe into @ and john_doe tokens, and didn't identify john_doe as proper noun (but rather just noun), identified 'omg' as noun, instead of interjection, but I guess it is still better than spaCy's X (not identified). It also splitted the hashtag into two tokens. spaCy does better job at tagging n't of can't as a PART rather than nltk's RB (adverb). It also does not handle well the hashtag, identifying # as a punctuation, and shochked as a verb. Also, it incorrectly marks @john_doe as a number, which may significantly impact the task. 

Both taggers are not suitable for tweets, but if I had to choose, I think NLTK would do a better job, as it at least identified john_doe as a noun, which is important for the understanding the meaning of the tweet. Other differences in the performance are not that significant, in my opinion, so I choose nltk. 

Question 3: Stemming vs. Lemmatisation

Example sentence:
‚ÄúThe children are running and ate their meals quickly."

(a) Use PorterStemmer or SnowballStemmer from NLTK to reduce words to their stems.

In [23]:
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize

sentence = "The children are running and ate their meals quickly."
stemmer = PorterStemmer()

stems = [stemmer.stem(word) for word in word_tokenize(sentence)]
print("Stems:", stems)

Stems: ['the', 'children', 'are', 'run', 'and', 'ate', 'their', 'meal', 'quickli', '.']


(b) Use token.lemma_ to get the lemma for each token

In [23]:
sentence = "The children are running and ate their meals quickly."

lemmas = [word.lemma_ for word in nlp(sentence)]
print("Lemmas:", lemmas)

Lemmas: ['the', 'child', 'be', 'run', 'and', 'eat', 'their', 'meal', 'quickly', '.']


(c) Compare the outputs - Why does stemming sometimes cut words awkwardly (e.g.,
"quickly" ‚Üí "quick") while lemmatisation returns "quickly"? When is stemming still
useful and when is lemmatisation preferable?


While lemmatisation reduced the word to its base, stemming reduced to its root form. The reason is that stemming uses heuristic rules without looking up in the dictionary, and ignoring context, thus, not caring if the resulted word makes sense or not. Lemmatisation, on the other hand, sses morphological analysis and POS information, thus returning the actual base. That's why stemming cut to "quickli" (some specific rule was applied) and lemmatisation  keeps" quickly" (it is already in its actual base form). 

Stemming is useful in tasks that do not require exact words' forms since it is fast - Search engines (IR), quick preprocessing for large text corpora. Lemmatisation os preferred in tasks when the actual form is important - for instance, in translating, or topic modelling.

Question 4: Named Entity Recognition (NER)

Data:
‚ÄúElon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission
that cost $500 million."

(a) Use spaCy‚Äôs doc.ents to extract named entities.

In [24]:
doc = nlp("Elon Musk founded SpaceX in 2002 in California. In 2023, the company launched a mission that cost $500 million.")

for ent in doc.ents:
    print(f"{ent.text:25} {ent.label_}")

Elon Musk                 PERSON
2002                      DATE
California                GPE
2023                      DATE
$500 million              MONEY


(b) Evaluate performance:
‚óè Did spaCy miss any entity?
‚óè Did it misclassify anything?
‚óè Suggest one case where rule-based approaches (like regex) could work better, and
one case where machine learning (NER) is superior

In the given sentence, spaCy missed the "SpaceX" name of organisation entity. Other than that, it classified each entity correctly. If there were not any persons', or organisations' names, then RegEx could be utilized, as dates and money values can be easily handled by them, because they follow pretty straight-forward pattern. However, it is not true with persons and organisations, as there is no one easily identified pattern, so proper trained NER model is better suited for the tasks of identifying those names. 

Question 5: Mini Project

Take a piece of text of your choice in any language you like apart from English (but
preferably one you understand). Run the full pipeline (make sure you choose appropriate
tokenisers, POS taggers, lemmatisers and NERs for the source language):
‚óè Choose an appropriate tokeniser
‚óè POS tag
‚óè Lemmatise
‚óè Extract named entities

Show all outputs (if using spaCy, use displacy for visualisation). Write a short report (1‚Äì2
paragraphs) about:
‚óè What kinds of words/entities were recognised well
‚óè What errors or limitations you observed while tokenising, POS tagging and lemmatising

In [16]:
import IPython.display
import sys

sys.modules['IPython.core.display'] = IPython.display # IPython.core.display was deprecated recently, but spaCy still utilizes it
                                                      # in order to resolve this issue, I added this line

import spacy
from spacy import displacy
from tabulate import tabulate


nlp = spacy.load("ru_core_news_sm")

import ru_core_news_sm
nlp = ru_core_news_sm.load()


text = ("–ù–∞—Å—Ç–æ—è—â–∏–π –ó–∞–∫–æ–Ω —Ä–µ–≥—É–ª–∏—Ä—É–µ—Ç –æ–±—â–µ—Å—Ç–≤–µ–Ω–Ω—ã–µ –æ—Ç–Ω–æ—à–µ–Ω–∏—è –≤ —Å—Ñ–µ—Ä–µ –ø—Ä–æ—Ö–æ–∂–¥–µ–Ω–∏—è –≤–æ–∏–Ω—Å–∫–æ–π —Å–ª—É–∂–±—ã –≥—Ä–∞–∂–¥–∞–Ω–∞–º–∏ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω –∏ –æ–ø—Ä–µ–¥–µ–ª—è–µ—Ç –æ—Å–Ω–æ–≤—ã –≥–æ—Å—É–¥–∞—Ä—Å—Ç–≤–µ–Ω–Ω–æ–π –ø–æ–ª–∏—Ç–∏–∫–∏ –ø–æ —Å–æ—Ü–∏–∞–ª—å–Ω–æ–º—É –æ–±–µ—Å–ø–µ—á–µ–Ω–∏—é –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–∏—Ö. "
        "–ì–ª–∞–≤–∞ 1. –û–ë–©–ò–ï –ü–û–õ–û–ñ–ï–ù–ò–Ø –°—Ç–∞—Ç—å—è 1. –û—Å–Ω–æ–≤–Ω—ã–µ –ø–æ–Ω—è—Ç–∏—è, –∏—Å–ø–æ–ª—å–∑—É–µ–º—ã–µ –≤ –Ω–∞—Å—Ç–æ—è—â–µ–º –ó–∞–∫–æ–Ω–µ. –í –Ω–∞—Å—Ç–æ—è—â–µ–º –ó–∞–∫–æ–Ω–µ –∏—Å–ø–æ–ª—å–∑—É—é—Ç—Å—è —Å–ª–µ–¥—É—é—â–∏–µ –æ—Å–Ω–æ–≤–Ω—ã–µ –ø–æ–Ω—è—Ç–∏—è: 1) –∞–¥—ä—é–Ω–∫—Ç ‚Äì –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–∏–π –æ—Ñ–∏—Ü–µ—Ä—Å–∫–æ–≥–æ –∏ —Å–µ—Ä–∂–∞–Ω—Ç—Å–∫–æ–≥–æ —Å–æ—Å—Ç–∞–≤–æ–≤,"
        " –æ–±—É—á–∞—é—â–∏–π—Å—è –≤ –∏–Ω–æ—Å—Ç—Ä–∞–Ω–Ω–æ–º –≤–æ–µ–Ω–Ω–æ–º —É—á–µ–±–Ω–æ–º –∑–∞–≤–µ–¥–µ–Ω–∏–∏, —Ä–µ–∞–ª–∏–∑—É—é—â–µ–º –æ–±—Ä–∞–∑–æ–≤–∞—Ç–µ–ª—å–Ω—ã–µ –ø—Ä–æ–≥—Ä–∞–º–º—ã –ø–æ—Å–ª–µ–≤—É–∑–æ–≤—Å–∫–æ–≥–æ –æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è; 1-1) –ª–∏—Ü–∞ –≥—Ä–∞–∂–¥–∞–Ω—Å–∫–æ–≥–æ –ø–µ—Ä—Å–æ–Ω–∞–ª–∞ (—Ä–∞–±–æ—Ç–Ω–∏–∫–∏) ‚Äì –≥—Ä–∞–∂–¥–∞–Ω–µ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω, –Ω–∞—Ö–æ–¥—è—â–∏–µ—Å—è "
        "–Ω–∞ –≥–æ—Å—É–¥–∞—Ä—Å—Ç–≤–µ–Ω–Ω–æ–π —Å–ª—É–∂–±–µ –∏–ª–∏ —Å–æ—Å—Ç–æ—è—â–∏–µ –≤ —Ç—Ä—É–¥–æ–≤—ã—Ö –æ—Ç–Ω–æ—à–µ–Ω–∏—è—Ö –≤ –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã—Ö –°–∏–ª–∞—Ö –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω, –¥—Ä—É–≥–∏—Ö –≤–æ–π—Å–∫–∞—Ö –∏ –≤–æ–∏–Ω—Å–∫–∏—Ö —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏—è—Ö (–¥–∞–ª–µ–µ ‚Äì –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã–µ –°–∏–ª—ã, –¥—Ä—É–≥–∏–µ –≤–æ–π—Å–∫–∞ –∏ –≤–æ–∏–Ω—Å–∫–∏–µ —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏—è); "
        "2) –ø–µ—Ä–µ–º–µ–Ω–Ω—ã–π —Å–æ—Å—Ç–∞–≤ ‚Äì –∫–∞—Ç–µ–≥–æ—Ä–∏—è –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–∏—Ö –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã—Ö –°–∏–ª, –¥—Ä—É–≥–∏—Ö –≤–æ–π—Å–∫ –∏ –≤–æ–∏–Ω—Å–∫–∏—Ö —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏–π, –æ–±—É—á–∞—é—â–∏—Ö—Å—è –≤ –≤–æ–µ–Ω–Ω—ã—Ö, —Å–ø–µ—Ü–∏–∞–ª—å–Ω—ã—Ö —É—á–µ–±–Ω—ã—Ö –∑–∞–≤–µ–¥–µ–Ω–∏—è—Ö, –Ω–µ –≤—Ö–æ–¥—è—â–∏—Ö –≤ —à—Ç–∞—Ç–Ω—É—é —á–∏—Å–ª–µ–Ω–Ω–æ—Å—Ç—å –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã—Ö –°–∏–ª, –¥—Ä—É–≥–∏—Ö –≤–æ–π—Å–∫ –∏ –≤–æ–∏–Ω—Å–∫–∏—Ö —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏–π;"
        " 3) –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–∏–µ, –ø—Ä–æ—Ö–æ–¥—è—â–∏–µ –≤–æ–∏–Ω—Å–∫—É—é —Å–ª—É–∂–±—É –ø–æ –ø—Ä–∏–∑—ã–≤—É, ‚Äì –≥—Ä–∞–∂–¥–∞–Ω–µ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω, –ø—Ä–∏–∑–≤–∞–Ω–Ω—ã–µ –Ω–∞ –≤–æ–∏–Ω—Å–∫—É—é —Å–ª—É–∂–±—É –≤ –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã–µ –°–∏–ª—ã, –¥—Ä—É–≥–∏–µ –≤–æ–π—Å–∫–∞ –∏ –≤–æ–∏–Ω—Å–∫–∏–µ —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏—è –Ω–∞ —Å—Ä–æ–∫, –ø—Ä–µ–¥—É—Å–º–æ—Ç—Ä–µ–Ω–Ω—ã–π –Ω–∞—Å—Ç–æ—è—â–∏–º –ó–∞–∫–æ–Ω–æ–º;"
        " 4) –æ—Ç—Å—Ä–æ—á–∫–∞ ‚Äì –ø–µ—Ä–µ–Ω–æ—Å —Å—Ä–æ–∫–∞ –ø—Ä–∏–∑—ã–≤–∞ –≥—Ä–∞–∂–¥–∞–Ω –Ω–∞ –≤–æ–∏–Ω—Å–∫—É—é —Å–ª—É–∂–±—É –ø–æ –æ—Å–Ω–æ–≤–∞–Ω–∏—è–º, –ø—Ä–µ–¥—É—Å–º–æ—Ç—Ä–µ–Ω–Ω—ã–º –Ω–∞—Å—Ç–æ—è—â–∏–º –ó–∞–∫–æ–Ω–æ–º;"
        " –ü—Ä–∏–º–µ—á–∞–Ω–∏–µ –ò–ó–ü–ò! –í –ø–æ–¥–ø—É–Ω–∫—Ç 5) –ø—Ä–µ–¥—É—Å–º–∞—Ç—Ä–∏–≤–∞–µ—Ç—Å—è –∏–∑–º–µ–Ω–µ–Ω–∏–µ –ó–∞–∫–æ–Ω–æ–º –†–ö –æ—Ç 16.07.2025 ‚Ññ 211-VIII (–≤–≤–æ–¥–∏—Ç—Å—è –≤ –¥–µ–π—Å—Ç–≤–∏–µ –ø–æ –∏—Å—Ç–µ—á–µ–Ω–∏–∏ —à–µ—Å—Ç–∏–¥–µ—Å—è—Ç–∏ –∫–∞–ª–µ–Ω–¥–∞—Ä–Ω—ã—Ö –¥–Ω–µ–π –ø–æ—Å–ª–µ –¥–Ω—è –µ–≥–æ –ø–µ—Ä–≤–æ–≥–æ –æ—Ñ–∏—Ü–∏–∞–ª—å–Ω–æ–≥–æ –æ–ø—É–±–ª–∏–∫–æ–≤–∞–Ω–∏—è)."
        " 5) –¥–æ–ø—Ä–∏–∑—ã–≤–Ω–∏–∫–∏ ‚Äì –≥—Ä–∞–∂–¥–∞–Ω–µ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω –º—É–∂—Å–∫–æ–≥–æ –ø–æ–ª–∞, –ø—Ä–æ—Ö–æ–¥—è—â–∏–µ –ø–æ–¥–≥–æ—Ç–æ–≤–∫—É –∫ –≤–æ–∏–Ω—Å–∫–æ–π —Å–ª—É–∂–±–µ –¥–æ –ø—Ä–∏–Ω—è—Ç–∏—è –Ω–∞ –≤–æ–∏–Ω—Å–∫–∏–π —É—á–µ—Ç; –ü—Ä–∏–º–µ—á–∞–Ω–∏–µ –ò–ó–ü–ò!"
        " –í –ø–æ–¥–ø—É–Ω–∫—Ç 6) –ø—Ä–µ–¥—É—Å–º–∞—Ç—Ä–∏–≤–∞–µ—Ç—Å—è –∏–∑–º–µ–Ω–µ–Ω–∏–µ –ó–∞–∫–æ–Ω–æ–º –†–ö –æ—Ç 16.07.2025 ‚Ññ 211-VIII (–≤–≤–æ–¥–∏—Ç—Å—è –≤ –¥–µ–π—Å—Ç–≤–∏–µ –ø–æ –∏—Å—Ç–µ—á–µ–Ω–∏–∏ —à–µ—Å—Ç–∏–¥–µ—Å—è—Ç–∏ –∫–∞–ª–µ–Ω–¥–∞—Ä–Ω—ã—Ö –¥–Ω–µ–π –ø–æ—Å–ª–µ –¥–Ω—è –µ–≥–æ –ø–µ—Ä–≤–æ–≥–æ –æ—Ñ–∏—Ü–∏–∞–ª—å–Ω–æ–≥–æ –æ–ø—É–±–ª–∏–∫–æ–≤–∞–Ω–∏—è)."
        " 6) –ø—Ä–∏–∑—ã–≤–Ω–∏–∫–∏ ‚Äì –≥—Ä–∞–∂–¥–∞–Ω–µ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω –º—É–∂—Å–∫–æ–≥–æ –ø–æ–ª–∞, –ø—Ä–∏–ø–∏—Å–∞–Ω–Ω—ã–µ –∫ –ø—Ä–∏–∑—ã–≤–Ω—ã–º —É—á–∞—Å—Ç–∫–∞–º –º–µ—Å—Ç–Ω—ã—Ö –æ—Ä–≥–∞–Ω–æ–≤ –≤–æ–µ–Ω–Ω–æ–≥–æ —É–ø—Ä–∞–≤–ª–µ–Ω–∏—è –∏ –ø–æ–¥–ª–µ–∂–∞—â–∏–µ –ø—Ä–∏–∑—ã–≤—É –Ω–∞ —Å—Ä–æ—á–Ω—É—é –≤–æ–∏–Ω—Å–∫—É—é —Å–ª—É–∂–±—É;"
        " 7) –≤–æ–∏–Ω—Å–∫–æ–µ –∑–≤–∞–Ω–∏–µ ‚Äì –∑–Ω–∞–∫ –≤–æ–∏–Ω—Å–∫–æ–≥–æ —Ä–∞–∑–ª–∏—á–∏—è, –ø—Ä–∏—Å–≤–∞–∏–≤–∞–µ–º—ã–π –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–µ–º—É –∏ –≤–æ–µ–Ω–Ω–æ–æ–±—è–∑–∞–Ω–Ω–æ–º—É;"
        " 8) –≤–æ–µ–Ω–Ω—ã–π –±–∏–ª–µ—Ç ‚Äì –µ–¥–∏–Ω—ã–π –±–µ—Å—Å—Ä–æ—á–Ω—ã–π –ª–∏—á–Ω—ã–π —É—á–µ—Ç–Ω–æ-–≤–æ–∏–Ω—Å–∫–∏–π –¥–æ–∫—É–º–µ–Ω—Ç –≥—Ä–∞–∂–¥–∞–Ω–∏–Ω–∞, –æ–ø—Ä–µ–¥–µ–ª—è—é—â–∏–π –µ–≥–æ –ø—Ä–∏–Ω–∞–¥–ª–µ–∂–Ω–æ—Å—Ç—å –∫ –≤–æ–∏–Ω—Å–∫–æ–π —Å–ª—É–∂–±–µ –∏ –æ—Ç–Ω–æ—à–µ–Ω–∏–µ –∫ –≤–æ–∏–Ω—Å–∫–æ–π –æ–±—è–∑–∞–Ω–Ω–æ—Å—Ç–∏;"
        " 9) —à—Ç–∞—Ç –≤–æ–∏–Ω—Å–∫–æ–π —á–∞—Å—Ç–∏ (—É—á—Ä–µ–∂–¥–µ–Ω–∏—è) ‚Äì –¥–æ–∫—É–º–µ–Ω—Ç, –æ–ø—Ä–µ–¥–µ–ª—è—é—â–∏–π —Å–æ—Å—Ç–∞–≤, –æ—Ä–≥–∞–Ω–∏–∑–∞—Ü–∏–æ–Ω–Ω–æ-—à—Ç–∞—Ç–Ω—É—é —Å—Ç—Ä—É–∫—Ç—É—Ä—É, —á–∏—Å–ª–µ–Ω–Ω–æ—Å—Ç—å –ª–∏—á–Ω–æ–≥–æ —Å–æ—Å—Ç–∞–≤–∞ –∏ –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ –∑–∞–∫—Ä–µ–ø–ª–µ–Ω–Ω–æ–≥–æ –æ—Å–Ω–æ–≤–Ω–æ–≥–æ –≤–æ–æ—Ä—É–∂–µ–Ω–∏—è –∏ –≤–æ–µ–Ω–Ω–æ–π —Ç–µ—Ö–Ω–∏–∫–∏ –≤ —Å–æ–æ—Ç–≤–µ—Ç—Å—Ç–≤–∏–∏ —Å –∫–∞–¥–∞—Å—Ç—Ä–æ–º –≤–æ–æ—Ä—É–∂–µ–Ω–∏—è –∏ –≤–æ–µ–Ω–Ω–æ–π —Ç–µ—Ö–Ω–∏–∫–∏ –í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã—Ö –°–∏–ª, –¥—Ä—É–≥–∏—Ö –≤–æ–π—Å–∫ –∏ –≤–æ–∏–Ω—Å–∫–∏—Ö —Ñ–æ—Ä–º–∏—Ä–æ–≤–∞–Ω–∏–π;"
        " 10) –≤–æ–∏–Ω—Å–∫–∏–π —É—á–µ—Ç ‚Äì —Å–∏—Å—Ç–µ–º–∞ —É—á–µ—Ç–∞ –∏ –∞–Ω–∞–ª–∏–∑–∞ –∫–æ–ª–∏—á–µ—Å—Ç–≤–µ–Ω–Ω—ã—Ö –∏ –∫–∞—á–µ—Å—Ç–≤–µ–Ω–Ω—ã—Ö –¥–∞–Ω–Ω—ã—Ö –æ –ø—Ä–∏–∑—ã–≤–Ω–∏–∫–∞—Ö, –≤–æ–µ–Ω–Ω–æ—Å–ª—É–∂–∞—â–∏—Ö –∏ –º–æ–±–∏–ª–∏–∑–∞—Ü–∏–æ–Ω–Ω—ã—Ö —Ä–µ—Å—É—Ä—Å–∞—Ö;"
        " 11) –≤–æ–∏–Ω—Å–∫–∏–µ —Å–±–æ—Ä—ã ‚Äì –º–µ—Ä–æ–ø—Ä–∏—è—Ç–∏—è, –ø—Ä–æ–≤–æ–¥–∏–º—ã–µ –æ—Ä–≥–∞–Ω–∞–º–∏ –≤–æ–µ–Ω–Ω–æ–≥–æ —É–ø—Ä–∞–≤–ª–µ–Ω–∏—è, —É–ø–æ–ª–Ω–æ–º–æ—á–µ–Ω–Ω—ã–º–∏ –≥–æ—Å—É–¥–∞—Ä—Å—Ç–≤–µ–Ω–Ω—ã–º–∏ –æ—Ä–≥–∞–Ω–∞–º–∏ –ø–æ –≤–æ–µ–Ω–Ω–æ–π –ø–æ–¥–≥–æ—Ç–æ–≤–∫–µ –≤ —Ü–µ–ª—è—Ö –ø—Ä–∏–æ–±—Ä–µ—Ç–µ–Ω–∏—è –∏ —Å–æ–≤–µ—Ä—à–µ–Ω—Å—Ç–≤–æ–≤–∞–Ω–∏—è –≤–æ–µ–Ω–Ω—ã—Ö –∑–Ω–∞–Ω–∏–π –≤–æ–µ–Ω–Ω–æ–æ–±—è–∑–∞–Ω–Ω—ã–º–∏ –∏ –≥—Ä–∞–∂–¥–∞–Ω–∞–º–∏, –∞ —Ç–∞–∫–∂–µ –≤ –∏–Ω—ã—Ö —Å–ª—É—á–∞—è—Ö, –ø—Ä–µ–¥—É—Å–º–æ—Ç—Ä–µ–Ω–Ω—ã—Ö –∑–∞–∫–æ–Ω–∞–º–∏ –†–µ—Å–ø—É–±–ª–∏–∫–∏ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω.")

doc = nlp(text)

rows = [(token.text, token.pos_, token.lemma_) for token in doc]
print(tabulate(rows, headers=["Token", "POS", "Lemma"], tablefmt="simple"))

displacy.render(doc, style="dep")
displacy.render(doc, style="ent")

Token              POS    Lemma
-----------------  -----  -----------------
–ù–∞—Å—Ç–æ—è—â–∏–π          ADJ    –Ω–∞—Å—Ç–æ—è—â–∏–π
–ó–∞–∫–æ–Ω              NOUN   –∑–∞–∫–æ–Ω
—Ä–µ–≥—É–ª–∏—Ä—É–µ—Ç         VERB   —Ä–µ–≥—É–ª–∏—Ä–æ–≤–∞—Ç—å
–æ–±—â–µ—Å—Ç–≤–µ–Ω–Ω—ã–µ       ADJ    –æ–±—â–µ—Å—Ç–≤–µ–Ω–Ω—ã–π
–æ—Ç–Ω–æ—à–µ–Ω–∏—è          NOUN   –æ—Ç–Ω–æ—à–µ–Ω–∏–µ
–≤                  ADP    –≤
—Å—Ñ–µ—Ä–µ              NOUN   —Å—Ñ–µ—Ä–∞
–ø—Ä–æ—Ö–æ–∂–¥–µ–Ω–∏—è        NOUN   –ø—Ä–æ—Ö–æ–∂–¥–µ–Ω–∏–µ
–≤–æ–∏–Ω—Å–∫–æ–π           ADJ    –≤–æ–∏–Ω—Å–∫–∏–π
—Å–ª—É–∂–±—ã             NOUN   —Å–ª—É–∂–±–∞
–≥—Ä–∞–∂–¥–∞–Ω–∞–º–∏         NOUN   –≥—Ä–∞–∂–¥–∞–Ω–∏–Ω
–†–µ—Å–ø—É–±–ª–∏–∫–∏         PROPN  —Ä–µ—Å–ø—É–±–ª–∏–∫–∞
–ö–∞–∑–∞—Ö—Å—Ç–∞–Ω          PROPN  –∫–∞–∑–∞—Ö—Å—Ç–∞–Ω
–∏                  CCONJ  –∏
–æ–ø—Ä–µ–¥–µ–ª—è–µ—Ç         VERB   –æ–ø—Ä–µ–¥–µ–ª—è—Ç—å
–æ—Å–Ω–æ–≤—ã             NOUN   –æ—Å–Ω–æ–≤–∞
–≥–æ—Å—É–¥–∞—Ä—Å—Ç–≤–µ–Ω–Ω–æ–π    ADJ    –≥–æ—Å—É–¥–∞—Ä—Å—Ç–≤–µ–Ω–Ω—ã–π
–ø–æ–ª–∏—Ç–∏–∫–∏           NOUN   –ø–æ–ª–∏—Ç–∏–∫–∞
–

The selected text is a piece from the Military Law of Kazakhstan. Thus, it has a pretty specialized legal vocabulary. I have run the full pipeline using spaCy Russian model. 



The model correctly recognised most words and tokenized them to separate words, numbers, punctuation. POS tagger tagged most words corretly with occassional mistakes (discussed in next paragraph). The Lemmatiser identified the words' bases correctly. Proper names were generally identified right, i.e., "–†–µ—Å–ø—É–±–ª–∏–∫–∞ –ö–∞–∑–∞—Ö—Å—Ç–∞–Ω" - PROPN. 
Overall, general non-domain-specific words were recognized well.

However, there are some limitations present. Most of them are due to domain specific words/entities. Firstly, the tokenizer split some legal document's references into multiple tokens, although it is preferable to keep them as a single token. For instance, "–ì–ª–∞–≤–∞ 1. –û–ë–©–ò–ï –ü–û–õ–û–ñ–ï–ù–ò–Ø" was split into 5 tokens, instead of 1 "legal" token. POS tagger made occasional mistakes, like identifying "–í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã–µ" in "–í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã–µ –°–∏–ª—ã" as a VERB instead of PROPN. NER in Russian spaCy model is significantly limited, with only LOC, ORG, PER labels available. Instead of LAW label like in English spaCy model, it identified "–ó–∞–∫–æ–Ω" entity as a Person. Occassionally, Organisation "–í–æ–æ—Ä—É–∂–µ–Ω–Ω—ã–µ –°–∏–ª—ã" was incorrectly identified as a Location, and "–û–±—â–∏–µ" as an Organisation. All these suggest that the model lacks training on law-specific texts. So, while suitable for general-purpose texts and tasks, for legal-specific texts, it may require some fine-tuning or other modifications.  