# Natural Language Processing. Lesson 2. Text processing basics

In this lab, we will cover a wide range of the Text Processing concepts:
- Sentence Segmentation,
- Lowercasing,
- Stop Words Removal,
- Lemmatization,
- Stemming,
- Byte-Pair Encoding (BPE),
- Edit Distance.

These methods help to understand how computers can work with human language. In other words, they are essential for unlocking the `meaning` hidden within text data.

## Sentence Segmentation

Sentence segmentation is a fundamental step that involves dividing a block of text into individual sentences, typically separated by punctuation marks. This method was considered in the previous lesson, so you already should be familiar with it. This technique may be used in:
- Part-of-Speech (POS): accurate boundaries between sentences are required for assigning grammatical labels like nouns, verbs, and adjectives.
- Sentiment Analysis: understanding the sentiment (positive, negative, neutral) of a sentence also relies on exact boundaries

And much more tasks need splitting the text on sentences. It can be performed using already known libraries: nltk or spaCy. Let's use nltk here.


In [1]:
# install the required library and import it
!pip install nltk



In [2]:
import nltk

nltk.download("punkt")

text = "This is a sample text. It contains multiple sentences. Can we segment it?"

# tokenize into sentences using nltk.sent_tokenize()
sentences = nltk.sent_tokenize(text)

print(sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This is a sample text.', 'It contains multiple sentences.', 'Can we segment it?']


#### Task 1

Complete the following code. Split the text into sentences and save tokens into the `sentences` variable.

In [3]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
text = "There exist some challenges for this technique. The main problem is \
language specificity because sentence segmentation rules can differ across \
languages. For example, Japanese omits spaces between words, making \
segmentation more complex. Punctuation ambiguity also may be the problem because \
of their complexity. Certain punctuation marks like ellipses (...) or colons (:) \
might not always indicate sentence boundaries, requiring context-aware approaches."

sentences = ...
print(sentences[:3])

In [None]:
assert sentences[:2] == [
    "There exist some challenges for this technique.",
    "The main problem is language specificity because \
sentence segmentation rules can differ across \
languages.",
]

## Lowercasing

Lowercasing, as the name suggests, is the process of converting all characters in a text string to lowercase. This seemingly simple step plays a crucial role in NLP tasks for several reasons:
- Consistency and focus on a word meaning: 'Apple' and 'apple' should be treated identically in terms of their meaning
- Improved performance: the same trick with apples reduces the number of unique words representations. Since many NLP algorithms rely on statistical analysis, it allows to avoid overfitting to specific capitalization patterns
- Compatibility with NLP tools: many NLP libraries and tools work primarily with lowercase text. Lowercasing ensures compatibility and avoids potential errors or inconsistencies.

For applying the Lowercasing we can use a simple Python built-in function for strings: `string.lower()`:

In [5]:
text = "ThIs Is AN ExaMple Text."

# apply the .lower() function
lowercased_text = text.lower()

print(lowercased_text)

this is an example text.


Except the .lower() the Python has .upper() function for strings. Converting all letters to their capital form is called `Uppercasing` and it also may be applied in NLP (rarely):
- Emphasis detection: in some cases, uppercase letters can indicate emphasis in text, like headlines, slogans, or acronyms. Uppercasing can help identify potential emphasis markers
- Specific NLP libraries: certain NLP libraries might have functionalities that work better with uppercase text, though this is less common. (Always refer to the documentation for specific tools)
- Named entity recognition (NER): in NER tasks, proper nouns (names of people, places, organizations) are often capitalized. Uppercasing text can be a preprocessing step to highlight potential named entities, but additional checks are needed for accuracy

#### Task 2.
Fill the gaps in the following code. Convert letters in 2 strings according to the meaning in the sentences:

In [None]:
# no additional libraries are required
upper = "aPplY thE uPpErCASiNg"
lower = "appLy ThE LowERcAsINg"

# apply the needed functions:
upper_result = ...
lower_result = ...

print(upper_result)
print(lower_result)

In [None]:
assert upper_result == "APPLY THE UPPERCASING"
assert lower_result == "apply the lowercasing"

## Stop Words Removal

Stop words are frequently occurring words that are often removed during text processing to focus on meaningful words. Examples include articles ("the", "a", "an"), prepositions ("of", "to", "in"), conjunctions ("and", "but", "or"), and pronouns ("I", "you", "he"). While these words are essential for human language construction, they often provide minimal value for NLP tasks, thus there are several reasons to get rid of them:
- Focus on the content and improved efficiency: removing stop words allows to keep much meaning and less words amount for optimizing the algorithms
- Statistical analysis: stop words can skew the results of statistical analysis in NLP tasks that rely on word frequency. Removing them reduces this bias and promotes to a more accurate representation of the important words

More stop-words examples:

![Stop words](https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab02/stop-words.png)

In [6]:
# Use an available in nltk method stopwords
from nltk.corpus import stopwords

# download the stop words list
# quiet=True hides messages that .download() might display
nltk.download("stopwords", quiet=True)

# retrieve the list with stop words in english
stop_words = set(stopwords.words("english"))

text = "This is an example sentence with some stop words."

# remove all stop words using the loop
filtered_words = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_words)

['example', 'sentence', 'stop', 'words.']


#### Task 3.
Fill the gaps in the following cells. Get rid of stop words.

In [None]:
text = "The quick brown fox jumps over the lazy dog."

# get the stop words
stop_words = ...

# remove the meaningless words
filtered_words = ...

print(filtered_words)

In [None]:
assert filtered_words == ["quick", "brown", "fox", "jumps", "lazy", "dog."]

You should always be careful with this method. Consider whether stop word removal is beneficial for your specific NLP task. Stop words removal might promote to a loss of context ("I don't like it" vs. "I like it" - the sentiment of the sentences can be lost). Use only domain-specific stop words.

## Lemmatization

Lemmatization involves  reducing words to their base or dictionary form, known as the lemma. This helps to group related words together and improve the accuracy of NLP models.

`Lemma` is the canonical form of a word, also referred to as its base or dictionary form (runs - run, keeps - keep, apples - apple).

Lemmatization algorithms use a dictionary and morphological analysis and rules to identify the base form of a word.

What for?
- Improved accuracy: grouping words with the same meaning into their base form helps to handle different variations of the same concept
- Reduced vocabulary size and memory usage: lemmatization reduces the number of unique words an NLP model needs to process

`WordNetLemmatizer` will help us in this method.

In [8]:
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()

words = ["rocks", "corpora", "cries"]

# apply the lemmatizer to all words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['rock', 'corpus', 'cry']


#### Task 4.
Fill the gaps. Convert given words to their original form. Use spacy. Hint: use lemma_ \
[Useful link](https://www.geeksforgeeks.org/python-pos-tagging-and-lemmatization-using-spacy/)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

def lemmatize_sentence(sentence):
    ...
    return lemmatized_words

sentence1 = "The quick brown foxes are jumping over the lazy dogs."
sentence2 = "He loves playing and running in the park."

expected_output1 = ['the', 'quick', 'brown', 'fox', 'be', 'jump', 'over', 'the', 'lazy', 'dog', '.']
expected_output2 = ['he', 'love', 'play', 'and', 'run', 'in', 'the', 'park', '.']

assert lemmatize_sentence(sentence1) == expected_output1, "Test case 1 failed"
assert lemmatize_sentence(sentence2) == expected_output2, "Test case 2 failed"


## Stemming

Stemming reduces words to their stems or root form, often by removing suffixes, in a more heuristic approach (running - run, jumped - jump, books - book). Similar to the Lemmatization, but what is the difference?
 - Stemming relies on suffix removal rules which might lead to a wrong word (running - runn), while Lemmatization uses the morphological analysis
 - Stemming is faster and simpler
 - Application: Stemming is more preferable when computational efficiency is a priority and general understanding of the core meaning is sufficient.

Stemming and Lemmatization produce quite similar results, however there are differences:

![Stemming and lemmatization](https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab02/stemming.png)

In [9]:
# import a simple module for stemming
from nltk.stem import PorterStemmer

# and create its instance
stemmer = PorterStemmer()

words = ["running", "rocks", "beautifully"]

# apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'rock', 'beauti']


#### Task 5.
Complete the following code. Hint: use nltk word tokenize


In [None]:
def stem_sentence(sentence):
    # YOUR CODE
    return stemmed_words


sentence1 = "The quick brown foxes are jumping over the lazy dogs."
sentence2 = "He loves playing and running in the park."

expected_output1 = [
    "the",
    "quick",
    "brown",
    "fox",
    "are",
    "jump",
    "over",
    "the",
    "lazi",
    "dog",
    ".",
]
expected_output2 = ["he", "love", "play", "and", "run", "in", "the", "park", "."]

assert stem_sentence(sentence1) == expected_output1, "Test case 1 failed"
assert stem_sentence(sentence2) == expected_output2, "Test case 2 failed"

## Byte-Pair Encoding (BPE)

BPE is a technique used in Natural Language Processing (NLP) for subword tokenization. Unlike traditional tokenization that splits text into individual words, BPE breaks down text into smaller units considering the vocabulary size and morphology of the language. This approach can be particularly beneficial when dealing with large vocabularies or rare words. The algoritghm:
1. Initial vocabulary: BPE starts with the individual characters in the text as the initial vocabulary.
2. Merging frequent pairs: it iteratively analyzes the training text and identifies the most frequent pair of characters or subwords (considering existing merged units).
3. Replacing pairs: this most frequent pair is replaced with a new symbol not present in the vocabulary. The new symbol represents the merged subword.
4. Vocabulary update: the vocabulary is updated to include the newly created symbol.
5. Repeat: steps 2-4 are repeated for a predefined number of iterations or until a desired vocabulary size is reached.

Applications:
- Machine translation: for effective handling vocabulary differences between languages
- Text classification and summarization: BPE proves a richer representation of words and captures morphological information
- Large Language Models (LLMs): BPE allowes LLMs to handle the vast vocabulary encountered in real-world text data

The visual explanation:

![Byte pair encoding](https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab02/bytepair.png)

In [10]:
!pip install tokenizers



In [11]:
# import TemplateProcessing for templates
from tokenizers.processors import TemplateProcessing

# these tokens have specific meanings within the tokenizer's
# vocabulary and are not part of the regular text
# UNK - unknown words not encountered during training
# CLS - indicate the beginning of a sentence
# SEP - separates sentences
# PAD - for padding sequences to a fixed length
# MASK - employed in tasks like masked language modeling,
# where certain words are masked and the model predicts them
special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]

# TemplateProcessing's instances define a template for how the tokenizer should
# handle text during the encoding and decoding process
temp_proc = TemplateProcessing(
    # single - specifies the format for encoding single sentences
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", special_tokens.index("[CLS]")),
        ("[SEP]", special_tokens.index("[SEP]")),
    ],
)

In [12]:
from tokenizers import Tokenizer
from tokenizers.normalizers import Sequence, Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE
from tokenizers.decoders import BPEDecoder

# create the instance of the Tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = BPEDecoder()
tokenizer.post_processor = temp_proc

In [13]:
from tokenizers.trainers import BpeTrainer

In [14]:
import nltk
from nltk.corpus import gutenberg

nltk.download("gutenberg", quiet=True)
nltk.download("punkt", quiet=True)

trainer = BpeTrainer(vocab_size=5000, special_tokens=special_tokens)
shakespeare = [" ".join(s) for s in gutenberg.sents("shakespeare-macbeth.txt")]
tokenizer.train_from_iterator(shakespeare, trainer=trainer)

In [15]:
print(
    tokenizer.encode(
        "BPE is a data compression technique used in NLP for tokenization."
    ).tokens
)
print(
    tokenizer.encode(
        "Is this a danger which I see before me, the handle toward my hand?"
    ).tokens
)

['[CLS]', 'b', 'pe', 'is', 'a', 'd', 'at', 'a', 'com', 'pre', 'ss', 'ion', 'te', 'ch', 'ni', 'que', 'use', 'd', 'in', 'n', 'lp', 'for', 'to', 'ken', 'iz', 'ation', '.', '[SEP]']
['[CLS]', 'is', 'this', 'a', 'danger', 'which', 'i', 'see', 'before', 'me', ',', 'the', 'handle', 'toward', 'my', 'hand', '?', '[SEP]']


#### Task 6.
Complete the following code to get functions required for BPE

In [16]:
def get_stats(ids):
    """
    Given a list of integers, return a dictionary of counts of consecutive pairs
    Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
    Optionally allows to update an existing dictionary of counts
    """
    counts = {}
    # YOUR CODE
    # YOUR CODE
    return counts

In [17]:
assert get_stats([1, 2, 3, 1, 2]) == {(1, 2): 2, (2, 3): 1, (3, 1): 1}
assert get_stats([]) == {}
assert get_stats([1]) == {}
assert get_stats([1, 1, 1, 1]) == {(1, 1): 3}
assert get_stats([1, 2, 1, 2, 1, 2]) == {(1, 2): 3, (2, 1): 2}

AssertionError: 

In [18]:
def merge(ids, pair, idx):
    """
    In the list of integers (ids), replace all consecutive occurrences
    of pair with the new integer token idx
    Example: ids=[1, 2, 3, 1, 2], pair=(1, 2), idx=4 -> [4, 3, 4]
    """
    newids = []
    i = 0
    # YOUR CODE
    # YOUR CODE
    return newids

In [None]:
assert merge([1, 2, 3, 1, 2], (1, 2), 4) == [4, 3, 4]
assert merge([1, 3, 4, 5], (1, 2), 4) == [1, 3, 4, 5]
assert merge([1, 2, 3, 4, 5], (1, 2), 6) == [6, 3, 4, 5]
assert merge([3, 4, 1, 2], (1, 2), 6) == [3, 4, 6]
assert merge([1, 2, 1, 2, 1, 2], (1, 2), 6) == [6, 6, 6]

## Levenshtein edit distance

Edit distance measures the similarity between two strings by counting the minimum number of operations needed to transform one string into the other.

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance#Example)

![Levenstein](https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab02/leven.svg)

In [26]:
import nltk

s1 = "abc"
s2 = "aec"
nltk.edit_distance(s1, s2)

1

#### Task 7.
Write a function levenshtein_distance that takes two strings as input and returns their Levenshtein distance. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.\
You can use the following [link](https://www.baeldung.com/cs/levenshtein-distance-computation#:~:text=Levenshtein%20distance%20is%20the%20smallest,insertions%2C%20deletions%2C%20and%20substitutions)

In [29]:
def levenshteinDistance(A, B):
    N, M = len(A), len(B)

    dp = [[0 for i in range(M + 1)] for j in range(N + 1)]

    # Base Case: When N = 0
    for j in range(M + 1):
        dp[0][j] = j
    # YOUR CODE

    return dp[N][M]

1

In [30]:
s1 = "kitty"
s2 = "kotyy"

In [31]:
assert levenshteinDistance(s1, s2) == nltk.edit_distance(s1, s2)

# Useful links:
- https://github.com/karpathy/minbpe.git BPE implementation from Genius
- https://arxiv.org/pdf/1508.07909.pdf BPE article

# Task


[Competition](https://www.kaggle.com/t/6dcb6f9def724f9f82050e9092952dd6)

The aim of the competition is to count the 10 most frequent words in the plays presented in the `data.txt` file.

In order to count the frequent words correctly, you must perform lemmatization and remove stop words.


In [None]:
with open("data.txt") as f:
    data = f.read()
plays = data.split("\n")
plays

In [None]:
plays_dict = {}

for play in plays:
    plays_dict[play] = gutenberg.raw(play)
    print(play, len(plays_dict[play]))

In [None]:
def top_frequent_words(text, topk=10):
    # your implementation
    pass

In [None]:
top_words = {}
for play, text in plays_dict.items():
    top_words[play] = top_frequent_words(text)

In [None]:
with open("submission.csv", "w") as f:
    f.write("id,count\n")
    for play, counts in top_words.items():
        for i, count in enumerate(counts):
            f.write(f"{play}_{i},{count[1]}\n")

# Conclusion

In this lesson basic text preprocessing techniques were considered. We've explored a range of fundamental methods that serve as the building blocks for Natural Language Processing (NLP) tasks:
    
- **Sentence** Segmentation: We delved into the art of dividing text into meaningful units – sentences. This crucial step allows us to analyze the structure and meaning within each sentence.
- **Lowercasing**: By converting all text to lowercase, we ensure consistency and facilitate comparisons between words, streamlining various NLP processes.
- **Stop Word Removal**: We tackled the challenge of high-frequency, uninformative words (stop words) by filtering them out. This step helps focus on the core meaning of the text and improves the performance of NLP models.
- **Lemmatization**: We explored the concept of reducing words to their base or dictionary forms (lemmas). This technique enhances consistency and allows us to capture the core meaning regardless of inflectional variations.
- **Stemming**: As an alternative to lemmatization, we investigated stemming, which reduces words to their base forms but might not always result in actual words. Stemming offers a balance between efficiency and accuracy.
- **Byte-Pair Encoding (BPE)**: We ventured into the world of subword units by exploring BPE. This technique breaks down words into smaller units, particularly valuable for handling rare words or complex vocabularies in large datasets.
- **Edit Distance**: We introduced the concept of edit distance, a metric for measuring the similarity between two sequences of text. This measure finds applications in tasks like spell checking, machine translation evaluation, and identifying textual variations.

By mastering these methods, you've gained the ability to clean, normalize, and analyze textual data effectively. This foundation empowers you to tackle more advanced NLP tasks and unlock the hidden treasures of meaning within language. Remember, the choice of technique depends on your specific NLP needs, and often a combination of these methods leads to optimal results.


