# VL02 - Text Processing

We'll use two different libraries, one that is good for education, and a more powerful one typically used in production. We'll contrast both through this lecture:

1. **nltk**: Primarily used for teaching, research, and exploring algorithms in NLP, offering a huge collection of corpora and modular functions. Requires more effort and individual function calls (like sent_tokenize and separate downloads) as it gives you a wide range of options for each task.

2. **spaCy**: Optimized for speed, efficiency, and production use, processing text much faster for real-world applications like chatbots or large-scale data analysis. Uses advanced pre-trained statistical models to process text in one go, automatically providing accurate sentence segmentation, Part-of-Speech tags, and Named Entity Recognition (NER).

## 1. Loading the libraries
Make sure to run the `install_env.sh` script to download additional dependencies. 

In [None]:
import re
try:
    import nltk
    from nltk.tokenize import sent_tokenize
    nltk.download('punkt', quiet=True)
except Exception:
    nltk = None
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
except Exception:
    nlp = None

## 2. Enconding and Unicode Normalisation

### 2.1 Different encodings
Unicode characters that look identical can be encoded in different ways. For example, the character ü may appear as:
- a single precomposed code point ü (U+00FC), or
- a decomposed sequence u (U+0075) + COMBINING DIAERESIS (U+0308).


In [None]:
"Mädchen"== "Mädchen"

In [None]:
s1 = 'Mädchen'       # 'Mädchen' NFC
s2 = 'Ma\u0308dchen' # 'Mädchen' NFD composed differently

print(f's1 = {s1}, s2 = {s2}')

In [None]:
print(f'{s1} == {s2} ?', s1 == s2)

In [None]:
import unicodedata
s2 = unicodedata.normalize('NFC', s2)

print(f'{s1} == {s2} ?', s1 == s2)

### 2.2 Encoding as an exploit

#### 2.2.1 Full-Width characters
Unicode includes "Full-Width" characters (used in East Asian typography), which look like standard ASCII characters but are treated as completely different characters by a computer.

| Keyword | ASCII character | Full-width character | Unicode difference |
|---:|:---:|:---:|:---|
| F | `F` (U+0046) | `Ｆ` | U+FF26 |
| R | `R` (U+0052) | `Ｒ` | U+FF32 |
| E | `E` (U+0045) | `Ｅ` | U+FF25 |


In [None]:
message_standard = "Claim your FREE prize now!"
message_exploit = "Claim your ＦＲＥＥ prize now!"

def check_if_spam (message):
    if( "FREE" in message): 
        print("SPAM:\t", message)
    else: 
        print("HAM:\t", message)
        
check_if_spam(message_standard)
check_if_spam(message_exploit)

# Can we solve it with our normalisation?
message_normalised = unicodedata.normalize('NFC', message_exploit)
check_if_spam(message_normalised)

##### What do we do?

In this situation, you might want to normalise to **NFKC**. NFKC normalizes characters that are compatibility variants of others (these are characters in Unicode intended to be typographic or compatibility forms). Examples NFKC will change:
- full-width latin letters Ｆ → F
- ligatures ﬁ → fi
- compatibility symbols ㎏ → kg
- circled numbers ① → 1
- superscripts ² → 2

In [None]:
# Some examples to run
samples = {
    "fullwidth": "Ｈｅｌｌｏ １２３",       # fullwidth Latin + fullwidth digits
    "ligature": "office ﬁle",             # 'ﬁ' ligature inside a word
    "compat_kg": "重量: ㎏",               # U+338F SQUARE KG -> 'kg'
    "superscript": "x² + y³",             # superscripts -> digits
    "circled": "① ② ③",                  # circled numbers -> digits
    "angstrom_sign": "\u212B",            # ANGSTROM SIGN (compat) -> 'Å'
    "umlaut" : 'Ma\u0308dchen',
}

def show(s):
    print("ORIG     :", s, " ->", [f"U+{ord(ch):04X}" for ch in s])
    nfc = unicodedata.normalize("NFC", s)
    nfkc = unicodedata.normalize("NFKC", s)
    print("NFC      :", nfc, " ->", [f"U+{ord(ch):04X}" for ch in nfc])
    print("NFKC     :", nfkc, " ->", [f"U+{ord(ch):04X}" for ch in nfkc])
    print("-" * 40)

for name, s in samples.items():
    print("SAMPLE:", name)
    show(s)

##### Defending against encoding exploits
When encoding attacks might be needed, or if we want to normalise for better search and matching, we can use NFKC. 
Does the code below defend agaisnt the attack?


In [None]:
message_normalised = unicodedata.normalize('NFKC', message_exploit)
check_if_spam(message_normalised)

#### 2.2.2 Zero-width exploits
Compatibility normalization (`NFKC`) is very useful (full-width → ASCII, ligatures → constituent letters, etc.), but it does **not** remove invisible / zero-width characters such as ZWSP, ZWNJ, ZWJ or the BOM.  
These characters can appear accidentally (copy/paste, editors) and they will break literal substring/regex rules unless you remove or canonicalize them.

In [None]:
message = "Claim your F‌REE prize now!"
check_if_spam(message)

message_normalised = unicodedata.normalize('NFKC', message)
check_if_spam(message_normalised)

In [None]:
message

We can see above that we sneaked in an invisible character. We can deal with them by simply removing them. Characters and their unicode codes:

```
  ZWSP    ZWNJ     ZWJ    BOM   
 \u200B  \u200C  \u200D  \uFEFF   
```

In [None]:
_zero_re = re.compile("[\u200B\u200C\u200D\uFEFF]")

message_clean = _zero_re.sub('', message)
check_if_spam(message_clean)

## 3. Sentence segmentation

### 3.1. Using regular expressions
We can use regular expressions to split a text into sentences.


In [None]:
text_short = "You have won a prize! Claim your gift now."

pattern = r'(?<=[.!?])\s+(?=[A-Z0-9"“\'\(\[])'   # split after .!? when next char looks like sentence start

sentences_short = re.split(pattern, text_short)

# Print sentences
def print_sentences (sentences):
    for i,s in enumerate(sentences,1):
        print(i, repr(s))

print_sentences(sentences_short)

In [None]:
text_long = (
    "Congratulations! You have won $1,000.00. Contact Dr. O'Neil at 9:00 a.m. to claim your prize. "
    "Offer valid for U.S. residents only. See Sec. 3.2 for terms... Don't miss out! Visit www.example.com/free-offer "
    "or call 1-800-555-0199. Mr. Smith, CEO of Acme Ltd., says, \"Act now!\""
)

sentences = re.split(pattern, text_long)
print_sentences(sentences)

### 3.2 Using `ntlk` sentence segmentation
NLTK's `sent_tokenize` uses the Punkt sentence tokenizer — a data-driven model that detects sentence boundaries.


In [None]:
sentences_long = sent_tokenize(text_long, language='english')
print_sentences(sentences_long)

### 3.3 Using `spacy` pipeline
spaCy performs sentence segmentation inside the pipeline via sentencizer/parser component. You access the results via `doc.sents`.

In [None]:
doc = nlp(text_long)

# Get sentences from the Doc object
sentences_long_spacy = [sent.text.strip() for sent in doc.sents]
print_sentences(sentences_long_spacy)

## 4. Tokenization

### 4.1 Using regular expressions

In [None]:
sent1 = 'You have won a prize!'
sent2 = "Don't miss this opportunity"

print ( re.findall(r"\b\w[\w'\-]*\b", sent1) )
print ( re.findall(r"\b\w[\w'\-]*\b", sent2) )

### 4.2 Using `nltk`

In [None]:
from nltk.tokenize import word_tokenize
print ( word_tokenize(sent1) )
print ( word_tokenize(sent2) )

### 4.3 Using `spacy`
The doc acts as a sequence of token objects, and you iterate on it to have access to all the tokens from a document. To work at the level of sentence you should access them through `doc.sents`, which is a slice of the doc tokens, for that sentence. 

In [None]:
doc1 = nlp(sent1)
doc2 = nlp(sent2)
print ( [t.text for t in doc1] ) ## all tokens
print ( [t.text for t in doc2] )

In [None]:
doc = nlp("You have won a prize! Claim your gift now.") 
for i, sent in enumerate(doc.sents, 1):
    print(f"Sentence {i}: {sent.text}")
    # tokens inside the sentence:
    for token in sent:
        print("   ", "is_start:", token.is_sent_start, "\t", token.text)

### 4.4 Challenges with German

In [None]:
# The Challenge: A standard tokenizer (like a simple regex or a basic word splitter)
# will treat the entire compound word as one token.

sent_german = "Der Krankenhaushaftpflichtversicherungsvertrag ist unklar."

# Simulate simple tokenization (e.g., splitting by space)
re.findall(r"\b\w[\w'\-]*\b", sent_german)

# Output (Error):
# ['Der', 'Krankenhaushaftpflichtversicherungsvertrag', 'ist', 'unklar.']

# The 'Solution' requires specialized tools (like spaCy's morphology component) 
# to split the word internally for proper analysis.

# Expected Correct Tokens for the compound word:
# ["Krankenhaus", "Haftpflicht", "Versicherung", "Vertrag"]

Run `python -m spacy download de_core_news_sm` before 

In [None]:
nlp_de = spacy.load("de_core_news_sm")

# 2. Process the text
doc_de = nlp_de(sent_german)

# 3. Extract tokens
[token.text for token in doc_de]

Tokenization is separate from morphological analysis. Long German compounds (e.g. Krankenhaushaftpflichtversicherungsvertrag) are tokenized by spaCy as a single token — splitting them into meaningful parts (decompounding) requires extra processing.
For a fast, ready-to-run approach you can use CharSplit (compound-split):

``pip install compound-split``

`compound-split` returns ranked binary split candidates for a compound.
- The top candidate is the model’s preferred split (left + right).
- The output includes a numeric score; higher = more confident.
- Scores can be negative for rare/awkward splits — treat them as less confident.

You can recursively apply the splitter to split multi-part compounds into smaller parts.

In [None]:
from compound_split import char_split   # module exported by package

words = [
    "Krankenhaushaftpflichtversicherungsvertrag",
    "Krankenversicherung",
    "Autobahnraststätte",
    "Kindergartenfreundschaft",
]

for w in words:
    splits = char_split.split_compound(w)   # returns ranked candidate (binary) splits
    print("WORD:", w)
    for score, left, right in splits[:3]:   # show top 3 candidates
        print(f"  score={score:.3f} -> {left} + {right}")
    print()

## 5. Case folding, punctuation, emoji handling

### 5.1 Case folding

In [None]:
tokens = ['WIN', 'Win', 'win']
[t.lower() for t in tokens]

In [None]:
tokens_de = ['Straße', 'STRASSE', "Osnabrück", "OSNABRUECK"]
t_lowered = [t.lower() for t in tokens_de]
t_folded = [t.casefold() for t in tokens_de]

print(t_lowered)
print(t_folded)

In [None]:
def remove_diacritics(s):
    return ''.join(ch for ch in unicodedata.normalize('NFKD', s) # decopo
                   if not unicodedata.combining(ch))
remove_diacritics("osnabrück")

### 5.2 Dealing with "special" characters
German (and many other European languages) use characters that are not ASCII — e.g. umlauts (ä/ö/ü), the sharp-S (ß) and other diacritics (é, ñ, …). How you handle them depends on the task. Below are the usual options and a short guideline.

- **Keep them (do nothing)**. Keep the original characters. This preserves all linguistic information and is recommended for most modern ML models and linguistic analyses.
- **Casefold only (caseless matching)**. Use Unicode case-folding (.casefold()) to compare case-insensitively. Important: .casefold() also maps ß → ss, unlike .lower().
- **Strip diacritics / ASCII transliteration**. Convert accented letters to closest ASCII equivalents (e.g., ü → u) using Unicode decomposition or a library (Unidecode, ICU). This loses diacritic information but can increase recall in legacy systems or ASCII-only contexts.
- **Orthographic mapping (German-style)**. Apply a language-aware mapping such as ä → ae, ö → oe, ü → ue, ß → ss. This preserves more of the original spelling conventions (useful for legacy matching, usernames, domain names, or keyboard variants).

What we choose depends on the type of task. When retreival tasks (matching, search) we probably want to strip it, for ML we probably don't want to strip  them, as we benefit from nuance


In [None]:
# Removing
text = "🎉 WIN a FREE vacation NOW!! 🏖️"

# we could remove ! from the list if we want to retain it
text_clean = re.sub(r"[\.,;:\"\(\)\[\]\\/\?@#\!\$%\^&\*_+=<>~`|]+", ' ', text) 
print(text, " -> ", text_clean)

# Notice that you would typically just filter the tokens (e.g., token.pos_ == PUNCT)
text = "🎉 WIN a FREE vacation... NOW!! 🏖️"
doc = nlp(text)
for token in doc:
    print("   ", token.text, " -> " , token.pos_, )

doc

In [None]:
import emoji
text = emoji.replace_emoji(text, replace=' <EMOJI> ')
print(text)

### 5.3 Stopwords

In [None]:
try:
    nltk.data.find('corpora/stopwords')
except Exception:
    nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords.words('english')

In [None]:
STOPWORDS = set(stopwords.words('english'))

# nltk
tokens = word_tokenize(text)
filtered_tokens = [ t for t in tokens if t not in STOPWORDS] 

print(filtered_tokens)


In [None]:
doc = nlp(text)
    
filtered_tokens_spacy = [token.text for token in doc 
                         if not token.is_stop and token.is_alpha]

print(filtered_tokens_spacy)

## 6. Stemming and Lematization 

### 6.1 Stemming

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer

porter = PorterStemmer()
snowball_en = SnowballStemmer("english")
snowball_de = SnowballStemmer("german")

print (porter.stem('study') )
print (snowball_en.stem('study') )
print (snowball_de.stem('Krankenhäuser') ) 

Comparing stems and lemmas

In [None]:
words_en = ['study', 'studies', 'studying', 'studied', 'run', 'running', 'ran', 'better', 'was', 'educate']
words_de = ['gehen', 'ging', 'gegangen', 'Krankenhaus', 'Krankenhäuser', 'arbeiten', 'arbeitete']

def compare_stems(words, lang='en'):
    print ("Language : ", lang)
    for w in words:
        p = porter.stem(w) 
        s_en = snowball_en.stem(w) 
        s_de = snowball_de.stem(w) 
        
        print(f" [{lang}] {w:15} | porter: {p:10} | snow_en: {s_en:10} | snow_de: {s_de:10}")
        
compare_stems(words_en, "en")  
compare_stems(words_de, "de")  

### 6.2 Lemmatization

#### Using ntlk
We need to do some work when performing lemmatization with ntlk. We can clearly see the need for POS before lemmatization in
`wln.lemmatize(token, pos)`

In [None]:
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk import pos_tag

wnl = WordNetLemmatizer()

def treebank_to_wordnet_pos(treebank_tag):
    """Map NLTK/Treebank POS tags to WordNet POS tags for lemmatizer."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # sensible default

def lemmatize_sentence(sent: str):
    tokens = word_tokenize(sent)
    tags = pos_tag(tokens)
    lemmas = []
    for token, tag in tags:
        wn_pos = treebank_to_wordnet_pos(tag)
        lemma = wnl.lemmatize(token, pos=wn_pos)
        lemmas.append(lemma)
    return lemmas

# Example
print(lemmatize_sentence("You deserve better rewards for your loyalty."))

#### Using spacy
The lemmatizer is already in spacy's pipeline, along with the POS. We can simply access the lemma with `token.lemma_`.

In [None]:
# To facilitate comparision, let's take the same words we used before
# we make a "sentence" out of it, since spacy input is the sentence
doc1 = nlp ("You deserve better rewards for your loyalty")

print ([token.lemma_ for token in doc1])

doc2 = nlp ("Get good benefits now - exclusively for loyal members!")

print ([token.lemma_ for token in doc2])

#### Inspect the POS and lemma
You can insect the POS and the lemma below:

In [None]:
for token in doc1:
    print("{:<10} {:<10} {:<10}".format(
        token.text,             
        token.pos_,      # Coarse-grained POS tag (e.g., NOUN, VERB)
        token.lemma_,   
    ))

In [None]:
from spacy import displacy
from IPython.display import HTML, display # Manually import from the standard location
# Get the raw HTML string

def display_parse_tree(doc):
    html_code = displacy.render(doc, style='dep', jupyter=False)
    
    # Use the standard IPython display to show the HTML
    display(HTML(html_code))
    
display_parse_tree(doc1)

## 7. A pre-processing pipeline

### 7.1 Removing punctuation before segementing sentences and tokens
We typically perform first segmentation and tokenization before removing punctuation. Otherwise this could affect the segmentation tasks.

In [None]:
def remove_punctuation (text):
    return re.sub(r"[\.,;:\"\\'(\)\[\]\\/\?@#\!\$%\^&\*_+=<>~`|]+", '', text) 

def tokenize (text):
    return word_tokenize(text)

text = "Don't click here!"

text_clean = remove_punctuation(text);
tokenize(text_clean)

### 7.2 Issues with Casefolding before lemmatisation
If you lowercase first, it can change the POS tagging.

"Reading is a lovely town."
"They are reading the book." 

In [None]:
text1 = "May is a great month."
text2 = "I may go later."

#doc1 = nlp(text1.casefold())    
doc1 = nlp(text1)
doc2 = nlp(text2)

display_parse_tree(doc1)
display_parse_tree(doc2)

print("doc1", get_lemma(doc1))
print("doc2", get_lemma(doc2))


### 7.3 Issues with stopword removal before lemmatisation
If we remove stopwords before lemmatization, this can change the structure of the sentence (leading to bad POS). Stopwords are typi

In [None]:
# Small demonstration of stopword-matching pitfalls
text = "The man wasn't running to the store"
tokens = word_tokenize(text)

# typical lowercased stoplist
stoplist = stopwords.words('english')

print (tokens)
# (A) Removing stopwords before lemmatization/case-normalization
#     We manually force to do it in the incorrect order
removed_before = [t for t in tokens if t.lower() not in stoplist]
print("Tokens A - remaining after stopword removal - (not recommended):", removed_before)   # 'He' removed? depends on .lower() use

# (B) Lemmatize first (using spaCy), then remove with lowercased stoplist
doc = nlp(text)
lemmas = [tok.lemma_ for tok in doc]
filtered_after = [tok for tok in lemmas if tok not in stoplist]
print("Tokens B - remaining after stopword removal - (recommended):", filtered_after)
print("->lemmas:", lemmas)


## 8. Practical notes about `spacy`  pipeline

In [None]:
print(nlp.pipeline)
print(nlp.pipe_names)


In [None]:
# NER
text = "Maria was walking in Paris. That was far from the United States of America"
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

# Running just the tokenizer
doc_tokens_only = nlp.make_doc(text)
print("Tokens (make_doc):", [t.text for t in doc_tokens_only])
print("Lema  (make_doc):", [t.lemma_ for t in doc_tokens_only])

# Exercise

## References

https://spacy.io/usage/processing-pipelines