# Module 02: Data Preprocessing Walkthrough

This notebook demonstrates the preprocessing steps for Romanian text, including cleaning, date normalization, and a comparison of different NLP pipelines.

In [20]:
# Let's verify our setup
import sys
from pathlib import Path

# Add the parent directory to path for imports
sys.path.insert(0, str(Path.cwd().parent))

In [21]:
from preprocessing.cleaner import clean_text, parse_romanian_date
from preprocessing.nlp_pipeline import RomanianNLP, get_romanian_stopwords

## 1. Text Cleaning and Date Normalization
We use custom logic for cleaning and `dateparser` for robust Romanian date parsing.

In [22]:
sample_text = "  În perioada 22 decembrie 2025, comisarii ANPC au aplicat sancțiuni în toată ţara.  "
cleaned = clean_text(sample_text)
date_iso = parse_romanian_date("22 decembrie 2025")

print(f"Original: '{sample_text}'")
print(f"Cleaned:  '{cleaned}'")
print(f"Date ISO: {date_iso}")

Original: '  În perioada 22 decembrie 2025, comisarii ANPC au aplicat sancțiuni în toată ţara.  '
Cleaned:  'În perioada 22 decembrie 2025, comisarii ANPC au aplicat sancțiuni în toată țara.'
Date ISO: 2025-12-22


## 2. Stemming vs Lemmatization

**Stemming** is a rule-based process that strips suffixes to find the 'root'.
**Lemmatization** is a dictionary-based process that finds the canonical form (lemma).

In [23]:
nlp = RomanianNLP()
demo_sentence = "Românii sunt oameni ospitalieri și merg la munte."
nlp.compare_stemming_vs_lemmatization(demo_sentence)

2025-12-22 20:43:12 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

2025-12-22 20:43:12 INFO: Downloaded file to /home/marius/stanza_resources/resources.json
2025-12-22 20:43:12 INFO: Loading these models for language: ro (Romanian):
| Processor | Package      |
----------------------------
| tokenize  | rrt          |
| pos       | rrt_nocharlm |
| lemma     | rrt_nocharlm |

2025-12-22 20:43:12 INFO: Using device: cpu
2025-12-22 20:43:12 INFO: Loading: tokenize
2025-12-22 20:43:12 INFO: Loading: pos
2025-12-22 20:43:14 INFO: Loading: lemma
2025-12-22 20:43:14 INFO: Done loading processors!


[{'text': 'Românii', 'stem': 'român', 'lemma': 'român'},
 {'text': 'sunt', 'stem': 'sunt', 'lemma': 'fi'},
 {'text': 'oameni', 'stem': 'oamen', 'lemma': 'om'},
 {'text': 'ospitalieri', 'stem': 'ospitalier', 'lemma': 'ospitaliar'},
 {'text': 'și', 'stem': 'și', 'lemma': 'și'},
 {'text': 'merg', 'stem': 'merg', 'lemma': 'merge'},
 {'text': 'la', 'stem': 'la', 'lemma': 'la'},
 {'text': 'munte', 'stem': 'munt', 'lemma': 'munte'},
 {'text': '.', 'stem': '.', 'lemma': '.'}]

## 3. Romanian Stopwords
Removing frequent words that don't carry much semantic meaning.

In [24]:
stopwords = get_romanian_stopwords()
print(f"Found {len(stopwords)} Romanian stopwords.")
print(f"Top 20: {stopwords[:20]}")

Found 356 Romanian stopwords.
Top 20: ['a', 'abia', 'acea', 'aceasta', 'această', 'aceea', 'aceeasi', 'acei', 'aceia', 'acel', 'acela', 'acelasi', 'acele', 'acelea', 'acest', 'acesta', 'aceste', 'acestea', 'acestei', 'acestia']


## 4. Comparative NLP Pipelines
Showing how Stanza and NLTK process the same text.

In [25]:
print("\nStanza Processing:")
stanza_results = nlp.process_with_stanza("Consumatorii au drepturi.")
for res in stanza_results:
    print(res)

print("\nSpaCy Processing:")
spacy_results = nlp.process_with_spacy("Consumatorii au drepturi.")
for res in spacy_results:
    print(res)

print("\nNLTK Processing (Stemming):")
nltk_results = nlp.process_with_nltk("Consumatorii au drepturi.")
for res in nltk_results:
    print(res)


Stanza Processing:
{'text': 'Consumatorii', 'lemma': 'consumator', 'pos': 'NOUN'}
{'text': 'au', 'lemma': 'avea', 'pos': 'VERB'}
{'text': 'drepturi', 'lemma': 'drept', 'pos': 'NOUN'}
{'text': '.', 'lemma': '.', 'pos': 'PUNCT'}

SpaCy Processing:
{'text': 'Consumatorii', 'lemma': 'consumator', 'pos': 'NOUN'}
{'text': 'au', 'lemma': 'avea', 'pos': 'AUX'}
{'text': 'drepturi', 'lemma': 'drept', 'pos': 'NOUN'}
{'text': '.', 'lemma': '.', 'pos': 'PUNCT'}

NLTK Processing (Stemming):
{'text': 'Consumatorii', 'stem': 'consum'}
{'text': 'au', 'stem': 'au'}
{'text': 'drepturi', 'stem': 'dreptur'}
{'text': '.', 'stem': '.'}


## 5. Dataset Processing
Finally, we apply our pipeline to the entire dataset.
The `process_dataset` function orchestrates cleaning, date parsing, and lemmatization for all articles.

In [26]:
import json
from preprocessing.process_dataset import INPUT_FILE, process_dataset

# Let's look at one raw article before processing
if INPUT_FILE.exists():
    with open(INPUT_FILE, "r", encoding="utf-8") as f:
        raw_data = json.load(f)
        if raw_data:
            print("Raw Article Example:")
            print(json.dumps(raw_data[0], indent=2, ensure_ascii=False)[:500] + "...")

Raw Article Example:
{
  "url": "https://anpc.ro/comandament-anpc-in-zonele-turistice/",
  "title": "Comandament ANPC în zonele turistice",
  "content": "În perioada 15.12.2025–21.12.2025, Autoritatea Națională pentru Protecția Consumatorilor, prin Comandamentul de Iarnă 2025, a verificat activitatea a peste 470 de operatori economici din stațiunile și zonele turistice. ANPC este prezent în toate zonele turistice pe întreaga durată a sezonului sărbătorilor de iarnă, pentru a garanta protecția și drepturile consumato...


In [27]:
# Run the full pipeline
process_dataset()

2025-12-22 20:43:15 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Loading dataset from /home/marius/ore/inlp/gh/inlp/01_data_collection/data/processed/articles_anpc.json...
Loaded 237 articles.
Initialising Stanza NLP pipeline for Romanian...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

2025-12-22 20:43:15 INFO: Downloaded file to /home/marius/stanza_resources/resources.json
2025-12-22 20:43:15 INFO: Loading these models for language: ro (Romanian):
| Processor | Package      |
----------------------------
| tokenize  | rrt          |
| pos       | rrt_nocharlm |
| lemma     | rrt_nocharlm |

2025-12-22 20:43:15 INFO: Using device: cpu
2025-12-22 20:43:15 INFO: Loading: tokenize
2025-12-22 20:43:15 INFO: Loading: pos
2025-12-22 20:43:16 INFO: Loading: lemma
2025-12-22 20:43:17 INFO: Done loading processors!


Processing articles...


100%|██████████| 237/237 [01:41<00:00,  2.33it/s]


Success! Processed 237 articles.
Saved to /home/marius/ore/inlp/gh/inlp/02_data_preprocessing/data/processed/articles_anpc_preprocessed.json


In [28]:
# Look at the processed result
OUTPUT_FILE = Path.cwd().parent / "data" / "processed" / "articles_anpc_preprocessed.json"
if OUTPUT_FILE.exists():
    with open(OUTPUT_FILE, "r", encoding="utf-8") as f:
        processed_data = json.load(f)
        if processed_data:
            print("\nProcessed Article Example:")
            # Show the new fields added during preprocessing
            example = processed_data[0]
            print(f"Title Cleaned: {example.get('title_cleaned')}")
            print(f"Date ISO: {example.get('date_iso')}")
            print(f"Tokens (first 5): {example.get('content_tokens')[:5]}")
            print(f"Lemmatized Content (snippet): {example.get('lemmatized_content')[:200]}...")


Processed Article Example:
Title Cleaned: Comandament ANPC în zonele turistice
Date ISO: 2025-12-22
Tokens (first 5): [{'text': 'În', 'lemma': 'în', 'pos': 'ADP'}, {'text': 'perioada', 'lemma': 'perioadă', 'pos': 'NOUN'}, {'text': '15.12.2025–21.12.2025', 'lemma': '15.12.2025–21.12.2025', 'pos': 'NUM'}, {'text': ',', 'lemma': ',', 'pos': 'PUNCT'}, {'text': 'Autoritatea', 'lemma': 'autoritate', 'pos': 'NOUN'}]
Lemmatized Content (snippet): în perioadă 15.12.2025–21.12.2025 autoritate național pentru protecție consumator prin comandament de iarnă 2025 avea verifica activitate al peste 470 de operator economic din stațiune și zonă turisti...
