# Spanish NLP: Spell Checking Notebook

This notebook demonstrates how to use the `SpanishSpellChecker` class from the `spanish_nlp` library.

It supports multiple spell checking methods:
*   `dictionary`: Uses `pyspellchecker` based on dictionary lookups and edit distance.
*   `contextual_lm`: Uses a transformer-based masked language model (like BETO) for context-aware corrections.

For more information visit [spanish_nlp](https://github.com/jorgeortizfuentes/spanish_nlp) repository on GitHub.

## Setup

Import the necessary class and configure logging to see informational messages.

In [1]:
import logging
from spanish_nlp import SpanishSpellChecker

# Configure logging to see messages from the library
logging.basicConfig(level=logging.INFO)
# You might want to set a higher level (e.g., logging.WARNING) for less verbose output
# logging.basicConfig(level=logging.WARNING)

## Method 1: Dictionary-Based Spell Checker (`method='dictionary'`)

In [2]:
try:
    # Initialize with default settings (language='es', distance=2)
    dict_checker = SpanishSpellChecker(method="dictionary")
    print(f"Initialized: {dict_checker.get_implementation_details()}")
except Exception as e:
    print(f"Error initializing dictionary checker: {e}")

INFO:spanish_nlp.spellchecker:Initializing SpanishSpellChecker with method: 'dictionary'
INFO:spanish_nlp.spellchecker.dictionary_impl:DictionarySpellChecker initialized for language 'es' with distance 2.


Initialized: Using implementation: DictionarySpellChecker


In [14]:
text_simple = "hola komo stas?"

if "dict_checker" in locals():
    print(f"Original Text: {text_simple}")

    # Find potential errors
    errors = dict_checker.find_errors(text_simple)
    print(f"Potential Errors: {errors}")

    # Check a specific word
    print(f"Is 'stás' correct? {dict_checker.is_correct('stás')}")
    print(f"Is 'hola' correct? {dict_checker.is_correct('hola')}")

    # Get suggestions for a word
    suggestions = dict_checker.suggest("pruevs")
    print(f"Suggestions for 'pruevs': {suggestions}")

    # Get the single best correction for a word
    correction = dict_checker.correct_word("testo")
    print(f"Correction for 'testo': {correction}")

    # Correct the entire text (use with caution)
    corrected_text = dict_checker.correct_text(text_simple)
    print(f"Corrected Text: {corrected_text}")

Original Text: hola komo stas?
Potential Errors: ['komo', 'stas']
Is 'stás' correct? False
Is 'hola' correct? True
Suggestions for 'pruevs': ['pues', 'prueba']
Correction for 'testo': esto
Corrected Text: hola como estas?


### Using Custom Distance

The `distance` parameter controls the maximum Levenshtein distance for suggestions. A lower distance is stricter.

In [None]:
try:
    # Initialize with a stricter distance (distance=1)
    dict_checker_strict = SpanishSpellChecker(method="dictionary", distance=1)
    
    word_to_check = "pruevs"
    suggestions_default = dict_checker.suggest(word_to_check) # Using default checker (distance=2)
    suggestions_strict = dict_checker_strict.suggest(word_to_check)
    
    print(f"Suggestions for '{word_to_check}' (distance=2): {suggestions_default}")
    print(f"Suggestions for '{word_to_check}' (distance=1): {suggestions_strict}")
    
except Exception as e:
    print(f"Error initializing strict distance checker: {e}")

### Using Custom Dictionary

Loading words into a dictionary allows for custom spell checking, preventing correctly spelled domain-specific words or names from being flagged as errors.

In [18]:
try:
    # Example adding custom words (using default distance=2)
    custom_words = ["levenshtein", "nlp", "pythonista"]
    dict_checker_custom = SpanishSpellChecker(
        method="dictionary", custom_dictionary=custom_words
    )

    text_custom = "hola levenshtein, me gusta el nlp y soy un buen pythonista."
    print(f"\nOriginal Text: {text_custom}")
    errors_custom = dict_checker_custom.find_errors(text_custom)
    print(f"Potential Errors (custom dict): {errors_custom}") # Should be empty

except Exception as e:
    print(f"Error initializing custom dictionary checker: {e}")

INFO:spanish_nlp.spellchecker:Initializing SpanishSpellChecker with method: 'dictionary'
INFO:spanish_nlp.spellchecker.dictionary_impl:Loading 1 words from custom list.
INFO:spanish_nlp.spellchecker.dictionary_impl:DictionarySpellChecker initialized for language 'es' with distance 1.



Original Text: hola levenshtein komo stas?
Potential Errors (custom dict): ['komo', 'stas']


## Method 2: Contextual Language Model (`method='contextual_lm'`)

This method uses a transformer model (like BETO) to understand context. It's generally better for distinguishing between correctly spelled words used incorrectly (e.g., homophones) but is computationally more expensive.

In [11]:
try:
    # Initialize using BETO model, auto-detect device (GPU if available)
    # You can specify device='cpu' or device=0 (for first GPU)
    lm_checker = SpanishSpellChecker(
        method="contextual_lm",
        model_name="dccuchile/bert-base-spanish-wwm-uncased",  # BETO
        top_k=5,  # Number of candidates model considers internally
        suggestion_distance_threshold=2,  # Filter suggestions by Levenshtein distance
    )
    print(f"Initialized: {lm_checker.get_implementation_details()}")

except Exception as e:
    print(f"Error initializing contextual checker: {e}")
    print(
        "Make sure 'transformers', 'torch' (or 'tensorflow'), 'accelerate', and 'python-Levenshtein' are installed."
    )

INFO:spanish_nlp.spellchecker:Initializing SpanishSpellChecker with method: 'contextual_lm'
INFO:spanish_nlp.spellchecker.contextual_lm_impl:Initializing ContextualLMSpellChecker with model 'dccuchile/bert-base-spanish-wwm-uncased' on device 'cpu'.
INFO:spanish_nlp.spellchecker.contextual_lm_impl:ContextualLMSpellChecker initialized successfully.


Initialized: Using implementation: ContextualLMSpellChecker


In [24]:
text_context_1 = "Ola komo estas?"
text_context_2 = "El tubo se rompio."
text_context_3 = "No ce si ir al cine."

if "lm_checker" in locals():
    print(f"Original 1: {text_context_1}")
    # is_correct only checks vocabulary, 'ha' might be in vocab
    print(f"Is 'ha' in vocab? {lm_checker.is_correct('ha')}")
    # correct_text uses context
    corrected_1 = lm_checker.correct_text(text_context_1)
    print(f"Corrected 1: {corrected_1}")

    print(f"\nOriginal 2: {text_context_2}")
    corrected_2 = lm_checker.correct_text(text_context_2)
    print(f"Corrected 2: {corrected_2}")  # Model might choose 'tubo' or 'tuvo'

    print(f"\nOriginal 3: {text_context_3}")
    corrected_3 = lm_checker.correct_text(text_context_3)
    print(f"Corrected 3: {corrected_3}")

Original 1: Ola komo estas?
Is 'ha' in vocab? True
Corrected 1: Ola komo estas?

Original 2: El tubo se rompio.
Corrected 2: El tubo se rompio.

Original 3: No ce si ir al cine.
Corrected 3: Ce se si ir al cine.
