<a href="https://colab.research.google.com/github/is5558/colab_samples/blob/main/tutorials/streamlit_notebooks/SPELL_CHECKER_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SPELL_CHECKER_EN.ipynb)




# **Spell check your text documents**

## 1. Colab Setup

Install dependencies

In [23]:
# Install PySpark and Spark NLP
! pip install -q pyspark spark-nlp

In [24]:
# prompt: find spark nlp version

!pip show spark-nlp

Name: spark-nlp
Version: 6.0.5
Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
Home-page: https://github.com/JohnSnowLabs/spark-nlp
Author: John Snow Labs
Author-email: 
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: 
Required-by: 


In [None]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

def initialize_spark_nlp():
    try:
        spark = sparknlp.start()
        print("Spark NLP version:", sparknlp.version())
        return spark
    except Exception as e:
        print("Error initializing Spark NLP session:", str(e))
        raise

def load_pipeline(pipeline_name='check_spelling', lang='en'):

    try:
        return PretrainedPipeline(pipeline_name, lang=lang)
    except Exception as e:
        print(f"Error loading pipeline '{pipeline_name}':", str(e))
        raise

def get_corrected_text(annotations):
    try:
        corrected_tokens = [token.result for token in annotations['checked']]
        return " ".join(corrected_tokens).replace(" ,", ",").replace(" .", ".")
    except KeyError:
        print("Error: 'checked' key not found in annotations.")
        return ""

def main():
    text = (
        "Yesturday, I went to the libary to borow a book about anciant civilizations. "
        "The wether was pleasent, so I decidid to walk insted of taking the buss. On the way, "
        "I saw a restuarent that lookt intresting, and I plan to viset it soon."
    )

    try:
        # Initialize Spark NLP and load the pipeline
        spark = initialize_spark_nlp()
        pipeline = load_pipeline()

        # Annotate text
        annotations = pipeline.fullAnnotate(text)[0]

        # Get and print corrected text
        corrected_text = get_corrected_text(annotations)
        print("*"*77)
        print("Original Text:\n", text)
        print("Corrected Text:\n", corrected_text)
        print("*"*77)

    except Exception as e:
        print("An unexpected error occurred:", str(e))

main()

Spark NLP version: 6.0.5
check_spelling download started this may take some time.
Approx size to download 884.9 KB
[OK!]
*****************************************************************************
Original Text:
 Yesturday, I went to the libary to borow a book about anciant civilizations. The wether was pleasent, so I decidid to walk insted of taking the buss. On the way, I saw a restuarent that lookt intresting, and I plan to viset it soon.
Corrected Text:
 Yesterday, I went to the library to borrow a book about ancient civilizations. The whether was pleasant, so I decided to walk instead of taking the bus. On the way, I saw a restuarent that looks interesting, and I plan to visit it soon.
*****************************************************************************


In [25]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.5.jar -O spark-nlp-6.0.5.jar


--2025-07-15 13:20:29--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.5.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 3.5.23.15, 54.231.224.192, 16.15.176.114, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|3.5.23.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 656279608 (626M) [application/java-archive]
Saving to: ‘spark-nlp-6.0.5.jar’


2025-07-15 13:20:40 (53.6 MB/s) - ‘spark-nlp-6.0.5.jar’ saved [656279608/656279608]



In [3]:
import os
os.path.exists("/content/spark-nlp-6.0.5.jar")


False

In [9]:
import sparknlp
from pyspark.sql import SparkSession
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *
from pyspark.ml import Pipeline

def initialize_spark_nlp():
    spark = SparkSession.builder \
        .appName("check_spelling") \
        .config("spark.jars", "/content/spark-nlp-6.0.5.jar") \
        .getOrCreate()
    return spark

def load_pipeline(pipeline_name='check_spelling', lang='en'):
    try:
        return PretrainedPipeline(pipeline_name, lang=lang)
    except Exception as e:
        print(f"Error loading pipeline '{pipeline_name}':", str(e))
        raise

def get_corrected_text(annotations):
    try:
        corrected_tokens = [token.result for token in annotations['checked']]
        return " ".join(corrected_tokens).replace(" ,", ",").replace(" .", ".")
    except KeyError:
        print("Error: 'checked' key not found in annotations.")
        return ""

def main():
    text = (
        "Yesturday, I went to the libary to borow a book about anciant civilizations. "
        "The wether was pleasent, so I decidid to walk insted of taking the buss. On the way, "
        "I saw a restuarent that lookt intresting, and I plan to viset it soon."
    )

    try:
        # Initialize Spark NLP and load the pipeline
        spark = initialize_spark_nlp()
        pipeline = load_pipeline()

        # Annotate text
        annotations = pipeline.fullAnnotate(text)[0]

        # Get and print corrected text
        corrected_text = get_corrected_text(annotations)
        print("*"*77)
        print("Original Text:\n", text)
        print("Corrected Text:\n", corrected_text)
        print("*"*77)

    except Exception as e:
        print("An unexpected error occurred:", str(e))

if __name__ == "__main__":
    main()

check_spelling download started this may take some time.
Approx size to download 884.9 KB
[OK!]
*****************************************************************************
Original Text:
 Yesturday, I went to the libary to borow a book about anciant civilizations. The wether was pleasent, so I decidid to walk insted of taking the buss. On the way, I saw a restuarent that lookt intresting, and I plan to viset it soon.
Corrected Text:
 Yesterday, I went to the library to borrow a book about ancient civilizations. The whether was pleasant, so I decided to walk instead of taking the bus. On the way, I saw a restuarent that looks interesting, and I plan to visit it soon.
*****************************************************************************


In [8]:
import sys
import sparknlp
from pyspark.sql import SparkSession
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import *
from pyspark.ml import Pipeline


# Initialize Spark NLP

def initialize_spark_nlp():
    spark = SparkSession.builder \
        .appName("spellcheck_models") \
        .config("spark.jars", "/content/spark-nlp-6.0.5.jar") \
        .config("spark.driver.memory", "8g") \
        .config("spark.executor.memory", "8g") \
        .getOrCreate()
    return spark

# Global Spark session and DocumentAssembler
spark = initialize_spark_nlp()
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
finisher = Finisher().setInputCols(["spell"])

# Load and define each spell check model pipeline

def load_spellcheck_dl():
    spell_model = ContextSpellCheckerModel.pretrained("spellcheck_dl", lang="en") \
        .setInputCols(["token"]).setOutputCol("spell")

    pipeline = Pipeline(stages=[document_assembler, tokenizer, spell_model, finisher])
    return pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

def load_spellcheck_norvig():
    spell_model = NorvigSweetingModel.pretrained("spellcheck_norvig", lang="en") \
        .setInputCols(["token"]).setOutputCol("spell")

    pipeline = Pipeline(stages=[document_assembler, tokenizer, spell_model, finisher])
    return pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

def load_spellcheck_sd():
    spell_model = SymmetricDeleteModel.pretrained("spellcheck_sd", lang="en") \
        .setInputCols(["token"]).setOutputCol("spell")

    pipeline = Pipeline(stages=[document_assembler, tokenizer, spell_model, finisher])
    return pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

# Helper function to correct and return text

def correct_text(pipeline_model, input_text):
    try:
        df = spark.createDataFrame([[input_text]]).toDF("text")
        result = pipeline_model.transform(df)
        corrected = result.select("finished_spell").first()[0]
        return " ".join(corrected)
    except Exception as e:
        print("Error during correction:", str(e))
        return ""

# Sample usage

def demo_model(model_name):
    sample_text = '''Yesturday, I went to the libary to borow a book about anciant civilizations.
    The wether was pleasent, so I decidid to walk insted of taking the buss. On the way,
    I saw a restuarent that lookt intresting, and I plan to viset it soon. I lke aple. we needto separate the words whereit is needed.'''

    print("\n" + "="*70)
    print(f"Running spell check using: {model_name}")
    print("Original:", sample_text)

    if model_name == "spellcheck_dl":
        model = load_spellcheck_dl()
    elif model_name == "spellcheck_norvig":
        model = load_spellcheck_norvig()
    elif model_name == "spellcheck_sd":
        model = load_spellcheck_sd()
    else:
        print("Invalid model name")
        return

    corrected = correct_text(model, sample_text)
    print("Corrected:", corrected)
    print("="*70)

if __name__ == "__main__":
    demo_model("spellcheck_dl")
    demo_model("spellcheck_norvig")
    # demo_model("spellcheck_sd")


Running spell check using: spellcheck_dl
Original: Yesturday, I went to the libary to borow a book about anciant civilizations.
    The wether was pleasent, so I decidid to walk insted of taking the buss. On the way,
    I saw a restuarent that lookt intresting, and I plan to viset it soon. I lke aple. we needto separate the words whereit is needed.
spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[OK!]
Corrected: Yesterday , I went to the library to borrow a book about ancient civilizations . The weather was pleasant , so I decided to walk instead of taking the busy . On the way , I saw a restaurant that look interesting , and I plan to visit it soon . I like ample . we need separate the words wherein is needed .

Running spell check using: spellcheck_norvig
Original: Yesturday, I went to the libary to borow a book about anciant civilizations.
    The wether was pleasent, so I decidid to walk insted of taking the buss. On the way,
    I saw

In [7]:
import sparknlp
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

def initialize_spark_nlp():
    return SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "4G") \
        .config("spark.jars", "/content/spark-nlp-6.0.5.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1000M") \
        .getOrCreate()

spark = initialize_spark_nlp()

# Step 3: Spell check pipeline runner
def try_pretrained_spellcheck_pipeline(pipeline_name, text):
    try:
        print("\n" + "=" * 90)
        print(f"Running pipeline: {pipeline_name}")
        print("Original Text:\n", text)

        pipeline = PretrainedPipeline(pipeline_name, lang="en")
        result = pipeline.annotate(text)

        # Dynamic output key resolution
        if "checked" in result:
            corrected_text = result["checked"]
        elif "spell" in result:
            corrected_text = " ".join(result["spell"])
        elif "finished_spell" in result:
            corrected_text = " ".join(result["finished_spell"])
        else:
            corrected_text = "[No corrected output found]"

        print("\nCorrected Output:\n", corrected_text)
        print("=" * 90)

    except Exception as e:
        print(f"[ERROR] Pipeline {pipeline_name} failed. Reason: {e}")

# Step 2: Sample text with spelling errors
sample_text = '''Yesturday, I went to the libary to borow a book about anciant civilizations.
The wether was pleasent, so I decidid to walk insted of taking the buss. On the way,
I saw a restuarent that lookt intresting, and I plan to viset it soon. I lke aple. we needto separate the words whereit is needed.'''

pipelines = [
        "check_spelling",
        # "check_spelling_dl"
        'spellcheck_dl_pipeline'
    ]

for pipeline_name in pipelines:
  try_pretrained_spellcheck_pipeline(pipeline_name, sample_text)


Running pipeline: check_spelling
Original Text:
 Yesturday, I went to the libary to borow a book about anciant civilizations.
The wether was pleasent, so I decidid to walk insted of taking the buss. On the way,
I saw a restuarent that lookt intresting, and I plan to viset it soon. I lke aple. we needto separate the words whereit is needed.
check_spelling download started this may take some time.
Approx size to download 884.9 KB
[OK!]

Corrected Output:
 ['Yesterday', ',', 'I', 'went', 'to', 'the', 'library', 'to', 'borrow', 'a', 'book', 'about', 'ancient', 'civilizations', '.', 'The', 'whether', 'was', 'pleasant', ',', 'so', 'I', 'decided', 'to', 'walk', 'instead', 'of', 'taking', 'the', 'bus', '.', 'On', 'the', 'way', ',', 'I', 'saw', 'a', 'restuarent', 'that', 'looks', 'interesting', ',', 'and', 'I', 'plan', 'to', 'visit', 'it', 'soon', '.', 'I', 'lke', 'able', '.', 'we', 'needto', 'separate', 'the', 'words', 'wherein', 'is', 'needed', '.']

Running pipeline: spellcheck_dl_pipeline
