<a href="https://colab.research.google.com/github/onlyabhilash/Spark_NLP/blob/main/spark-nlp_basics/spark_07_ContextSpellChecker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4
! pip install --ignore-installed -q spark-nlp==2.7.1

import sparknlp

spark = sparknlp.start() # for GPU training >> sparknlp.start(gpu = True) # for Spark 2.3 =>> sparknlp.start(spark23 = True)

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)

spark

openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
[K     |████████████████████████████████| 215.7 MB 58 kB/s 
[K     |████████████████████████████████| 197 kB 51.2 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 138 kB 26.2 MB/s 
[?25hSpark NLP version 2.7.1
Apache Spark version: 2.4.4


In [5]:
from sparknlp.common import *
from IPython.utils.text import columnize

In [6]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")\
  .setPrefixes(["\"", "(", "[", "\n"])\
  .setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")\
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)

spellcheck_dl download started this may take some time.
Approximate size to download 112 MB
[OK!]


In [7]:
finisher = Finisher()\
    .setInputCols("checked")

pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

In [8]:
lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste")

{'checked': ['Please',
  'allow',
  'me',
  'to',
  'introduce',
  'myself',
  ',',
  'I',
  'am',
  'a',
  'man',
  'of',
  'wealth',
  'and',
  'taste']}

### Word Level Corrections

In [9]:
# First let's start with a loaded model, and check which classes it has been trained with
spellModel.getWordClasses()

['(_AGE_,RegexParser)',
 '(_NUM_,RegexParser)',
 '(_LOC_,VocabParser)',
 '(_DATE_,RegexParser)',
 '(_NAME_,VocabParser)']

In [10]:
beautify = lambda annotations: [columnize(sent['checked']) for sent in annotations]

In [11]:
# Foreign name without errors
sample = 'We are going to meet Jowita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Moita  in  the  city  hall  .\n']

Well, the result is not very good, that's because the Spell Checker has been trained mainly with American English texts. At least, the surrounding words are helping to obtain a correction that is a name. We can do better, let's see how.

### Updating a predefined word class
### Vocabulary Classes
In order for the Spell Checker to be able to preserve words, like a foreign name, we have the option to update existing classes so they can cover new words.

In [12]:
# add some more, in case we need them
spellModel.updateVocabClass('_NAME_', ['Monika', 'Agnieszka', 'Inga', 'Jowita', 'Melania'], True)

# Let's see what we get now
sample = 'We are going to meet Jowita at the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Jowita  at  the  city  hall  .\n']

Much better, right? Now suppose that we want to be able to not only preserve the word, but also to propose meaningful corrections to the name of our foreign friend.

In [13]:
# Foreign name with an error
sample = 'We are going to meet Jovita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Jowita  in  the  city  hall  .\n']

Here we were able to add the new word to the class and propose corrections for it, but also, the new word has been treated as a name, that meaning that the model used information about the typical context for names in order to produce the best correction.

### Regex Classes
We can do something similar for classes defined by regex. We can add a regex, to for example deal with a special format for dates, that will not only preserve the date with the special format, but also be able to correct it.

In [14]:
# Date with custom format
sample = 'We are going to meet her in the city hall on february-3.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  all  on  February  .\n']

In [15]:
# this is a sample regex, for simplicity not covering all months
spellModel.updateRegexClass('_DATE_', '(january|february|march)-[0-31]')
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  all  on  february-3  .\n']

In [16]:
# now check that it produces good corrections to the date
sample = 'We are going to meet her in the city hall on febbruary-3.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  her  in  the  city  all  on  february-3  .\n']

And the model produces good corrections for the special regex class. Remember that each regex that you enter to the model must be finite. In all these examples the new definitions for our classes didn't prevent the model to continue using the context to produce corrections. Let's see why being able to use the context is important.

### Sentence Level Corrections
The Spell Checker can leverage the context of words for ranking different correction sequences. Let's take a look at some examples,



In [17]:
# check for the different occurrences of the word "siter"
example1 = ["I will call my siter.",\
    "Due to bad weather, we had to move to a different siter.",\
    "We travelled to three siter in the summer."]
beautify(lp.annotate(example1))

['I  will  call  my  sister  .\n',
 'Due  to  bad  weather  ,  we  had  to  move  to  a  different  site  .\n',
 'We  travelled  to  three  sites  in  the  summer  .\n']

In [18]:
# check for the different occurrences of the word "ueather"
example2 = ["During the summer we have the best ueather.",\
    "I have a black ueather jacket, so nice.",\
    "I introduce you to my sister, she is called ueather."]
beautify(lp.annotate(example2))

['During  the  summer  we  have  the  best  Heather  .\n',
 'I  have  a  black  leather  jacket  ,  so  nice  .\n',
 'I  introduce  you  to  my  sister  ,  she  is  called  Heather  .\n']

### Subword level corrections

In [19]:
# sending or lending ?
sample = 'I will be 1ending him my car'
lp.annotate(sample)

{'checked': ['I', 'will', 'be', 'sending', 'him', 'my', 'car']}

In [20]:
# let's make the replacement of an '1' for an 'l' cheaper
weights = {'1': {'l': .1}}
spellModel.setWeights(weights)
lp.annotate(sample)

{'checked': ['I', 'will', 'be', 'lending', 'him', 'my', 'car']}

### Advanced - the mysterious tradeoff parameter

In [21]:
sample = 'have you been two the falls?'
beautify([lp.annotate(sample)])

['have  you  been  two  the  falls  ?\n']

In [22]:
spellModel.getTradeoff()

6.0

In [23]:
# let's decrease the influence of word-level errors
# TODO a nicer way of doing this other than re-creating the pipeline?
spellModel.setTradeoff(5.0)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    finisher
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

beautify([lp.annotate(sample)])

['have  you  been  to  the  falls  ?\n']

### Advanced - performance

In [24]:
def sparknlp_spell_check(text):

  return beautify([lp.annotate(text)])[0].rstrip()

In [25]:
sparknlp_spell_check('I will go to Philadelhia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [26]:
sparknlp_spell_check('I will go to Philadhelpia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [27]:
sparknlp_spell_check('I will go to Piladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [28]:
sparknlp_spell_check('I will go to Philadedlphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'

In [29]:
sparknlp_spell_check('I will go to Phieladelphia tomorrow')

'I  will  go  to  Philadelphia  tomorrow'