<a href="https://colab.research.google.com/github/hrishikeshmalkar/Spark-nlp-projects/blob/main/2_Creating_Spell_Checker_Pretrained_Pipeline_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spell-Checker DL

#### Creating a Pretrained Pipeline From Scratch
Let's start with building a pipeline; a *spell correction pipeline*. We will use a pretrained model from our library.

# Setting Spark Environment

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
!bash colab_setup.sh

--2021-04-12 14:35:46--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1593 (1.6K) [text/plain]
Saving to: ‘colab_setup.sh.1’


2021-04-12 14:35:46 (14.9 MB/s) - ‘colab_setup.sh.1’ saved [1593/1593]

setup Colab for PySpark 3.1.1 and Spark NLP 3.0.1


In [None]:
# Importing required libraries
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
from IPython.utils.text import columnize

#### Starting Spark Session

In [None]:
spark = sparknlp.start()

# params =>> gpu=False, spark23=False (start with spark 2.3)

In [None]:
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 3.0.1
Apache Spark version: 3.1.1


#### Defining Stages with SpellChecker_Pretrained_Model

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\
    
spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")\
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)

finisher = Finisher()\
    .setInputCols("checked")

spellcheck_dl download started this may take some time.
Approximate size to download 111.4 MB
[OK!]


#### Creating a Normal Pipelinem

In [None]:
pipeline = Pipeline(
    stages = [
    document,
    tokenizer,
    spellModel,
    finisher
  ])

#### Creating Empty Dataframe and using in pipeline

In [None]:
empty_df = spark.createDataFrame([['']]).toDF("text")

In [None]:
spell_pipeline = pipeline.fit(empty_df)

In [None]:
spell_pipeline.stages

[DocumentAssembler_8cedde72ea90,
 REGEX_TOKENIZER_c4c9f6809e47,
 SPELL_b22681bc00ec,
 Finisher_ef646736b663]

#### Saving our Model

In [None]:
spell_pipeline.save('SavedSpellChecker')

#### Cross-Checking wheather our model was saved or not

In [None]:
!ls -lt

total 438252
drwxr-xr-x  4 root root      4096 Apr 12 14:40 SavedSpellChecker
-rw-r--r--  1 root root      1593 Apr 12 14:35 colab_setup.sh.1
-rw-r--r--  1 root root      1593 Apr 12 14:30 colab_setup.sh
drwxr-xr-x  1 root root      4096 Apr  7 13:36 sample_data
-rw-r--r--  1 root root 224374704 Feb 22 02:45 spark-3.1.1-bin-hadoop2.7.tgz
-rw-r--r--  1 root root 224374704 Feb 22 02:45 spark-3.1.1-bin-hadoop2.7.tgz.1
drwxr-xr-x 13 1000 1000      4096 Feb 22 02:44 spark-3.1.1-bin-hadoop2.7


#### Loading our saved spell-pipeline-model 

In [None]:
from sparknlp.pretrained import PretrainedPipeline
pipeline_local = PretrainedPipeline.from_disk('SavedSpellChecker')

#### Model testing

In [None]:
testDoc = '''
During the rainy seacon we have th best ueather and "I have a black ueather jacket, so nice."
'''

In [None]:
result = pipeline_local.annotate(testDoc)

In [None]:
result.keys()

dict_keys(['checked'])

In [None]:
result['checked']

['During',
 'the',
 'rainy',
 'season',
 'we',
 'have',
 'the',
 'best',
 'weather',
 'and',
 '"',
 'I',
 'have',
 'a',
 'black',
 'leather',
 'jacket',
 ',',
 'so',
 'nice',
 '."']

In [None]:
corrected_text = ''
for i in result['checked']:
  corrected_text = corrected_text +' '+ i


In [None]:
corrected_text

' During the rainy season we have the best weather and " I have a black leather jacket , so nice ."'