## Explain Document Deep Learning

This notebook shows some of the available annotators in sparknlp. We start by importing required modules. 

In [2]:
import sparknlp
spark = sparknlp.start_with_ocr()

In [3]:
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *

Now, we load a pipeline model which contains the following annotators:
Tokenizer, Deep Sentence Detector, Lemmatizer, Stemmer, Part of Speech (POS) and Context Spell Checker

In [4]:
%%time
pipeline = PretrainedPipeline('explain_document_dl')

CPU times: user 11.5 ms, sys: 4.77 ms, total: 16.2 ms
Wall time: 34.5 s


We simple send the text we want to transform and the pipeline does the work.

In [5]:
%%time
from sparknlp.ocr import OcrHelper
data = OcrHelper().createDataset(spark, './immortal_text.pdf')
data.show()

+--------------------+--------------------+-------+------+
|                text|            filename|pagenum|method|
+--------------------+--------------------+-------+------+
|would have been a...|file:/home/saif/I...|      1|  text|
+--------------------+--------------------+-------+------+

CPU times: user 353 µs, sys: 2.28 ms, total: 2.63 ms
Wall time: 2.08 s


We can see the output of each annotator below.

In [6]:
pipeline.transform(data).select("ner", "checked").show()

+--------------------+--------------------+
|                 ner|             checked|
+--------------------+--------------------+
|[[named_entity, 0...|[[token, 0, 4, wo...|
+--------------------+--------------------+



In [11]:
local_data = data.select("text").first()['text']
local_data

'would have been a liberation, a joy, and a fiesta. \nHe sensed that had he been able to choose or \ndream his death that night, this is the death he \nwould have dreamed or chosen.  \nDahlmann firmly grips the knife, which he \nmay have no idea how to manage, and steps out \ninto the plains.  \n \n \n \nThe Aleph  \n(1949) \n \n \nThe Immortal \n \nSolomon saith: There is no new thing upon \nthe earth.  So that as Plato had an imagination, \nthat all knowledge was but remembrance;  so \nSolomon giveth his sentence, that all novelty is \nbut oblivion.  \nFrancis Bacon: Essays,  LVIII \n \nIn London, in early June of the year 1929,'

In [13]:
result = pipeline.annotate(local_data)
list(zip(result['token'], result['ner']))

[('would', 'B-sent'),
 ('have', 'O'),
 ('been', 'O'),
 ('a', 'O'),
 ('liberation', 'O'),
 (',', 'O'),
 ('a', 'B-sent'),
 ('joy', 'O'),
 (',', 'O'),
 ('and', 'B-sent'),
 ('a', 'O'),
 ('fiesta', 'O'),
 ('.', 'O'),
 ('He', 'B-sent'),
 ('sensed', 'O'),
 ('that', 'O'),
 ('had', 'O'),
 ('he', 'O'),
 ('been', 'O'),
 ('able', 'O'),
 ('to', 'O'),
 ('choose', 'O'),
 ('or', 'O'),
 ('dream', 'O'),
 ('his', 'O'),
 ('death', 'O'),
 ('that', 'O'),
 ('night', 'B-sent'),
 (',', 'O'),
 ('this', 'O'),
 ('is', 'O'),
 ('the', 'B-sent'),
 ('death', 'O'),
 ('he', 'O'),
 ('would', 'O'),
 ('have', 'O'),
 ('dreamed', 'O'),
 ('or', 'O'),
 ('chosen', 'O'),
 ('.', 'O'),
 ('Dahlmann', 'B-sent'),
 ('firmly', 'O'),
 ('grips', 'O'),
 ('the', 'O'),
 ('knife', 'O'),
 (',', 'O'),
 ('which', 'O'),
 ('he', 'B-sent'),
 ('may', 'O'),
 ('have', 'B-sent'),
 ('no', 'O'),
 ('idea', 'O'),
 ('how', 'O'),
 ('to', 'O'),
 ('manage', 'O'),
 (',', 'O'),
 ('and', 'B-sent'),
 ('steps', 'O'),
 ('out', 'O'),
 ('into', 'O'),
 ('the', 'O'),
