### Spark Pipeline Advanced

In this notebook, we will discuss how to search documents based on the Disease described in the CVD tree.

In [1]:
import pandas as pd
import json
from neo4j import GraphDatabase
import csv

#### Authentication to access covidgraph.org graph

In [2]:
covid_browser = "https://db.covidgraph.org/browser/"
covid_url = "bolt://db.covidgraph.org:7687"
user = "public"
password = "corona"

#driver = GraphDatabase.driver(uri, auth=(user, password))
driver = GraphDatabase.driver(uri = covid_url,\
                              auth = (user,password))

##### Example of a paper node in the covid graph

In [9]:
paper_query = "MATCH (n:Paper) RETURN n LIMIT 1"
Data = []
with driver.session() as session:
    info = session.run(paper_query)
    for item in info:
        print(item)

<Record n=<Node id=2385529 labels={'Paper'} properties={'cord_uid': 'ocp6yodg', 'cord19-fulltext_hash': 'b8957d48b6bcf17b7b51e004d19314ce77f653a1', 'journal': 'BMC Infect Dis', 'publish_time': '2011-12-28', 'source': 'PMC', 'title': 'Timeliness of contact tracing among flight passengers for influenza A/H1N1 2009', '_hash_id': '84b069ab23fb0ecebe6925af9c2b18ae', 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265549/'}>>


#### Extract Fragments

In [80]:
TEXT = []
query = "MATCH (p:Paper)-[:PAPER_HAS_BODYTEXTCOLLECTION]-(:BodyTextCollection)\
        -[:BODYTEXTCOLLECTION_HAS_BODYTEXT]-(a:BodyText)-[:HAS_FRAGMENT]\
        -(f:Fragment)-[:MENTIONS]->(g:GeneSymbol) RETURN p.cord_uid, g.taxid, f.text, a.text limit 100"

with driver.session() as session:
    info = session.run(query)
    for item in info:
        #print(item)
        TEXT.append({"chord_id" : item.values()[0],\
                     "gene_id":item.values()[1],
                     "fragment":item.values()[2],
                     "text":item.values()[3]})

In [81]:
TEXT[0:2]

[{'chord_id': 'ocp6yodg',
  'fragment': 'SARS and viral hemorrhagic fevers)',
  'gene_id': '9606',
  'text': 'In hindsight, the limited burden of disease of influenza A/H1N1 2009 did not justify contact tracing efforts. The main reason for flight contact tracing is raising alertness for possible exposure to uncommon infectious diseases, enabling early recognition and treatment of the disease and timely installation of control measures (e.g. SARS and viral hemorrhagic fevers). For some diseases, PEP is indicated as well. The risk assessment upon which the decision to install contact tracing is based should incorporate -apart from an evaluation of the severity and rarity of disease -an assessment of the required timeliness of effective control measures [23] . The expected time for laboratory confirmation of index cases and identification and tracing of contacts should be related to the maximum period during which quarantine, PEP or other control measures are effective in order to decide 

### Spark Pipeline

In [82]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun
from pyspark.sql.types import *

In [83]:
from sparknlp.base import DocumentAssembler, Finisher

#### Create Spark Session

In [84]:
packages = ','.join([
    "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1",
])

spark_conf = SparkConf()
spark_conf = spark_conf.setAppName('spark1')
spark_conf = spark_conf.setAppName('master[*]')
spark_conf = spark_conf.set("spark.jars.packages", packages)

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()

In [87]:
schema = StructType([
    StructField('chord_id', StringType()),
    StructField('gene_id', StringType()),
    StructField('fragment', StringType()),
    StructField('text', StringType()),
    
])
texts_df = spark.createDataFrame(TEXT, schema)

In [88]:
texts_df.show()

+--------+-------+--------------------+--------------------+
|chord_id|gene_id|            fragment|                text|
+--------+-------+--------------------+--------------------+
|ocp6yodg|   9606|SARS and viral he...|In hindsight, the...|
|ocp6yodg|   9606|Other nation's he...|The procedure for...|
|ocp6yodg|   9606|MHS Kennemerland ...|The procedure for...|
|ocp6yodg|   9606|Requests for cont...|The procedure for...|
|ocp6yodg|   9606|Requests for cont...|The procedure for...|
|ocp6yodg|   9606|In case of Schiph...|The procedure for...|
|ocp6yodg|   9606|The CIb verifies ...|The procedure for...|
|ocp6yodg|   9606|The MHS of the ai...|The procedure for...|
|ocp6yodg|   9606|For tracing forei...|The procedure for...|
|ocp6yodg|   9606|The other interva...|For the 17 comple...|
|ocp6yodg|   9606|For the 17 comple...|For the 17 comple...|
|ocp6yodg|   9606|After acceptance ...|For the 17 comple...|
|ocp6yodg|   9606|Interval III of t...|For the 17 comple...|
|ocp6yodg|   9606|Interv

#### Document Accembler

In [89]:
document_assembler = DocumentAssembler()\
    .setInputCol('fragment')\
    .setOutputCol('document')\
    .setIdCol('chord_id')

In [90]:
docs = document_assembler.transform(texts_df)

In [91]:
docs.limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 33, SARS and viral hemorrhagic ..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 165, Other nation's health auth..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 100, MHS Kennemerland then comp..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci..."


In [92]:
docs.first()['document'][0].asDict()

{'annotatorType': 'document',
 'begin': 0,
 'embeddings': [],
 'end': 33,
 'metadata': {'id': 'ocp6yodg', 'sentence': '0'},
 'result': 'SARS and viral hemorrhagic fevers)'}

#### Sentence Detector

In [93]:
from sparknlp.annotator import SentenceDetector

sent_detector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

In [94]:
sentences = sent_detector.transform(docs)

In [95]:
sentences.limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document,sentences
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(document, 0, 33, SARS and viral hemorrhagic ..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 165, Other nation's health auth...","[(document, 0, 165, Other nation's health auth..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 100, MHS Kennemerland then comp...","[(document, 0, 100, MHS Kennemerland then comp..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci..."


#### Tockenizer

In [96]:
from sparknlp.annotator import Tokenizer

tokenizer = Tokenizer()\
    .setInputCols(['sentences'])\
    .setOutputCol('tokens')\
    .fit(sentences)

In [97]:
tokens = tokenizer.transform(sentences)
tokens.limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document,sentences,tokens
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(token, 0, 3, SARS, {'sentence': '0'}, []), (..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 165, Other nation's health auth...","[(document, 0, 165, Other nation's health auth...","[(token, 0, 4, Other, {'sentence': '0'}, []), ..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 100, MHS Kennemerland then comp...","[(document, 0, 100, MHS Kennemerland then comp...","[(token, 0, 2, MHS, {'sentence': '0'}, []), (t..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []..."


#### Lemmatizer

In [98]:
! touch en_lemmas.txt

In [99]:
from sparknlp.annotator import Lemmatizer

lemmatizer = Lemmatizer() \
  .setInputCols(["tokens"]) \
  .setOutputCol("lemma") \
  .setDictionary('en_lemmas.txt', '\t', ',')\
  .fit(tokens)

In [100]:
lemmas = lemmatizer.transform(tokens)
lemmas.limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document,sentences,tokens,lemma
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(token, 0, 3, SARS, {'sentence': '0'}, []), (...","[(token, 0, 3, SARS, {'sentence': '0'}, []), (..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 165, Other nation's health auth...","[(document, 0, 165, Other nation's health auth...","[(token, 0, 4, Other, {'sentence': '0'}, []), ...","[(token, 0, 4, Other, {'sentence': '0'}, []), ..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 100, MHS Kennemerland then comp...","[(document, 0, 100, MHS Kennemerland then comp...","[(token, 0, 2, MHS, {'sentence': '0'}, []), (t...","[(token, 0, 2, MHS, {'sentence': '0'}, []), (t..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(token, 0, 7, Requests, {'sentence': '0'}, []..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(token, 0, 7, Requests, {'sentence': '0'}, []..."


#### POS Tagger

In [101]:
from sparknlp.annotator import PerceptronModel

In [102]:
pos_tagger = PerceptronModel.pretrained() \
  .setInputCols(["tokens", "sentences"]) \
  .setOutputCol("pos")

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


In [103]:
postags = pos_tagger.transform(lemmas)
postags.limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document,sentences,tokens,lemma,pos
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(document, 0, 33, SARS and viral hemorrhagic ...","[(token, 0, 3, SARS, {'sentence': '0'}, []), (...","[(token, 0, 3, SARS, {'sentence': '0'}, []), (...","[(pos, 0, 3, NNP, {'word': 'SARS'}, []), (pos,..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 165, Other nation's health auth...","[(document, 0, 165, Other nation's health auth...","[(token, 0, 4, Other, {'sentence': '0'}, []), ...","[(token, 0, 4, Other, {'sentence': '0'}, []), ...","[(pos, 0, 4, JJ, {'word': 'Other'}, []), (pos,..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 100, MHS Kennemerland then comp...","[(document, 0, 100, MHS Kennemerland then comp...","[(token, 0, 2, MHS, {'sentence': '0'}, []), (t...","[(token, 0, 2, MHS, {'sentence': '0'}, []), (t...","[(pos, 0, 2, NNP, {'word': 'MHS'}, []), (pos, ..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(pos, 0, 7, NNP, {'word': 'Requests'}, []), (..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 221, Requests for contact traci...","[(document, 0, 221, Requests for contact traci...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(token, 0, 7, Requests, {'sentence': '0'}, []...","[(pos, 0, 7, NNP, {'word': 'Requests'}, []), (..."


#### Pretrained Pipeline

In [104]:
from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


In [105]:
pipeline.transform(texts_df).limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,document,sentence,token,spell,lemmas,stems,pos
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[(document, 0, 1156, In hindsight, the limited...","[(document, 0, 108, In hindsight, the limited ...","[(token, 0, 1, In, {'sentence': '0'}, []), (to...","[(token, 0, 1, In, {'sentence': '0', 'confiden...","[(token, 0, 1, In, {'sentence': '0', 'confiden...","[(token, 0, 1, in, {'sentence': '0', 'confiden...","[(pos, 0, 1, IN, {'word': 'In'}, []), (pos, 3,..."
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[(document, 0, 1396, The procedure for contact...","[(document, 0, 59, The procedure for contact t...","[(token, 0, 2, The, {'sentence': '0'}, []), (t...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, the, {'sentence': '0', 'confide...","[(pos, 0, 2, DT, {'word': 'The'}, []), (pos, 4..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[(document, 0, 1396, The procedure for contact...","[(document, 0, 59, The procedure for contact t...","[(token, 0, 2, The, {'sentence': '0'}, []), (t...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, the, {'sentence': '0', 'confide...","[(pos, 0, 2, DT, {'word': 'The'}, []), (pos, 4..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 1396, The procedure for contact...","[(document, 0, 59, The procedure for contact t...","[(token, 0, 2, The, {'sentence': '0'}, []), (t...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, the, {'sentence': '0', 'confide...","[(pos, 0, 2, DT, {'word': 'The'}, []), (pos, 4..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[(document, 0, 1396, The procedure for contact...","[(document, 0, 59, The procedure for contact t...","[(token, 0, 2, The, {'sentence': '0'}, []), (t...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, The, {'sentence': '0', 'confide...","[(token, 0, 2, the, {'sentence': '0', 'confide...","[(pos, 0, 2, DT, {'word': 'The'}, []), (pos, 4..."


In [107]:
text = texts_df.first()['text']

In [108]:
annotations = pipeline.annotate(text)
list(zip(
    annotations['token'], 
    annotations['stems'], 
    annotations['lemmas']
))[100:120]

[('of', 'of', 'of'),
 ('the', 'the', 'the'),
 ('required', 'requir', 'require'),
 ('timeliness', 'timeli', 'timeliness'),
 ('of', 'of', 'of'),
 ('effective', 'effect', 'effective'),
 ('control', 'control', 'control'),
 ('measures', 'measur', 'measure'),
 ('[23]', '[23]', '[23]'),
 ('.', '.', '.'),
 ('The', 'the', 'The'),
 ('expected', 'expect', 'expect'),
 ('time', 'time', 'time'),
 ('for', 'for', 'for'),
 ('laboratory', 'laboratori', 'laboratory'),
 ('confirmation', 'confirm', 'confirmation'),
 ('of', 'of', 'of'),
 ('index', 'index', 'index'),
 ('cases', 'case', 'case'),
 ('and', 'and', 'and')]

#### Finisher

In [111]:
from pyspark.ml import Pipeline

In [112]:
finisher = Finisher()\
    .setInputCols(['tokens', 'lemma'])\
    .setOutputCols(['tokens', 'lemmata'])\
    .setCleanAnnotations(True)\
    .setOutputAsArray(True)

In [113]:
custom_pipeline = Pipeline(stages=[
    document_assembler,
    sent_detector,
    tokenizer,
    lemmatizer,
    finisher
]).fit(texts_df)

In [115]:
custom_pipeline.transform(texts_df).limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,tokens,lemmata
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[SARS, and, viral, hemorrhagic, fevers, )]","[SARS, and, viral, hemorrhagic, fever, )]"
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[Other, nation's, health, authorities, will, m...","[Other, nation's, health, authority, will, mak..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[MHS, Kennemerland, then, completes, contact, ...","[MHS, Kennemerland, then, complete, contact, d..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[Requests, for, contact, tracing, to, the, CIb...","[Requests, for, contact, trace, to, the, CIb, ..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[Requests, for, contact, tracing, to, the, CIb...","[Requests, for, contact, trace, to, the, CIb, ..."


#### Stop Words

In [117]:
from pyspark.ml.feature import StopWordsRemover
stopwords = StopWordsRemover.loadDefaultStopWords('english')

In [118]:
stopwords = StopWordsRemover.loadDefaultStopWords('english')

In [120]:
larger_pipeline = Pipeline(stages=[
    custom_pipeline,
    StopWordsRemover(
        inputCol='lemmata', 
        outputCol='terms', 
        stopWords=stopwords)
]).fit(texts_df)

In [122]:
larger_pipeline.transform(texts_df).limit(5).toPandas()

Unnamed: 0,chord_id,gene_id,fragment,text,tokens,lemmata,terms
0,ocp6yodg,9606,SARS and viral hemorrhagic fevers),"In hindsight, the limited burden of disease of...","[SARS, and, viral, hemorrhagic, fevers, )]","[SARS, and, viral, hemorrhagic, fever, )]","[SARS, viral, hemorrhagic, fever, )]"
1,ocp6yodg,9606,Other nation's health authorities will make a ...,"The procedure for contact tracing is complex, ...","[Other, nation's, health, authorities, will, m...","[Other, nation's, health, authority, will, mak...","[nation's, health, authority, make, request, C..."
2,ocp6yodg,9606,MHS Kennemerland then completes contact detail...,"The procedure for contact tracing is complex, ...","[MHS, Kennemerland, then, completes, contact, ...","[MHS, Kennemerland, then, complete, contact, d...","[MHS, Kennemerland, complete, contact, detail,..."
3,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[Requests, for, contact, tracing, to, the, CIb...","[Requests, for, contact, trace, to, the, CIb, ...","[Requests, contact, trace, CIb, Dutch, index, ..."
4,ocp6yodg,9606,Requests for contact tracing to the CIb for Du...,"The procedure for contact tracing is complex, ...","[Requests, for, contact, tracing, to, the, CIb...","[Requests, for, contact, trace, to, the, CIb, ...","[Requests, contact, trace, CIb, Dutch, index, ..."


### Supplimentary

In [100]:
texts_df.show(n=5, truncate=100, vertical=True)

-RECORD 0----------------------------------------------------------------------------------------------------
 text | SARS and viral hemorrhagic fevers)                                                                   
-RECORD 1----------------------------------------------------------------------------------------------------
 text | Other nation's health authorities will make a request to the CIb in case they diagnosed a patient... 
-RECORD 2----------------------------------------------------------------------------------------------------
 text | MHS Kennemerland then completes contact details through booking offices or using other search met... 
-RECORD 3----------------------------------------------------------------------------------------------------
 text | Requests for contact tracing to the CIb for Dutch index patients originate from any Dutch MHS whi... 
-RECORD 4----------------------------------------------------------------------------------------------------
 text | Re

In [101]:
texts_df.limit(5).toPandas()

Unnamed: 0,text
0,SARS and viral hemorrhagic fevers)
1,Other nation's health authorities will make a ...
2,MHS Kennemerland then completes contact detail...
3,Requests for contact tracing to the CIb for Du...
4,Requests for contact tracing to the CIb for Du...


In [102]:
from sparknlp.pretrained import PretrainedPipeline

In [103]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


In [104]:
pipeline.annotate('Hellu wrold!')

{'document': ['Hellu wrold!'],
 'lemmas': ['Hilo', 'world', '!'],
 'pos': ['NNP', 'NN', '.'],
 'sentence': ['Hellu wrold!'],
 'spell': ['Hilo', 'world', '!'],
 'stems': ['hilo', 'world', '!'],
 'token': ['Hellu', 'wrold', '!']}

In [105]:
texts_df.printSchema()

root
 |-- text: string (nullable = true)



In [106]:
procd_texts_df = pipeline.annotate(texts_df, 'text')

In [107]:
procd_texts_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

In [108]:
procd_texts_df.show(n=2)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               spell|              lemmas|               stems|                 pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|SARS and viral he...|[[document, 0, 33...|[[document, 0, 33...|[[token, 0, 3, SA...|[[token, 0, 3, SA...|[[token, 0, 3, SA...|[[token, 0, 3, sa...|[[pos, 0, 3, NNP,...|
|Other nation's he...|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 4, Ot...|[[token, 0, 4, Ot...|[[token, 0, 4, Ot...|[[token, 0, 4, ot...|[[pos, 0, 4, JJ, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--

In [109]:
procd_texts_df.show(n=2, truncate=100, vertical=True)

-RECORD 0--------------------------------------------------------------------------------------------------------
 text     | SARS and viral hemorrhagic fevers)                                                                   
 document | [[document, 0, 33, SARS and viral hemorrhagic fevers), [sentence -> 0], []]]                         
 sentence | [[document, 0, 33, SARS and viral hemorrhagic fevers), [sentence -> 0], []]]                         
 token    | [[token, 0, 3, SARS, [sentence -> 0], []], [token, 5, 7, and, [sentence -> 0], []], [token, 9, 13... 
 spell    | [[token, 0, 3, SARS, [confidence -> 1.0, sentence -> 0], []], [token, 5, 7, and, [confidence -> 1... 
 lemmas   | [[token, 0, 3, SARS, [confidence -> 1.0, sentence -> 0], []], [token, 5, 7, and, [confidence -> 1... 
 stems    | [[token, 0, 3, sar, [confidence -> 1.0, sentence -> 0], []], [token, 5, 7, and, [confidence -> 1.... 
 pos      | [[pos, 0, 3, NNP, [word -> SARS], []], [pos, 5, 7, CC, [word -> and], []], [

In [110]:
from sparknlp import Finisher
finisher = Finisher()
finisher = finisher
# taking the lemma column
finisher = finisher.setInputCols(['lemmas'])
# seperating lemmas by a single space
finisher = finisher.setAnnotationSplitSymbol(' ')
finished_texts_df = finisher.transform(procd_texts_df)
finished_texts_df.show(n=1, truncate=100, vertical=True)

-RECORD 0----------------------------------------------------
 text            | SARS and viral hemorrhagic fevers)        
 finished_lemmas | [SARS, and, viral, hemorrhagic, fever, )] 
only showing top 1 row



In [112]:
finished_texts_df.select('finished_lemmas').take(10)

[Row(finished_lemmas=['SARS', 'and', 'viral', 'hemorrhagic', 'fever', ')']),
 Row(finished_lemmas=['Other', "nation's", 'health', 'authority', 'will', 'make', 'a', 'request', 'to', 'the', 'CIb', 'in', 'case', 'they', 'diagnose', 'a', 'patient', 'which', 'arrive', 'at', 'Schiphol', 'airport', 'for', 'transit', 'while', 'be', 'infectious']),
 Row(finished_lemmas=['MHS', 'Kennemerland', 'then', 'complete', 'contact', 'detail', 'through', 'book', 'office', 'or', 'use', 'other', 'search', 'method']),
 Row(finished_lemmas=['request', 'for', 'contact', 'trace', 'to', 'the', 'CIb', 'for', 'Dutch', 'index', 'patient', 'originate', 'from', 'any', 'Dutch', 'MHS', 'which', 'identify', 'a', 'patient', 'who', 'travel', 'by', 'plane', 'while', 'be', 'contagious', 'for', 'an', 'infectious', 'disease', 'which', 'require', 'contact', 'trace']),
 Row(finished_lemmas=['request', 'for', 'contact', 'trace', 'to', 'the', 'CIb', 'for', 'Dutch', 'index', 'patient', 'originate', 'from', 'any', 'Dutch', 'MHS', '