<a href="https://colab.research.google.com/github/hrishikeshmalkar/Spark-nlp-projects/blob/main/1_Basic_NER_SPARK_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Setting Spark Environment

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
!bash colab_setup.sh

--2021-04-12 13:41:23--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1593 (1.6K) [text/plain]
Saving to: ‘colab_setup.sh.1’


2021-04-12 13:41:23 (28.7 MB/s) - ‘colab_setup.sh.1’ saved [1593/1593]

setup Colab for PySpark 3.1.1 and Spark NLP 3.0.1


In [None]:
#If we want specific version of pyspark and spark nlp follow belowed code
#!bash colab_setup.sh -p 2.4.4 -s 2.7.5

# Where -p is for pyspark and -s is for spark-nlp
# by default they are set to the latest

#### Importing required libraries

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

#### Starting Spark Session

In [None]:
spark = sparknlp.start()

# params =>> gpu=False, spark23=False (start with spark 2.3)

In [None]:
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 3.0.1
Apache Spark version: 3.1.1


#### Now, using pretrained pipeline (Provided by JohnSnowLabs)

In [None]:
from sparknlp.pretrained import PretrainedPipeline

In [None]:
#### Using 'explain_doucument_dl' English pipeline
pipeline_dl = PretrainedPipeline('explain_document_dl', lang='en')

explain_document_dl download started this may take some time.
Approx size to download 169.3 MB
[OK!]


# **Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)


In [None]:
testMsg1 = '''
Peter Parker is a very good persn.
My life in Russia is very intersting.
John and Peter are brothrs. However they don't support each other that much.
Mercedes Benz is also working on a driverless car.
Europe is very culture rich. There are huge churches! and big houses!
'''

In [None]:
result = pipeline_dl.annotate(testMsg1)

##### stages

In [None]:
result.keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

##### Entities in paragraph

In [None]:
result['entities']

['Peter Parker', 'Russia', 'John', 'Peter', 'Mercedes Benz', 'Europe']

In [None]:
testMsg2 = '''
Billionaire Gautam Adani announced that Flipkart and Adani Group have inked a deal where AdaniConneX will build Flipkart's new data centre in Chennai. 
As part of the deal, Adani Logistics will ulso build Flipkart's 5,34,000 sq ft fulfilment centre in Mumbai. 
The facility will enhance locl dmployment and create 2,500 direct jobs and thousands of indirect jobs, Flipkart said.
'''

In [None]:
result1 = pipeline_dl.annotate(testMsg2)

##### Entities in paragraph

In [None]:
result1['entities']

['Gautam Adani',
 'Flipkart',
 'Adani Group',
 'AdaniConneX',
 "Flipkart's",
 'Chennai',
 'Adani Logistics',
 'Mumbai',
 'Flipkart']

In [None]:
result1['sentence']

["Billionaire Gautam Adani announced that Flipkart and Adani Group have inked a deal where AdaniConneX will build Flipkart's new data centre in Chennai.",
 "As part of the deal, Adani Logistics will ulso build Flipkart's 5,34,000 sq ft fulfilment centre in Mumbai.",
 'The facility will enhance locl dmployment and create 2,500 direct jobs and thousands of indirect jobs, Flipkart said.']

##### Identifying tokens, lemmas, stems and wrong spelling using checked in paragraph.

In [None]:
list(zip(result1['token'], result1['lemma'], result1['stem'], result1['checkeda']))

[('Billionaire', 'Billionaire', 'billionair', 'Billionaire'),
 ('Gautam', 'Gautama', 'gautama', 'Gautama'),
 ('Adani', 'Adani', 'adani', 'Adani'),
 ('announced', 'announce', 'announc', 'announced'),
 ('that', 'that', 'that', 'that'),
 ('Flipkart', 'Flipkart', 'flipkart', 'Flipkart'),
 ('and', 'and', 'and', 'and'),
 ('Adani', 'Adani', 'adani', 'Adani'),
 ('Group', 'Group', 'group', 'Group'),
 ('have', 'have', 'have', 'have'),
 ('inked', 'ink', 'ink', 'inked'),
 ('a', 'a', 'a', 'a'),
 ('deal', 'deal', 'deal', 'deal'),
 ('where', 'where', 'where', 'where'),
 ('AdaniConneX', 'AdaniConneX', 'adaniconnex', 'AdaniConneX'),
 ('will', 'will', 'will', 'will'),
 ('build', 'build', 'build', 'build'),
 ("Flipkart's", "Flipkart's", "flipkart'", "Flipkart's"),
 ('new', 'new', 'new', 'new'),
 ('data', 'data', 'data', 'data'),
 ('centre', 'centre', 'centr', 'centre'),
 ('in', 'in', 'in', 'in'),
 ('Chennai', 'Chenoa', 'chenoa', 'Chenoa'),
 ('.', '.', '.', '.'),
 ('As', 'As', 'a', 'As'),
 ('part', 'part'

#### Creating Data Frame

In [None]:
import pandas as pd

df = pd.DataFrame({'Token':result1['token'], 'Ner_Label':result1['ner'],
                      'Corrected_Spell':result1['checked'], 'POS':result1['pos'],
                      'Lemmas':result1['lemma'], 'Stems':result1['stem']})

In [None]:
df.head()

Unnamed: 0,Token,Ner_Label,Corrected_Spell,POS,Lemmas,Stems
0,Billionaire,O,Billionaire,NNP,Billionaire,billionair
1,Gautam,B-PER,Gautama,NNP,Gautama,gautama
2,Adani,I-PER,Adani,NNP,Adani,adani
3,announced,O,announced,VBD,announce,announc
4,that,O,that,IN,that,that


In [None]:
df[33:50]

Unnamed: 0,Token,Ner_Label,Corrected_Spell,POS,Lemmas,Stems
33,ulso,O,also,VB,also,also
34,build,O,build,VB,build,build
35,Flipkart's,O,Flipkart's,NNP,Flipkart's,flipkart'
36,534000,O,534000,CD,534000,534000
37,sq,O,sq,NN,sq,sq
38,ft,O,ft,NN,ft,ft
39,fulfilment,O,fulfilment,NN,fulfilment,fulfil
40,centre,O,centre,NN,centre,centr
41,in,O,in,IN,in,in
42,Mumbai,B-LOC,Mumbai,NNP,Mumbai,mumbai


In above dataframe if you noticed at index position 33 and 49 the words which are used in paragraphs is wrong (i.e ulso and dmployment). So model is able to identify that mistake and resolve it efficiently.

#### Using fullAnnotate to get more details.

In [None]:
detailed_result = pipeline_dl.fullAnnotate(testMsg1)

detailed_result[0]['entities']

[Annotation(chunk, 1, 12, Peter Parker, {'entity': 'PER', 'sentence': '0', 'chunk': '0'}),
 Annotation(chunk, 47, 52, Russia, {'entity': 'LOC', 'sentence': '1', 'chunk': '1'}),
 Annotation(chunk, 74, 77, John, {'entity': 'PER', 'sentence': '2', 'chunk': '2'}),
 Annotation(chunk, 83, 87, Peter, {'entity': 'PER', 'sentence': '2', 'chunk': '3'}),
 Annotation(chunk, 151, 163, Mercedes Benz, {'entity': 'ORG', 'sentence': '4', 'chunk': '4'}),
 Annotation(chunk, 202, 207, Europe, {'entity': 'LOC', 'sentence': '5', 'chunk': '5'})]

#### Creating dataframe with identified chunkes and their entities

In [None]:
chunks=[]
entities=[]
for n in detailed_result[0]['entities']:
        
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 

In [None]:
df = pd.DataFrame({'chunks':chunks, 'entities':entities})
df

Unnamed: 0,chunks,entities
0,Peter Parker,PER
1,Russia,LOC
2,John,PER
3,Peter,PER
4,Mercedes Benz,ORG
5,Europe,LOC


#### Creating data frame with a standard format for later use.

In [None]:
tuples = []

for x,y,z in zip(detailed_result[0]["token"], detailed_result[0]["pos"], detailed_result[0]["ner"]):

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end, y.result, z.result))

In [None]:
df = pd.DataFrame(tuples, columns=['sent_id','token','start','end','pos', 'ner'])
df.head()

Unnamed: 0,sent_id,token,start,end,pos,ner
0,0,Peter,1,5,NNP,B-PER
1,0,Parker,7,12,NNP,I-PER
2,0,is,14,15,VBZ,O
3,0,a,17,17,DT,O
4,0,very,19,22,RB,O
