<a href="https://colab.research.google.com/github/mohammad0alfares/MachineLearningNotebooks/blob/master/Spark_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark NLP


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start_google_colab.ipynb)

The idea is to use Spark NLP ( https://github.com/JohnSnowLabs/spark-nlp ) to extract relationships between entities and their Part of Speech tags.
1.	Follow the instructions to install Spark NLP, either using Python 3 (Jupyter Notebook) or Scala 2.11. 

2.	Read the given dataset using Spark

3.	Create a Spark ML Pipeline using the following annotators (Use English pretrained models):
a.	DocumentAssembler
b.	Tokenizer
c.	WordEmbeddingsModel (Word Embeddings, Glove)
d.	PerceptronModel (Part of Speech)
e.	NerCrfModel (Named Entity Recognition)

4.	Print the transformed DataFrame showing only the POS column and the NER column. BONUS: Show only the result attribute of these Annotations

5.	Collect the result attribute of NER and POS, find a way to explain any relationship (if exists) between found entities and their part of speech attributes. 

Note: An Annotation column is an Array of Annotation objects. Annotation objects have the following scheme:
Annotation(annotatorType, begin, end, result, metadata, embeddings, sentenceEmbeddings)
More documentation here: https://nlp.johnsnowlabs.com/docs/en/annotators and examples can be found here: https://github.com/JohnSnowLabs/spark-nlp-workshop 

Basic Imports:

from sparknlp.base import *

from sparknlp.annotator import *

from sparknlp.embeddings import *



In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

# Install Spark NLP
! pip install --ignore-installed johnsnowlabs


openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 57kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 42.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130388 sha256=9aa341e03256b1f7d7fb29a4883e061ef3e9a3e2cd0105108bd04a32c16eaf9e
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d18423005

In [14]:
import sparknlp
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
Apache Spark version


'2.4.4'

In [3]:
!rm -rf ./spark-demos
!git clone https://github.com/hamed-abdelhaq/spark-demos.git

Cloning into 'spark-demos'...
remote: Enumerating objects: 106, done.[K
remote: Counting objects:   0% (1/106)[Kremote: Counting objects:   1% (2/106)[Kremote: Counting objects:   2% (3/106)[Kremote: Counting objects:   3% (4/106)[Kremote: Counting objects:   4% (5/106)[Kremote: Counting objects:   5% (6/106)[Kremote: Counting objects:   6% (7/106)[Kremote: Counting objects:   7% (8/106)[Kremote: Counting objects:   8% (9/106)[Kremote: Counting objects:   9% (10/106)[Kremote: Counting objects:  10% (11/106)[Kremote: Counting objects:  11% (12/106)[Kremote: Counting objects:  12% (13/106)[Kremote: Counting objects:  13% (14/106)[Kremote: Counting objects:  14% (15/106)[Kremote: Counting objects:  15% (16/106)[Kremote: Counting objects:  16% (17/106)[Kremote: Counting objects:  17% (19/106)[Kremote: Counting objects:  18% (20/106)[Kremote: Counting objects:  19% (21/106)[Kremote: Counting objects:  20% (22/106)[Kremote: Counting objects:  21% (2

In [4]:
!ls ./spark-demos/data/spark_nlp_dataset.parquet

part-00000-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00001-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00002-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00003-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00004-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00005-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00006-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00007-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00008-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00009-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00010-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00011-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
_SUCCESS


In [5]:
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

path = './spark-demos/data/spark_nlp_dataset.parquet'
table = pq.read_table(path)
df = table.to_pandas()

In [6]:
import pyspark
spark_df=spark.createDataFrame(df) 
spark_df.show()

+--------------------+
|                text|
+--------------------+
|CRICKET - LEICEST...|
|   LONDON 1996-08-30|
|West Indian all-r...|
|By the close York...|
|Australian Tom Mo...|
|After the frustra...|
|CRICKET - ENGLISH...|
|   LONDON 1996-08-30|
|Result and close ...|
|Somerset 83 and 1...|
|Leicestershire 22...|
|Chester-le-Street...|
|London ( The Oval...|
|Portsmouth : Midd...|
|Bristol : Glouces...|
|CRICKET - 1997 AS...|
|a six-test series...|
|Australia will al...|
|The tourists will...|
|as well as one-da...|
+--------------------+
only showing top 20 rows



# Try Pre-trained Model


In [19]:
# Import Spark NLP            
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

bert = BertEmbeddings.pretrained('bert_base_cased', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)

ner_dl = NerDLModel.pretrained('ner_dl')

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert,
        ner_dl])

bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [39]:
result = ner_prediction_pipeline.fit(spark_df)

# Try Pre-trained pipeline




In [26]:
# Import Spark NLP            
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp


pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo. 
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)


explain_document_dl download started this may take some time.
Approx size to download 168.4 MB
[OK!]


In [27]:
# What's in the pipeline
list(result.keys())

['entities',
 'stem',
 'checked',
 'lemma',
 'document',
 'pos',
 'token',
 'ner',
 'embeddings',
 'sentence']

In [35]:
result = pipeline.transform(spark_df)

In [37]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|             checked|               lemma|                stem|                 pos|          embeddings|                 ner|            entities|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[[document, 0, 64...|[[document, 0, 64...|[[token, 0, 6, CR...|[[token, 0, 6, CR...|[[token, 0, 6, CR...|[[token, 0, 6, cr...|[[pos, 0, 6, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 10, 23, ...|
|   LONDON 1996-08-30|[[document, 0, 16...|[[document, 0, 16...|[[to