<a href="https://colab.research.google.com/github/mohammad0alfares/MachineLearningNotebooks/blob/master/Spark_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark NLP Quick Start
### How to use Spark NLP pretrained pipelines

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start_google_colab.ipynb)

We will first set up the runtime environment and then load pretrained Entity Recognition model and Sentiment analysis model and give it a quick test. Feel free to test the models on your own sentences / datasets.

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

In [2]:
import sparknlp
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
Apache Spark version


'2.4.4'

In [3]:
from sparknlp.pretrained import PretrainedPipeline 

Let's use Spark NLP pre-trained pipeline for `named entity recognition`

In [4]:
pipeline = PretrainedPipeline('recognize_entities_dl', 'en')

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]


In [5]:
result = pipeline.annotate('Harry Potter is a great movie') 

In [6]:
print(result['ner'])

['B-PER', 'I-PER', 'O', 'O', 'O', 'O']


Let's use Spark NLP pre-trained pipeline for `sentiment` analysis

In [7]:
pipeline = PretrainedPipeline('analyze_sentiment', 'en')

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


In [19]:
result = pipeline.annotate('film was bad')

In [20]:
print(result['sentiment'])

['positive']


In [39]:
!rm -rf ./spark-demos
!git clone https://github.com/hamed-abdelhaq/spark-demos.git

Cloning into 'spark-demos'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects:   1% (1/54)[Kremote: Counting objects:   3% (2/54)[Kremote: Counting objects:   5% (3/54)[Kremote: Counting objects:   7% (4/54)[Kremote: Counting objects:   9% (5/54)[Kremote: Counting objects:  11% (6/54)[Kremote: Counting objects:  12% (7/54)[Kremote: Counting objects:  14% (8/54)[Kremote: Counting objects:  16% (9/54)[Kremote: Counting objects:  18% (10/54)[Kremote: Counting objects:  20% (11/54)[Kremote: Counting objects:  22% (12/54)[Kremote: Counting objects:  24% (13/54)[Kremote: Counting objects:  25% (14/54)[Kremote: Counting objects:  27% (15/54)[Kremote: Counting objects:  29% (16/54)[Kremote: Counting objects:  31% (17/54)[Kremote: Counting objects:  33% (18/54)[Kremote: Counting objects:  35% (19/54)[Kremote: Counting objects:  37% (20/54)[Kremote: Counting objects:  38% (21/54)[Kremote: Counting objects:  40% (22/54)[Kremote: Count

In [None]:
import pandas as pd
df = pd.read_csv("./spark-demos/tree/master/data/spark_nlp_dataset.parquet")
df.head()

In [42]:
!ls ./spark-demos/data/spark_nlp_dataset.parquet

part-00000-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00001-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00002-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00003-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00004-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00005-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00006-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00007-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00008-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00009-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00010-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
part-00011-b415d83b-aa0f-4f60-b33c-9c36d8cc6ac0-c000.snappy.parquet
_SUCCESS


In [43]:
dataset = pq.ParquetDataset('./spark-demos/data/spark_nlp_dataset.parquet')
table = dataset.read()

In [47]:
table.column_names

['text']

In [48]:
table.num_rows

1634