<a href="https://colab.research.google.com/github/keshav-b/Spark-NLP/blob/master/SparkNLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a DataFrame and visualize it

In [0]:
! pip install pyspark spark-nlp

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 64kB/s 
[?25hCollecting spark-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/28/dc/a3d36b2dc6c8df87098e39234656356e9035bd493945fcbbdca279294cac/spark_nlp-2.3.4-py2.py3-none-any.whl (62kB)
[K     |████████████████████████████████| 71kB 6.5MB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 67.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=6a4f9fff90bbce9cf75a3e678454ccbd6da2f0aef7

In [0]:
import sparknlp

In [0]:
# Starts a session of sparkNLP
spark = sparknlp.start()

In [0]:
document = DocumentAssembler().setInputCol('text').setOutputCol('document')

In [0]:
data = spark.createDataFrame([['this is the first sentence']]).toDF('text')

In [0]:
data.show()

+--------------------+
|                text|
+--------------------+
|this is the first...|
+--------------------+



In [0]:
# transforming into document
doc_data = document.transform(data)
doc_data.show(truncate = False)

+--------------------------+--------------------------------------------------------------------+
|text                      |document                                                            |
+--------------------------+--------------------------------------------------------------------+
|this is the first sentence|[[document, 0, 25, this is the first sentence, [sentence -> 0], []]]|
+--------------------------+--------------------------------------------------------------------+



# Annotators

**Tokenizer**



In [0]:
tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('tokens')

In [0]:
tokenized_data = tokenizer.fit(doc_data).transform(doc_data)

In [0]:
tokenized_data.show(truncate= False)

+--------------------------+--------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                      |document                                                            |tokens                                                                                                                                                                                                                   |
+--------------------------+--------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|this is the first sentence|[[document, 0, 25, this 

In [0]:
# To see the tokens alone
tokenized_data.select('tokens').show(truncate= False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|tokens                                                                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[token, 0, 3, this, [sentence -> 0], []], [token, 5, 6, is, [sentence -> 0], []], [token, 8, 10, the, [sentence -> 0], []], [token, 12, 16, first, [sentence -> 0], []], [token, 18, 25, sentence, [sentence -> 0], []]]|
+-----------------------------------------------------------------------------------------------------------------------

In [0]:
# To view a better version of the tokenized text
tokenized_data.select('tokens.result').show(truncate= False)

+--------------------------------+
|result                          |
+--------------------------------+
|[this, is, the, first, sentence]|
+--------------------------------+



# Pipeline

In [0]:
from pyspark.ml import Pipeline

here our pipeline consists of tranforming the raw_text into a document and Tokenizer

In [0]:
pipeline = Pipeline().setStages([document, tokenizer])

Even though there is nothing to train here, we  can convert a pipeline into a pipeline model

In [0]:
model = pipeline.fit(data)

In [0]:
result = model.transform(data)

In [0]:
result.show()

+--------------------+--------------------+--------------------+
|                text|            document|              tokens|
+--------------------+--------------------+--------------------+
|this is the first...|[[document, 0, 25...|[[token, 0, 3, th...|
+--------------------+--------------------+--------------------+

