# Integrantes:

* Patrick Xavier Marquez Choque
* Jean Carlo Cornejo Cornejo

Primero es necesario la instalación de la propia librería luego de esto podremos utilizar los modelos pre-entrenados y el pipeline de Spark NLP

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-12-01 21:23:26--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2021-12-01 21:23:26--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-12-01 21:23:26--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [3]:
spark = sparknlp.start()
spark

In [4]:
print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 3.3.4
Apache Spark version: 3.0.3


## Aqui Cargamos muestro texto para ser analizado por Spark NLP

In [5]:
text = "While not normally known for his musical talent, Elon Musk is releasing a debut album at midnight. Musk has had a busy month. In July, Tesla reported a record loss, marking the company’s first quarterly net loss in its history. Musk also landed in a feud with rapper Azealia Banks, whom he has accused of trolling him, but Musk may soon put this all behind him. The Tesla CEO has put out a single and music video for an album called “Smile from the Beginning”. The 11-minute track will debut at midnight on the 18th of August. Musk shared a video on Instagram of a cover art featuring himself smiling at the camera while perched on a motorbike. Preparing to release a brand new album, Smile from the Beginning at midnight on August 18th,” he wrote alongside the video. Along with “Smile from the Beginning”, Musk also shared the album’s tracklist, which includes “Loop”, “633”, and “Sweet Sleeping Beauty”. “Six months ago I realized I had to start taking risks,” Musk told the press last year, in reference to the “Tesla Inc” social media account. He further explained. Then he left. :c"

In [6]:
text_file = open("/content/paper.txt", "r")
text2 = text_file.read()
text_file.close()

In [7]:
data = spark.createDataFrame([[text2]]).toDF('text')

In [8]:
#data2 = spark.createDataFrame([['Universidad Católica San Pablo es mi Universidad favorita. La Universidad Católica San Pablo tiene campus universitario. Me gusta el campus universitario de la Universidad Católica San Pablo.']]).toDF('text')

In [9]:
data.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Configuración del Pipeline para nuestro Modelo Pre-entrenado.
En este caso utilizaremos varios modelos pre-entrenados para realizar el análisis del texto ingresado.

In [10]:
document = DocumentAssembler().setInputCol('text').setOutputCol('document').setCleanupMode('shrink')

In [11]:
document

DocumentAssembler_15cf012d8281

In [12]:
sentence = SentenceDetector().setInputCols('document').setOutputCol('sentence')

In [13]:
sentence.setExplodeSentences(True)

SentenceDetector_76234a67dcb2

In [14]:
tokenizer = Tokenizer().setInputCols('sentence').setOutputCol('token')

In [15]:
tokenizer.setExceptions(['Católica'])

Tokenizer_68ae7ff445b0

In [16]:
checker = NorvigSweetingModel.pretrained().setInputCols(['token']).setOutputCol('checked')

spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]


In [17]:
embeddings = WordEmbeddingsModel.pretrained().setInputCols(['sentence', 'checked']).setOutputCol('embeddings')

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [18]:
ner = NerDLModel.pretrained().setInputCols(['sentence','checked', 'embeddings']).setOutputCol('ner')

ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [19]:
converter = NerConverter().setInputCols(['sentence', 'checked', 'ner']).setOutputCol('chunk')

In [20]:
from pyspark.ml import Pipeline

In [21]:
pipeline = Pipeline().setStages([document, sentence, tokenizer, checker, embeddings, ner, converter])

In [22]:
pipeline

Pipeline_c1f807de6d2a

In [23]:
model = pipeline.fit(data)

In [24]:
model

PipelineModel_c78ea880aa5b

In [25]:
result = model.transform(data)

In [26]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|             checked|          embeddings|                 ner|               chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Spark NLP: Natura...|[[document, 0, 14...|[[document, 0, 27...|[[token, 0, 4, Sp...|[[token, 0, 4, Sp...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 8, Sp...|
|Spark NLP: Natura...|[[document, 0, 14...|[[document, 275, ...|[[token, 275, 276...|[[token, 275, 276...|[[word_embeddings...|[[named_entity, 2...|[[chunk, 306, 306...|
|Spark NLP: Natura...|[[document, 0, 14...|[[document, 416, ...|[[token, 416, 420...|[[token, 416, 420...|[[word_embeddings...|[[named_entity, 4...|[[

In [27]:
result.select('sentence.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [28]:
result.select('checked.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------

In [29]:
result.select('ner.result').show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[B-ORG, I-ORG, O, O, O, O, O, O, O, B-ORG, O, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG, O, O, B-LOC, I-LOC, I-LOC, O, B-LOC, O, B-LOC, O, O, O, O, B-ORG, I-ORG, I-ORG, O, O, B-ORG, I-ORG, I-ORG, O, B-ORG, O, O, O, O, O, O, B-MISC, I-MISC, O, O]|
|[O, O, O, O

In [30]:
result.select('ner.begin', 'ner.end').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|begin                                                                                                                                                                                                                  

In [31]:
result.select('chunk.result', 'chunk.begin', 'chunk.end').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------------------------------------------+
|result                                                                                                                                                  |begin                                         |end                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------------------------------------------+
|[Spark NLP, Kocaman, David Talby John Snow Labs Inc, Coastal Highway Lewes, DE, USA, Abstract Spark NLP, Natural Language Processing, NLP, Apache Spark]|[0, 58, 67, 105, 128, 133, 176, 200, 229, 258]|[8, 64, 96, 125, 129, 135, 193, 226, 231, 269]|
|[&,

In [32]:
light = LightPipeline(model)

In [33]:
light

<sparknlp.base.LightPipeline at 0x7f1419288a10>

In [34]:
light.annotate("Donald MacDonald rides his Tesla in Africa")

{'checked': ['Donald', 'MacDonald', 'rides', 'his', 'Tesla', 'in', 'Africa'],
 'chunk': ['Donald MacDonald', 'Tesla', 'Africa'],
 'document': ['Donald MacDonald rides his Tesla in Africa'],
 'embeddings': ['Donald',
  'MacDonald',
  'rides',
  'his',
  'Tesla',
  'in',
  'Africa'],
 'ner': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC'],
 'sentence': ['Donald MacDonald rides his Tesla in Africa'],
 'token': ['Donald', 'MacDonald', 'rides', 'his', 'Tesla', 'in', 'Africa']}