# SPARK NLP (John Snow Labs)

 Two alternatives for installation:

- Colab: `!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash`

- Jupyter/local:
```
!pip install pyspark
!pip install spark-nlp==5.1.4
```

More info and examples: https://github.com/JohnSnowLabs/spark-nlp-workshop



In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2023-11-01 12:25:54--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2023-11-01 12:25:55--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’


2023-11-01 12:25:55 (78.8 MB/s) - written to stdout [1191/1191]

Installing PySpark 3.2.3 and Spark NLP 5.1.4
setup Colab for PySpark 3.2.3 and Spark NLP 5

## Buildig the context and spark object

In [None]:
import sparknlp

spark = sparknlp.start()

In [None]:
print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 5.1.4
Apache Spark version: 3.2.3


In [None]:
from sparknlp.pretrained import PretrainedPipeline

In [None]:
from pyspark.serializers import NoOpSerializer
ner = PretrainedPipeline('recognize_entities_dl', 'en')

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]


In [None]:
result = ner.annotate('The president Jesús Martínez arrived yesterday at Santa Cruz de Tenerife and he gave a nice speech.')

In [None]:
result

{'entities': ['Jesús Martínez', 'Santa Cruz de Tenerife'],
 'document': ['The president Jesús Martínez arrived yesterday at Santa Cruz de Tenerife and he gave a nice speech.'],
 'token': ['The',
  'president',
  'Jesús',
  'Martínez',
  'arrived',
  'yesterday',
  'at',
  'Santa',
  'Cruz',
  'de',
  'Tenerife',
  'and',
  'he',
  'gave',
  'a',
  'nice',
  'speech',
  '.'],
 'ner': ['O',
  'O',
  'B-PER',
  'I-PER',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-LOC',
  'I-LOC',
  'I-LOC',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'],
 'embeddings': ['The',
  'president',
  'Jesús',
  'Martínez',
  'arrived',
  'yesterday',
  'at',
  'Santa',
  'Cruz',
  'de',
  'Tenerife',
  'and',
  'he',
  'gave',
  'a',
  'nice',
  'speech',
  '.'],
 'sentence': ['The president Jesús Martínez arrived yesterday at Santa Cruz de Tenerife and he gave a nice speech.']}

### LOOK UP YOUR PREDICTING TEXT MODEL:
-  `https://sparknlp.org/models`
- Check limitations and task to accomplish (e.g., max number of tokens, embeddings, fill mask, sentiment, etc.)
- Check size (some models can be very large)

In [None]:
sentiment = PretrainedPipeline('analyze_sentimentdl_glove_imdb', 'en')

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 154.1 MB
[OK!]


We can test the pipeline with toy samples:

In [None]:
result = sentiment.annotate("The Minions is an excellent movie")

In [None]:
result

{'document': ['The Minions is an excellent movie'],
 'sentiment': ['pos'],
 'word_embeddings': ['The', 'Minions', 'is', 'an', 'excellent', 'movie'],
 'sentence_embeddings': ['The Minions is an excellent movie'],
 'tokens': ['The', 'Minions', 'is', 'an', 'excellent', 'movie'],
 'sentence': ['The Minions is an excellent movie']}

Or use a spark pipeline to process a large dataset of texts:

https://sparknlp.org/api/python/user_guide/annotators.html

In [None]:
from sparknlp.base import *
from sparknlp.annotator import *

In [None]:
documentAssembler = DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
  .setInputCols("document") \
  .setOutputCol("sentence_embeddings")

sentiment = SentimentDLModel.pretrained("sentimentdl_use_twitter")\
  .setInputCols("sentence_embeddings")\
  .setThreshold(0.7)\
  .setOutputCol("sentiment")

pipeline = Pipeline(stages=[documentAssembler, use, sentiment])

data = spark.createDataFrame([["What a nasty movie."],["Indeed a good film."]]).toDF("text")

result = pipeline.fit(data).transform(data)


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]


In [None]:
result.select("text", "sentiment.result")\
      .selectExpr( "text", "explode(result) as sentiment")\
      .show()

+-------------------+---------+
|               text|sentiment|
+-------------------+---------+
|What a nasty movie.| negative|
|Indeed a good film.| positive|
+-------------------+---------+



## Dealing directly with embeddings:





In [None]:
from sparknlp.annotator import Tokenizer, WordEmbeddingsModel, SentenceEmbeddings

In [None]:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_roberta_base_spanish","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["Me encanta spark nlp"],["No estoy seguro que sea bueno"]]).toDF("text")

result = pipeline.fit(data).transform(data)

### Training

In [None]:
docClassifier = ClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline().setStages([
    documentAssembler,
    useEmbeddings,
    docClassifier
])
pipelineModel = pipeline.fit(smallCorpus)

#The result is a PipelineModel that can be used with transform(data) to classify sentiment.