In [None]:
import findspark
findspark.init('/spark-3.5.1-bin-hadoop3')
from pyspark import *
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start SparkSession with Spark NLP
# start() functions has 3 parameters: gpu, apple_silicon, and memory
# sparknlp.start(gpu=True) #will start the session with GPU support
spark = sparknlp.start(apple_silicon=True) # will start the session with macOS M1 & M2 support
# sparknlp.start(memory="16G") #to change the default driver memory in SparkSession
# spark = sparknlp.start()
# spark = SparkSession.builder.appName("analytics").getOrCreate()

# Preprocesamiento de Texto con Spark NLP

Para realizar el preprocesamiento de texto, vamos a utilizar el modelo `explain_document_dl` ya preentrenado.

Información sobre el modelo: [Explain Document DL Pipeline for English](https://www.johnsnowlabs.com/explain-document-pretrained-pipeline-spark-nlp-short-blogpost-series-1/)

Este es un modelo "sencillo" que realiza las tareas de procesamiento de texto más comunes y reconocimiento de entidades

In [None]:
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')


# Crear un conjunto de textos para analizar
texts = [
    "Apple Inc. is looking to buy a startup in the United States.",
    "Barack Obama was the 44th President of the United States.",
    "Elon Musk founded SpaceX, an aerospace manufacturer and space transportation company.",
    "The Amazon rainforest is the largest tropical rainforest in the world.",
    "Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."
]

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Primero anotamos el texto de prueba que usa Spark NLP:
result = pipeline.annotate(text)



Usamos esta primera anotación para mostrar la información que extrae el modelo:

In [None]:
# What's in the pipeline
list(result.keys())

Como podemos ver, el modelo tiene las siguientes capacidades:

* `entities`: Extrae entidades del texto
* `stem`: Realiza stemmming sobre el texto
* `checked`:
* `lemma`:
* `document`:
* `pos`:
* `token`:
* `ner`
* `embeddings`:
* `sentence`:

Podemos iniciar mostrando las entidades retornadas

In [None]:
# Entidades:
result['token']

Antes de proseguir, creamos una funcion que nos pueda mostrar el texto junto con el resultado de estos analisis uno al lado del otro:

In [None]:
import pandas as pd

def create_dataframe(data, key):
    # Ensure the key exists in the data
    if key not in data:
        raise ValueError(f"The key '{key}' is not present in the data.")
    
    # Create a DataFrame from the tokens and the specified key
    df = pd.DataFrame({
        'token': data['token'],
        key: data[key]
    })
    
    return df

Mostramos el resultado del *Part of Speech Tagger* Usando esta funcion. Como referencia, recuerden la clasificación que usa el POS Tagger:

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: “there is” … think of it like “there exists”)
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective "big"
* JJR adjective, comparative "bigger"
* JJS adjective, superlative "biggest"
* LS list marker 1)
* MD modal could, will
* NN noun, singular "desk"
* NNS noun plural "desks"
* NNP proper noun, singular "Harrison"
* NNPS proper noun, plural "Americans"
* PDT predeterminer "all the kids"
* POS possessive ending parent"s
* PRP personal pronoun I, he, she
* PRP\$ possessive pronoun my, his, hers
* RB adverb very, silently,
* RBR adverb, comparative better
* RBS adverb, superlative best
* RP particle give up
* TO, to go "to" the store.
* UH interjection, errrrrrrrm
* VB verb, base form take
* VBD verb, past tense took
* VBG verb, gerund/present participle taking
* VBN verb, past participle taken
* VBP verb, sing. present, non-3d take
* VBZ verb, 3rd person sing. present takes
* WDT wh-determiner which
* WP wh-pronoun who, what
* WP\$ possessive wh-pronoun whose
* WRB wh-abverb where, when

Tags tomadas de [Categorizing and POS Tagging with NLTK Python](https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/)

In [None]:
create_dataframe(result,'pos')

Ahora podemos ver como clasifica los tokens usando named entity recognition (NER), utilizando el formato *IOB (Inside, Outside, Beginning)*.

Etiquetas IOB2:

* `I- (Inside)`: Indica que el token está dentro de un chunk.
* `O (Outside)`: Indica que el token no pertenece a ningún chunk.
* `B- (Beginning)`: Indica que el token es el comienzo de un chunk

Cuando un chunk comienza después de una etiqueta O, el primer token del chunk lleva el prefijo B-.


Mas información sobre el formato IOB: [Inside–outside–beginning (tagging)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))

In [None]:
create_dataframe(result,'ner')

In [None]:
El modelo también hace stemming:

In [None]:
# Check the results
create_dataframe(result,'stem')

Y lemmatización:

In [None]:
# Check the results
create_dataframe(result,'lemma')

In [None]:
# Check the results
result['token']

## Analisis de Sentimiento

Spark NLP contiene un módulo de análisis de sentimiento llamado [SentimentDetector](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/sentiment/sentiment_detector/index.html)

Este módulo funciona en base a reglas, a diferencia de los modelos que vimos en la clase anterior que funcionan en base a modelos pre-entrenados. Estos modelos se basan en diccionarios de palabras que tienen connotaciones positivas o negativas, y en base a ellos determina mas una serie de heurísticas, determina si un texto es de sentimiento positivo o negativo.

El siguiente ejemplo está basado en [Sentiment Analysis with Spark NLP without Machine Learning](https://www.johnsnowlabs.com/sentiment-analysis-with-spark-nlp-without-machine-learning/)

Para este modelo, es necesario obtener primero los archivos de lemmatizacion y un diccionarque contiene palabras y su "sentimiento" asociado:

In [None]:
! curl -O --output-dir /tmp https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/lemma-corpus-small/lemmas_small.txt
! curl -O --output-dir /tmp https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/default-sentiment-dict.txt 

El siguiente paso es crear el pipeline de Spark para analizar el texto (tomado de [Sentiment Analysis with Spark NLP without Machine Learning](https://www.johnsnowlabs.com/sentiment-analysis-with-spark-nlp-without-machine-learning/) ):

In [None]:
# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline, Finisher
from sparknlp.annotator import (
    SentenceDetector,
    Tokenizer,
    Lemmatizer,
    SentimentDetector
)
import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

# Step 2: Sentence Detection
sentence_detector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

# Step 3: Tokenization
tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

# Step 4: Lemmatization
lemmatizer= Lemmatizer().setInputCols("token").setOutputCol("lemma")\
                        .setDictionary("/tmp/lemmas_small.txt", key_delimiter="->", value_delimiter="\t")

# Step 5: Sentiment Detection
sentiment_detector= (
    SentimentDetector()\
    .setInputCols(["lemma", "sentence"])\
    .setOutputCol("sentiment_score")\
    .setDictionary("/tmp/default-sentiment-dict.txt", ",")
)

# Step 6: Finisher
finisher= (
    Finisher()
    .setInputCols(["sentiment_score"]).setOutputCols("sentiment")
)

# Define the pipeline
pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector, 
        tokenizer, 
        lemmatizer, 
        sentiment_detector, 
        finisher
    ]
)

Usamos los ejemplos de textos que utilizamos la clase anterior:

In [None]:
tweets = [
    "I think Alex Johnson is doing a fantastic job leading the country.",
    "Alex Johnson's policies are ruining our economy.",
    "I'm not sure about Alex Johnson's latest speech, it was confusing.",
    "The new reforms introduced by Alex Johnson are very promising.",
    "Alex Johnson seems to care about the people's issues, which is refreshing.",
    "I am disappointed with Alex Johnson's performance.",
    "Alex Johnson's leadership style is quite effective.",
    "The way Alex Johnson handled the recent crisis was commendable.",
    "I don't trust Alex Johnson's intentions at all.",
    "Alex Johnson has brought positive changes to the healthcare system."
]

# Crear un DataFrame de Spark con la lista de frases
tweets_df = spark.createDataFrame([(text,) for text in tweets], ["text"])

# Mostrar el DataFrame
tweets_df.show(truncate=False)


Y ejecutamos el pipeline:

In [None]:
# Fit-transform to get predictions
# Fit-transform to get predictions
result = pipeline.fit(tweets_df).transform(tweets_df).show(truncate = 50)