# Instalando o Spark NLP e pyspark.

In [17]:
!pip install -q pyspark==3.3.0 spark-nlp==4.0.1

Criando uma sessão do Spark NLP:

In [18]:
import sparknlp

spark = sparknlp.start()

# Criando um DataFrame

In [19]:
data = [("The Beatles", "There are places I'll remember "+
                        "All my life though some have changed "+
                        "Some forever, not for better "+
                        "Some have gone and some remain"),
        ("Oasis", "So I start a revolution from my bed " + 
                  "Cause you said the brains I had went to my head " +
                  "Step outside, summertime's in bloom "+
                  "Stand up beside the fireplace"),
        ("Pink Floyd", "How I wish you were here " +
                        "We're just two lost soul " +
                        "Swimming in a fish bowl year after year " + 
                        "Running over the same old ground")]

df_musica = spark.createDataFrame(data, ["artista", "letra"])

In [20]:
df_musica.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)



# Trabalhando com a biblioteca

In [21]:
# Componentes spark-nlp. Cada um vai incorporar nosso pipeline.

from sparknlp.annotator import LemmatizerModel, Stemmer, Tokenizer, StopWordsCleaner
from sparknlp.base import DocumentAssembler

## [DocumentAssembler](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/DocumentAssembler)

Vamos agora começar a montar o pipeline para o tratamento dos dados.


* `DocumentAssembler`: pega os dados de texto bruto e os converte em um formato que pode ser tokenizado, tornando-se um dos objetos que são nativos do **sparl-nlp*, o "Document";

* `.setInputCol()`: define qual será a coluna de entrada do anotador;

* `.setOutputCol()`: define qual será a coluna de saída. A coluna de saída de um anotador será a coluna de entrada no seu subsequente.


In [22]:
document_assembler = DocumentAssembler()\
                       .setInputCol("letra")\
                       .setOutputCol("document")

doc_df = document_assembler.transform(df_musica)
doc_df.show(truncate=False)

+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|artista    |letra                                                                                                                                                |document                                                                                                                                                                                        |
+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------

In [23]:
doc_df.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



A nova coluna criada está em array do tipo struct e possui como parâmetros os valores mostrados acima. 

In [24]:
import pyspark.sql.functions as F
doc_df.withColumn("tmp", F.explode("document"))\
        .select("tmp.*")\
        .show()

+-------------+-----+---+--------------------+---------------+----------+
|annotatorType|begin|end|              result|       metadata|embeddings|
+-------------+-----+---+--------------------+---------------+----------+
|     document|    0|126|There are places ...|{sentence -> 0}|        []|
|     document|    0|148|So I start a revo...|{sentence -> 0}|        []|
|     document|    0|121|How I wish you we...|{sentence -> 0}|        []|
+-------------+-----+---+--------------------+---------------+----------+



In [25]:
doc_df.select('document.result').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|[There are places I'll remember All my life though some have changed Some forever, not for better Some have gone and some remain]                      |
|[So I start a revolution from my bed Cause you said the brains I had went to my head Step outside, summertime's in bloom Stand up beside the fireplace]|
|[How I wish you were here We're just two lost soul Swimming in a fish bowl year after year Running over the same old ground]                           |
+---------------------------------------------------------------------------

## [Tokenizer](https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)

In [26]:
tokenizer = Tokenizer()\
              .setInputCols(["document"])\
              .setOutputCol("token")      

In [27]:
token_df = tokenizer.fit(doc_df).transform(doc_df)

token_df.select('token.result').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[There, are, places, I'll, remember, All, my, life, though, some, have, changed, Some, forever, ,, not, for, better, Some, have, gone, and, some, remain]                            |
|[So, I, start, a, revolution, from, my, bed, Cause, you, said, the, brains, I, had, went, to, my, head, Step, outside, ,, summertime's, in, bloom, Stand, up, beside, the, fireplace]|
|[How, I, wish, you, were, here, We're, just, two, lost, soul, Swimming, in, a, 

In [28]:
token_df.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- va

## Removendo [Stopwords](https://nlp.johnsnowlabs.com/docs/en/annotators#stopwordscleaner)

> Esse anotador pega uma sequência de strings (por exemplo, a saída de um Tokenizer, Normalizer, Lemmatizer e Stemmer) e descarta todas as palavras de parada das sequências de entrada (tradução livre). Fonte: Documentação

Por padrão utiliza as mesmas palavras de [MLlibs StopWordsRemover](https://spark.apache.org/docs/latest/ml-features#stopwordsremover).

In [29]:
stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")

Obtendo quais palavras são consideradas Stop Words:

In [30]:
stopwords_cleaner.getStopWords()

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [31]:
token_df_clean = stopwords_cleaner.transform(token_df)
token_df_clean.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- va

In [32]:
token_df_clean.select("cleanTokens.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------+
|[places, remember, life, though, changed, forever, ,, better, gone, remain]                                               |
|[start, revolution, bed, Cause, said, brains, went, head, Step, outside, ,, summertime's, bloom, Stand, beside, fireplace]|
|[wish, two, lost, soul, Swimming, fish, bowl, year, year, Running, old, ground]                                           |
+--------------------------------------------------------------------------------------------------------------------------+



## [Stemmer](https://nlp.johnsnowlabs.com/docs/en/annotators#stemmer)

O stemmer fará a substituição das palavras pelo significado raiz. Por exemplo, changing, changed, change seriam substituidas pela palavra chang, pois todas possuem essa palavra raiz. 


In [33]:
stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

stem = stemmer.transform(token_df_clean)
stem.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- va

In [34]:
stem.select("stem.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------+
|result                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------+
|[place, rememb, life, though, chang, forev, ,, better, gone, remain]                                             |
|[start, revolut, bed, caus, said, brain, went, head, step, outsid, ,, summertime', bloom, stand, besid, fireplac]|
|[wish, two, lost, soul, swim, fish, bowl, year, year, run, old, ground]                                          |
+-----------------------------------------------------------------------------------------------------------------+



## [Lemmatizer](https://nlp.johnsnowlabs.com/docs/en/annotators#lemmatizer)

É uma técnica que também faz a redução da palavra, porém leva em consideração a análise morfológica, geralmente visa remover os sufixos garantindo que a palavra reduzida sempre existirá no vocabulário. Vamos fazer uso de um modelo pré-treinado do Spark NLP que já possui o dicionário em inglês que precisamos

In [35]:
lemmatizer = LemmatizerModel.pretrained() \
              .setInputCols(["cleanTokens"]) \
              .setOutputCol("lemma")

lemma_antbnc download started this may take some time.
Approximate size to download 907,6 KB
[ | ]lemma_antbnc download started this may take some time.
Approximate size to download 907,6 KB
[ / ]Download done! Loading the resource.
[OK!]


In [36]:
result = lemmatizer.transform(stem)

In [37]:
result.printSchema()

root
 |-- artista: string (nullable = true)
 |-- letra: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- va

In [38]:
result.select("lemma.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------+
|[place, remember, life, though, change, forever, ,, well, go, remain]                                                 |
|[start, revolution, bed, Cause, say, brain, go, head, Step, outside, ,, summertime's, bloom, Stand, beside, fireplace]|
|[wish, two, lose, soul, Swimming, fish, bowl, year, year, Running, old, ground]                                       |
+----------------------------------------------------------------------------------------------------------------------+



In [39]:
import pyspark.sql.functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.stem.result, result.lemma.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("stem"),
                          F.expr("cols['2']").alias("lemma")).toPandas()

print("Comparação entre stemmer e lemmatizer:")
result_df.head(10)

Comparação entre stemmer e lemmatizer:


Unnamed: 0,token,stem,lemma
0,There,place,place
1,are,rememb,remember
2,places,life,life
3,I'll,though,though
4,remember,chang,change
5,All,forev,forever
6,my,",",","
7,life,better,well
8,though,gone,go
9,some,remain,remain


## Extra: criando o pipeline

Uma ótima prática é colocar todos os passos adotados realizados até agora em um pipeline. Para fazer isso vamos utilizar o método [`pipeline`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) do MLlib.

In [40]:
from pyspark.ml import Pipeline

Finisher converte tokens em saída legível por humanos

In [46]:
from sparknlp.base import Finisher

finisher = Finisher() \
     .setInputCols(['stem'])

No pipeline optamos por utilizar o **stemmer** ao invés do **lemmatizer**, mas você pode usá-lo sem problemas.

In [49]:
pipeline = Pipeline(stages=[
      document_assembler,
      tokenizer,
      stopwords_cleaner,
      stemmer,
      finisher
])

In [50]:
pipe = pipeline.fit(df_musica).transform(df_musica)
pipe.show(truncate=False)

+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|artista    |letra                                                                                                                                                |finished_stem                                                                                                    |
+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|The Beatles|There are places I'll remember All my life though some have changed Some forever, not for better Some have gone and some remain                      |[pl

A `finished_stem` é o resultado do stemmer, resultado final do pipeline.