### Text preparation Pyspark

## Importing libraries

For this text preparation process we are going to use the **PySpark** library 

In [1]:
import sparknlp
from sparknlp.annotator import Stemmer, LemmatizerModel, Tokenizer, Normalizer, StopWordsCleaner
from sparknlp.base import DocumentAssembler, Pipeline, Finisher

In [2]:
spark = sparknlp.start()

your 131072x1 screen size is bogus. expect trouble
23/09/14 21:18:21 WARN Utils: Your hostname, DESKTOP-KCSPFSJ resolves to a loopback address: 127.0.1.1; using 172.24.244.59 instead (on interface eth0)
23/09/14 21:18:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/david/.local/share/virtualenvs/Trabajo2_almdatos-LZAGjvTA/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/david/.ivy2/cache
The jars for the packages stored in: /home/david/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-41541c31-a09e-476b-bd6d-79d614a4916e;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.1 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.20.1 in central
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#failurea

## Reading Twitter Data

In [3]:
path_in = "twitterClimateData.csv"
df = spark.read.csv(path_in,inferSchema=True,header=True,sep=';')
df = df.select(["text","hashtags"])
df.show()



+--------------------+--------------------+
|                text|            hashtags|
+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|
|Winter has not st...|#climatefriday #c...|
|WEEK 55 of #Clima...|      #ClimateStrike|
|A year of resista...|#greta #gretathun...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|
|10 Questions to A...|#climatechange #n...|
|#climatestrike #F...|#climatestrike #F...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|
|My oldest daughte...|#climatestrike #l...|
|Our toddler #POTU...|#POTUS #Time #Gre...|
|"""The change is ...|#ClimateChange #c...|
|Moments after #Im...|#ImpeachmentVote ...|
|#climatestrike #C...|#climatestrike #C...|
|Keep up the great...|#ClimateChangeIsR...|
|Congratulations @...|#climatestrike #F...|
|Even though I hop...|#HongKongProteste...|
|*gretathunberg Is...|#PersonoftheYear ...|
| Congratulations ...|#vegan #climatest...|
|I get my energy a...|#ClimateStrike #F...|
| THE CHAMBER OF C...|#greta #gr

                                                                                

In [4]:
df.printSchema()

root
 |-- text: string (nullable = true)
 |-- hashtags: string (nullable = true)



In [5]:
df.count()

72405

## Text preparation process

The goal of this process is to reduce the number of tokens but without eliminating the intepretability of the words, in order to create the best bag of words possible. We are going to split this process for each column of the DataFrame, first for `text` column and then for `hashtags` column.

### Text preparation process for `Text` Column

### 1) Tokenization

In [6]:
documentAssembler = (
    DocumentAssembler().setInputCol("text").setOutputCol("document_text")
)
tokenizer = Tokenizer().setInputCols(["document_text"]).setOutputCol("text_tokens")
normalizer = Normalizer().setInputCols(["text_tokens"]).setOutputCol("text_normalized")
stop_word_remover = (
    StopWordsCleaner().setInputCols(["text_normalized"]).setOutputCol("text_stop_words_cleaned")
)
stemmer = Stemmer().setInputCols(["text_stop_words_cleaned"]).setOutputCol("text_stemmed")
lemmatizer = (
    LemmatizerModel.pretrained()
    .setInputCols(["text_stemmed"])
    .setOutputCol("text_lemmatized")
)
finisher = Finisher().setInputCols(["text_lemmatized"]).setOutputCols(["refined_text"]).setOutputAsArray(True)

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[OK!]


In [7]:
pipeline = Pipeline().setStages(
    [documentAssembler, tokenizer, normalizer, stop_word_remover, stemmer, lemmatizer, finisher]
)
transformed_df = pipeline.fit(df).transform(df)
transformed_df.select(["text", "refined_text"]).show()



+--------------------+--------------------+
|                text|        refined_text|
+--------------------+--------------------+
|2020 is the year ...|[year, votethemou...|
|Winter has not st...|[winter, stop, gr...|
|WEEK 55 of #Clima...|[week, climatestr...|
|A year of resista...|[year, resist, yo...|
| HAPPY HOLIDAYS #...|[happi, holidai, ...|
|10 Questions to A...|[question, ask, p...|
|#climatestrike #F...|[climatestrik, fr...|
|#ClimateChangeIsR...|[climatechangeisr...|
|My oldest daughte...|[old, daughter, f...|
|Our toddler #POTU...|[toddler, potu, w...|
|"""The change is ...|[chang, go, come,...|
|Moments after #Im...|[moment, impeachm...|
|#climatestrike #C...|[climatestrik, cl...|
|Keep up the great...|[keep, great, wor...|
|Congratulations @...|[congratul, greta...|
|Even though I hop...|[even, though, ho...|
|*gretathunberg Is...|[gretathunberg, y...|
| Congratulations ...|[congratul, inspi...|
|I get my energy a...|[get, energi, hop...|
| THE CHAMBER OF C...|[chamber, 

### 2) Spark NLP Stemming and Lemmatizing - Text

### 3) Text preparation process for `Hashtags` Column

In [8]:
documentAssembler = (
    DocumentAssembler().setInputCol("hashtags").setOutputCol("document_hashtags")
)
tokenizer = Tokenizer().setInputCols(["document_hashtags"]).setOutputCol("hashtags_tokens")
normalizer = Normalizer().setInputCols(["hashtags_tokens"]).setOutputCol("hashtags_normalized")
stop_word_remover = (
    StopWordsCleaner().setInputCols(["hashtags_normalized"]).setOutputCol("hashtags_stop_words_cleaned")
)
stemmer = Stemmer().setInputCols(["hashtags_stop_words_cleaned"]).setOutputCol("hashtags_stemmed")
lemmatizer = (
    LemmatizerModel.pretrained()
    .setInputCols(["hashtags_stemmed"])
    .setOutputCol("hashtags_lemmatized")
)
finisher = Finisher().setInputCols(["hashtags_lemmatized"]).setOutputCols(["refined_hashtags"]).setOutputAsArray(True)

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


In [9]:
pipeline = Pipeline().setStages(
    [documentAssembler, tokenizer, normalizer, stop_word_remover, stemmer, lemmatizer, finisher]
)
transformed_df = pipeline.fit(df).transform(df)
transformed_df.select(["hashtags", "refined_hashtags"]).show()

+--------------------+--------------------+
|            hashtags|    refined_hashtags|
+--------------------+--------------------+
|#votethemout #cli...|[votethemout, cli...|
|#climatefriday #c...|[climatefridai, c...|
|      #ClimateStrike|      [climatestrik]|
|#greta #gretathun...|[greta, gretathun...|
|#greta #gretathun...|[greta, gretathun...|
|#climatechange #n...|[climatechang, na...|
|#climatestrike #F...|[climatestrik, fr...|
|#ClimateChangeIsR...|[climatechangeisr...|
|#climatestrike #l...|[climatestrik, le...|
|#POTUS #Time #Gre...|[potu, time, gret...|
|#ClimateChange #c...|[climatechang, cl...|
|#ImpeachmentVote ...|[impeachmentvot, ...|
|#climatestrike #C...|[climatestrik, cl...|
|#ClimateChangeIsR...|[climatechangeisr...|
|#climatestrike #F...|[climatestrik, fr...|
|#HongKongProteste...|[hongkongprotest,...|
|#PersonoftheYear ...|[personoftheyear,...|
|#vegan #climatest...|[vegan, climatest...|
|#ClimateStrike #F...|[climatestrik, fi...|
|#greta #gretathun...|[greta, gr

### 4) Spark NLP Stemming and Lemmatizing - Hashtags

## References

1) for stemming and lemmatizing 
* https://medium.com/trustyou-engineering/topic-modelling-with-pyspark-and-spark-nlp-a99d063f1a6e