### Text preparation Pyspark

## Importing libraries

For this text preparation process we are going to use the **PySpark** library 

In [1]:
import sparknlp
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col, rand, regexp_replace
from sparknlp.annotator import Stemmer, LemmatizerModel
from sparknlp.base import DocumentAssembler, Pipeline
from pyspark.sql import SparkSession


In [2]:
#spark=SparkSession.builder.appName('nlp').getOrCreate()
spark = sparknlp.start()

your 131072x1 screen size is bogus. expect trouble
23/09/14 20:17:09 WARN Utils: Your hostname, Cavelez resolves to a loopback address: 127.0.1.1; using 172.19.58.130 instead (on interface eth0)
23/09/14 20:17:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/camilo/.local/share/virtualenvs/Trabajo2_almdatos-PXMnXu75/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/camilo/.ivy2/cache
The jars for the packages stored in: /home/camilo/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dadb7739-8301-45e2-8dd9-fcab3894b0dd;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.1.1 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.20.1 in central
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#failur

## Reading Twitter Data

In [3]:
path_in = "twitterClimateData.csv"
df = spark.read.csv(path_in,inferSchema=True,header=True,sep=';')
df = df.select(["text","hashtags"])
df.show()

                                                                                

+--------------------+--------------------+
|                text|            hashtags|
+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|
|Winter has not st...|#climatefriday #c...|
|WEEK 55 of #Clima...|      #ClimateStrike|
|A year of resista...|#greta #gretathun...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|
|10 Questions to A...|#climatechange #n...|
|#climatestrike #F...|#climatestrike #F...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|
|My oldest daughte...|#climatestrike #l...|
|Our toddler #POTU...|#POTUS #Time #Gre...|
|"""The change is ...|#ClimateChange #c...|
|Moments after #Im...|#ImpeachmentVote ...|
|#climatestrike #C...|#climatestrike #C...|
|Keep up the great...|#ClimateChangeIsR...|
|Congratulations @...|#climatestrike #F...|
|Even though I hop...|#HongKongProteste...|
|*gretathunberg Is...|#PersonoftheYear ...|
| Congratulations ...|#vegan #climatest...|
|I get my energy a...|#ClimateStrike #F...|
| THE CHAMBER OF C...|#greta #gr

In [4]:
df.printSchema()

root
 |-- text: string (nullable = true)
 |-- hashtags: string (nullable = true)



In [5]:
df.count()

                                                                                

72405

## Text preparation process

The goal of this process is to reduce the number of tokens but without eliminating the intepretability of the words, in order to create the best bag of words possible. We are going to split this process for each column of the DataFrame, first for `text` column and then for `hashtags` column.

### Text preparation process for `Text` Column

### 1) Tokenization

In [6]:
tokenization=Tokenizer(inputCol='text',outputCol='text_tokens')

In [7]:
df_tokens=tokenization.transform(df)

In [8]:
df_tokens.show()

+--------------------+--------------------+--------------------+
|                text|            hashtags|         text_tokens|
+--------------------+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|[2020, is, the, y...|
|Winter has not st...|#climatefriday #c...|[winter, has, not...|
|WEEK 55 of #Clima...|      #ClimateStrike|[week, 55, of, #c...|
|A year of resista...|#greta #gretathun...|[a, year, of, res...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|[, happy, holiday...|
|10 Questions to A...|#climatechange #n...|[10, questions, t...|
|#climatestrike #F...|#climatestrike #F...|[#climatestrike, ...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|[#climatechangeis...|
|My oldest daughte...|#climatestrike #l...|[my, oldest, daug...|
|Our toddler #POTU...|#POTUS #Time #Gre...|[our, toddler, #p...|
|"""The change is ...|#ClimateChange #c...|["""the, change, ...|
|Moments after #Im...|#ImpeachmentVote ...|[moments, after, ...|
|#climatestrike #C...|#cl

In [9]:
stopword_removal=StopWordsRemover(inputCol='text_tokens',outputCol='refined_text_tokens')

In [10]:
refined_text_df=stopword_removal.transform(df_tokens)
refined_text_df.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            hashtags|         text_tokens| refined_text_tokens|
+--------------------+--------------------+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|[2020, is, the, y...|[2020, year, #vot...|
|Winter has not st...|#climatefriday #c...|[winter, has, not...|[winter, stopped,...|
|WEEK 55 of #Clima...|      #ClimateStrike|[week, 55, of, #c...|[week, 55, #clima...|
|A year of resista...|#greta #gretathun...|[a, year, of, res...|[year, resistance...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|[, happy, holiday...|[, happy, holiday...|
|10 Questions to A...|#climatechange #n...|[10, questions, t...|[10, questions, a...|
|#climatestrike #F...|#climatestrike #F...|[#climatestrike, ...|[#climatestrike, ...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|[#climatechangeis...|[#climatechangeis...|
|My oldest daughte...|#climatestrike #l...|[my, oldest

In [11]:
len_udf = udf(lambda s: len(s), IntegerType()) 

refined_text_df = refined_text_df.withColumn("token_text_count", len_udf(col('refined_text_tokens')))

refined_text_df.show(10)


[Stage 8:>                                                          (0 + 1) / 1]

+--------------------+--------------------+--------------------+--------------------+----------------+
|                text|            hashtags|         text_tokens| refined_text_tokens|token_text_count|
+--------------------+--------------------+--------------------+--------------------+----------------+
|2020 is the year ...|#votethemout #cli...|[2020, is, the, y...|[2020, year, #vot...|              21|
|Winter has not st...|#climatefriday #c...|[winter, has, not...|[winter, stopped,...|              11|
|WEEK 55 of #Clima...|      #ClimateStrike|[week, 55, of, #c...|[week, 55, #clima...|              32|
|A year of resista...|#greta #gretathun...|[a, year, of, res...|[year, resistance...|              25|
| HAPPY HOLIDAYS #...|#greta #gretathun...|[, happy, holiday...|[, happy, holiday...|              23|
|10 Questions to A...|#climatechange #n...|[10, questions, t...|[10, questions, a...|              21|
|#climatestrike #F...|#climatestrike #F...|[#climatestrike, ...|[#climate

                                                                                

### 2) Spark NLP Stemming and Lemmatizing - Text

In [13]:
#Stemming

stemmer = Stemmer().setInputCols(["refined_text_tokens"]).setOutputCol("stemming_text_tokens")

In [14]:
lemmatizer = LemmatizerModel.pretrained() \
     .setInputCols(['refined_text_tokens']) \
     .setOutputCol('lemmatize_text_tokens')

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ / ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ — ]Download done! Loading the resource.




[ \ ]

                                                                                

[OK!]


In [None]:

documentAssembler = DocumentAssembler().setInputCol(text_col).setOutputCol('document')

In [22]:
refined_text_df = stemmer.transform(refined_text_df)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in Stemmer_2ef55dc5f5bd.

Current inputCols: refined_text_tokens. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=hashtags,is_nlp_annotator=false)
(column_name=text_tokens,is_nlp_annotator=false)
(column_name=refined_text_tokens,is_nlp_annotator=false)
(column_name=token_text_count,is_nlp_annotator=false).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: token

In [None]:
refined_text_df = lemmatizer.transform(refined_text_df)

In [None]:
refined_text_df.show(10)

### 3) Text preparation process for `Hashtags` Column

In [15]:
df = df.withColumn("hashtags_without#",regexp_replace("hashtags","#",""))
df.show()

+--------------------+--------------------+--------------------+
|                text|            hashtags|   hashtags_without#|
+--------------------+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|votethemout clima...|
|Winter has not st...|#climatefriday #c...|climatefriday cli...|
|WEEK 55 of #Clima...|      #ClimateStrike|       ClimateStrike|
|A year of resista...|#greta #gretathun...|greta gretathunbe...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|greta gretathunbe...|
|10 Questions to A...|#climatechange #n...|climatechange nat...|
|#climatestrike #F...|#climatestrike #F...|climatestrike Fri...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|ClimateChangeIsRe...|
|My oldest daughte...|#climatestrike #l...|climatestrike let...|
|Our toddler #POTU...|#POTUS #Time #Gre...|POTUS Time GretaT...|
|"""The change is ...|#ClimateChange #c...|ClimateChange cli...|
|Moments after #Im...|#ImpeachmentVote ...|ImpeachmentVote C...|
|#climatestrike #C...|#cl

In [16]:
tokenization=Tokenizer(inputCol='hashtags_without#',outputCol='hashtags_tokens')


In [17]:
df_hashtags=tokenization.transform(df)

In [18]:
df_hashtags.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            hashtags|   hashtags_without#|     hashtags_tokens|
+--------------------+--------------------+--------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|votethemout clima...|[votethemout, cli...|
|Winter has not st...|#climatefriday #c...|climatefriday cli...|[climatefriday, c...|
|WEEK 55 of #Clima...|      #ClimateStrike|       ClimateStrike|     [climatestrike]|
|A year of resista...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun...|
|10 Questions to A...|#climatechange #n...|climatechange nat...|[climatechange, n...|
|#climatestrike #F...|#climatestrike #F...|climatestrike Fri...|[climatestrike, f...|
|#ClimateChangeIsR...|#ClimateChangeIsR...|ClimateChangeIsRe...|[climatechangeisr...|
|My oldest daughte...|#climatestrike #l...|climatestri

In [19]:
stopword_removal=StopWordsRemover(inputCol='hashtags_tokens',outputCol='refined_hashtags_tokens')

In [20]:
refined_hashtags_df=stopword_removal.transform(df_hashtags)
refined_hashtags_df.show()

+--------------------+--------------------+--------------------+--------------------+-----------------------+
|                text|            hashtags|   hashtags_without#|     hashtags_tokens|refined_hashtags_tokens|
+--------------------+--------------------+--------------------+--------------------+-----------------------+
|2020 is the year ...|#votethemout #cli...|votethemout clima...|[votethemout, cli...|   [votethemout, cli...|
|Winter has not st...|#climatefriday #c...|climatefriday cli...|[climatefriday, c...|   [climatefriday, c...|
|WEEK 55 of #Clima...|      #ClimateStrike|       ClimateStrike|     [climatestrike]|        [climatestrike]|
|A year of resista...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun...|   [greta, gretathun...|
| HAPPY HOLIDAYS #...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun...|   [greta, gretathun...|
|10 Questions to A...|#climatechange #n...|climatechange nat...|[climatechange, n...|   [climatechange, n...|
|#climates

In [21]:
len_udf = udf(lambda s: len(s), IntegerType()) 

refined_hashtags_df = refined_hashtags_df.withColumn("token_hashtags_count", len_udf(col('refined_hashtags_tokens')))

refined_hashtags_df.show(10)

+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+
|                text|            hashtags|   hashtags_without#|     hashtags_tokens|refined_hashtags_tokens|token_hashtags_count|
+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+
|2020 is the year ...|#votethemout #cli...|votethemout clima...|[votethemout, cli...|   [votethemout, cli...|                   3|
|Winter has not st...|#climatefriday #c...|climatefriday cli...|[climatefriday, c...|   [climatefriday, c...|                   3|
|WEEK 55 of #Clima...|      #ClimateStrike|       ClimateStrike|     [climatestrike]|        [climatestrike]|                   1|
|A year of resista...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun...|   [greta, gretathun...|                  11|
| HAPPY HOLIDAYS #...|#greta #gretathun...|greta gretathunbe...|[greta, gretathun..

### 4) Spark NLP Stemming and Lemmatizing - Hashtags

In [None]:
#Stemming

stemmer = Stemmer().setInputCols(["refined_text_tokens"]).setOutputCol("stemming_text_tokens")

In [None]:
from sparknlp.annotator import LemmatizerModel
lemmatizer = LemmatizerModel.pretrained() \
     .setInputCols(['refined_text_tokens']) \
     .setOutputCol('lemmatize_text_tokens')

## References

1) for stemming and lemmatizing 
* https://www.johnsnowlabs.com/boost-your-nlp-results-with-spark-nlp-stemming-and-lemmatizing-techniques/