# Preprocessing text with Spark NPL - Overview





## 1. Prepare Environment

In [None]:
import os

!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq

#Install Java
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

Selecting previously unselected package openjdk-8-jre-headless:amd64.
(Reading database ... 156210 files and directories currently installed.)
Preparing to unpack .../openjdk-8-jre-headless_8u312-b07-0ubuntu1~18.04_amd64.deb ...
Unpacking openjdk-8-jre-headless:amd64 (8u312-b07-0ubuntu1~18.04) ...
Selecting previously unselected package openjdk-8-jdk-headless:amd64.
Preparing to unpack .../openjdk-8-jdk-headless_8u312-b07-0ubuntu1~18.04_amd64.deb ...
Unpacking openjdk-8-jdk-headless:amd64 (8u312-b07-0ubuntu1~18.04) ...
Setting up openjdk-8-jre-headless:amd64 (8u312-b07-0ubuntu1~18.04) ...
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/orbd to provide /usr/bin/orbd (orbd) in auto mode
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/servertool to provide /usr/bin/servertool (servertool) in auto mode
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/tnameserv to provide /usr/bin/tnameserv (tnameserv) in auto mode
Setting up ope

In [None]:
#Install Pyspark
! pip install --ignore-installed pyspark==2.4.4

#Install Spark NLP
! pip install --ignore-installed spark-nlp==2.6.2

Collecting pyspark==2.4.4
  Downloading pyspark-2.4.4.tar.gz (215.7 MB)
[K     |████████████████████████████████| 215.7 MB 52 kB/s 
[?25hCollecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
[K     |████████████████████████████████| 197 kB 18.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130392 sha256=c1e946960d551d47780eb2f140e7d2b997319577201932559fefdbbf8701c0c3
  Stored in directory: /root/.cache/pip/wheels/11/48/19/c3b6b66e4575c164407a83bc065179904ddc33c9d6500846f0
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.4
Collecting spark-nlp==2.6.2
  Downloading spark_nlp-2.6.2-py2.py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 5.1 MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed

## 2. Start Spark Session

In [None]:
import sparknlp
spark = sparknlp.start()

from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

## 3. Load and Read Dataset

In this part, I am reading **.csv files** that I generated from the **classics** .txt files.
These .csv files were **generated in python**, exporting the final dataframes we already built to csv.
I am sharing these csv files **in GIT** (https://github.com/mpmccord/FanFicVsClassicLiterature/tree/main/data). 

There are 4 of them:


1.   **classics_clean.csv:** The eight classics with the whole corpus, preprocessed (lower case, remove spaces, etc.)
2.   **classics_raw.csv:**  The eight classics with the whole corpus, raw 
3.   **classics_clean_test.csv:** Subset of cleaned classics. Just two of them with only 200 words of the corpus.
4.   **classics_raw_test.csv:** Subset of raw classics. Just two of them with only 200 words of the corpus.



### 3.1 Read classics_raw_test.csv

For testing porpuses and for short computational times, I am going to use this subset with only two incomplete raw texts

In [None]:
#Generate Spark dataframe from csv file
df_Spark = spark.read \
           .option("header", True) \
           .csv("/content/drive/MyDrive/Distributed-Computing/data/classics_raw_test.csv") #Change path accordingly to yours

df_Spark.show(2, truncate=200)

+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  id|type|                                                                                                                                                                                                    text|
+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1905|   C|THE GOVERNESS OR, THE LITTLE FEMALE ACADEMY (1749) by Sarah Fielding There lived in the northern parts of England, a gentlewoman who undertook the education of young ladies and this trust she endea...|
| 768|   C|Transcribed from the 1910 John Murray edition by David Price, email ccx074@pglaf.org WUTHERING HEIGHTS CHAPTER I 1801.--I have just retur

## 4. Create Pipeline

Create pipeline to preprocess the spark dataframe texts.

Each of these classes receive an input column and creates the output column.
At the end of the pipeline, we will have a dataframe with all of the columns that are created on the fly and their results.

The **last column** generated, in this case **token_features** is the one that has all the words after being preprocessed, removing stop words, etc.

In [None]:
#https://medium.com/spark-nlp/spark-nlp-101-document-assembler-500018f5f6b5
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document") \
    .setCleanupMode("shrink_full") #remove new lines and tabs, plus shrinking spaces and blank lines.

#We dont need this because when preprocessing, the test ends up being one sentence
sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

#https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp.annotator.Tokenizer.html
token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

#https://nlp.johnsnowlabs.com/docs/en/annotators
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["""[^A-Za-z]"""]) # remove punctuations and alphanumeric chars

#Stop words used by Spark NLP: http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
#https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb#scrollTo=1-eGocORg2ml
stop_words = StopWordsCleaner.pretrained('stopwords_en', 'en')\
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

#https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/LemmatizerModel.html
lemmatizer = LemmatizerModel.pretrained() \
         .setInputCols(["cleanTokens"]) \
         .setOutputCol("lemma")

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["token_features"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


In [None]:
nlp_pipeline_lr = Pipeline(
        stages=[document, 
                sentence,
                token,
                normalizer,
                stop_words, 
                lemmatizer, 
                finisher])

## 5. Apply Pipeline to Spark Dataframe

In [None]:
processed_text = nlp_pipeline_lr.fit(df_Spark).transform(df_Spark)

In [None]:
processed_text.show(truncate=200) #Show results

+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
#Show last column, the one that has the final result
processed_text.select("token_features").show(truncate=200) 

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                          token_features|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[governess, female, academy, sarah, field, live, northern, part, england, gentlewoman, undertake, education, young, lady, trust, endeavour, faithfully, discharge, instruct, commit, care, read, writ...|
|[transcribe, john, murray, edition, david, price, email, ccxpglaforg, wuthering, height, chapter, return, visit, landlordthe, solitary, neighbour, trouble, beautiful, country, england, fi