<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/2.ML_with_PySpark_MLlib/NLP/1.Tokenizer_UDF_StopWords_NGram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

In [None]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [None]:
sent_df = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

In [None]:
sent_df.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish Java could...|
|  2|Logistic,regressi...|
+---+--------------------+



# Tokenizer

In [None]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

# UDF - Lambda Expression
We will define a UDF for counting the words.

In [None]:
count_tokens = udf(lambda words: len(words), IntegerType())

In [None]:
tokenized = tokenizer.transform(sent_df)

In [None]:
tokenized.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|
|  1|I wish Java could...|[i, wish, java, c...|
|  2|Logistic,regressi...|[logistic,regress...|
+---+--------------------+--------------------+



After tokenization, we will count the words.

In [None]:
tokenized.withColumn('tokens', count_tokens(col('words'))).show(truncate=False)

+---+-----------------------------------+------------------------------------------+------+
|id |sentence                           |words                                     |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|2  |Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |1     |
+---+-----------------------------------+------------------------------------------+------+



In the row 2, there is only 1 token because there is no whitespace between the words.

# RegexTokenizer

In [None]:
regex_tokenizer = RegexTokenizer(inputCol='sentence', outputCol='words', pattern='\\W')

In [None]:
regex_tokenized = regex_tokenizer.transform(sent_df)

In [None]:
regex_tokenized.withColumn('tokens', count_tokens(col('words'))).show(truncate=False)

+---+-----------------------------------+------------------------------------------+------+
|id |sentence                           |words                                     |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|2  |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5     |
+---+-----------------------------------+------------------------------------------+------+



In this result we have the right token count for row #2. Because regex tokenizer counted the words according to commas.

# StopWordsRemover

<p><code>StopWordsRemover</code> takes as input a sequence of strings (e.g. the output
of a <a href="ml-features.html#tokenizer">Tokenizer</a>) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the <code>stopWords</code> parameter. Default stop words for some languages are accessible 
by calling <code>StopWordsRemover.loadDefaultStopWords(language)</code>, for which available 
options are &#8220;danish&#8221;, &#8220;dutch&#8221;, &#8220;english&#8221;, &#8220;finnish&#8221;, &#8220;french&#8221;, &#8220;german&#8221;, &#8220;hungarian&#8221;, 
&#8220;italian&#8221;, &#8220;norwegian&#8221;, &#8220;portuguese&#8221;, &#8220;russian&#8221;, &#8220;spanish&#8221;, &#8220;swedish&#8221; and &#8220;turkish&#8221;. 
A boolean parameter <code>caseSensitive</code> indicates if the matches should be case sensitive 
(false by default).</p>

In [None]:
from pyspark.ml.feature import StopWordsRemover

In [None]:
sentenceData = spark.createDataFrame([
                                  (0, ["I", "saw", "the", "red", "balloon"]),
                                  (1, ["Mary", "had", "a", "little", "lamb"])
                                  ], ["id", "tokens"])

sentenceData.show(truncate=False)

+---+----------------------------+
|id |tokens                      |
+---+----------------------------+
|0  |[I, saw, the, red, balloon] |
|1  |[Mary, had, a, little, lamb]|
+---+----------------------------+



In [None]:
remover = StopWordsRemover(inputCol='tokens', outputCol='filtered')  # we can add stopwords inside by using stopword parameter
remover.transform(sentenceData).show(truncate=False)

+---+----------------------------+--------------------+
|id |tokens                      |filtered            |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+



# n-gram
An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram class can be used to transform input features into nn-grams.

<p><code>NGram</code> takes as input a sequence of strings (e.g. the output of a <a href="ml-features.html#tokenizer">Tokenizer</a>).  The parameter <code>n</code> is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words.  If the input sequence contains fewer than <code>n</code> strings, no output is produced.</p>

In [None]:
from pyspark.ml.feature import NGram

In [None]:
wordDataFrame = spark.createDataFrame([
    (0, ["Hi", "I", "heard", "about", "Spark"]),
    (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
    (2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"])

In [None]:
ngram = NGram(n=2, inputCol='words', outputCol='grams')
ngram.transform(wordDataFrame).show(truncate=False)

+---+------------------------------------------+------------------------------------------------------------------+
|id |words                                     |grams                                                             |
+---+------------------------------------------+------------------------------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |
|1  |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |
+---+------------------------------------------+------------------------------------------------------------------+



In [None]:
ngram = NGram(n=3, inputCol='words', outputCol='grams')
ngram.transform(wordDataFrame).show(truncate=False)

+---+------------------------------------------+--------------------------------------------------------------------------------+
|id |words                                     |grams                                                                           |
+---+------------------------------------------+--------------------------------------------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I heard, I heard about, heard about Spark]                                  |
|1  |[I, wish, Java, could, use, case, classes]|[I wish Java, wish Java could, Java could use, could use case, use case classes]|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression models, regression models are, models are neat]            |
+---+------------------------------------------+--------------------------------------------------------------------------------+

