# NLP - 1: NLP tools

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('nlp').getOrCreate()

21/09/24 10:36:59 WARN Utils: Your hostname, GBLON1WLZ13699 resolves to a loopback address: 127.0.1.1; using 10.164.72.129 instead (on interface eth2)
21/09/24 10:36:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/09/24 10:37:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Tokenization

In [3]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

In [5]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [6]:
sen_df = spark.createDataFrame([
    (0, 'Hi I heard about Spark'),
    (1, 'I wish java could use case classes'),
    (2, 'Logistic,regression,models,are,neat')
], ['id','sentence'])

In [7]:
sen_df.show()

                                                                                

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish java could...|
|  2|Logistic,regressi...|
+---+--------------------+



In [8]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

In [19]:
regex_tokenizer = RegexTokenizer(inputCol='sentence', outputCol='words', pattern='\\W') 
# \W is a character that is not a word character, so you can split on multiple special characters

In [13]:
count_tokens = udf(lambda words: len(words), IntegerType())  # user-defined function

In [14]:
tokenized = tokenizer.transform(sen_df)

In [15]:
tokenized.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|
|  1|I wish java could...|[i, wish, java, c...|
|  2|Logistic,regressi...|[logistic,regress...|
+---+--------------------+--------------------+



It doesn't look like the third sentence has been tokenized. Let's check

In [17]:
tokenized.withColumn('tokens', count_tokens(col('words'))).show()

[Stage 5:>                                                          (0 + 3) / 3]

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic,regress...|     1|
+---+--------------------+--------------------+------+



                                                                                

Yup, the standard tokenizer didn't work.

In [20]:
rg_tokenized = regex_tokenizer.transform(sen_df)

In [21]:
rg_tokenized.withColumn('tokens', count_tokens(col('words'))).show()

[Stage 7:>                                                          (0 + 3) / 3]

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic, regres...|     5|
+---+--------------------+--------------------+------+



                                                                                

Noice.

## Stop word removal

Remove very common words which don't have much meaning.

In [22]:
from pyspark.ml.feature import StopWordsRemover

In [23]:
sentenceDataFrame = spark.createDataFrame([
    (0, ['I', 'saw', 'the', 'green', 'horse']),
    (1, ['Mary', 'had', 'a', 'little', 'lamb'])
], ['id', 'tokens'])

In [24]:
remover = StopWordsRemover(inputCol='tokens', outputCol='filtered')

In [25]:
remover.transform(sentenceDataFrame).show()

+---+--------------------+--------------------+
| id|              tokens|            filtered|
+---+--------------------+--------------------+
|  0|[I, saw, the, gre...| [saw, green, horse]|
|  1|[Mary, had, a, li...|[Mary, little, lamb]|
+---+--------------------+--------------------+



                                                                                

## N-gram generation

Transform input sequence of strings, into concatenations of consecutive words of length N

In [26]:
from pyspark.ml.feature import NGram

In [27]:
word_dataframe = spark.createDataFrame([
    (0, ['Hi', 'I', 'heard', 'about', 'Spark']),
    (1, ['I', 'wish', 'java', 'could', 'use', 'case', 'classes']),
    (2, ['Logistic', 'regression', 'models', 'are', 'neat'])
], ['id','words'])

In [28]:
ngram = NGram(n=2, inputCol='words', outputCol='grams')

In [30]:
ngram.transform(word_dataframe).show(truncate=False)

                                                                                

+---+------------------------------------------+------------------------------------------------------------------+
|id |words                                     |grams                                                             |
+---+------------------------------------------+------------------------------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |
|1  |[I, wish, java, could, use, case, classes]|[I wish, wish java, java could, could use, use case, case classes]|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |
+---+------------------------------------------+------------------------------------------------------------------+



## Term frequency-inverse document frequency

A statistic to assign word importance to a corpus of documents

In [31]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [37]:
sen_df = spark.createDataFrame([
    (0, 'Hi I heard about Spark'),
    (0, 'I wish java could use case classes'),  # note this belongs to the first document
    (1, 'Logistic regression models are neat')
], ['id','sentence'])

In [38]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

In [39]:
words_data = tokenizer.transform(sen_df)

In [40]:
words_data.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|
|  0|I wish java could...|[i, wish, java, c...|
|  1|Logistic regressi...|[logistic, regres...|
+---+--------------------+--------------------+



In [41]:
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures')

In [42]:
featurized_data = hashing_tf.transform(words_data)

In [43]:
idf = IDF(inputCol='rawFeatures', outputCol='features')

In [44]:
idf_model = idf.fit(featurized_data)

                                                                                

In [45]:
rescaled_data = idf_model.transform(featurized_data)

In [48]:
rescaled_data.select('id', 'features').show(truncate=False)

21/09/24 12:10:52 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
21/09/24 12:10:52 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB


+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                                                                                      |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0  |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
|0  |(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1  |(262144,[4

We get two features per word -- the term frequency and the inverse document frequency. Tf-idf is simply the product of these two features.

## Count vectorizer

Bag of words method to get counts of each word

In [49]:
from pyspark.ml.feature import CountVectorizer

In [50]:
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (0, "a b b c a".split(" ")),
], ['id', 'words'])
df.show()

                                                                                

+---+---------------+
| id|          words|
+---+---------------+
|  0|      [a, b, c]|
|  0|[a, b, b, c, a]|
+---+---------------+



In [51]:
cv = CountVectorizer(inputCol='words', outputCol='features', vocabSize=3, minDF=2.0)

In [52]:
model = cv.fit(df)

                                                                                

In [53]:
result = model.transform(df)

In [54]:
result.show(truncate=False)

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|0  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

