## Setup

Here, we install JDK and set the proper paths using conda.

In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Start the Spark Session

Here, we start the `spark` session.

In [None]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

In [4]:
print(spark.version)

3.4.0


In [5]:
from pyspark.sql.functions import col, lower, regexp_extract, regexp_replace

## Test Reading in Data from Shared Bucket

Here, we try to read in a file that Alex created in our shared bucket, to ensure that someone in the group can read a file that is owned by someone else.

```
%%time
t = spark.read.text(
    "s3a://project17-bucket-alex/eda_ideas.txt"
)
t.show()
```

## Test Writing Data into Shared Bucket

Here, we try to write a small data frame into the shared bucket.

```
%%time

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
        ]

df = spark.createDataFrame(data)
df.show()
```

```
df.write.csv(
    "s3a://project17-bucket-alex/matt-test-csv.csv"
)
```

## Read in the Data

Here, now that we have our shared bucket configured properly, we can read our filtered project data.

### Comments - Read the Data (ONE MONTH)

```
%%time
comments = spark.read.parquet(
    's3a://project17-bucket-alex/project_jan2021/comments/*.parquet',
    header = True
)
```

### Comments - Read the Data (FULL)

In [6]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
#output_prefix_data = "project_2022"

# List of 12 directories each containing 1 month of data
directories = ["project_2022_" + str(i) + "/comments" for i in range(1, 13)]

# Iterate through 12 directories and merge each monthly data set to create one big data set
comments = None
for directory in directories:
    s3_path = f"s3a://{bucket}/{directory}"
    month_df = spark.read.parquet(s3_path, header = True)
    
    if comments is None:
        comments = month_df
    else:
        comments = comments.union(month_df)

23/11/17 19:19:40 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

CPU times: user 25.3 ms, sys: 4.27 ms, total: 29.6 ms
Wall time: 17.4 s


                                                                                

### Comments - View the Data

In [7]:
comments.select(['subreddit', 'author', 'body', 'parent_id', 'link_id', 'id', 'created_utc']).show(10)

[Stage 12:>                                                         (0 + 1) / 1]

+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|        subreddit|              author|                body| parent_id|  link_id|     id|        created_utc|
+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|    AmItheAsshole|         beckydragon|                 NTA| t3_rz9uu3|t3_rz9uu3|hs0rusg|2022-01-10 04:49:57|
|    AmItheAsshole|        Cactus_chuck|NTA. My partners ...| t3_s0baev|t3_s0baev|hs0rusr|2022-01-10 04:49:57|
|    AmItheAsshole|   Red-belliedOrator|INFO\n\nIn genera...| t3_s0a5hn|t3_s0a5hn|hs0rut9|2022-01-10 04:49:57|
|NoStupidQuestions|  SoMuchForLongevity|You couldn't heat...| t3_s0b5be|t3_s0b5be|hs0rutc|2022-01-10 04:49:57|
|NoStupidQuestions|          MMmason651|it wouldn't taste...| t3_s0axsd|t3_s0axsd|hs0rutm|2022-01-10 04:49:57|
|           AskMen|          redditfu76|  Play with my boobs| t3_s0bc1r|t3_s0bc1r|hs0ruva|2022-01-10 04:49:58|
|

                                                                                

### Comments - Print the Shape of the Data

In [8]:
# comments.count(), len(comments.columns)

The shape is `(76503363, 21)` (no need to run the count operation again).

### Comments - Print the Schema

In [9]:
comments.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)



### Submissions - Read the Data (ONE MONTH)

```
%%time
comments = spark.read.parquet(
    's3a://project17-bucket-alex/project_jan2021/submissions/*.parquet',
    header = True
)
```

### Submissions - Read the Data (FULL)

In [7]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
#output_prefix_data = "project_2022"

# List of 12 directories each containing 1 month of data
directories = ["project_2022_" + str(i) + "/submissions" for i in range(1, 13)]

# Iterate through 12 directories and merge each monthly data set to create one big data set
submissions = None
for directory in directories:
    s3_path = f"s3a://{bucket}/{directory}"
    month_df = spark.read.parquet(s3_path, header = True)
    
    if submissions is None:
        submissions = month_df
    else:
        submissions = submissions.union(month_df)

23/11/18 15:44:16 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
23/11/18 15:44:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


CPU times: user 16.8 ms, sys: 9.39 ms, total: 26.2 ms
Wall time: 13.5 s


### Submissions - View the Data

In [8]:
submissions.select(['subreddit', 'author', 'title', 'selftext', 'created_utc', 'num_comments']).show(10)

[Stage 12:>                                                         (0 + 1) / 1]

+-----------------+-------------------+--------------------+--------------------+-------------------+------------+
|        subreddit|             author|               title|            selftext|        created_utc|num_comments|
+-----------------+-------------------+--------------------+--------------------+-------------------+------------+
|NoStupidQuestions|          [deleted]|Who do you call w...|           [deleted]|2022-01-22 18:14:03|           4|
|    AmItheAsshole|          [deleted]|AITA for blowing ...|           [removed]|2022-01-22 18:14:04|           7|
|    AmItheAsshole|       go_awaythrow|AITA if I cut my ...|           [removed]|2022-01-22 18:14:12|           1|
|NoStupidQuestions|          [deleted]|   [deleted by user]|           [removed]|2022-01-22 18:14:16|           1|
|           AskMen|          [deleted]|Do men actually l...|           [removed]|2022-01-22 18:14:21|           1|
|         antiwork|        Vivid_Steel|For Those of You ...|In most states in...

                                                                                

### Submissions - Print the Shape of the Data

In [9]:
# submissions.count(), len(submissions.columns)

The shape is `(3444283, 68)` (no need to run the count operation again).

### Submissions - Print the Schema

In [10]:
submissions.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    

## Comments - Filter for Non-Deleted, Non-Removed, Non-Empty Posts

Here, we filter the data for comments that are not labeled as having been deleted or removed.

In [14]:
invalid_comments = ['[deleted]', '[removed]', '']

comments_valid = comments.filter(~comments.body.isin(invalid_comments))
comments_valid.select(['subreddit', 'author', 'body', 'parent_id', 'link_id', 'id', 'created_utc']).show(10)

[Stage 26:>                                                         (0 + 1) / 1]

+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|        subreddit|              author|                body| parent_id|  link_id|     id|        created_utc|
+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|    AmItheAsshole|         beckydragon|                 NTA| t3_rz9uu3|t3_rz9uu3|hs0rusg|2022-01-10 04:49:57|
|    AmItheAsshole|        Cactus_chuck|NTA. My partners ...| t3_s0baev|t3_s0baev|hs0rusr|2022-01-10 04:49:57|
|    AmItheAsshole|   Red-belliedOrator|INFO\n\nIn genera...| t3_s0a5hn|t3_s0a5hn|hs0rut9|2022-01-10 04:49:57|
|NoStupidQuestions|  SoMuchForLongevity|You couldn't heat...| t3_s0b5be|t3_s0b5be|hs0rutc|2022-01-10 04:49:57|
|NoStupidQuestions|          MMmason651|it wouldn't taste...| t3_s0axsd|t3_s0axsd|hs0rutm|2022-01-10 04:49:57|
|           AskMen|          redditfu76|  Play with my boobs| t3_s0bc1r|t3_s0bc1r|hs0ruva|2022-01-10 04:49:58|
|

                                                                                

In [15]:
comments_relationships = comments_valid.filter(col('subreddit') == 'relationship_advice')

## Submissions - Filter for Non-Deleted, Non-Removed, Non-Empty Posts

Here, we filter the data for submissions that are not labeled as having been deleted or removed.

In [11]:
invalid_submissions = ['[deleted]', '[removed]', '']

submissions_valid = submissions.filter(~submissions.selftext.isin(invalid_submissions))
submissions_valid.select(['subreddit', 'author', 'title', 'selftext', 'created_utc', 'num_comments']).show(10)

+-------------------+--------------------+--------------------+--------------------+-------------------+------------+
|          subreddit|              author|               title|            selftext|        created_utc|num_comments|
+-------------------+--------------------+--------------------+--------------------+-------------------+------------+
|           antiwork|         Vivid_Steel|For Those of You ...|In most states in...|2022-01-22 18:14:28|           1|
|   unpopularopinion| ballonfightaddicted|Waking up 15-30 m...|I like waking up ...|2022-01-22 18:15:24|           5|
|      AmItheAsshole|       geosunsetmoth|AITA for refusing...|I (NB 19) am auti...|2022-01-22 18:15:30|         425|
|  NoStupidQuestions|           Killdreth|Can I do anything...|I don’t know why,...|2022-01-22 18:15:45|           2|
|     TrueOffMyChest|          sadness_18|I hate people who...|I've been called ...|2022-01-22 18:15:46|           5|
|relationship_advice|  Natural_Rabbit8936|I went thru my

                                                                                

## Perform NLP with `relationship_advice` Submissions

First, we will filter the valid submissions to the `relationship_advice` subreddit.

In [12]:
submissions_relationships = submissions_valid.filter(col('subreddit') == 'relationship_advice')

Our goal is to extract the age and gender of the author of a post. We use the `relationship_advice` subreddit because authors are required to include this information in their posts. The most common way to express this information is through the use of parentheses, abbreviating the age and gender; for example, "I (23M) gave my friend...". Using a set of complex regular expressions, we will extract the age and gender information from these parentheses, where available. 

There are many considerations that we must make, which are summarized below:

* The author may choose to write the age and gender in any order. This means that we must properly extract these variables, whether they are written as "(23M)", for example, or "(M23)".
* The author may also provide age and gender information of other subjects of their post. For example, they might say "I (23M) gave my friend (F24)...". In these cases, we must carefully extract the correct age and gender of the author, not of the other subjects mentioned in the post. For this, we look to the pronouns used before the parentheses involved: pronouns such as "I" and "me" indicate the author of the post, while others like "him" and "girlfriend" indicate other individuals involved in the post.
* There can be whitespace within the post that adds complexity to the regex search, so we will remove all whitespace for the purpose of performing the regex search only.

The regular expression that we will use is: `'(i)\(([0-9]{2}[a-zA-Z]{1,2})\)|(me)\(([0-9]{2}[a-zA-Z]{1,2})\)|(i)\(([a-zA-Z]{1,2}[0-9]{2})\)|(me)\(([a-zA-Z]{1,2}[0-9]{2})\)'`. This can be interpreted in four parts:

* `'(i)\(([0-9]{2}[a-zA-Z]{1,2})\)'`: Match the pronoun "I", followed by a set of parentheses with two digits inside, followed by one or two letters.
    * Example: "I (22M) talked to my friend..." --> Match: "I (22M)"
* `'(me)\(([0-9]{2}[a-zA-Z]{1,2})\)'`: Match the pronoun "me", followed by a set of parentheses with two digits inside, followed by one or two letters.
    * Example: "My girlfriend gave me (25M)..." --> Match: "me (25M)"
* `'(i)\(([a-zA-Z]{1,2}[0-9]{2})\)'`: Match the pronoun "I", followed by a set of parentheses with one or two letters inside, followed by two digits.
    * Example: "I (F45) talked to my friend..." --> Match: "I (F45)"
* `'(me)\(([a-zA-Z]{1,2}[0-9]{2})\)'`: Match the pronoun "me", followed by a set of parentheses with one or two letters inside, followed by two digits.
    * Example: "My girlfriend gave me (F31)..." --> Match: "me (F31)"

Let's write the regular expression here.

In [13]:
my_regex = '(i)\(([0-9]{2}[a-zA-Z]{1,2})\)|(me)\(([0-9]{2}[a-zA-Z]{1,2})\)|(i)\(([a-zA-Z]{1,2}[0-9]{2})\)|(me)\(([a-zA-Z]{1,2}[0-9]{2})\)'

Now, we can apply it to the `selftext` variable, which contains the content of the post.

In [14]:
submissions_regex_applied = submissions_relationships.withColumn(
    'selftext_nowhitespace', regexp_replace('selftext', ' ', '')            # remove whitespace
).withColumn(
    'selftext_lower', lower('selftext_nowhitespace')                        # convert to lowercase
).withColumn(
    'regex_extract', regexp_extract('selftext_lower', my_regex, 0)          # apply the complex regex
).withColumn(
    'regex_parentheses', regexp_extract('regex_extract', '\(([^\)]+)\)', 1) # extract what is in the parentheses
).withColumn(
    'regex_age', regexp_extract('regex_parentheses', '\d+', 0)              # extract the digits of the parentheses
).withColumn(
    'regex_gender', regexp_extract('regex_parentheses', '\D+', 0)           # extract the characters of the parentheses
)

Let's take a peek at how this worked.

In [15]:
submissions_regex_applied.select(['title', 'selftext_lower', 'regex_extract', 'regex_parentheses', 'regex_age', 'regex_gender'])\
                         .filter(col('regex_extract') != '').show(10)

+--------------------+--------------------+-------------+-----------------+---------+------------+
|               title|      selftext_lower|regex_extract|regex_parentheses|regex_age|regex_gender|
+--------------------+--------------------+-------------+-----------------+---------+------------+
|Tips for emotiona...|mygirlfriend(19f)...|      me(19f)|              19f|       19|           f|
|Partner with chro...|wheni(34f)amliste...|       i(34f)|              34f|       34|           f|
|How do I make LDR...|i(17f)havebeendat...|       i(17f)|              17f|       17|           f|
|Married sex life....|i(42m)andwife(37f...|       i(42m)|              42m|       42|           m|
|Saying I Love you...|alrightsojustalit...|      me(17m)|              17m|       17|           m|
|I can't sleep at ...|basicallyi(m22)mo...|       i(m22)|              m22|       22|           m|
|I (24F) fell in l...|i(24f)startedafwb...|       i(24f)|              24f|       24|           f|
|My girlfr

In [16]:
submissions_age_gender = submissions_regex_applied.filter(col('regex_extract') != '')

In [17]:
# submissions_age_gender.count(), len(submissions_age_gender.columns)

The shape is `(59242, 74)`.

### Submissions - Value Counts by Age and Gender

Here, we see the distribution of ages and genders within the `relationship_advice` subreddit.

In [31]:
# submissions_age_counts = submissions_age_gender.groupBy('regex_age').count().orderBy('regex_age', ascending = True).cache()
# submissions_age_counts.show()

In [32]:
# submissions_gender_counts = submissions_age_gender.groupBy('regex_gender').count().orderBy('regex_gender', ascending = False).cache()
# submissions_gender_counts.show()

### Save the Data

Let's save off the dataset before proceeding with a lot more NLP work.

In [18]:
submissions_age_gender = submissions_age_gender.drop(
    'selftext_nowhitespace',
    'selftext_lower',
    'regex_extract',
    'regex_parentheses'
)

In [None]:
# submissions_age_gender.write.parquet(
#     "s3a://project17-bucket-alex/matt-submissions-age-gender"
# )

## Setup

Here, we start back up again with a spark session that is capable of working with NLP.

In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [4]:
import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

In [None]:
# Import pyspark and build Spark session
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\
    .getOrCreate()

In [6]:
print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")

Spark version: 3.4.0
sparknlp version: 5.1.3


## Read in the Saved Data

Here, we will read in the saved data above as a fresh starting point.

In [7]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
directory = "matt-submissions-age-gender"

s3_path = f"s3a://{bucket}/{directory}"
submissions_age_gender = spark.read.parquet(s3_path, header = True)

23/11/18 15:52:16 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

CPU times: user 8.4 ms, sys: 2.75 ms, total: 11.2 ms
Wall time: 6.87 s


23/11/18 15:52:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [9]:
submissions_age_gender.select(['title', 'selftext', 'regex_age', 'regex_gender']).show(10)

+--------------------+--------------------+---------+------------+
|               title|            selftext|regex_age|regex_gender|
+--------------------+--------------------+---------+------------+
|my boyfriend(27) ...|So my boyfriend(m...|       27|           f|
|Confused in an in...|\nIn a new relati...|       21|           f|
|Asking for phone ...|So, I (21M) was a...|       21|           m|
|LDR bf of 3 month...|I(25F) met my bf(...|       25|           f|
|I break up with m...|I (23m) shared a ...|       23|           m|
|How can I get mor...|My boyfriend(32M)...|       23|           f|
|I (35F) can't get...|So I live with an...|       35|           f|
|I think I'm a les...|I (25f) have been...|       25|           f|
|I (24F) snore too...|So, I (24 F) am i...|       24|           f|
|One of my best fr...|I’m on mobile so ...|       21|           f|
+--------------------+--------------------+---------+------------+
only showing top 10 rows



In [10]:
df = submissions_age_gender.select(['selftext', 'regex_age', 'regex_gender'])
del(submissions_age_gender)

## Sentiment Model

In [None]:
MODEL_NAME = 'sentimentdl_use_twitter'

documentAssembler = DocumentAssembler().setInputCol("selftext").setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name = "tfhub_use", lang = "en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang = "en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

In [12]:
pipelineModel = nlpPipeline.fit(df)
results = pipelineModel.transform(df)

In [13]:
results = results.withColumn('sentiment', F.explode(results.sentiment.result))
final_data = results.select('selftext', 'regex_age', 'regex_gender', 'sentiment')
final_data.persist()
final_data.show()

[Stage 7:>                                                          (0 + 1) / 1]

+--------------------+---------+------------+---------+
|            selftext|regex_age|regex_gender|sentiment|
+--------------------+---------+------------+---------+
|So my boyfriend(m...|       27|           f| negative|
|\nIn a new relati...|       21|           f|  neutral|
|So, I (21M) was a...|       21|           m| negative|
|I(25F) met my bf(...|       25|           f| negative|
|I (23m) shared a ...|       23|           m| negative|
|My boyfriend(32M)...|       23|           f| negative|
|So I live with an...|       35|           f| negative|
|I (25f) have been...|       25|           f| negative|
|So, I (24 F) am i...|       24|           f| negative|
|I’m on mobile so ...|       21|           f| negative|
|TDLR: can ex’s be...|       22|           f| negative|
|I (22m) have been...|       22|           m| negative|
|I (21f) am having...|       21|           f| negative|
|I (16f) am thinki...|       16|           f| negative|
|Hi everyone. I (2...|       21|           f| ne

                                                                                

In [14]:
final_data = final_data.select('regex_age', 'regex_gender', 'sentiment')

In [15]:
# save the results to CSV
final_data.write.option('header', True).csv('submission_age_gender_sentiment.csv')

                                                                                

## Non-Sentiment NLP

In [35]:
MODEL_NAME = 'sentimentdl_use_twitter'

documentAssembler = DocumentAssembler().setInputCol("selftext").setOutputCol("document")

tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("tokenized")

stemmer = Stemmer().setInputCols(["tokenized"]).setOutputCol("stemmed")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          tokenizer,
          stemmer
      ])

In [36]:
pipelineModel = nlpPipeline.fit(df)
results = pipelineModel.transform(df)

In [None]:
results = results.withColumn('stemmed', F.explode(results.stemmed.result))
final_data = results.select('selftext', 'stemmed')
final_data.persist()
final_data.show()

[Stage 9:>                                                          (0 + 1) / 1]