This notebook takes the vectors from the topic model distribution and classifies each document into a topic.

In [1]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1",
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type": "native",
        "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv"

    }
}

In [2]:
sc.install_pypi_package('spark-nlp',"https://pypi.org/simple")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1716430456629_0004,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting spark-nlp
  Downloading https://files.pythonhosted.org/packages/13/96/a580e098e00905ef715253fc85589db00ca5bfa324deb5aa7cb4fc069004/spark_nlp-5.3.3-py2.py3-none-any.whl (568kB)
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.3.3

In [3]:
df = spark.read.parquet("s3://finalproject-nat-s3/data_withtopics/*.parquet")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
#Recall structure of the dataframe 
print('Total Columns: %d' % len(df.dtypes))
print('Total Rows: %d' % df.count())
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total Columns: 15
Total Rows: 1646407
root
 |-- id: string (nullable = true)
 |-- created: string (nullable = true)
 |-- author: string (nullable = true)
 |-- score: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- num_comments: string (nullable = true)
 |-- entire_text: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- person_1: string (nullable = true)
 |-- person_2: string (nullable = true)
 |-- finished_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- terms_frequencies: vector (nullable = true)
 |-- tf_idf_features: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)

In [5]:
#Recall structure of the dataframe 
df.show(10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+----------------+--------------------+-----+--------------------+--------------------+------------+--------------------+----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|     id|         created|              author|score|               title|            selftext|num_comments|         entire_text|year|person_1|person_2|     finished_tokens|   terms_frequencies|     tf_idf_features|   topicDistribution|
+-------+----------------+--------------------+-----+--------------------+--------------------+------------+--------------------+----+--------+--------+--------------------+--------------------+--------------------+--------------------+
| nao0ib|2021-05-12 07:56|   u/Colonel_Chronic|    3|I 26M love her 25...|Sorry for long po...|           7|i 26m love her 25...|2021|        |        |[love, despise, f...|(1329,[0,1,2,3,5,...|(1329,[0,1,2,3,5,...|[1.63186347610851...|
| w0i2vh|2022-07-16 09:35|u/Educational_Fox...|    1

## Categorize Documents into topics

In [6]:
import pyspark.sql.functions as F

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
df.select("topicDistribution").show(5, truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topicDistribution                                                                                                                                                                                                                                                                                                                   |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1.631863476108513

In [9]:
#Recall topics and create a dictionary with its labels

topics = {"0": "Romantic relationships",
          "8": "Dating",
          "9": "Finances, employment and housing",
          "10": "Social media and messaging",
          "11": "Family"}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use of AI for the following code: I was initially treating the values of topicDistribution as a list and it was not working out. I sent my code to ChatGPT:

"I have a column with my topic distribution, it's of type "vector". I am trying to create an udf that:

- checks the largest value in that vector
- identify the position/index of that vector
- associate that index with a dictionary that has the topic labels

This is my code:

def get_topic_label(lst):\
    '''Takes the topic distribution for each document and returns a label'''\
    try:
        topic_number = lst.index(max(lst))\
        if str(topic_number) in topics:\
            return topics[str(topic_number)]\
        else:\
            return "Unclear topic"\
    except:\
        return None

What am I doing wrong?"

And it explained me that there are different types of vectors and that pyspark can not read all of them. It explained me that I needed to convert the numpy array into a type of object that is readable by pyspark (the DenseVector) + it also suggested using the get function to look at the labels dictionary, instead of using if/else statements. 

In [10]:
from pyspark.ml.linalg import DenseVector

# Define the function to get the topic label
def get_topic_label(vector):
    '''
    Takes the topic distribution for each document and returns a label

    Input (numpy array): topic probabilities distribution
    Output (string): label
    '''

    #Convert numpy array into a DenseVector
    dense_vector = DenseVector(vector)
        
    #Find the index of the maximum value, which corresponds to the number of topic
    max_index = dense_vector.argmax()
    
    #Look for the label in the dictionary based on the number of topic
    return topics.get(str(max_index), "Unclear topic")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
#Tell spark to recognize an user-defined function
udf_get_topic_label = F.udf(get_topic_label)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
#Apply transformation to rows with UDF 
df = df.withColumn('topic_label', udf_get_topic_label(F.col('topicDistribution')))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
#Check topic distribution to ensure there's enough data to work on, and not only
#"Unclear topic" submissions

label_counts = df.groupBy("topic_label").count()\
                 .orderBy("count", ascending=False)

# Show the results
label_counts.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------+
|         topic_label| count|
+--------------------+------+
|       Unclear topic|921402|
|              Dating|240820|
|Romantic relation...|203455|
|              Family|101419|
|Social media and ...| 98879|
|Finances, employm...| 80432|
+--------------------+------+

Since several topics were considered vague and labeled as "Unclear topic" is logical that said label represents the largest category. However, there are still thousands of submissions in the other 5 defined labels that can be useful to explore.

Next, I'll check some examples associated to each label to see if they make sense

In [14]:
df.filter(df.topic_label == "Dating")\
  .sample(0.0001, seed=25)\
  .limit(5)\
  .select("entire_text").show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [15]:
df.filter(df.topic_label == "Romantic relationships")\
  .sample(0.0001, seed=25)\
  .limit(5)\
  .select("entire_text").show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
df.filter(df.topic_label == "Family")\
  .sample(0.0001, seed=25)\
  .limit(5)\
  .select("entire_text").show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
df.filter(df.topic_label == "Social media and messaging")\
  .sample(0.0001, seed=25)\
  .limit(5)\
  .select("entire_text").show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [18]:
df.filter(df.topic_label == "Finances, employment and housing")\
  .sample(0.0001, seed=25)\
  .limit(5)\
  .select("entire_text").show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Overall, the topics seem to be logic and coherent. Particularly, the topic on social media and messaging is especially important for my original research questions. Skimming through the examples, it seems as the model was able to capture the 5 labeled topics.

## Save df as parquet files in S3, containing only useful information for further analysis

In [19]:
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: string (nullable = true)
 |-- created: string (nullable = true)
 |-- author: string (nullable = true)
 |-- score: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- num_comments: string (nullable = true)
 |-- entire_text: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- person_1: string (nullable = true)
 |-- person_2: string (nullable = true)
 |-- finished_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- terms_frequencies: vector (nullable = true)
 |-- tf_idf_features: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)
 |-- topic_label: string (nullable = true)

I'll exclude the following columns:

* Authors are not necessary, they do not tell much

* title and selftext are not necessary since entire_text combines them 

* person_1 and person_2 were initial attempts to obtain the age of the involved people in the submissions, but after exploring the texts more throughougly, some submissions include 3+ people, or do not follow the same format for indicating age, so for simplicity I'll exclude those columns. (I hadn't even finished that code so I just deleted it from the data wrangling notebook)

* The created column is also not necessary because the year is already extracted

* All of the columns created for the topic model are also not necessary anymore since I can stick with the label


In [20]:
reduced_df = df.select("id", "score", "num_comments", "entire_text", "year", "topic_label")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
reduced_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: string (nullable = true)
 |-- score: string (nullable = true)
 |-- num_comments: string (nullable = true)
 |-- entire_text: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- topic_label: string (nullable = true)

In [23]:
df_repartitioned = reduced_df.repartition(10)
df_repartitioned.write.parquet("s3://finalproject-nat-s3/data_withtopic_labels", mode = 'overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…