# HW 4 — Topic Modeling

**Student Name**: Your name here  
**(Double-click this cell to type)**

---
In this homework assignment, you will be topic modeling a collection of Reddit posts from a subreddit — either [r/legaladvice](https://www.reddit.com/r/legaladvice) or [r/ talesfromtechsupport](https://www.reddit.com/r/talesfromtechsupport/) — to understand its broad themes and trends. All of the posts have an upvote score of 1,000 or more. 

This homework assignment closely mirror [chapters from our textbook](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/10-Topic-Modeling-CSV.html). You are encouraged to refer back to to the textbook and our class exercises.

<div class="admonition note" name="html-admonition" style="background: orange; padding: 10px">
<p class="title">Content Warning</p>
I haven't read through all the Reddit posts in the r/legaladvice or r/talesfromtechsupport datasets, but I wanted to warn you that some of the posts are related to sexual harassment and sexual assault. Some of the posts also contain otherwise offensive or inappropriate content. You do not have to engage with these posts if you don't want to. If you would prefer to work with a different dataset, please reach out to me.</div>

## Python Review
---

## 1. Lists and List Comprehensions (3 points)

Before you topic model, you will review the two main ways that you can create Python lists.

Your task in this section is to make all the Reddit post titles in the list `titles` lowercase and then save them as a new list `lowercase_titles` or `lowercase_titles2`. This workflow will mirror the process that we will use when we prepare Reddit posts for topic modeling.

In [7]:
titles = ['I have been working remotely since march of 2020, I am moving to another state and my employer wants me to resign. State is MA.',
 'Can my landlord force me to be exposed to Covid-19? [Dillon, CO]',
 '(IL) Can I reveal a “confidential” HR investigation after leaving the company?']

First, make a new list `lowercase_titles` with a `for` loop and the `.append()` method.

In [None]:
lowercase_titles = []
# Your code here👇

Then print it out to make sure you have the right answer

In [11]:
print(lowercase_titles)

['i have been working remotely since march of 2020, i am moving to another state and my employer wants me to resign. state is ma.', 'can my landlord force me to be exposed to covid-19? [dillon, co]', '(il) can i reveal a “confidential” hr investigation after leaving the company?']


Next, make a new list `lowercase_titles2` with a [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp).

In [None]:
lowercase_titles2 = #Your code here

Then print it out to make sure you have the right answer

In [10]:
print(lowercase_titles2)

['i have been working remotely since march of 2020, i am moving to another state and my employer wants me to resign. state is ma.', 'can my landlord force me to be exposed to covid-19? [dillon, co]', '(il) can i reveal a “confidential” hr investigation after leaving the company?']


## Topic Modeling
---

Import the following Python packages

In [1]:
import little_mallet_wrapper
import pandas as pd
from pathlib import Path

## 2. Read in Data (4 points)

Read in the CSV file for the [r/legaladvice](https://www.reddit.com/r/legaladvice/) data ("reddit-data/Legal-Advice-Reddit-Submissions.csv") or the [r/ talesfromtechsupport](https://www.reddit.com/r/talesfromtechsupport/) data ("reddit-data/Tales-From-Tech-Support-Reddit-Submissions.csv) and then save it as `reddit_df` (1 point)

In [None]:
reddit_df = # Your code here

How many Reddit posts are in this DataFrame? Enter your answer in the blank below. (1 point)

In [None]:
# Your code here — you can use any method to find the right answer

There are **____** Reddit posts in this DataFrame.

How many `NaN` values are in the column `"selftext"`? Enter your answer in the blank below. (2 points)

In [None]:
# Your code here — you can use any method to find the right answer

There are **____** `NaN` values in the column "selftext."

## 3. Convert Column to String (1 point)

Because there are`NaN` values in the `"selftext"` column, we will get errors if we treat them as string data and try to pre-process them. Before we move on, you need to convert the `"selftext"` column to the datatype `string` and re-assign it to the same column.

*The rest of the notebook will not work if you do not get the correct answer here!*

In [221]:
# your code here

## 4. Process Reddit Posts and Titles (3 points)

To prepare the Reddit data for topic modeling, we need to transform all the Reddit posts to lowercase and remove stopwords, punctuation, and numbers.

To do so, we will use a `little_mallet_wrapper` function and a list comprehension. Fill in the blank (`YOUR-CODE-HERE`) in the list comprehension below.

In [222]:
training_data = [little_mallet_wrapper.process_string(YOUR-CODE-HERE, numbers='remove') for text in reddit_df['selftext']]

Make a list of the original (not pre-processed) Reddit posts and assign it to the variable `original_texts` 

In [25]:
# your code here

Make a list of every title from the DataFrame in the column `title` and assign it to the variable `reddit_titles` with a list comprehension

In [26]:
#Your code here]

## 5. Set Up and Train Topic Model (0 points)

Below, you need to decide on the number of topics that you want to return (must be more than 15 topics).

The rest of the code is filled in for you — the filepath to Mallet and the output filepaths for our topic modeling results.

In [None]:
num_topics = # your choice of topics

path_to_mallet = '../mallet/bin/mallet'

#No need to change anything below here
training_data = training_data

#Set output directory
output_directory_path = 'topic-model-output/'

#Create output directory
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

#Create output files
path_to_training_data           = f"{output_directory_path}/training.txt"
path_to_formatted_training_data = f"{output_directory_path}/mallet.training"
path_to_model                   = f"{output_directory_path}/mallet.model.{str(num_topics)}"
path_to_topic_keys              = f"{output_directory_path}/mallet.topic_keys.{str(num_topics)}"
path_to_topic_distributions     = f"{output_directory_path}/mallet.topic_distributions.{str(num_topics)}"

Train the topic model with `little_mallet_wrapper.quick_train_topic_model()`. Remember that this may take a couple of minutes.

In [None]:
little_mallet_wrapper.quick_train_topic_model(path_to_mallet,
                                              output_directory_path,
                                              num_topics,
                                              training_data)

## 6. Display Topics and Top Words (2 points)

Fill in the code below to print all the topics with topic number and topic key words

In [None]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)


# your code here
    print(f"✨Topic {number}✨\n\n{topic}\n")

## 7. Load Topic Distributions (0 points)

Load the topic distributions for all the documents (the probability that each document contains each of the topics)

In [230]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

## 8. Identify and Label 3 Topics (9 points)

Below, you will choose and closely examine 3 of the topics in detail.

But first you need to run the cell below to create the functions that will help you examine the top Reddits posts for each topic.

In [256]:
from IPython.display import Markdown, display
import re

def make_md(string):
    """A function that transforms string data into Markdown
    so it can be nicely formatted with bolding and emojis
    """
    display(Markdown(str(string)))

def get_top_docs(docs, topic_distributions, topic_index=1, n=5):
    
    """A function that shows the top documents for a given set of topic distributions
    and a specific topic number
    """
    
    sorted_data = sorted([(_distribution[topic_index], _document) for _distribution, _document in zip(topic_distributions, docs)], reverse=True)
    topic_words = topics[topic_index]
    make_md(f"### ✨Topic {topic_index}✨\n\n{topic_words}\n\n---")
    
    for probability, doc in sorted_data[:n]:
        # Make topic words bolded
        for word in topic_words:
            if word in doc.lower():
                doc = re.sub(f"\\b{word}\\b", f"**{word}**", doc, re.IGNORECASE)
        make_md(f'✨  \n**Topic Probability**: {probability}  \n**Document**: {doc}\n\n')

For the following questions, you need to use **this function**:  `get_top_docs(docs, topic_distributions, topic_index=1, n=5)`

💡 Don't forget to fill in the topic words and a label for each topic below!!

### First Topic

Display the top 5 Reddit **titles** with the highest probability of containing your topic 

In [None]:
# Your code here — use the get_top_docs() function from above

Display the top 5 Reddit **posts** with the highest probability of containing your topic

In [None]:
# Your code here — use the get_top_docs() function from above

Based on the topic words and the top documents, come up with a label for your first topic and fill in the details below.

💡 **Topic *Number Here***: First Five Topic Words Here  
💡 **Topic Label**: Topic Label Here  

### Second Topic

Display the top 5 Reddit **titles** with the highest probability of containing your topic 

In [None]:
# Your code here — use the get_top_docs() function from above

Display the top 5 Reddit **posts** with the highest probability of containing your topic

In [None]:
# Your code here — use the get_top_docs() function from above

Based on the topic words and the top documents, come up with a label for your second topic and fill in the details below.

💡 **Topic *Number Here***: First Five Topic Words Here  
💡 **Topic Label**: Topic Label Here  

### Third Topic

Display the top 5 Reddit **titles** with the highest probability of containing your topic 

In [None]:
# Your code here — use the get_top_docs() function from above

Display the top 5 Reddit **posts** with the highest probability of containing your topic

In [None]:
# Your code here — use the get_top_docs() function from above

Based on the topic words and the top documents, come up with a label for your third topic and fill in the details below.

💡 **Topic *Number Here***: First Five Topic Words Here  
💡 **Topic Label**: Topic Label Here  

## 9/10. Reflection (2 points each)

In 2-5 sentences, answer the following questions:
- What kind of further research or analysis could you do with your 3 chosen topics or with any of the other topics? 

**You answer here**

- What do you think was the most difficult topic to label of your topics? Why? Be sure to include the top 5 words from this topic in your response 

**You answer here**

## When You're Finished

When you're finished, you should:

1. Save your Jupyter notebook with your last name. To save your notebook, you can select File -> Save Notebook or File -> Save Notebook As...

2. Download your Jupyter notebook file (.ipynb), and then upload it to Gradescope