# Week 7 Part 1 Exercises — Topic Modeling

---

## Dataset

### Am I the Asshole?

<blockquote class="epigraph" style=" padding: 10px">

AITA for lying about my biggest fear on a quiz show and subsequently winning a car and making other contestants lose?


<p class ="attribution">
    —Reddit user iwonacar, <a href="https://www.reddit.com/r/AmItheAsshole/comments/dmfsum/aita_for_lying_about_my_biggest_fear_on_a_quiz/">r/AmItheAsshole</a>
    </p>
    
</blockquote>

In this particular lesson, we're going to use [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), a Python wrapper for [MALLET](http://mallet.cs.umass.edu/topics.php), to topic model a CSV file with 2,932 Reddit posts from the subreddit [r/AmITheAsshole](https://www.reddit.com/r/AmItheAsshole/) that have at least an upvote score of 2,000. This is an online forum where people share their personal conflicts and ask the community to judge who's the a**hole in the story. This data was collected with PSAW, a wrapper for the Pushshift API.

___

<div class="admonition attention" name="html-admonition" style="background: lightyellow;padding: 10px">  
<p class="title">Attention</p>  
    
<p>If you're working in this Jupyter notebook on your own computer, you'll need to have both the Java Development Kit and MALLET pre-installed. For set up instructions, please see <a href="http://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Set-Up.html">the previous lesson<a/>.  </p>
    
If you're working in this Jupyter notebook in the cloud via Binder/JupyterHub, then the Java Development Kit and Mallet will already be installed. You're good to go!  
     
</div>  

## Set MALLET Path

Since Little MALLET Wrapper is a Python package built around MALLET, we first need to tell it where the bigger, Java-based MALLET lives.

We're going to make a variable called `path_to_mallet` and assign it the file path of our MALLET program. We need to point it, specifically, to the "mallet" file inside the "bin" folder inside the "mallet-2.0.8" folder. 

In [17]:
path_to_mallet = '../mallet/bin/mallet'

If MALLET is located in another directory, then set your `path_to_mallet` to that file path.

## Install Packages

In [2]:
#!pip install little_mallet_wrapper
#!pip install seaborn
#To install the most updated version of little_mallet_wrapper:
#!!pip install git+https://github.com/maria-antoniak/little-mallet-wrapper.git

## Import Packages

Now let's `import` the `little_mallet_wrapper` and the data viz library `seaborn`.

In [2]:
import little_mallet_wrapper
import seaborn
import glob
from pathlib import Path
import pandas as pd
import random
pd.options.display.max_colwidth = 100

We're also going to import [`glob`](https://docs.python.org/3/library/glob.html) and [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) for working with files and the file system, as well as pandas for working with the CSV data.

## Get Training Data From CSV File

In [148]:
reddit_df = pd.read_csv("../data/top-reddit-aita-posts.csv")

In [None]:
reddit_df.head()

## Your turn! 👩‍💻

There are NaN values in the "selftext" column. To work with this column as string data, you need to explicitly convert the column to string data.

### Process Reddit Posts

`little_mallet_wrapper.process_string(text, numbers='remove')`

Next we're going to process our texts with the function `little_mallet_wrapper.process_string()`. This function will take every individual post, transform all the text to lowercase as well as remove stopwords, punctuation, and numbers, and then add the processed text to our master list `training_data`.

## Your turn! 👩‍💻
Loop through every post in the column "selftext," use Little Mallet Wrapper to pre-process each post (make sure to remove numbers), and then make a new list called "training data" consisting of the pre-processed texts 

In [None]:
training_data =  # your code here






Now make a list of the original Reddit posts, so that we can reference them later

In [None]:
original_texts = # your code here





### Process Reddit Post Titles

We're also going to extract the file name for each Reddit post.

In [135]:
reddit_titles = [title for title in reddit_df['title']]

### Get Dataset Statistics

We can get training data summary statisitcs by using the funciton `little_mallet_wrapper.print_dataset_stats()`.

In [136]:
little_mallet_wrapper.print_dataset_stats(training_data)

Number of Documents: 2932
Mean Number of Words per Document: 147.3
Vocabulary Size: 19331


## Training the Topic Model

## Set Number of Topics

We need to make a variable `num_topics` and assign it the number of topics we want returned.

In [137]:
num_topics = 15

## Set Training Data

We already made a variable called `training_data`, which includes all of our processed Reddit post texts, so we can just set it equal to itself.

In [138]:
training_data = training_data

## Set Other MALLET File Paths

Then we're going to set a file path where we want all our MALLET topic modeling data to be dumped. I'm going to output everything onto my Desktop inside a folder called "topic-model-output" and a subfolder specific to the Reddit posts called "Reddit."

All the other necessary variables below `output_directory_path` will be automatically created inside this directory.

In [139]:
#Change to your desired output directory
output_directory_path = 'topic-model-output/reddit'

#No need to change anything below here
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

path_to_training_data           = f"{output_directory_path}/training.txt"
path_to_formatted_training_data = f"{output_directory_path}/mallet.training"
path_to_model                   = f"{output_directory_path}/mallet.model.{str(num_topics)}"
path_to_topic_keys              = f"{output_directory_path}/mallet.topic_keys.{str(num_topics)}"
path_to_topic_distributions     = f"{output_directory_path}/mallet.topic_distributions.{str(num_topics)}"

### Train Topic Model

Then we're going to train our topic model with `little_mallet_wrapper.quick_train_topic_model()`. The topic model should take about 1-1.5 minutes to train and complete. If you want it, you can look at your Terminal or PowerShell and see what the model looks like as it's training.

In [None]:
little_mallet_wrapper.quick_train_topic_model(path_to_mallet,
                                             output_directory_path,
                                             num_topics,
                                             training_data)

When the topic model finishes, it will output your results to your `output_directory_path`.

## Display Topics and Top Words

To examine the 15 topics that the topic model extracted from the Reddit posts, run the cell below. This code uses the `little_mallet_wrapper.load_topic_keys()` function to read and process the MALLET topic model output from your computer, specifically the file "mallet.topic_keys.15".

In [157]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)

These are all the topics returned by the topic model. They are a little hard to read, so we are going to re-format them.

In [None]:
topics

## Your turn! 👩‍💻
Loop through every topic in `topics` and print out the topic number surrounded by emojis of your choice, as well as the list of top words (your output should basically look like the example below).

> ✨Topic 0✨

> ['post', 'phone', 'pictures', 'name', 'facebook', 'sent', 'social', 'account', 'number', 'posted', 'message', 'picture', 'media', 'call', 'messages', 'people', 'sarah', 'photos', 'reddit', 'found']

In [159]:
# your code here




## Load Topic Distributions

MALLET also calculates the likely mixture of these topics for every single Reddit post in the corpus. This mixture is really a probability distribution, that is, the probability that each topic exists in the document. We can use these probability distributions to examine which of the above topics are strongly associated with which specific posts.

To get the topic distributions, we're going to use the `little_mallet_wrapper.load_topic_distributions()` function, which will read and process the MALLET topic model output, specifically the file "mallet.topic_distributions.15". 

In [143]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

In [None]:
topic_distributions[0]

## Display Top Titles Per Topic

We can also display the Reddit posts and titles that have the highest probability for every topic with the `little_mallet_wrapper.get_top_docs()` function.

In [145]:
training_data_reddit_titles = dict(zip(training_data, reddit_titles))
training_data_original_text = dict(zip(training_data, original_texts))

We'll make our own function `display_top_titles_per_topic()` that will display the top Reddit post titles for every topic. This function accepts a given `topic_number` as well as a desired `number_of_documents` to display.

In [146]:
def display_top_titles_per_topic(topic_number=0, number_of_documents=5):
    
    print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        print(round(probability, 4), training_data_reddit_titles[document] + "\n")
    return

To display the top Reddit post titles with the highest probability of containing a topic, we will run:

In [None]:
display_top_titles_per_topic(topic_number=# your code here, number_of_documents=15)

## Display Topic Words in Context of Original Text

To display the original Reddit post texts that rank highly for a given topic, with the relevant topic words **bolded** for emphasis, we are going to make the function `display_bolded_topic_words_in_context()`.

In the cell below, we're importing two special Jupyter notebook display modules, which will allow us to make the relevant topic words **bolded**, as well as the regular expressions library `re`, which will allow us to find and replace the correct words.

In [120]:
from IPython.display import Markdown, display
import re

def display_bolded_topic_words_in_context(topic_number=3, number_of_documents=3, custom_words=None):

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        
        print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")
        
        probability = f"✨✨✨\n\n**{probability}**"
        reddit_title = f"**{training_data_reddit_titles[document]}**"
        original_text = training_data_original_text[document]
        topic_words = topics[topic_number]
        topic_words = custom_words if custom_words != None else topic_words

        for word in topic_words:
            if word in original_text:
                original_text = re.sub(f"\\b{word}\\b", f"**{word}**", original_text)

        display(Markdown(probability)), display(Markdown(reddit_title)), display(Markdown(original_text))
    return

To display the top original Reddit posts with the highest probability of containing a topic and with relevant topic words bolded, we will run:

In [None]:
display_bolded_topic_words_in_context(topic_number= # your code here, number_of_documents=10)

## Your turn! 👩‍💻
Come up with a topic label for 2 topics by inspecting the top titles and posts for that topic. Why did you choose these topic labels? What, if anything, was hard about this process? Answer in at least 2-3 sentences on Canvas for today's lecture participation.