# Topic Modeling Part 2 Workbook

In these lessons, we're learning about a text analysis method called *topic modeling*. This method will help us identify the main topics or discourses within a collection of texts.

___

## Dataset

In this particular lesson, we're going to use [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), a Python wrapper for [MALLET](http://mallet.cs.umass.edu/topics.php), to topic model a CSV file with 2,932 Reddit posts from the subreddit [r/AmITheAsshole](https://www.reddit.com/r/AmItheAsshole/) that have at least an upvote score of 2,000. This is an online forum where people share their personal conflicts and ask the community to judge who's the a**hole in the story. This data was collected with PSAW, a wrapper for the Pushshift API.

___

## Does Java/Mallet Work on Your Computer?

### Java

Do you have the Java Development Kit installed and configured properly? Test it out by running this command:

In [None]:
!javac

- If you get a long message, `Usage: javac <options> <source files>...`, then JDK is installed and set up!  
- If you get some version of the message, `javac: command not found`, then it is not installed or configured properly. If you're a Mac user, make sure you installed and donwloaded the JDK. If you're a Windows user, make sure you installed and downloaded the JDK and then changed your PATH enviornment variable, [as described here](https://info1350.github.io/Intro-CA-SP21/05-Text-Analysis/06-Topic-Modeling-Set-Up.html#windows).

### Mallet

Now you need to locate the correct filepath for Mallet. Where is Mallet downloaded and installed on your computer?

Mac users should be able to use the version of Mallet in this directory, i.e. `mallet-2.0.8/bin/mallet`. Windows users need to have an environment variable for the Mallet filepath, as described here, so it may be set up somewhere else, i.e. `C:/mallet-2.0.8/bin/mallet`

We also discovered that you may need to give explicit permission to the Mallet filepath, which you can do with the `chmod +x` command below (see ["Setting Permissions,"](https://launchschool.com/books/command_line/read/permissions#settingpermissions) The Launch School). Be sure to replace the filepaths below with your correct filepath for Mallet.

Allow execute permissions for Mallet:

In [139]:
!chmod +x mallet-2.0.8/bin/mallet

Test to see if Mallet is recognized and working (if you get the message "A tool for creating instance lists of feature vectors from comma-separated-values..."):

In [None]:
!mallet-2.0.8/bin/mallet import-file

If Mallet is recognized and working above, set `path_to_mallet` to that exact filepath:

In [25]:
path_to_mallet = 'mallet-2.0.8/bin/mallet'

## Install and Import Packages

Install `little_mallet_wrapper` and the data viz library `seaborn`

In [None]:
!pip install little_mallet_wrapper
!pip install seaborn

Now let's `import` the `little_mallet_wrapper` and the data viz library `seaborn`. We're also going to import `pandas` and [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) for working with files and the file system.

In [117]:
import little_mallet_wrapper
import seaborn as sns
import pandas as pd
from pathlib import Path

## Get Training Data From CSV File

Before we topic model the Reddit posts, we need to process the posts and prepare them for analysis. The steps below demonstrate how to process texts if they come from a CSV file.

In [141]:
reddit_df = pd.read_csv("../data/top-reddit-aita-posts.csv")

Convert the `selftext` column to a string:

In [6]:
reddit_df['selftext'] = reddit_df['selftext'].astype(str)

### Process Reddit Posts and Titles

Next we're going to process our texts with the function `little_mallet_wrapper.process_string()`.

This function will take every individual post, transform all the text to lowercase as well as remove stopwords, punctuation, and numbers, and then add the processed text to our master list `training_data`.

In [7]:
training_data = [little_mallet_wrapper.process_string(text, numbers='remove') for text in reddit_df['selftext']]

We're also going to make a list of the original texts.

In [81]:
original_texts = [text for text in reddit_df['selftext']]

We're also going to extract the file name for each Reddit post.

In [83]:
reddit_titles = [title for title in reddit_df['title']]

## Training the Topic Model

### Set Number of Topics

We need to make a variable `num_topics` and assign it the number of topics we want returned.

In [145]:
num_topics = 15

### Set Topic Model Output Files

Finally, we need to tell Little MALLET Wrapper where to find and output all of our topic modeling results. The code below will set Little MALLET Wrapper up to output your results inside a directory called "topic-model-output" and a subdirectory called "NYT-Obits", all of which will be inside your current directory.

If you'd like to change this output location, simply change `output_directory_path` below.

In [144]:
# Set training data (our pre-processed text files)
training_data = training_data

#Change to your desired output directory
output_directory_path = 'topic-model-output/Reddit-Workbook'

#No need to change anything below here
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

path_to_training_data           = f"{output_directory_path}/training.txt"
path_to_formatted_training_data = f"{output_directory_path}/mallet.training"
path_to_model                   = f"{output_directory_path}/mallet.model.{str(num_topics)}"
path_to_topic_keys              = f"{output_directory_path}/mallet.topic_keys.{str(num_topics)}"
path_to_topic_distributions     = f"{output_directory_path}/mallet.topic_distributions.{str(num_topics)}"

### Train Topic Model

Now we can run our topic model with `little_mallet_wrapper.quick_train_topic_model()`.

In [124]:
#path_to_mallet = 'mallet-2.0.8/bin/mallet'

In [None]:
little_mallet_wrapper.quick_train_topic_model(path_to_mallet,
                                              output_directory_path,
                                              num_topics,
                                              training_data)

## Display Topics and Top Words

This code uses the `little_mallet_wrapper.load_topic_keys()` function to read and process the MALLET topic model output from your computer, specifically the file "mallet.topic_keys.15".

In [None]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)

for topic_number, topic in enumerate(topics):
    print(f"✨Topic {topic_number}✨\n\n{topic}\n")

## Load Topic Distributions

To get the topic distributions, we're going to use the `little_mallet_wrapper.load_topic_distributions()` function, which will read and process the MALLET topic model output, specifically the file "mallet.topic_distributions.15". 

In [147]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

## Display Top Titles Per Topic

We can also display the obituaries that have the highest probability for every topic with the `little_mallet_wrapper.get_top_docs()` function.

Here we `zip` together the training data (pre-processed docs) and list of titles, as well as the training data and list of original texts. Then we turn them into dictionaries.

In [148]:
training_data_reddit_titles = dict(zip(training_data, reddit_titles))
training_data_original_text = dict(zip(training_data, original_texts))

Then we'll make our own function `display_top_titles_per_topic()` that will display the top text titles for every topic.

In [149]:
def display_top_titles_per_topic(topic_number=0, number_of_documents=5):
    
    print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        print(round(probability, 4), training_data_reddit_titles[document] + "\n")
    return

To display the top 5 obituary titles with the highest probability of containing Topic 0, we will run:

In [None]:
display_top_titles_per_topic(topic_number=0, number_of_documents=5)

## Display Topic Words in Context

To display the original obituary texts that rank highly for a given topic, with the relevant topic words **bolded** for emphasis, we are going to make the function `display_bolded_topic_words_in_context()`.

In [151]:
from IPython.display import Markdown, display
import re

def display_bolded_topic_words_in_context(topic_number=3, number_of_documents=3, custom_words=None):

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        
        print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")
        
        probability = f"✨✨✨\n\n**{probability}**"
        obit_title = f"**{training_data_reddit_titles[document]}**"
        original_text = training_data_original_text[document]
        topic_words = topics[topic_number]
        topic_words = custom_words if custom_words != None else topic_words

        for word in topic_words:
            if word in original_text:
                original_text = re.sub(f"\\b{word}\\b", f"**{word}**", original_text)

        display(Markdown(probability)), display(Markdown(obit_title)), display(Markdown(original_text))
    return

To display the top 3 Reddit posts with the highest probability of containing Topic 0 and with relevant topic words bolded, we will run:

In [None]:
display_bolded_topic_words_in_context(topic_number=0, number_of_documents=3)

## Your Turn!

With your group, come up with labels for the select topics below. For this exercise, we will all use the exact same topic model results, which can be found in `topic-model-output/Reddit-Class-Version`.

Run the cells below to load the class version of our topic model results.

In [166]:
output_directory_path_class = 'topic-model-output/Reddit-Class-Version'

path_to_topic_keys_class              = f"{output_directory_path_class}/mallet.topic_keys.15"
path_to_topic_distributions_class     = f"{output_directory_path_class}/mallet.topic_distributions.15"

In [167]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys_class)
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions_class)

## Topics to Label

**Topic 0**: 'time', 'work', 'get', 'day', 'going', 'home'  
**Topic Label**: ?

**Topic 1**: 'name', 'friends', 'people', 'post', 'even', 'group'  
**Topic Label**: ?

**Topic 2**: 'family', 'brother', 'years', 'parents', 'sister'  
**Topic Label**: ?

**Topic 3**: 'food', 'eat', 'dinner', 'eating', 'one'   
**Topic Label**: ?

**Topic 5**: 'like', 'really', 'doesn', 'think', 'things'    
**Topic Label**: ?

**Topic 6**: 'money', 'pay', 'would', 'get', 'buy', 'since'   
**Topic Label**: ?

**Topic 12**: 'husband', 'kids', 'dog', 'said', 'home'  
**Topic Label**: ?

**Topic 14**: 'car', 'back', 'get', 'didn', 'got', 'said'  
**Topic Label**: ?


To come up with a label for each topic, examine the top Reddit post titles for that topic as well as the words in context.

*Note: To use the functions below, you need to have run most of the cells above, except for the `quick_train_topic_model()` cell.*

In [None]:
display_top_titles_per_topic(topic_number=0, number_of_documents=5)

In [None]:
display_bolded_topic_words_in_context(topic_number=0, number_of_documents=4)