# Class 13 Exercises — Topic Modeling

In these lessons, we're learning about a text analysis method called *topic modeling*. This method will help us identify the main topics or discourses within a collection of texts.

In this particular lesson, we're going to use [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), a Python wrapper for [MALLET](http://mallet.cs.umass.edu/topics.php), to topic model 379 obituaries published by *The New York Times*. This dataset is based on data originally collected by Matt Lavin for his *Programming Historian* [TF-IDF tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset). I have re-scraped the obituaries so that the subject's name and death year is included in each text file name, and I have added 13 more ["Overlooked"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries, including [Karen Spärck Jones](https://www.nytimes.com/2019/01/02/obituaries/karen-sparck-jones-overlooked.html), the computer scientist who introduced TF-IDF.

___

## Install and Import Packages

To use `little_mallet_wrapper` outside JupyterHub, you will need to install it.

In [None]:
#!pip install little_mallet_wrapper

Import `little_mallet_wrapper` and the data viz library `seaborn`. We're also going to import [`glob`](https://docs.python.org/3/library/glob.html) and [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) for working with files and the file system.

In [6]:
import little_mallet_wrapper
import seaborn

import glob
from pathlib import Path

## Process Texts

First, we're going to review how to open and read text files with Python, and we're going to test out one of `little_mallet_wrapper`'s functions. This workflow will be essential for pre-processing the NYT obituaries for topic modeling.

Open and read the text file for Ada Lovelace's obituary, which has the following relative filepath: "../NYT-Obituaries/1852-Ada-Lovelace.txt"

In [27]:
#Your code here

'A gifted mathematician who is now recognized as the first computer programmer.By CLAIRE CAIN MILLER\n\n A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843. It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard loom weaves flowers and leaves.” The computer she was writing about, the British inventor Charles Babbage’s Analytical Engine, was never built. But her writings about computing have earned Lovelace — who died of uterine cancer in 1852 at 36 — recognition as the first computer programmer. \n\n The program she wrote for the Analytical Engine was to calculate the seventh Bernoulli number. (Bernoulli numbers, named after the Swiss mathematician Jacob Bernoulli, are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating nu

Assign the contents of Ada Lovelace's obituary to the variable `text`

In [28]:
text = #Your code here

Now run `little_mallet_wrapper.process_string()` on `text`

In [29]:
little_mallet_wrapper.process_string(text, numbers='remove')

'gifted mathematician recognized first computer programmer claire cain miller century dawn computer age ada lovelace imagined modern day general purpose computer could programmed follow instructions wrote could calculate also create weaves algebraic patterns jacquard loom weaves flowers leaves computer writing british inventor charles babbage analytical engine never built writings computing earned lovelace died uterine cancer recognition first computer programmer program wrote analytical engine calculate seventh bernoulli number bernoulli numbers named swiss mathematician jacob bernoulli used many different areas mathematics deeper influence see potential computing machines could beyond calculating numbers said understand symbols used create music art insight would become core concept digital age walter isaacson wrote book innovators piece content data information music text pictures numbers symbols sounds video could expressed digital form manipulated machines also explored ramificati

## Working with Multiple Text Files — Glob to the Rescue!

To topic model the *NYT* obituaries, we need to process and work with dozens of text files. But how can we work with multiple text files at the same time?

This is where the `glob` library comes in handy. We can use `glob.glob()` to get a list of filepaths that match a certain pattern. Below, we are matching any file (`*`) in the "NYT-Obituaries" directory that has the file extension `.txt`

In [3]:
glob.glob( "../NYT-Obituaries/*.txt")

['../texts/history/NYT-Obituaries/1945-Adolf-Hitler.txt',
 '../texts/history/NYT-Obituaries/1915-F-W-Taylor.txt',
 '../texts/history/NYT-Obituaries/1975-Chiang-Kai-shek.txt',
 '../texts/history/NYT-Obituaries/1984-Ethel-Merman.txt',
 '../texts/history/NYT-Obituaries/1953-Jim-Thorpe.txt',
 '../texts/history/NYT-Obituaries/1964-Nella-Larsen.txt',
 '../texts/history/NYT-Obituaries/1955-Margaret-Abbott.txt',
 '../texts/history/NYT-Obituaries/1984-Lillian-Hellman.txt',
 '../texts/history/NYT-Obituaries/1959-Cecil-De-Mille.txt',
 '../texts/history/NYT-Obituaries/1928-Mabel-Craty.txt',
 '../texts/history/NYT-Obituaries/1973-Eddie-Rickenbacker.txt',
 '../texts/history/NYT-Obituaries/1989-Ferdinand-Marcos.txt',
 '../texts/history/NYT-Obituaries/1991-Martha-Graham.txt',
 '../texts/history/NYT-Obituaries/1997-Deng-Xiaoping.txt',
 '../texts/history/NYT-Obituaries/1938-George-E-Hale.txt',
 '../texts/history/NYT-Obituaries/1885-Ulysses-Grant.txt',
 '../texts/history/NYT-Obituaries/1909-Sarah-Orne-Je

Below, we are matching any files that starts with 1945 (`1945*`) in the "NYT-Obituaries" directory and have the file extension `.txt`

In [32]:
glob.glob( "../NYT-Obituaries/1945*.txt")

['../NYT-Obituaries/1945-Adolf-Hitler.txt',
 '../NYT-Obituaries/1945-Bela-Bartok.txt',
 '../NYT-Obituaries/1945-Ernie-Pyle.txt',
 '../NYT-Obituaries/1945-Harry-S-Truman.txt',
 '../NYT-Obituaries/1945-George-Patton.txt',
 '../NYT-Obituaries/1945-FDR.txt',
 '../NYT-Obituaries/1945-Jerome-Kern.txt']

Below, we are matching any files that starts with 1990 (`1990*`) in the "NYT-Obituaries" directory and have the file extension `.txt`

In [33]:
glob.glob( "../NYT-Obituaries/1990*.txt")

['../NYT-Obituaries/1990-Leonard-Bernstein.txt',
 '../NYT-Obituaries/1990-Erte.txt',
 '../NYT-Obituaries/1990-Ralph-David-Abernathy.txt',
 '../NYT-Obituaries/1990-Rex-Harrison.txt',
 '../NYT-Obituaries/1990-Greta-Garbo.txt',
 '../NYT-Obituaries/1990-Sammy-Davis-Jr.txt']

Save the all the text files in "NYT-Obituaries" to the variable `filenames`

In [49]:
directory = "../NYT-Obituaries"
filenames = glob.glob(f"{directory}/*.txt")

## Get Training Data From Text Files

Next, we need to process our texts with the function `little_mallet_wrapper.process_string()`. This function will take a text file and transform it to lowercase as well as remove stopwords, punctuation, and numbers.

Complete the code below to iterate through all the filenames, open and read each file, process the text, and then add the processed text to the list `training_data`.

In [None]:
#Make a list of all pre-processed NYT obituaries
training_data = []

for file in ...
    text = # Your code here — open and read text file
    processed_text = # Your code here — process text with little mallet wrapper
    # Your code here — add the pre-processed text 

We will also make a master list of the original text of the obituaries for future reference.

In [23]:
#Make a list of all original NYT obituaries (not pre-processed)
original_texts = []

for file in filenames:
    text = # Your code here — open and read text file
        # Your code here — add the original text 

Here, we will extract the relevant part of each file name by using [`Path().stem`](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem), which conveniently extracts just the last part of the file path without the ".txt" file extension.

In [65]:
obit_titles = [Path(file).stem for file in filenames]

Examine the first item in the list:

In [None]:
obit_titles#Your code here

## Setting Up and Training the Topic Model

We can get training data summary statistics by using the function `print_dataset_stats()`.

In [None]:
little_mallet_wrapper.print_dataset_stats(training_data)

Since Little MALLET Wrapper is a Python package built around MALLET, we need to tell it where the bigger, Java-based MALLET lives (`path_to_mallet`), and we also need to give permission to MALLET to run (`chmod +x`).

In [None]:
!chmod +x '/home/jovyan/course_materials/mallet/bin/mallet'

<div class="admonition note" name="html-admonition" style="background: orange; padding: 10px">
<p class="title">Note</p>
Make sure you run the cell above or Mallet/Little Mallet Wrapper won't work! You should only have to run it once.
</div>

Below we set the number of topics we want to return, as well as file paths for our topic modeling results. If you'd like to change this output location, simply change `output_directory_path` below.

In [53]:
num_topics = 15

#path_to_mallet = '/home/jovyan/course_materials/mallet/bin/mallet'
path_to_mallet = '../../../mallet/bin/mallet'
#No need to change anything below here
training_data = training_data

#Set output directory
output_directory_path = 'topic-model-output/'

#Create output directory
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

#Create output files
path_to_training_data           = f"{output_directory_path}/training.txt"
path_to_formatted_training_data = f"{output_directory_path}/mallet.training"
path_to_model                   = f"{output_directory_path}/mallet.model.{str(num_topics)}"
path_to_topic_keys              = f"{output_directory_path}/mallet.topic_keys.{str(num_topics)}"
path_to_topic_distributions     = f"{output_directory_path}/mallet.topic_distributions.{str(num_topics)}"

Finally, we can train our topic model with `little_mallet_wrapper.quick_train_topic_model()`.

In [None]:
little_mallet_wrapper.quick_train_topic_model(path_to_mallet,
                                              output_directory_path,
                                              num_topics,
                                              training_data)

## Display Topics and Top Words

To examine the 15 topics that the topic model extracted from the *NYT* obituaries, we can use `load_topic_keys()`.

The `load_topic_keys()` function will read and process the MALLET topic model output from your JupyterHub, specifically the file "mallet.topic_keys.15".

In [None]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)
topics

Complete the code below to iterate through the list `topics` and print out each topic number and list of topic words.

In [31]:
for ___, ____ in enumerate(topics):
    print(f"✨Topic {number}✨\n\n{topic}\n")

✨Topic 0✨

['art', 'paris', 'moses', 'picasso', 'work', 'artist', 'york', 'painting', 'new', 'schulz', 'pictures', 'modern', 'wright', 'works', 'disney', 'brown', 'painter', 'charlie', 'museum', 'arts']

✨Topic 1✨

['music', 'band', 'jazz', 'piano', 'sinatra', 'musical', 'composer', 'goodman', 'new', 'york', 'orchestra', 'concert', 'stravinsky', 'armstrong', 'bernstein', 'musicians', 'playing', 'davis', 'style', 'played']

✨Topic 2✨

['university', 'professor', 'research', 'science', 'institute', 'scientific', 'oppenheimer', 'atomic', 'prize', 'society', 'also', 'human', 'dewey', 'theory', 'nobel', 'work', 'vaccine', 'scientist', 'received', 'freud']

✨Topic 3✨

['one', 'years', 'first', 'said', 'time', 'new', 'later', 'would', 'world', 'life', 'two', 'many', 'made', 'man', 'year', 'also', 'could', 'became', 'american', 'old']

✨Topic 4✨

['black', 'court', 'justice', 'rights', 'warren', 'king', 'negro', 'civil', 'white', 'law', 'supreme', 'case', 'marshall', 'said', 'negroes', 'blacks

## Load Topic Distributions

MALLET also calculates the likely mixture of these topics for every single obituary in the corpus. This mixture is called a probability distribution.

To get the topic distributions, we're going to use the `little_mallet_wrapper.load_topic_distributions()` function, which will read and process the MALLET topic model output, specifically the file "mallet.topic_distributions.15". 

In [62]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

If we look at the 32nd topic distribution in this list of `topic_distributions`, which corresponds to Marilyn Monroe's obituary, we will see a list of 15 probabilities. This  list corresponds to the likelihood that each of the 15 topics exists in Marilyn Monroe's obituary.

In [None]:
topic_distributions[32]

## Explore Heatmap of Topics and Texts

We can visualize and compare these topic probability distributions with a heatmap by using the `little_mallet_wrapper.plot_categories_by_topics_heatmap()` function.

We have everything we need for the heatmap except for our list of target_labels, the sample of texts that we’d like to visualize and compare with the heatmap. Below we make our list of desired target labels.

In [67]:
target_labels = ['1852-Ada-Lovelace', '1885-Ulysses-Grant',
                 '1900-Nietzsche', '1931-Ida-B-Wells', '1940-Marcus-Garvey',
                 '1941-Virginia-Woolf', '1954-Frida-Kahlo', '1962-Marilyn-Monroe',
                 '1963-John-F-Kennedy', '1964-Nella-Larsen', '1972-Jackie-Robinson',
                 '1973-Pablo-Picasso', '1984-Ray-A-Kroc','1986-Jorge-Luis-Borges', '1991-Miles-Davis',
                 '1992-Marsha-P-Johnson', '1993-Cesar-Chavez']

# If you'd like to make a random list of target labels, you can uncomment and run the code below.
#import random
#target_labels = random.sample(obit_titles, 10)

In [None]:
little_mallet_wrapper.plot_categories_by_topics_heatmap(obit_titles,
                                      topic_distributions,
                                      topics, 
                                      output_directory_path + '/categories_by_topics.pdf',
                                      target_labels=target_labels,
                                      dim= (13, 9)
                                     )

## Examine Top Documents

The functions below will allow us to find the documents (the original texts of the NYT obituaries or their titles) that have the highest probability of containing certain topics.

In [69]:
from IPython.display import Markdown, display
import re

def make_md(string):
    """A function that transforms string data into Markdown
    so it can be nicely formatted with bolding and emojis
    """
    display(Markdown(str(string)))

def get_top_docs(docs, topic_distributions, topic_index, n=5):
    
    """A function that shows the top documents for a given set of topic distributions
    and a specific topic number
    """
    
    sorted_data = sorted([(_distribution[topic_index], _document) for _distribution, _document in zip(topic_distributions, docs)], reverse=True)
    topic_words = topics[topic_index]
    make_md(f"### ✨Topic {topic_index}✨\n\n{topic_words}\n\n---")
    
    for probability, doc in sorted_data[:n]:
        # Make topic words bolded
        for word in topic_words:
            if word in doc.lower():
                doc = re.sub(f"\\b{word}\\b", f"**{word}**", doc, re.IGNORECASE)
        make_md(f'✨  \n**Topic Probability**: {probability}  \n**Document**: {doc}\n\n')

Examine the top 5 NYT obituary article titles for the Topic 0

In [None]:
get_top_docs(obit_titles, topic_distributions, topic_index=0, n=5)

Examine the top 5 NYT obituary articles for the Topic 0

In [None]:
get_top_docs(original_texts, topic_distributions, topic_index=0, n=5)

## Your Turn!

Come up with labels for 5 of the 15 topics. To do so, examine the top obituaries for that topic as well as the words in context.

In [None]:
get_top_docs(original_texts, topic_distributions, topic_index=YOUR-TOPIC-NUMBER, n=5)

**Topic *Number Here***: First Five Topic Words Here  
**Topic Label**: Topic Label Here  

**Topic *Number Here***: First Five Topic Words Here  
**Topic Label**: Topic Label Here  

**Topic *Number Here***: First Five Topic Words Here  
**Topic Label**: Topic Label Here  

**Topic *Number Here***: First Five Topic Words Here  
**Topic Label**: Topic Label Here  

**Topic *Number Here***: First Five Topic Words Here  
**Topic Label**: Topic Label Here  

**Reflection**

Why did you label your topics the way you did? What made labeling them easy or hard? Where could you go next with this analysis?