# Topic Modeling on London court cases

We're going to be using a topic model to explore transcripts from court cases in London from 1820-1830. A topic model is similar to a document clustering algorithm, but instead of grouping together documents we're going to group together word *tokens*. A document can thus "belong" to multiple topics, and a word can be present in multiple topics.

This is a small portion of the Old Bailey corpus, originally from https://www.oldbaileyonline.org/, formatted and annotated by http://fedora.clarin-d.uni-saarland.de/oldbailey/.

**This is real history. Some of the content of these cases might feel closer to your real life than other fictional works we've studied. I have made an effort to not include some of the more nightmarish cases, but be ready.**

Some of you are having trouble installing or running the `cython` package. If so, find someone at your table who has got this working. Answer each question on your own, but work with your tablemate.

I have made some changes to the code from Monday that should make it possible to run the model-training cell multiple times without 

* *Before* looking at documents, take a moment to think about your expectations. What crimes will people be charged with? What sort of evidence will be presented? What will punishments be? Write this **on paper** with your name and netid.

Run the topic model. Sometimes we are interested in small words like "the" or "and". Here we're looking for more meaning-bearing words, so we're going to selectively remove small or overly frequent words from the collection. I have given you a minimal set of words to get you started.

* Sometimes we are interested in small words like "the" or "and". Here we're looking for more meaning-bearing words, so we're going to selectively remove small or overly frequent words from the collection. Go to the file `data/OldBailey/stoplist.txt` and add words to the "stoplist" (words to ignore). Rerun your model. You may want to repeat this process several times, updating the stoplist based on your new results. Describe several cases of words where you were not sure whether to keep them or not. Why were you uncertain? How might your analysis change depending on whether you removed those words or not? Come to a consensus at your table about stoplist edits, so that your results will be comparable in the next sections.

[Response here]

* Compare your topics to others at your table. Find four topics that are similar across all models. Record at least  three variants for each similar topic. Use the `top_docs()` function to show the most prominent documents for these topics (save these executed cells). Describe the documents that have the largest proportion of this topic, and compare those documents to your table-mates. Are they similar or different?

[Response here]

* Read about Mr. Trust's mugging in `documents[780]["original"]`. I've given you a cell that shows the most represented topics for this document. Do these topics represent the content of the document? What do they include and what do they miss? How does these topics compare to your table-mates?

[Response here]

* Is this a useful way to look at a collection? What type of analysis does it support, and what would be difficult? What do you want to know about the Old Bailey corpus now, and what methods or tools would you use to find out more?

[Response here]

* Now that you have explored the documents through the topic model, what do you think of the original questions? What crimes were people  charged with? What sort of evidence was presented? What punishments were applied? (Write your answer **on paper** in class). 

[Response here]



In [None]:
import numpy as np
%load_ext cython

import re, sys, random, math
from collections import Counter
from timeit import default_timer as timer
from IPython.display import display, clear_output, Markdown, Latex

from matplotlib import pyplot

word_pattern = re.compile("\w[\w\-\']*\w|\w")

In [None]:
source_directory = "../data/OldBailey"

num_topics = 30
doc_smoothing = 0.5
word_smoothing = 0.01

In [None]:
%%cython

from cython.view cimport array as cvarray
import numpy as np
import random
from timeit import default_timer as timer

class Document:
    
    def __init__(self, long[:] doc_tokens, long[:] doc_topics, long[:] topic_changes, long[:] doc_topic_counts):
        self.doc_tokens = doc_tokens
        self.doc_topics = doc_topics
        self.topic_changes = topic_changes
        self.doc_topic_counts = doc_topic_counts

cdef class TopicModel:
    
    cdef long[:] topic_totals
    cdef long[:,:] word_topics
    cdef int num_topics
    cdef int vocab_size
    
    cdef double[:] topic_probs
    cdef double[:] topic_normalizers
    cdef float doc_smoothing
    cdef float word_smoothing
    cdef float smoothing_times_vocab_size
    
    documents = []
    vocabulary = []
    
    def __init__(self, num_topics, vocabulary, doc_smoothing, word_smoothing):
        self.num_topics = num_topics
        self.vocabulary.extend(vocabulary)
        self.vocab_size = len(vocabulary)
        
        self.doc_smoothing = doc_smoothing
        self.word_smoothing = word_smoothing
        self.smoothing_times_vocab_size = word_smoothing * self.vocab_size
        
        self.topic_totals = np.zeros(num_topics, dtype=int)
        self.word_topics = np.zeros((self.vocab_size, num_topics), dtype=int)
    
    def clear_documents(self):
        self.documents.clear()
    
    def add_document(self, doc):
        cdef int word_id, topic
        
        self.documents.append(doc)
        
        for i in range(len(doc.doc_tokens)):
            word_id = doc.doc_tokens[i]
            topic = random.randrange(self.num_topics)
            doc.doc_topics[i] = topic
            
            self.word_topics[word_id,topic] += 1
            self.topic_totals[topic] += 1
            doc.doc_topic_counts[topic] += 1
            
    def sample(self, iterations):
        cdef int old_topic, new_topic, word_id, topic, i, doc_length
        cdef double sampling_sum = 0
        cdef double sample
        cdef long[:] word_topic_counts
        
        cdef long[:] doc_tokens
        cdef long[:] doc_topics
        cdef long[:] doc_topic_counts
        cdef long[:] topic_changes
        
        cdef double[:] uniform_variates
        cdef double[:] topic_probs = np.zeros(self.num_topics, dtype=float)
        cdef double[:] topic_normalizers = np.zeros(self.num_topics, dtype=float)
        
        for topic in range(self.num_topics):
            topic_normalizers[topic] = 1.0 / (self.topic_totals[topic] + self.smoothing_times_vocab_size)
        
        for iteration in range(iterations):
            for document in self.documents:
                doc_tokens = document.doc_tokens
                doc_topics = document.doc_topics
                doc_topic_counts = document.doc_topic_counts
                topic_changes = document.topic_changes
                
                doc_length = len(document.doc_tokens)
                uniform_variates = np.random.random_sample(doc_length)
                
                for i in range(doc_length):
                    word_id = doc_tokens[i]
                    old_topic = doc_topics[i]
                    word_topic_counts = self.word_topics[word_id,:]
        
                    ## erase the effect of this token
                    word_topic_counts[old_topic] -= 1
                    self.topic_totals[old_topic] -= 1
                    doc_topic_counts[old_topic] -= 1
        
                    topic_normalizers[old_topic] = 1.0 / (self.topic_totals[old_topic] + self.smoothing_times_vocab_size)
        
                    ###
                    ### SAMPLING DISTRIBUTION
                    ###
        
                    sampling_sum = 0.0
                    for topic in range(self.num_topics):
                        topic_probs[topic] = (doc_topic_counts[topic] + self.doc_smoothing) * (word_topic_counts[topic] + self.word_smoothing) * topic_normalizers[topic]
                        sampling_sum += topic_probs[topic]

                    sample = uniform_variates[i] * sampling_sum
        
                    new_topic = 0
                    while sample > topic_probs[new_topic]:
                        sample -= topic_probs[new_topic]
                        new_topic += 1
            
                    ## add the effect of this token back in
                    word_topic_counts[new_topic] += 1
                    self.topic_totals[new_topic] += 1
                    doc_topic_counts[new_topic] += 1
                    topic_normalizers[new_topic] = 1.0 / (self.topic_totals[new_topic] + self.smoothing_times_vocab_size)

                    doc_topics[i] = new_topic
        
                    if new_topic != old_topic:
                        #pass
                        topic_changes[i] += 1

    def topic_words(self, int topic, n_words=12):
        sorted_words = sorted(zip(self.word_topics[:,topic], self.vocabulary), reverse=True)
        return " ".join([w for x, w in sorted_words[:n_words]])

    def print_all_topics(self):
        for topic in range(self.num_topics):
            print(topic, self.topic_words(topic))

In [None]:
## Read the stoplist file

stoplist = set()
with open("{}/stoplist.txt".format(source_directory), encoding="utf-8") as stop_reader:
    for line in stop_reader:
        line = line.rstrip()
        stoplist.add(line)


## Read the documents file
        
word_counts = Counter()
documents = []

for line in open("{}/documents.txt".format(source_directory), encoding="utf-8"):
    #line = line.lower()
    
    tokens = word_pattern.findall(line)
    
    ## remove stopwords, short words, and upper-cased words
    tokens = [w for w in tokens if not w in stoplist and len(w) >= 3 and not w[0].isupper()]
    word_counts.update(tokens)
    
    doc_topic_counts = np.zeros(num_topics, dtype=int)
    
    documents.append({ "original": line, "token_strings": tokens, "topic_counts": doc_topic_counts })

## Now that we're done reading from disk, we can count the total
##  number of words.
vocabulary = list(word_counts.keys())
word_ids = { w: i for (i, w) in enumerate(vocabulary) }

## With the vocabulary, go back and create arrays of numeric word IDs
for document in documents:
    tokens = document["token_strings"]
    doc_topic_counts = document["topic_counts"]
    
    doc_tokens = np.ndarray(len(tokens), dtype=int)
    doc_topics = np.ndarray(len(tokens), dtype=int)
    topic_changes = np.zeros(len(tokens), dtype=int)
    
    for i, w in enumerate(tokens):
        doc_tokens[i] = word_ids[w]
        ## topics will be initialized by the model
    
    document["doc_tokens"] = doc_tokens
    document["doc_topics"] = doc_topics
    document["topic_changes"] = topic_changes

### This cell actually runs the model

This may take some time to run. It will print "Done!" at the end.

I've made some changes since Monday that will let you re-run this cell without errors.

In [None]:
model = TopicModel(num_topics, vocabulary, doc_smoothing, word_smoothing)

## `documents` seems to be a class variable, not an object variable
model.clear_documents()

for document in documents:
    document["topic_changes"].fill(0)
    document["topic_counts"].fill(0)
    c_doc = Document(document["doc_tokens"], document["doc_topics"], document["topic_changes"], document["topic_counts"])
    model.add_document(c_doc)

sampling_dist = np.zeros(num_topics, dtype=float)

doc_topic_probs = np.zeros((len(model.documents), num_topics))
word_topic_probs = np.zeros((len(vocabulary), num_topics))

# Initial burn-in iterations
for i in range(10): # using 500 iterations for faster stoplist curation
    start = timer()
    model.sample(50)
    elapsed_time = timer() - start
    
    display(Markdown("### Iteration {}, {:.2f} seconds per iteration".format((i+1) * 50, elapsed_time / 50)))
    
    table_markdown = "### Iteration {}, {:.2f} seconds per iteration\n".format((i+1) * 50, elapsed_time / 50)
    table_markdown += "|Topic | Most likely words (descending)|\n"
    table_markdown += "|--|--|\n"
    for topic in range(num_topics):
        table_markdown += "|{}|{}|\n".format(topic, model.topic_words(topic, 12))
    
    clear_output()
    display(Markdown(table_markdown))
        
# Saved samples
for i in range(5):
    model.sample(10)
    
    for doc_id, doc in enumerate(model.documents):
        for word_id, topic in zip(doc.doc_tokens, doc.doc_topics):
            doc_topic_probs[doc_id,topic] += 1
            word_topic_probs[word_id,topic] += 1

            
print("Done!")
            
# Normalize
doc_row_sums = doc_topic_probs.sum(axis=1)
doc_topic_probs /= doc_row_sums[:,np.newaxis]

word_col_sums = word_topic_probs.sum(axis=0)
word_topic_probs /= word_col_sums[np.newaxis,:]

topic_top_words = []
for topic in range(num_topics):
    sorted_words = sorted(zip(word_topic_probs[:,topic], vocabulary), reverse=True)
    topic_top_words.append(" ".join([w for x, w in sorted_words[:12]]))

### Representing a document as a mixture of topics

At the end of the training process I'm saving multiple samples and averaging over them to get better estimates of the probability of words in topics and the probability of topics in words.

For convenience, I created an array `topic_top_words` that contains the top 12 most probable words in this averaged matrix.

Let's look at one of the longer documents, describing an incident of violence and property crime. (Most proper names have been removed)

In [None]:
documents[780]["original"]

Here are the most probable topics in this document:

In [None]:
sorted(zip(doc_topic_probs[780,:], topic_top_words), reverse=True)

### This cell shows the prevalence of each topic from the beginning of the work to the end.

I'm including this for comparison to Pliny's encyclopedia. It looks quite different to me, and tells us something about the construction of the collection.

In [None]:
for topic in range(num_topics):
    print(topic, model.topic_words(topic, n_words=6))
    pyplot.plot(doc_topic_probs[:,topic])
    pyplot.show()

### This cell prints the documents with the largest proportion of a specified topic.

Find a few topics that are of interest, and compare results around your table. Are topics that look similar from the perspective of their top words bringing together similar documents?

In [None]:
def top_docs(topic, n_docs=10):
    for doc_id in np.argsort(-doc_topic_probs[:,topic])[:n_docs]:
        print("{} {:.1f}% | {}".format(doc_id, 100 * doc_topic_probs[doc_id,topic], documents[doc_id]["original"]))

In [None]:
top_docs(11)