# Finding Similar and Different Documents

One of the core tasks of text-based data analysis (or of natural language processing in general) is figuring out which documents are similar to one another and which ones are different. Let's say you have a ton of financial documents &mdash; far too many to read manually &mdash; a few of which you know are interesting. You want to find more documents that look like the interesting documents you have.

Even if you haven't done this yourself, you've probably seen something like this. Text-based recommendation systems, like the ones at *The New York Times* that suggest a story for you to read after your current article, almost certainly incorporate document similarity as a key feature.

In addition to being a feature, and sometimes a standalone task, in natural language processing, document similarity closely resembles a number of other tasks. When you're clustering a set of documents into machine-learned topics, you're finding groups of documents that are similar to one another. And if you're classifying a set of documents, you're kind of creating a binary cluster of documents that resemble each other, at least in one specific way.

So what does this have to do with `text_data`, which is a data exploration library? Well, whether you're trying to create a model for a web application, for some data analysis, or to strengthen a journalistic story, you'll want some intuition about *why* your computer thinks some documents are similar to one another and why they're different. You want to make sure that your model is picking up on what you want it to pick up on and is not picking up on things you don't want it to pick up on. It's very easy to create biased models; data exploration is a key way to limit those biases and improve the performance of your models.

There's only one function in `text_data` specifically geared toward helping you explore similar documents. `distance_heatmap` allows you to graphically render similarity scores between two corpora (or between one corpus and itself). This requires a "distance matrix," or a matrix of pairwise distances between all of the documents in one corpus to all of the documents in another. There are a ton of different ways to create these matrices, but the basic strategy you'll take will typically go as follows:

1. You'll tokenize all of the documents, or in other words convert each document into a list of strings
2. You'll build some sort of model that converts the tokenized documents into a vector or matrix of numbers (called an encoding).
3. You'll create an pairwise matrix between the two corpuses, setting the values of the cells as the similarity or distance of the two documents at their respective locations. (Typically, you'll use cosine similarity.)

For this particular, notebook, I'll be using `Doc2Vec`. Another good option is the Universal Sentence Encoder. But feel free to get creative with this. As long as you end up with a matrix of numbers when all is said and done, you should be good.

## Set Up

To start, you'll need to install some packages. This notebook uses `altair_saver` to save data visualizations as images, `pandas` to deal with data selection and manipulation, and `gensim` to create our model. You can install them in the cell below.

In [1]:
%%capture
pip install altair_saver pandas gensim

## Loading the Data

Next, you'll want to load in the data. For all of the examples, I'm using the [Kaggle State of the Union Corpus](https://www.kaggle.com/rtatman/state-of-the-union-corpus-1989-2017) for all of the notebooks in this examples directory. This dataset contains the text of all of the State of the Union Addresses between 1790 and 2018, except for 1933.

In [3]:
import multiprocessing

from gensim.models import doc2vec
from IPython.display import Image
import numpy as np
import pandas as pd
import text_data
from utilities import load_sotu_data, tokenizer

sotu_speeches = pd.DataFrame(load_sotu_data())
sotu_speeches.head()

Unnamed: 0,president,year,speech
0,Bush,2001,To the Congress of the United States:\n\nMr. S...
1,Monroe,1822,Fellow-Citizens of the Senate and House of Rep...
2,Washington,1794,Fellow-Citizens of the Senate and House of Rep...
3,Cleveland,1895,To the Congress of the United States:\n\nThe p...
4,Bush,2008,"Madam Speaker, Vice President Cheney, Members ..."


## Create a pairwise distance matrix

Next, I'm going to create a document embedding model and use it to create a matrix of similar (or dissimilar) documents. I'm using `Doc2Vec` for this. `Doc2Vec` is a slight extension of `word2vec` that creates vector representations of both documents and words. We'll use it to get a sense of which documents are similar to other documents. However, depending on your context, other methods might work. Just be careful about the length of the documents you're looking at: some methods perform very well for small documents (like tweets) but perform poorly for longer documents (like the State of the Union Addresses we're looking at).

To start, I'll create a model and use that to create a pairwise matrix of document similarities across the entire State of the Union corpus.

In [2]:
sotu_corpus = text_data.Corpus(list(sotu_speeches.speech), tokenizer)
documents = [
    doc2vec.TaggedDocument(doc, [i])
    for i, doc in enumerate(sotu_corpus.tokenized_documents)
]
# hyperparameters; feel free to experiment with these; this is just intended as a fast and dirty
# *ok* representation of the documents
doc2vec_params = {
    "alpha": 0.025,
    "epochs": 10,
    "min_alpha": 0.0001,
    # this culls rare words from the trained model,
    # which lowers the memory cost of the model and *should* improve performance
    "min_count": 5,
    # this just speeds up performance on multi-core machines
    "workers": max(1, multiprocessing.cpu_count() // 2)
}
model = doc2vec.Doc2Vec(documents, **doc2vec_params)
distance_matrix = np.array(list(map(model.docvecs.distances, range(len(documents)))))

In [4]:
%timeit sotu_corpus.add_ngram_index(n=2)

2.11 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%load_ext memory_profiler

In [2]:
%memit

peak memory: 47.64 MiB, increment: 0.22 MiB


In [1]:
import numpy as np
import pandas as pd
import text_data
from utilities import load_sotu_data, tokenizer

sotu_speeches = pd.DataFrame(load_sotu_data())

In [2]:
corpus = text_data.Corpus(list(sotu_speeches.speech), tokenizer)

the	0.09286477736412128
of	0.05791770487410226
and	0.03570726898185784
to	0.03567010709502323
in	0.02236732086865637
a	0.016178321672267786
that	0.012621529525096691
for	0.010882052536303649
be	0.010669679065834813
our	0.009902744892902372


In [2]:
corpus = text_data.Corpus(["The cat is near the birds", "The birds are distressed"])
corpus.get_top_words(corpus.word_count_vector(), top_n=1)

(array(['the'], dtype='<U10'), array([3.]))

In [4]:
corpus.get_top_words(corpus.word_count_vector(), top_n=2)

(array(['the', 'birds'], dtype='<U10'), array([3., 2.]))

In [2]:
sotu_corpus = text_data.Corpus(list(sotu_speeches.speech[:5]), tokenizer)

In [3]:
most_freq = sotu_corpus.index.tf_matrix()
sotu_corpus.get_top_words(sotu_corpus.index.word_count_vector(), top_n=2)

(array(['the', 'of'], dtype='<U16'), array([2379., 1540.]))

In [5]:
sotu_corpus.index.document_count_vector().max()

5.0

In [5]:
sotu_corpus.vocab_size

23442

In [None]:
np.log(term_doc)

In [12]:
l, r = numpy.shape(term_doc)
l * r

250650204

In [None]:
sotu_corpus.ngram_indexes[3].index.term_doc_matrix()

In [13]:
sotu_corpus.ngram_indexes[3].index.get_docs_with_word("corner of the")

[28, 117, 168, 169, 173, 197, 213]

In [5]:
%timeit [tokenizer(doc) for doc in list(sotu_speeches.speech)]

1.53 s ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
from text_data.core import tokenize_many

%timeit tokenize_many(list(sotu_speeches.speech), tokenizer)

2.04 s ± 55.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Visualizing the Similarities

Next, let's see what State of the Union Addresses are similar to each other. To do this, we'll create a heat map using `text_data` that takes our similarity matrix and graphically renders it so similar addresses will have a lighter color, while considerably different addresses will have a darker color.

There are a few things to note here. First of all, you should see a very light diagonal straight down the middle. These are comparing the similarity of one document to itself. Naturally, our model knows that the two identical documents are pretty much the same.

But you should also see a structure to this visualization. There are entire blocks of light colors and blocks of dark colors. If you look carefully at the indices, you'll notice that they appear in alphabetical order. What's happening is something that probably makes intuitive sense to you: the State of the Union Addresses that individual presidents give are pretty similar *to one another*.

You can also see some patterns or outliers in the data that are interesting (but make sense). President George W. Bush's addresses look fairly similar to a lot of addresses from presidents that are fairly contemporary to him &mdash; presidents like Obama, Clinton, and Reagan. Finally, you can see that George Washington's inaugural address really looks unlike any of the other addresses (with the exception of itself); the heatmap is basically a dark blue line.

In [4]:
speech_indices = sotu_speeches.president + ", " + sotu_speeches.year
chart = text_data.multi_corpus.distance_heatmap(
    distance_matrix,
    speech_indices,
    speech_indices,
    "SOTU Speech",
    "SOTU Speech"
)
chart.save("speech_similarity_matrix.png")

![A heatmap showing which State of the Union Addresses are Similar to Each Other](speech_similarity_matrix.png)

While the first heatmap compares the entire corpus to itself, there's no need to do this. Here, I'm going to compare the speeches of Clinton and Obama to those of George W. and George HW Bush.

In [5]:
roosevelts = sotu_speeches[sotu_speeches.president == "Roosevelt"]
post_war = sotu_speeches[
    (sotu_speeches.president == "Truman") | (sotu_speeches.president == "Eisenhower")
]
compare_matrix = distance_matrix[roosevelts.index][:,post_war.index]
heatmap = text_data.multi_corpus.distance_heatmap(
    compare_matrix,
    speech_indices.loc[roosevelts.index],
    speech_indices.loc[post_war.index],
    "Roosevelt Speeches",
    "Clinton and Obama Speeches"
)
heatmap.save("recent_sotu_heatmap.png")

![A heatmap of the two Roosevelts, Truman and Eisenhower](recent_sotu_heatmap.png)

Again, there's a pattern here, this one temporal. As you'll see, the first Roosevelt's speeches have a lot less in common to Eisenhower's and Truman's speeches. That makes sense, considering that the second Roosevelt served during the same general timeframe as the other two (even if the Second World War complicates things.

## Final Exploration

There's a lot more you can do with `text_data` to explore these similarities. I'm going to explore one way and hint at how you might extend that. In particular, I'm going to find the documents that are most similar to FDR's "Four Freedoms" speech, the 1941 State of the Union Address where FDR  argued that the United States needed to serve as "an arsenal" for its allies and spoke of democratic principles as they were most under attack.

I'm going to start doing this by getting the 25 speeches that `Doc2Vec` thinks are most similar to this one, and I'm going to plot the top 15 results in `pandas`:

In [6]:
four_freedoms = sotu_speeches[sotu_speeches.year == "1941"].index[0]
freedom_idxs, _ranks = zip(*model.docvecs.most_similar(four_freedoms, topn=25))
near_four_freedoms = sotu_speeches.loc[list((*freedom_idxs, four_freedoms))]
near_four_freedoms.head(15)

Unnamed: 0,president,year,speech
61,Truman,1951,"Mr. President, Mr. Speaker, Members of the Con..."
161,Roosevelt,1940,"Mr. Vice President, Mr. Speaker, Members of th..."
26,Roosevelt,1943,"Mr. Vice President, Mr. Speaker, Members of th..."
183,Roosevelt,1944,To the Congress:\n\nThis Nation in the past tw...
21,Roosevelt,1942,In fulfilling my duty to report upon the State...
45,Roosevelt,1939,"Mr. Vice President, Mr. Speaker, Members of th..."
221,Truman,1953,To the Congress of the United States:\n\nI hav...
32,Kennedy,1963,"Mr. Vice President, Mr. Speaker, Members of th..."
157,Roosevelt,1945,To the Congress:\n\nIn considering the State o...
56,Truman,1952,"Mr. President, Mr. Speaker, Members of the Con..."


Predictably, there are a lot of speeches from Roosevelt in that list (and interestingly, they're speeches he gave during the war). You can also see a fair amount of temporal clustering. But now I want to hint at how this kind of thing can be extended.

I'm going to find the words that most distinguish the documents that are similar to Roosevelt's Four Freedoms speech from a background corpus of all of the SOTU speeches.

In [7]:
four_freedoms_corpus = text_data.Corpus(near_four_freedoms.speech.to_list(), tokenizer)

text_data.multi_corpus.display_top_differences(
    four_freedoms_corpus,
    sotu_corpus,
    text_data.multi_corpus.log_odds_ratio
)

Order,Word,Score
1.0,officers,-3.0808350286828308
2.0,objects,-3.117887798901709
3.0,tribes,-3.1245334627833383
4.0,notes,-3.163506158320878
5.0,ports,-3.185558219649298
6.0,silver,-3.2192582319201986
7.0,claims,-3.231417600316276
8.0,intercourse,-3.400612267993276
9.0,mexico,-3.4735328097059446
10.0,spain,-3.600418221679119

Order,Word,Score
1.0,kremlin,2.5861379733056165
2.0,hitler,2.5861379733056165
3.0,nazis,2.58612420956848
4.0,appeasement,2.5861035643758257
5.0,hemispheric,2.586096682755084
6.0,numerically,2.5860898011894147
7.0,hats,2.5860898011894147
8.0,amphibious,2.5860898011894147
9.0,versus,2.5860829196788124
10.0,stalinist,2.5860829196788124


Predictably, words related to the Nazis and to wars are *highly associated* with the Four Freedoms speech and similar speeches, while old-timey words about silver or Mexico are much less common.

Finally, we can look to see examples of certain words appearing. (We could do the same with entire documents, but the State of the Union Addresses are probably too long for this to be all that useful.

In [8]:
four_freedoms_corpus.display_search_results("appeasement", window_size=150)

## Conclusion

There's a lot more that you could do to explore this data. But hopefully this provides a decent overview of how you can use the heatmap function in `text_data` to figure out what documents are similar to each other and to visually identify patterns. And hopefully, you can imagine ways in which you could take information you have about document similarity &mdash; either from this library or from another &mdash; and search through those similar (or different documents) to try to identify meaningful patterns.