Script overview

The paper describing this project is included in the repo under the title scale_draft1.pdf. The algorithm is based on this paper by Shuang Yang.

The main functions are separated into several modules. Use of the modules is demonstrated in the script files

The corpus used to build the initial pointwise mutual innformation(PMI) graph is not included and would need to be rebuilt using the build_graph_script.


preprocess: routines based on the NLTK library for tokenizing and cleaning text.

language_model: functions used to build a semantic graph based on pointwise mutual information.

gauss_filter: is used for filtering or smoothing over either 1D or 2D binary representations of text.

keyword: contains functions for building a list of keywords from a filtered text.

Scripts Demonstrates building a graph based on a text file. It can take 10-20 minutes to build a graph based on a moderate sized corpus Demonstrates filtering on a reduced vocabulary. This enables the result to be visualized, since using a full vocabulary would make the semantic axis much greater than the spatial axis (x). This script demonstrates extracting keywords from a moderate-sized text


In the current scripts data is loaded from the semantic_graph.p file which has around 82,000 bigrams with 16,000 unique words.

Text files:

didion.txt : a clip from a book review paper_draft.txt : a draft of the paper handed in with this project


Scale-space theory applied to text analysis. A paper and algorithm prototype based on a paper by Shuang Yang



