Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Phrase Similarity Visualizer

CIS 192 S18 Final Project, by Valencia and Arun

Calculates and visualizes various similarity metrics between two pieces of text: Hamming distance, Euclidian distance, Manhattan distance, and "meaning" distance (calculated using non-stopword vector embeddings).

Project Requirements

The custom class, SenLen, is found in and handles string manipulation logic. Magic methods for operators are used to perform computations. There is also a method that returns the maximum sentence length from two sentences given as arguments.

We use the modules math, flask, numpy, nltk, and gensim (Python implementation of Word2Vec model).

We use a decorator for the / route and its methods GET and POST.

Routes and Usage

To run the server, install the packages listed above and run

Type two sentences into in input areas and click 'Go!' to visualize the similarity between them. You can hover over the bar graph visualization to view exact percentages for similarity.

Project Implementation

The Flask server is instantiated and served in In addition, we developed a DistanceCalculator class that computes distance calculations and sends the similarities to the Flask template rendering.

Traditional string manipulation technique implementations are in which we import into These include Hamming distance, Euclidian distance, and Manhattan distance. To determine similarity, these distances are computed and then divided by the length of the larger string in order to yield a value between 0 and 1. Finally, we take the compliment of this value to represent the similarity. In, we also implemented a class named SenLen to handle certain string manipulation logic—namely padding strings in case of differing lengths. We use magic methods to handle comparisons.

The "meaning" distance heuristic is implemented in and is imported into First, we remove all stopwords from the input (e.g. words like "the" and "for" that yield no meaning) using the nltk package (English). We use the word2vec model implemented in the gensim package to generate word embeddings (numpy vectors). The model is currently trained on a corpus of Barack Obama's speeches (Donald Trump's speeches were getting a bit out of hand). Then, we average the vectors to determine a centralized meaning for the phrase. Finally, we calculate the distance between the two vectors by normalizing each vector and computing the cosine distance.


Visualization tool to compare various methods of phrase similarity (i.e. string, lexical, semantic) using deep learning.




No releases published


No packages published