# An explainable topological search engine with giotto-tda

**Summary:** In this notebook, I will share my work on how one can use the `giotto-tda`-package to cluster words and create a semantical search engine, retrieving documents similar to one's query, thereby showing how we can use topological techniques in the context of Natural Language Processing (NLP).  

**Data:** We will be using the classical [*reuters*](https://www.nltk.org/book/ch02.html) dataset, consisting of around 10K news documents collected from Reuters (inherently available in the *nltk* package). We will also use pre-trained word vectors, available through the *gensim* API. These are the [*word2vec-google-news-300*](https://github.com/mmihaltz/word2vec-GoogleNews-vectors). 

## How do we use giotto-tda for a search engine?! 

I will try to briefly explain the methodology and intuition behind this search engine.. 

A problem with many search engine techniques today is that they rely on the words to exist in the corpus of document we already have. 

When we search for certain terms in a collection of documents, many methods often look for the specific words, which can be troublesome. An example of this is the work from a [Kaggle notebook](https://www.kaggle.com/amitkumarjaiswal/nlp-search-engine) by Amit Kumar Jaiswal. Here, words are removed if they do not exist in the vocabulary of the previous documents. This is of course not applicable to advanced search engines, and work arounds exist, but we might miss semantic meanings and important information sometimes. 

A tool that allows for constructing a search engine mitigating this problem is the Mapper in `giotto-tda`. By taking the word vectors of all the words existing in a collection of documents, we can create a graph from these word vectors by using the mapper method. Each node will then work as a cluster of words that are similar and are used in the same context. Due to the inherent nature of word vectors, words are perceived as similar when they are used in the same context (see [Mikolov et al.](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)), and this can be measured with the use of the cosine similarity. We therefore use the cosine distance as our distance metric when constructing our graph. 

In essence, the mapper therefore allows us to cluster the words into an arbitrary number of clusters, **not a determined number of clusters**, with words that are similar in terms of semantical meaning. If a word does not exist, we can rely on its neighbours, defined through the vertices it belongs to, to fetch document that might be about similar topics. 

This mapper graph can then be used to query the data and retrieve documents. Each document has words belonging to different nodes, as different words end up in different nodes based on which words they are similar to. A document will therefore implicitly belong to many different nodes as it contains many words. 

We can explain a query in a few steps. For a demonstration, look below, but here I will simply explain it in words. Let's say a word is searched for, say `China`. We then look up which nodes this word belongs to, which also contains other words and therefore many different documents. We look at all documents that have words in those nodes, and the documents are ranked *based on their percent of words contained in the same node as the query*. In the simple example of *china*, we simply rank all documents based on the percentage of their words they have in the same nodes as the query. If a word does not exist in the corpus, we can simply leverage the word vectors and derive the most similar word that exist in the corpus. 

The construction of the mapper graph therefore works as an efficient way of clustering words and implicitly creating what one might call a topological signature for each document. 


### A brief mathematical formulation

Consider the following: 
1. A vocabulary $\mathcal{V}$ consisting of words. 
2. A corpus of documents $\mathcal{C} = \{d_1, ..., d_m\} = \{\{w_{11},..,w_{1n_1}, ... ,{w_{m1},..,w_{mn_m}}\}\}$ where $w_{ij} \in \mathcal{V}$
3. A word2vec model mapping words to vectors, $f_{w2vec} : \mathcal{V} \rightarrow \mathbb{R}^D$ (in our case, $D = 300$). 
4. A `giotto-tda` mapper graph function mapping the vectors to a graph $\mathcal{G} := (V, E)$ such that $f_{mapper} : \mathbb{R}^D \rightarrow \mathcal{G}$
5. The vertices can each contain an arbitrary number of words, such that $V = \{v_1, ..., v_k\} = \{\{w_{11},..,w_{1l_1}\},...,\{w_{z1},..,w_{zl_z}\}\}$ (do not confuse these sets with the sets in point 2!).
6. Now, for a query $\mathcal{Q} = \{w_1, ... , w_s\}$, we consider a query function obtaining all the nodes that $\mathcal{Q}$ is in, such that $f_{q} : \mathcal{Q} \rightarrow V$. We also use a function mapping the documents in $\mathcal{C}$ and a set of vertices $V$ to a real-valued ranking scalar, such that $f_{rank} : \mathcal{C}, V, Q \rightarrow \mathbb{R}$. This function $f_{rank}$ is currently as seen below. In short terms, it is the number of words that the document has in the same nodes as the query has its words, in proportion to the total number of words that a document has. 

$$f_{rank}(d_i, V, Q) = \frac{\sum_{j=1}^{n_i} \sum_{a=1}^z \sum_{k=1}^{|f_{\mathcal{Q}}(Q)|} \mathbb{1}(w_{jn_i}\in v_{k})}{|d_i|}$$

It was later seen that this algorithm favoured short documents due to its nature. Therefore, I introduced a bias for larger documents, as seen below. 

$$f_{rank}(d_i, V, Q) = c \cdot \log(|d_i|) \cdot \frac{\sum_{j=1}^{n_i} \sum_{a=1}^z \sum_{k=1}^{|f_{\mathcal{Q}}(Q)|} \mathbb{1}(w_{jn_i}\in v_{k})}{|d_i|}$$

### How is giotto-tda involved? 

As I probably already answered, the most important underlying concept in this case is the mapper pipeline; mapping the data points in the point cloud to a graph which we can query is actually the main underlying mechanism of this search engine. 

## Install all necessary packages

First, let's install all the necessary packages and adjust set the settings correctly. 

In [1]:
# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing the Python code
%autoreload 2

In [2]:
import sys
!{sys.executable} -m pip install gensim nltk scikit-learn rank_bm25

You should consider upgrading via the '/usr/local/opt/python@3.8/bin/python3.8 -m pip install --upgrade pip' command.[0m


## Import all the necessary packages

Second of all, let's load all the necessary packages. 

In [3]:
from sklearn.cluster import DBSCAN
from gtda.mapper import CubicalCover, make_mapper_pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from sklearn.decomposition import PCA
import pandas as pd
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
import re
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import reuters
import numpy as np
from gtda.mapper import plot_static_mapper_graph
from gtda.mapper import plot_static_mapper_graph

nltk.download('reuters')
nltk.download('stopwords')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/filipcornell/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/filipcornell/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Load helper functions 

First of all, we need to load all the helper functions needed to to this. 

### Data manipulation functions

In [4]:
def load_w2vec():
    model = api.load("word2vec-google-news-300")
    return model

def flatten(li):
    return [item for sublist in li for item in sublist]

def remove_stopwords(x, stopWords):
    return " ".join([y for y in x.split(" ") if y not in stopWords])



def preprocess_text(list_of_text):
    try:
        stopWords = set(stopwords.words("english"))
    except Exception:
        nltk.download("stopwords")
    stopWords = set(stopwords.words("english"))
    text = pd.Series(list_of_text)
    text = text.str.lower()
    text = text.apply(lambda x: re.sub(r"http\S+", "weblink", str(x)))
    text = text.str.replace("[^a-zA-Z0-9]", " ", regex=True)
    text = text.str.replace(" +", " ", regex=True)
    text = text.apply(lambda x: remove_stopwords(x, stopWords))
    return text


def load_reuters():
    """ Returns a list of the reuters docs. """
    return [reuters.raw(x) for x in reuters.fileids()]


def get_vecs(words, index, vocab, model):
    vecs = []
    for word in set(words.split(" ")):
        try:
            vecs.append((index, word))
            if word not in vocab:
                vocab[word] = model[word]
        except KeyError:
            pass
    return vecs


def get_vectors(data, model):
    vocab = {}
    vecs = []
    for i, row in data.iterrows():
        vecs += get_vecs(row["Text"], i, vocab, model)
    return vecs, vocab


def convert_list(li, col_name0):
    fst = pd.Series([x[0] for x in li])
    snd = pd.DataFrame(np.array([x[1] for x in li]))
    df = pd.concat([fst, snd], axis=1)
    df.columns = [x if i > 0 else col_name0 for i, x in enumerate(df.columns)]
    return df


def load_and_preprocess_reuters(model=None):
    if model is None:
        model = load_w2vec()
    text = preprocess_text(load_reuters())
    vecs, vocab = get_vectors(pd.DataFrame(text, columns=["Text"]), model)
    docs = [(i, x) for i, x in enumerate(reuters.fileids())]
    vocab = [(word, vocab[word]) for word in vocab]
    vecs = convert_list(vecs, "index")
    vecs.columns = [x if i != 1 else "word" for i, x in enumerate(vecs.columns)]

    vocab = convert_list(vocab, "word")
    vocab.reset_index(inplace=True)
    vocab.rename({"index": "word_index"}, axis=1, inplace=True)
    return vecs, docs, vocab, model


def load_words(
    words_path="data/google-10000-english-no-swears.txt",
    curse_words_path="data/curse_words.txt",
):
    with open(curse_words_path, "r") as f:
        curse_words = list(set(re.sub(" +", "", f.read()).lower().split("\n")))
    with open(words_path, "r") as f:
        common_words = list(set(f.read().lower().split("\n")))
    return common_words, curse_words


## Load helper classes

Let's load the helper classes. 

### Custom filtering functions

A custom filtering function used for the project. 

In [5]:
class CustomTSNE(BaseEstimator, TransformerMixin):
    """
        TSNE dimensionality reduction as a filtering function. 
        Will reduce 
        
        Will account for the train data only and not account for the test data when constructing a graph
        using eccentricity.

    """

    def __init__(self, train_data, n_components=2, metric="cosine", seed=125342):
        self.train_data = train_data
        self.metric = metric
        self.n_components = n_components

        if train_data.shape[1] > 50:
            # Then perform PCA to reduce the dimensionality to speed up convergence.
            self.pca = PCA(n_components=50)
            self.train_data = self.pca.fit_transform(self.train_data)

        self.tsne = TSNE(
            n_components=self.n_components, metric=self.metric, random_state=seed
        )
        self.train_data_transformed = self.tsne.fit_transform(train_data[:, :101])

    def fit(self, X):
        self._is_fitted = True
        return self

    def transform(self, X):
        check_is_fitted(self, "_is_fitted")
        return self.train_data_transformed

    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

### The search engine

We also create a search engine class. 

In [6]:
class TopologicalSearchEngine:
    def __init__(
        self,
        vecs,  # The vectors with the document index associated
        docs,  # The documents associated
        vocab,  # The vocabulary of all the documents
        model,  # the w2vec model.
        filtering_function,
        n_intervals,
        overlap_frac,
        clusterer=DBSCAN,
        n_jobs=2,
    ):
        self.vecs = vecs
        self.docs = docs
        self.vocab = vocab
        self.model = model
        self.train_data = vocab.iloc[:, 2:].values
        self.clusterer = clusterer(metric="cosine")
        self.filter_func = filtering_function(train_data=self.train_data)
        self.cover = CubicalCover(n_intervals=n_intervals, overlap_frac=overlap_frac)

        # Initialise pipeline
        self.pipe = make_mapper_pipeline(
            filter_func=self.filter_func,
            cover=self.cover,
            clusterer=self.clusterer,
            verbose=False,
            n_jobs=n_jobs,
        )
        self.graph = self.construct_mapper_graph()
        self.train_data_transformed = self.filter_func.train_data_transformed
        self.doc_sizes = self.vecs[["index","word"]].groupby("index").count().reset_index()
        self.construct_indexes()
    
    def query_node(self, word):
        pass
    
    def plot(self):
        plot_static_mapper_graph(se.pipe, se.train_data)
    
    def construct_mapper_graph(self):
        return self.pipe.fit_transform(self.train_data)

    def construct_indexes(self):

        word_nodes = {}

        for node_index, word_indexes in enumerate(
            self.graph.vs["node_elements"]
        ):
            for word_index in word_indexes:
                if word_index not in word_nodes:
                    word_nodes[word_index] = set()
                word_nodes[word_index].add(node_index)

        # Now, we have, for every word,
        # which nodes it is contained inside.
        # Now we want to relate the nodes to the documents.
        doc_nodes = {}

        self.vecs = self.vecs.merge(self.vocab[["word_index", "word"]], on="word")

        incidence = np.zeros(
            (len(self.docs), len(self.graph.vs["node_elements"]))
        )
        for i, row in self.vecs.iterrows():
            for thing in list(word_nodes[row["word_index"]]):
                incidence[row["index"], thing] += 1
        self.word_nodes = word_nodes
        self.doc_nodes = doc_nodes
        self.incidence = incidence / incidence.sum(axis=1)[:, None]
    
    def most_similar_word(self, word):
        """If a word does not exist in the document, fetch its closest semantic neighbour
        
            Example: 
            self.most_similar_word("rejuvenating")
            returns : "rejuvenation"
        """
        return self.vocab.loc[
                        np.argmax(
                            cosine_similarity(
                                self.model[word].reshape(1, -1), self.vocab.iloc[:, 2:].values
                            )
                        ),
                        "word",
                    ]
    
    def most_similar_word(self, word):
        return self.vocab.loc[
            np.argmax(
                cosine_similarity(
                    self.model[word].reshape(1, -1), self.vocab.iloc[:, 2:].values
                )
            ),
            "word",
        ]
    
    def get_nodes(self, word):
        """Returns which nodes a word should belong to in a 
        
        """
        word_ = self.most_similar_word(word)
        word_index = self.vocab.loc[self.vocab["word"] == word_, "word_index"].iloc[0]
        return list(self.word_nodes[word_index])
        
        
        
    def get_similar_words(self, word):
        """Retrieves similar words by looking into the nodes that a word is inside. 
        
            If the work
        
        """
        word_ = self.most_similar_word(word)
        word_index = self.vocab.loc[self.vocab["word"] == word_, "word_index"].iloc[0]
        nodes = list(self.word_nodes[word_index])
        word_vector = self.vocab.loc[self.vocab["word"] == word_].iloc[0, 2:].values
        word_indexes = np.concatenate(
            [self.graph.vs["node_elements"][i] for i in nodes], axis=0
        )
        word_indexes = [x for x in word_indexes if word_index != x]
        return self.vocab.iloc[
            np.array(word_indexes)[
                np.argsort(
                    cosine_similarity(
                        word_vector.reshape(1, -1), self.vocab.iloc[word_indexes, 2:]
                    )[0, 1:]
                )[::-1]
            ],
            1,
        ].values.tolist()
    
    def query(self, query, n_max=10, punish_shortness=False, c=1):
        """Querying the data. 
        
            params:
                query: the query
                    (str)
                n_max: the number of results to retrieve 
                    (int) default: 10
                
                punish_shortness: Whether to induce a penalty on the length of the text. 
                    (boolean) default: False
                
                c: punishing coeffient. Only applicable if punish_shortness is True. Smaller coefficient => longer text.
                    (int, float) default: 1
        """
        query = preprocess_text(query).loc[0].split(" ")
        queries = []
        for q in query:
            if q not in self.vocab["word"].values:
                word = self.most_similar_word(q)
            else:
                word = q
            queries.append(word)
            word_indices = [
                self.vocab.loc[self.vocab["word"] == x, "word_index"].iloc[0]
                for x in queries
            ]

        if punish_shortness is False:
            return [
                reuters.raw(self.docs[x][1])
                for x in np.argsort(
                    self.incidence[
                        :, flatten([self.word_nodes[x] for x in word_indices])
                    ].sum(axis=1)
                )[::-1][:n_max]
            ]
        else:
            size_ = np.log1p(self.doc_sizes['word']*c)
            return [
                reuters.raw(self.docs[x][1])
                for x in np.argsort(
                    self.incidence[
                        :, flatten([self.word_nodes[x] for x in word_indices])
                    ].sum(axis=1)*size_
                )[::-1][:n_max]
            ]
        
    def pretty_result(search_engine, query, n_max = 10, punish_shortness=False, c=1, max_chars=500):
        for i, res in enumerate(se.query(query,n_max=n_max, punish_shortness=punish_shortness, c=c)):
            print("RESULT {} for the query '{}': \n \n{}".format(i, query, res[:max_chars]+"..."))


## Let's get started! 

Now, let's get started! First of all, I will retrieve the model and prepare the vectors for the dataset. 

In [7]:
vecs, docs, vocab, model = load_and_preprocess_reuters()

As a filtering function, I will use a `CustomTSNE`-filtering class in which I use T-SNE for reducing the dimensions. This is of course a discussable and configurable parameters which should be up for discussion and further investigation. 

We construct the search engine through the class `SearchEngine` with which we generate a word graph. 

In [8]:
filtering_function = CustomTSNE

se = TopologicalSearchEngine(
    vecs,  # The vectors with the document index associated
    docs,  # The documents associated
    vocab,  # The vocabulary of all the documents
    model, # The w2vec model. 
    filtering_function,
    n_intervals=10, # Use a large range of intervals - we want many nodes! 
    overlap_frac=0.3, # The overlap is adjustable
    clusterer=DBSCAN, # The clusterer should be cosine-similarity-compatible. 
    n_jobs=2,
)

Now, let's try some queries. First, let's look at searching for `soybeans`. 

Here, we see that most results do actually contain the word soybeans, but it is focused on quite short messages. 

This is expected, as the algorithm created favours shorter documents as words that match take up a larger percentage of the entire document in these. 

In [9]:
se.pretty_result("soybeans", n_max=10, punish_shortness=False)

RESULT 0 for the query 'soybeans': 
 
U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 18,345 WHEAT 11,470 CORN 34,940

  U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 18,345 WHEAT 11,470 CORN 34,940
  

...
RESULT 1 for the query 'soybeans': 
 
U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 20,349 WHEAT 14,070 CORN 21,989

  U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 20,349 WHEAT 14,070 CORN 21,989
  

...
RESULT 2 for the query 'soybeans': 
 
U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 18,616 WHEAT 16,760 CORN 25,193

  U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 18,616 WHEAT 16,760 CORN 25,193
  

...
RESULT 3 for the query 'soybeans': 
 
U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS,  SOYBEANS 16,333 WHEAT 30,917 CORN 36,781

  U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS,  SOYBEANS 16,333 WHEAT 30,917 CORN 36,781
  

...
RESULT 4 for the query 'soybeans': 
 
U.S. EXPORT INSPECTIONS, IN THOUS BUSHELS  SOYBEANS 17,683 WHEAT 20,717 CORN 36,581

  U.S. 

Let's take a look at an example where we can clearly see the semantics of the mapper graph in work. 

Below is a search for "river stops", for which one of the top results is "PANAMANIAN WHEAT SHIP STILL GROUNDED OFF SYRIA". In one way, the "river" matched the "water" part of this, while the "stops" matched the "grounded". This is of course a possible theory, but not a sufficient explanation, and more investigation is needed. Either way, the topological search engine captures semantics and meanings in the documents that BM25 does not (see example below when comparing)!

In [10]:
se.pretty_result("river stops", n_max=5, punish_shortness=True)

RESULT 0 for the query 'river stops': 
 
PANAMANIAN WHEAT SHIP STILL GROUNDED OFF SYRIA
  The Panamanian bulk carrier Juvena is
  still aground outside Tartous, Syria, despite discharging 6,400
  tons of its 39,000-ton cargo of wheat, and water has entered
  the engine-room due to a crack in the vessel bottom, Lloyds
  Shipping Intelligence Service said.
      The Juvena, 53,351 tonnes dw, ran aground outside Tartous
  port basin breakwater on February 25 in heavy weather and rough
  seas.
  

...
RESULT 1 for the query 'river stops': 
 
PENTAGON SAYS U.S. WARSHIPS BEGIN ESCORTING GULF TANKER CONVOY SOUTH FROM KUWAIT

  PENTAGON SAYS U.S. WARSHIPS BEGIN ESCORTING GULF TANKER CONVOY SOUTH FROM KUWAIT
  

...
RESULT 2 for the query 'river stops': 
 
LIBERIAN SHIP GROUNDED IN SUEZ CANAL REFLOATED
  A Liberian motor bulk carrier, the
  72,203 dw tonnes Nikitas Roussos, which was grounded in the
  Suez canal yesterday, has been refloated and is now proceeding
  through the the canal, Lloyds

We know that the word *plantation* exists. Let's have a look at the results and see what follows here. 

In [11]:
se.pretty_result("plantation", n_max=10, punish_shortness=True, c=0.001)

RESULT 0 for the query 'plantation': 
 
TROPICAL FOREST DEATH COULD SPARK NEW DEBT CRISIS
  The death of the world's tropical rain
  forests could trigger a new debt crisis and social and
  biological disasters, scientists and ecologists involved with
  the International Tropical Timber Organisation (ITTO) said.
      At stake is the ability of developing nations, including
  Brazil, Mexico and the Philippines, to service their debts and
  the loss of trade worth hundreds of billions of dollars in
  important sectors such as agricultu...
RESULT 1 for the query 'plantation': 
 
HOG REPORT SHOWS MORE HOGS ON FARMS
  The USDA quarterly hogs and pig report
  yesterday showed more hogs on U.S. farms compared to last year
  as profitability resulting from low grain prices encouraged
  producers to step up production, analysts said.
      Most analysts seemed to agree with Chicago Mercantile
  Exchange floor traders that the report will be viewed as
  bearish to pork futures and futures price

For *plantation*, we see that the results are largely focused on documents related to forests, trees and farms. In other words, we retrieve documents that are semantically similar to the query we had. 

Let's look at the result for China. Here, we can see that we have documents that is clearly about China as well for most of the top results. 

In [12]:
se.pretty_result("china",n_max=5, punish_shortness=True)

RESULT 0 for the query 'china': 
 
USDA ACCEPTS OFFERS FOR 550,000 TONNES  OF BONUS WHEAT FOR CHINA

  USDA ACCEPTS OFFERS FOR 550,000 TONNES  OF BONUS WHEAT FOR CHINA
  

...
RESULT 1 for the query 'china': 
 
U.S. EXPORTERS REPORT 455,000 TONNES  OF WHEAT SOLD TO CHINA FOR 1986/87 AND 1987/88

  U.S. EXPORTERS REPORT 455,000 TONNES  OF WHEAT SOLD TO CHINA FOR 1986/87 AND 1987/88
  

...
RESULT 2 for the query 'china': 
 
U.S. INDUSTRIAL CAPACITY USE RATE 81.2 PCT IN SEPTEMBER, UNCHANGED FROM AUGUST

  U.S. INDUSTRIAL CAPACITY USE RATE 81.2 PCT IN SEPTEMBER, UNCHANGED FROM AUGUST
  

...
RESULT 3 for the query 'china': 
 
CANADIAN PACIFIC LTD 4TH QTR OPER NET 30 CTS VS 20 CTS

  CANADIAN PACIFIC LTD 4TH QTR OPER NET 30 CTS VS 20 CTS
  

...
RESULT 4 for the query 'china': 
 
USDA ESTIMATES CHINA WHEAT
  The U.S. Agriculture Department
  projected China's 1986/87 wheat crop at 90.30 mln tonnes, vs
  88.50 mln tonnes last month. It estimated the 1985/86 crop at
  85.81 mln tonnes, vs 85

We can see that the algorithm clearly favors shorter texts. For this reason, I introduced a way to punish the size of the document, favouring larger documents. Let's compare the search for "oil" with different regularizations on the length of the text. 

In [13]:
se.pretty_result("oil",n_max=5)

RESULT 0 for the query 'oil': 
 
PERMIAN RAISES CRUDE OIL POSTINGS 50 CTS
 A BBL, WTI TO 19 DLRS

  PERMIAN RAISES CRUDE OIL POSTINGS 50 CTS
   A BBL, WTI TO 19 DLRS
  

...
RESULT 1 for the query 'oil': 
 
UNOCAL RAISED CRUDE OIL POSTINGS 50 CTS/BBL, WTI NOW 19
DLRS/BBL

  UNOCAL RAISED CRUDE OIL POSTINGS 50 CTS/BBL, WTI NOW 19
  DLRS/BBL
  

...
RESULT 2 for the query 'oil': 
 
API SAYS DISTILLATE STOCKS OFF 4.4 MLN BBLS, GASOLINE OFF 30,000, CRUDE UP 700,000

  API SAYS DISTILLATE STOCKS OFF 4.4 MLN BBLS, GASOLINE OFF 30,000, CRUDE UP 700,000
  

...
RESULT 3 for the query 'oil': 
 
API SAYS DISTILLATE STOCKS UP 628,000 BBLS, GASOLINE UP 2.29 MLN, CRUDE UP 8.52 MLN

  API SAYS DISTILLATE STOCKS UP 628,000 BBLS, GASOLINE UP 2.29 MLN, CRUDE UP 8.52 MLN
  

...
RESULT 4 for the query 'oil': 
 
API SAYS DISTILLATES OFF 1.95 MLN BARRELS, GASOLINE OFF 3.98 MLN, CRUDE UP 2.42 MLN

  API SAYS DISTILLATES OFF 1.95 MLN BARRELS, GASOLINE OFF 3.98 MLN, CRUDE UP 2.42 MLN
  

...


In [14]:
se.pretty_result("oil",n_max=5,punish_shortness=True,c=0.01)

RESULT 0 for the query 'oil': 
 
SINGAPORE PETROLEUM CO RAISES OIL PRODUCT POSTINGS
  Singapore Petroleum Co Pte Ltd said it
  will raise posted prices for its products from June 19, by one
  cent/gallon for lpg, naphtha and gasoline, two cents for gas
  oil and by 60 cents/barrel for marine diesel oil.
      New prices are - lpg 36.0 cents/gallon, chemical naphtha
  47, unleaded reformate 65.8, 0.4 gm lead 97 octane 61.3, 95
  octane 59.3, 92 octane 55.5, 85 octane 49.5, 0.125 gm lead 97
  octane 64.3, 92 octane 58.5, 85 octan...
RESULT 1 for the query 'oil': 
 
HOUSTON METALS' MINE YIELDS POSITIVE RESULTS
  &lt;Houston Metals Corp> said the
  the first phase of the underground rehabilitation, extensive
  drilling and bulk sampling program at its Silver Queen Mine has
  yielded positive results.
      Houston said representative assays from the 2,750 ft and
  2,600 ft levels at the south end of the mine established ore
  deposits in the following ranges: copper, 3.7 pct to 5.08 pct,
 

## A longer query

Let's try something a bit longer. I think we want to search for something quite specific. The search query *finance markets in china plummet in crisis* does not retrieve any particular document, but does retrieve related ones, as it seems to retrieve documents that is on the topic of finance in Asia. 

In [15]:
se.pretty_result("finance markets in china plummet in crisis", punish_shortness=True, c=0.01)

RESULT 0 for the query 'finance markets in china plummet in crisis': 
 
SINGAPORE BANKS SAY DIVERSIFICATION KEY TO GROWTH
  Singapore's major banks are
  diversifying and gradually shifting their asset holdings from
  loans to debt instruments, banking sources said.
      The banks following the trend are the &lt;Overseas Union Bank
  Ltd>, &lt;United Overseas Bank Ltd>, &lt;Oversea-Chinese Banking
  Corporation> and the &lt;Development Bank of Singapore Ltd>.
      The shift towards securitisation has been helped by
  volatile financial markets which have developed...
RESULT 1 for the query 'finance markets in china plummet in crisis': 
 
HK BANK EXPECTED TO POST 10 TO 13 PCT PROFIT RISE
  The Hongkong and Shanghai Banking Corp
  &lt;HKBH.HK> is likely to show a rise in profit of between 10 and
  13 pct for 1986, reflecting stronger than expected loan growth,
  share analysts polled by Reuters said.
      Their estimates of the bank's net earnings for last year
  ranged from 2.99 to 3

Here, I wanted to see whether an actual headline lands high. Although it does not land as the highest, it still lands among the top results. 

In [16]:
se.pretty_result("JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS", punish_shortness=True, c=0.01)

RESULT 0 for the query 'JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS': 
 
SINGAPORE BANKS SAY DIVERSIFICATION KEY TO GROWTH
  Singapore's major banks are
  diversifying and gradually shifting their asset holdings from
  loans to debt instruments, banking sources said.
      The banks following the trend are the &lt;Overseas Union Bank
  Ltd>, &lt;United Overseas Bank Ltd>, &lt;Oversea-Chinese Banking
  Corporation> and the &lt;Development Bank of Singapore Ltd>.
      The shift towards securitisation has been helped by
  volatile financial markets which have developed...
RESULT 1 for the query 'JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS': 
 
NIPPON LIFE, SHEARSON TIE-UP SEEN SETTING TREND
  Nippon Life Insurance Co's 538 mln dlr
  purchase of a 13 pct stake in Shearson Lehman Brothers Inc
  brokerage unit is a shrewd move that other Japanese insurers
  are likely to follow, securities analysts said.
      The investment in one of Wall Street's top brokerage houses
  is likely to pay of

I must admit that the algorithm is currently not perfect; it does need improvements. However, it works! And it is all thanks to `giotto-tda`!!

## How does it work? A step by step guide

Let's look at the query step by step. 

First of all, we say the query is `china`. 

In [17]:
query = "china"

Now, which words are similar according to the mapper graph? The `get_most_similar_words`-function will return all words in the same nodes, sorted by their similarity with the query word. 

In [18]:
se.get_similar_words(query)

['corp',
 'cia',
 'cia',
 'taiwan',
 'beg',
 'mir',
 'beg',
 'internatio',
 'yesterdays',
 'pref',
 'yeras',
 'futher',
 'imf',
 'abouth',
 'intel',
 'davos',
 'olympics',
 'frist',
 'proceding',
 'oponents',
 'lo',
 'gmt',
 'uni',
 'fiji',
 'intergroup',
 'ussr',
 'rpt',
 'pacific',
 'kyoto',
 'oman',
 'quo',
 'banc',
 'mot',
 'august',
 'aare',
 'anw',
 'wll',
 'comm',
 'op',
 'yesterdays',
 'abouth',
 'aboout',
 'aboout',
 'earlie',
 'todays',
 'enron',
 'alos',
 'wil',
 'fmr',
 'itc',
 'ira',
 'ltd',
 'pharm',
 'enron',
 'august',
 'actu',
 'wold',
 'alos',
 'yang',
 'abn',
 'yeras',
 'eldorado',
 'ops',
 'asst',
 'icelandic',
 'kep',
 'todays',
 'frist',
 'intel',
 'aare',
 'amt',
 'amt',
 'luna',
 'luna',
 'upt',
 'mtg',
 'togo',
 'advo',
 'mote',
 'pak',
 'revers',
 'ual',
 'assoc',
 'indo',
 'bering',
 'pdt',
 'dwi',
 'coms',
 'hoc',
 'certs',
 'usda',
 'wmd',
 'mote',
 'shing',
 'sinc',
 'earlie',
 'dd',
 'edt',
 'bering',
 'nics',
 'hugh',
 'cdt',
 'oks']

Okay, so these words are the words that the graph perceive as similar to our word query. These words therefore function as an additional link to different documents. In other words; documents that contain these words more will be favoured over others. 

Which documents are these? 

In [19]:
def get_documents_containing_words(self, list_of_words, n_max=10):
    docs_with_words = self.vecs.loc[self.vecs['word'].isin(list_of_words)]['index'].unique().tolist()
    return [reuters.raw(self.docs[x][1]) for x in docs_with_words]

for doc in get_documents_containing_words(se, se.get_similar_words(query))[:10]:
    print("The following document has words associated with {}: \n\n{}\n\n".format(query, doc))

The following document has words associated with china: 

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major 

This is of course just a subsample, but these are the documents whose relevance will be ranked based on how many words they have in **the same nodes as the query**.

## Comparison with another work 

Here, I will compare my search engine with a classical method, the *bm25* search method. Following [its guide on how to use it](https://pypi.org/project/rank-bm25/), I rank and retrieve results for the same keyword searches as for my method. 

In [20]:
from rank_bm25 import BM25Okapi

class BM25:
    def __init__(self, corpus):
        self.corpus = corpus
        self.tokenized_corpus = [doc.split(" ") for doc in corpus]
        self.bm25 = BM25Okapi(self.tokenized_corpus)
    def query(self, query, n_max=5):
        tokenized_query = query.split(" ")
        return self.bm25.get_top_n(tokenized_query, self.corpus, n=5)
    
    def pretty_result(self, query, n_max=10, max_chars=500):
        for i, res in enumerate(self.query(query,n_max=n_max)):
            print("RESULT {} for the query '{}': \n \n{}".format(i, query, res[:max_chars]))
        

In [21]:
bm25 = BM25([reuters.raw(x) for x in reuters.fileids()])

### Oil

In [22]:
bm25.pretty_result("oil")

RESULT 0 for the query 'oil': 
 
DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
  Diamond Shamrock Corp said that
  effective today it had cut its contract prices for crude oil by
  1.50 dlrs a barrel.
      The reduction brings its posted price for West Texas
  Intermediate to 16.00 dlrs a barrel, the copany said.
      "The price reduction today was made in the light of falling
  oil product prices and a weak crude oil market," a company
  spokeswoman said.
      Diamond is the latest in a line of U.S. oil companies that
  have cut
RESULT 1 for the query 'oil': 
 
GLOBAL PETROLEUM &lt;GNR> UPS HEAVY FUEL PRICES
  Global Petroleum Corp said it had raised
  the contract prices for heavy fuel oil from 25 cts to one dlr
  per barrel, effective today.
      The company said 0.3 pct fuel oil is up one dlr a barrel to
  22.25 dlrs a barrel. They said 0.5 pct fuel oil is up by 50 cts
  to 21.95 dlrs a barrel.
      Global raised one pct fuel oil by 35 cts to 20.25 dlrs a
  barrel. The company rais

### Soybeans

In [23]:
bm25.pretty_result("soybeans")

RESULT 0 for the query 'soybeans': 
 
SOYCOMPLEX COULD RALLY ON TIGHT U.S. FEED SUPPLY
  Nearby months in soybean and soymeal
  futures could post a short-term rally on tightening supply of
  livestock feed, even if favorable growing conditions keep the
  new crop outlook bearish, traders said.
      "A lot of soymeal dealers are just getting very worried
  about where processors will get their soybeans this Summer,"
  one Illinois soyproduct dealer said.
      Processors are competing vigorously with river dealers for
  the few soy
RESULT 1 for the query 'soybeans': 
 
U.S. MARKET LOAN NOT THAT ATTRACTIVE-BOSCHWITZ
  A marketing loan for U.S. wheat,
  feedgrains and soybeans would do nothing to help the surplus
  production situation and would be extremely costly, Sen. Rudy
  Boschwitz (R-Minn.) said.
      "I think I would not support a marketing loan now," he told
  the House agriculture subcommittee on wheat, soybeans and
  feedgrains. Boschwitz was one of the original supporters o

### China

Both BM25 and the topological search engine retrieves results that are not necessarily perfect. In particular, result 5 here is not exactly related to China. 

In [24]:
bm25.pretty_result("china")

RESULT 0 for the query 'china': 
 
PEARSON CONCENTRATES ON FOUR SECTORS
  Pearson Plc &lt;PSON.L> said the recent
  sale of its Fairey Engineering companies, in a 51.5 mln stg
  management buy-out, was part of its policy of concentrating on
  four key sectors.
      In a statement with its 1986 results, the company said its
  information and entertainments sector's Financial Times, FT,
  newspaper had record sales and profits.
      The FT is subject to a 70 mln stg investment programme,
  with the printing and publishing operati
RESULT 1 for the query 'china': 
 
&lt;A.H.A. AUTOMOTIVE TECHNOLOGIES CORP> YEAR NET
  Shr 43 cts vs 52 cts
      Shr diluted 41 cts vs 49 cts
      Net 1,916,000 vs 2,281,000
      Revs 32.6 mln vs 22.6 mln
  


RESULT 2 for the query 'china': 
 
ALLEGHENY INTERNATIONAL &lt;AG> SELLS WILKINSON
  Allegheny International Inc said it
  sold its Wilkinson Sword Consumer Group to the &lt;Swedish Match
  Co> of Stockholm for for 230 mln dlrs.
      After settlement

## River stops

Here is where it gets interesting. In my (subjective opinion), the topological search engine fetches more relevant results for the first result than BM25! This is actually very interesting. 

In [25]:
bm25.pretty_result("river stops")

RESULT 0 for the query 'river stops': 
 
TOTAL PETROLEUM &lt;TPN> SHUTS TEXAS PIPELINES
  Total Petroleum NA &lt;TPN> shut down
  several small crude oil pipelines operating near the
  Texas/Oklahoma border last Friday as a precaution against
  damage from local flooding, according to Gary Zollinger,
  manager of operations.
      Total shut a 12-inch line that runs across the Ouachita
  River from Wynnewood to Ardmore with a capacity of 62,000 bpd
  as well as several smaller pipelines a few inches wide with
  capacities of several th
RESULT 1 for the query 'river stops': 
 
SOME SHIPPING RESTRICTIONS REMAIN ON RHINE
  Limited shipping restrictions due to high
  water remain in force on parts of the West German stretch of
  the Rhine river between the Dutch border and the city of Mainz
  but most are expected to be lifted this weekend.
      water authority officials said The restrictions, caused by
  high water levels, include speed limits and directives to keep
  to the middle of th

### Plantation

Here, results are slightly worse for BM25, as it does not capture the related topics to plantation as the topological search engine does. 

In [26]:
bm25.pretty_result("plantation")

RESULT 0 for the query 'plantation': 
 
MIXED ASIAN REACTION TO NEW RUBBER PACT
  Governments of major Asian
  producing countries have welcomed the conclusion of a new
  International Natural Rubber Agreement (INRA), but growers and
  traders are unhappy with the development, according to views
  polled by Reuter correspondents.
      Officials in Malaysia, Indonesia and Thailand, which
  produce the bulk of the world's rubber, said they expected the
  new pact to continue to stabilise prices and help their rubber
  industries remain
RESULT 1 for the query 'plantation': 
 
ALLEGHENY INTERNATIONAL &lt;AG> SELLS WILKINSON
  Allegheny International Inc said it
  sold its Wilkinson Sword Consumer Group to the &lt;Swedish Match
  Co> of Stockholm for for 230 mln dlrs.
      After settlement of intercompany transactions between the
  Wilkinson Sword groups and Allegheny, the net payment by
  Swedish Match will amount to about 160 mln dlrs.
      The Wilkinson Sword Group was transferred to 

In [27]:
bm25.pretty_result("finance markets in china plummet in crisis")

RESULT 0 for the query 'finance markets in china plummet in crisis': 
 
BALLADUR HAS HAD CONTACT WITH G-7 MINISTERS
  French Finance Minister Edouard Balladur
  has been in contact with several Finance Ministers from the
  Group of Seven leading industrial countries, in particular West
  German Finance Minister Gerhard Stoltenberg, to discuss the
  crisis on world markets, Finance Ministry sources said.
      They did not say whether the contacts had led to concerted
  action on the markets or merely an exchange of views.
      But they added that French ministry of
RESULT 1 for the query 'finance markets in china plummet in crisis': 
 
UGANDA PULLS OUT OF COFFEE MARKET - TRADE SOURCES
  Uganda's Coffee Marketing Board (CMB)
  has stopped offering coffee on the international market because
  it is unhappy with current prices, coffee trade sources said.
      The board suspended offerings last week but because of its
  urgent need for cash it was not immediately clear how long it
  coul

For the above result, BM25 is clearly worse; it is not at all about finance markets in Asia.

### JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS

Not even BM25 manages to get the actual headline as its top result. It seems like the topological search engine is not too bad after all. 

In [28]:
bm25.pretty_result("JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS")

RESULT 0 for the query 'JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS': 
 
BALDRIGE SAYS U.S. TO GO AHEAD WITH JAPANESE SANCTIONS

  BALDRIGE SAYS U.S. TO GO AHEAD WITH JAPANESE SANCTIONS
  


RESULT 1 for the query 'JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS': 
 
JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS
  The dollar's tumble to a record low of
  144.70 yen in Tokyo today motivated some major Japanese
  investors to lighten their U.S. Bond inventory further and is
  expected to spur diversification into investment assets
  including foreign and domestic shares, dealers said.
      The key U.S. 7-1/2 pct Treasury bond due 2016 fell to a low
  of 96.08-12 in early Tokyo trade against the 98.05-06 New York
  finish, then recovered to 96.20-22.
      Some 
RESULT 2 for the query 'JAPANESE SEEN LIGHTENING U.S. BOND HOLDINGS': 
 
VIACOM SAID IT HAS NEW NATIONAL AMUSEMENTS, MCV HOLDINGS BIDS

  VIACOM SAID IT HAS NEW NATIONAL AMUSEMENTS, MCV HOLDINGS BIDS
  


RESULT 3 for the query 'JAPANE

## Wrap-up and further work

In this work, I demonstrated a proof-of-concept, showing how it is possible to use `giotto-tda` and its packages within NLP and for creating a semantic search engine. As seen with some examples (see `river stops`) the Topological Search Engine captures semantics that BM25, a classical search engine, does not manage to capture.  

Some critics might argue that this is not applied topology. I beg to differ and regard this as a potentially new use case of topology; rather than an analysis, the topological functions allowed through the package of `giotto-tda` has allowed us to build a search engine for natural language! 

There are however a few points that needs to be pointed out. 

- I do not currently leverage the edges explicitly, which probably can be of value. In an extension of this, the edges should be used in some way to increase the search radius and possibly retrieve more relevant search results. As an example, if a document has many of its words in a neighbor to the node of the query, it can still be a highly relevant document for the query in question. 
- Constructing a Mapper graph is in many ways an arbitrary process, in which it is not always clear which filtering function and which overlap fractions and number of partitions we should use. In further work, one should investigate whether it is better with richer representations containing more nodes (higher number of partitions) and rich connectivity, as opposed to larger nodes (with fewer partitions) and lower connectivity. A higher connectivity means more words will belong to more nodes, yielding a more descriptive representation of each document. It is clear that this method can be further refined by further examination. 
- The filtering function should also be investigated as to which one is the best to use. 
- Currently, due to the way we rank, the algorithm clearly favors smaller documents, simply because we disregard their size. One way to possibly mitigate this is to simply induce a bias based on the size of the document. We have introduced a punisher for smaller documents where we can intentionally introduce a bias towards larger or smaller documents, but a better solution rather than an adjustable hyper parameter would be to prefer. 
- My last point relates to my first and fourth point; how we rank the documents now, by matching the query's corresponding nodes with the document having its highest percentage of words in the same nodes will favor smaller documents and simultaneously not regard neighboring nodes, which is especially important in a rich graph format. 


More work is absolutely needed to improve these results; most of all, this method can clearly be fine-tuned to work better. Either way, this notebook serve as a demonstration that `giotto-tda` and topology has its place within Natural Language Processing. It also shows, for a few sets of queries, that it is not necessarily inferior to classical methods such as BM25; with more adjustments, it could probably even perform better. 