# Project and Developer Similarity

In this notebook, we study what projects and developers are about and how to find projects or devs that are close to them.

We use [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model), through the excellent [BigARTM](http://docs.bigartm.org/en/stable/index.html) library to achieve that. Roughly speaking, the topic model we use sees each code file as stemming from some topics (e.g., `setup.py` might come from topics about packaging and documentation).

To be able to apply this topic modeling technique, we need to transform each code file into a bag of identifiers.

We start by defining the paths we'll use for our inputs and outputs later in this notebook.

A note on how we use the cells in this notebook: __all cells should be only dependent from the first one (that defines paths)__. It means we will 
save and load all results in files to achieve that. This helps mitigate the problems that arise from stateful notebooks.

In [None]:
from enum import Enum
from os import makedirs
from os.path import join as path_join
from typing import Union

from utils import DirsABC, FilesABC, Run

class Files(FilesABC, Enum):
    IDENTIFIERS = ["identifiers.jsonl.bz2"]
    SPLIT_IDENTIFIERS = ["split-identifiers.jsonl.bz2"]
    FILTERED_IDENTIFIERS = ["filtered-identifiers.jsonl.bz2"]
    IDENTIFIERS_COUNTER = ["identifiers-counter.pickle"]
    COMMON_IDENTIFIERS_COUNTER = ["common-identifiers-counter.pickle"]
    VW_DATASET = ["dataset.vw"]
    ARTM_DICT = ["bigartm", "identifiers.dict"]
    ARTM_STAGE1 = ["bigartm", "stage1.model"]
    ARTM_STAGE2 = ["bigartm", "stage2.model"]
    ARTM_FILES_TOPICS = ["bigartm", "files-topics.bigartm"]
    ARTM_TOPICS_IDENTIFIERS = ["bigartm", "topics-identifiers.bigartm"]
    PYLDAVIS_DATA = ["pyldavis-data.pickle"]
    CONTRIBUTIONS = ["contributions.pickle"]
    REPOS_TOPICS = ["repos.pickle"]
    AUTHORS_TOPICS = ["authors.pickle"]


class Dirs(DirsABC, Enum):
    ARTM_LOGS = ["bigartm", "logs"]
    ARTM_BATCHES = ["bigartm", "batches"]


full_run = Run("similarity", "full")
limited_run = Run("similarity", "limited")
run = full_run

## Preprocessing

We start by the bulk of the preprocessing: extracting identifiers from code with [`gitbase`](http://docs.bigartm.org/en/stable/index.html). `gitbase` exposes git repositories as SQL databases with the following schema:

![`gitbase` schema](img/gitbase-schema.png)

Ok, it's probably hard to read. It's likely possible to use `Right click` > `View image` to explore it, but we'll extract the tables we're going to use below to make our life easy:

![tables](img/tables.png)

Using those 3 tables, we can get identifiers with the `uast_extract(blob, key) text array` [`gitbase` function](https://docs.sourced.tech/gitbase/using-gitbase/functions), that leverages [Babelfish](https://doc.bblf.sh/). The nice point about using Babelfish is that since it exposes the same API for different languages, we write a query once, and we get a preprocessing that works readily for plenty of languages: c#, c++, c, cuda, opencl, metal, bash, shell, go, java, javascript, jsx, php, python, ruby and typescript.

All we have to do now is to write the query!

In [None]:
from bz2 import open as bz2_open
from json import dumps as json_dumps, loads as json_loads
from pprint import pprint

from utils import SUPPORTED_LANGUAGES, query_gitbase


def extract_identifiers(identifiers_path: str, limit: int = 0):
    sql = """
        SELECT
            repository_id,
            LANGUAGE(file_path) AS lang,
            file_path,
            uast_extract(
                uast(blob_content,
                     LANGUAGE(file_path),
                     '//uast:Identifier'),
                'Name'
            ) AS identifiers
        FROM refs
        NATURAL JOIN commit_files
        NATURAL JOIN blobs
        WHERE
            ref_name = 'HEAD'
            AND NOT IS_VENDOR(file_path)
            AND NOT IS_BINARY(file_path)
            AND LANGUAGE(file_path) IN (%s)
        %s
    """ % (
        ",".join("'%s'" % language for language in SUPPORTED_LANGUAGES),
        "LIMIT %d" % limit if limit > 0 else ""
    )
    print("Extracting identifiers with the following gitbase query:")
    print(sql)
    print("First extracted rows:")
    with bz2_open(identifiers_path, "wt", encoding="utf8") as fh:
        shown = 0
        for row in query_gitbase(sql):
            if row["identifiers"] is None:
                continue
            row["identifiers"] = json_loads(row["identifiers"])
            while shown < 10:
                shown += 1
                print("Row %d:" % shown)
                pprint(row)
            fh.write("%s\n" % json_dumps(row))


extract_identifiers(run.path(Files.IDENTIFIERS))

Now that we have a file that stores all the identifiers, we can refine it until it is ready for topic modeling! The remaining steps are to further split each identifier (`set_timer` should become `set` and `timer`), and to apply some stemming (`connecting` and `connection` should both result in `connect`, note that the result stem might not be an English word).

In [None]:
from bz2 import open as bz2_open
from collections import Counter
from json import dumps as json_dumps, loads as json_loads
from pickle import dump as pickle_dump

from utils import TokenParser


def split_identifiers(identifiers_path: str,
                      split_identifiers_path: str,
                      counter_path: str):
    with bz2_open(identifiers_path, "rt", encoding="utf8") as fh_identifiers, \
            bz2_open(split_identifiers_path, "wt", encoding="utf8") as fh_split_identifiers, \
            open(counter_path, "wb") as fh_counter:
        identifiers_counter = Counter()
        token_parser = TokenParser()
        shown = set()
        print("10 first splits:")
        for row_str in fh_identifiers:
            row = json_loads(row_str)
            identifiers = row.pop("identifiers")
            split_identifiers = []
            for identifier in identifiers:
                split_identifier = list(token_parser(identifier))
                split_identifiers.extend(split_identifier)
                if (len(shown) < 10
                        and identifier not in shown
                        and len(split_identifier) > 1):
                    shown.add(identifier)
                    print("Splitting %s into (%s)" % (
                        identifier,
                        ", ".join(split_identifier)
                    ))
            identifiers_counter.update(split_identifiers)
            row["split_identifiers"] = split_identifiers
            fh_split_identifiers.write("%s\n" % json_dumps(row))
        pickle_dump(identifiers_counter, fh_counter)


split_identifiers(run.path(Files.IDENTIFIERS),
                  run.path(Files.SPLIT_IDENTIFIERS),
                  run.path(Files.IDENTIFIERS_COUNTER))

The resulting identifiers still need some processing: some of them appear only a few times and will bring mostly noise to our models. We will discard them now. The first step is to find out which identifiers are common enough to be kept.

In [None]:
from pickle import dump as pickle_dump, load as pickle_load


def build_common_counter(count_threshold: int,
                         counter_path: str,
                         common_counter_path: str):
    with open(counter_path, "rb") as fh:
        identifiers_counter = pickle_load(fh)
    print("Found %d different identifiers" % len(identifiers_counter))

    common_identifiers_counter = identifiers_counter.copy()
    for identifier, count in identifiers_counter.items():
        if count < count_threshold:
            del common_identifiers_counter[identifier]
    with open(common_counter_path, "wb") as fh:
        pickle_dump(common_identifiers_counter, fh)
    print("Found %d different identifiers after pruning"
          % len(common_identifiers_counter))


build_common_counter(10,
                     run.path(Files.IDENTIFIERS_COUNTER),
                     run.path(Files.COMMON_IDENTIFIERS_COUNTER))

Now that we know which are the common identifiers, we can recreate our mapping from files to identifiers with only the ones that we want to keep.

In [None]:
from json import dumps as json_dumps, loads as json_loads
from pickle import load as pickle_load


def filter_identifiers(split_identifiers_path: str,
                       common_counter_path: str,
                       filtered_identifiers_path: str):
    with bz2_open(split_identifiers_path, "rt", encoding="utf8") as fh_split_identifiers, \
            open(common_counter_path, "rb") as fh_common_counter, \
            bz2_open(filtered_identifiers_path, "wt", encoding="utf8") as fh_filtered_identifiers:
        common_identifiers_counter = pickle_load(fh_common_counter)
        for row_str in fh_split_identifiers:
            row = json_loads(row_str)
            row["split_identifiers"] = [identifier
                                        for identifier in row["split_identifiers"]
                                        if identifier in common_identifiers_counter]
            if row["split_identifiers"]:
                fh_filtered_identifiers.write("%s\n" % json_dumps(row))


filter_identifiers(run.path(Files.SPLIT_IDENTIFIERS),
                   run.path(Files.COMMON_IDENTIFIERS_COUNTER),
                   run.path(Files.FILTERED_IDENTIFIERS))

## Topic Modeling

The preprocessing is over! We now create the input dataset, in the VW format (see https://bigartm.readthedocs.io/en/stable/tutorials/datasets.html). We replace spaces in `file_path` to avoid creating false identifiers (VW would consider the latter parts of a path containing spaces to be identifiers).

In [None]:
def build_file_id(repository_id: str, lang: str, file_path: str):
    return "%s//%s//%s" % (repository_id,
                           lang,
                           file_path.replace(" ", "_"))

In [None]:
from collections import Counter
from os.path import join as path_join


def build_vw_dataset(filtered_identifiers_path: str,
                     vw_dataset_path: str):
    !rm -rf vw_dataset_path
    with bz2_open(filtered_identifiers_path, "rt", encoding="utf8") as fh_filtered_identifiers, \
            open(vw_dataset_path, "w") as fh_vw:
        shown = 0
        print("Showing first 10 lines:")
        for row_str in fh_filtered_identifiers:
            counter = Counter()
            row = json_loads(row_str)
            counter.update(row["split_identifiers"])
            line = "%s %s" % (
                build_file_id(row["repository_id"],
                              row["lang"],
                              row["file_path"]),
                " ".join("%s:%d" % (identifier, count)
                     for identifier, count in counter.items())
            )
            if shown < 10:
                shown +=1
                print("Line %d: %s" % (shown, line))
            fh_vw.write("%s\n" % line)


build_vw_dataset(run.path(Files.FILTERED_IDENTIFIERS),
                 run.path(Files.VW_DATASET))

Bigartm has its own binary format to efficiently store and access the data used to build its topic models. The next step is therefore to transform our VW dataset into the correct Bigartm format.

In [None]:
def prepare_bigartm(vw_dataset_path: str,
                    artm_batches_path: str,
                    artm_dict_path: str,
                    artm_logs_path: str):
    !rm -rf {artm_batches_path} {artm_dict_path}
    !bigartm \
        --log-dir {artm_logs_path} \
        -c {vw_dataset_path} \
        -p 0 \
        --save-batches {artm_batches_path} \
        --save-dictionary {artm_dict_path}


prepare_bigartm(run.path(Files.VW_DATASET),
                run.path(Dirs.ARTM_BATCHES),
                run.path(Files.ARTM_DICT),
                run.path(Dirs.ARTM_LOGS))

We can now train our first topic model! As per the Bigartm documentation, we don't use too much magic (yet), and only use one regularizer --- the decorrelation one. It will make sure that no 2 topics are about the same concepts.

In [None]:
from multiprocessing import cpu_count


def train_topic_model(artm_batches_path: str,
                      artm_dict_path: str,
                      artm_stage1_path: str,
                      artm_logs_path: str,
                      n_topics: int = 64,
                      n_epochs: int = 100,
                      n_cpus: int = cpu_count() * 2,
                      seed: int = 2019,
                      regularizer: str = '"1000 Decorrelation"'):
    !bigartm \
        --log-dir {artm_logs_path} \
        --use-batches {artm_batches_path} \
        --use-dictionary {artm_dict_path} \
        -t {n_topics} \
        -p {n_epochs} \
        --threads {n_cpus} \
        --rand-seed {seed} \
        --regularizer {regularizer} \
        --save-model {artm_stage1_path} \
        --force


train_topic_model(run.path(Dirs.ARTM_BATCHES),
                  run.path(Files.ARTM_DICT),
                  run.path(Files.ARTM_STAGE1),
                  run.path(Dirs.ARTM_LOGS))

This topic model is probably quite good already, but since Bigartm is a powerful library, we can improve it even further by making it sparser. Sparse topics for a document means that it will have mainly a few topics with high weight and other topics with weight 0.

Example of non-sparse documents:

|           | Backend | Logging | Machine Learning | Data Processing |
|-----------|---------|---------|------------------|-----------------|
| server.py | 0.8     | 0.1     | 0.06             | 0.04            |
| utils.py  | 0.1     | 0.7     | 0.08             | 0.12            |

Same documents but with sparse topics:

|           | Backend | Logging | Machine Learning | Data Processing |
|-----------|---------|---------|------------------|-----------------|
| server.py | 0.85    | 0.15    | 0                | 0               |
| utils.py  | 0.12    | 0.88    | 0                | 0               |

This makes understanding the documents and topics easier: they contain less non-zero entries and are more focused on the most important stuff.

In [None]:
def sparsify_topic_model(
    artm_batches_path: str,
    artm_dict_path: str,
    artm_stage1_path: str,
    artm_stage2_path: str,
    artm_files_topics_path: str,
    artm_topics_identifiers_path: str,
    artm_logs_path: str,
    n_epochs: int = 20,
    n_cpus: int = cpu_count() * 2,
    seed: int = 2019,
    regularizer: str = ' "1000 Decorrelation" "0.5 SparsePhi" "0.5 SparseTheta" '
):
    !bigartm \
        --log-dir {artm_logs_path} \
        --use-batches {artm_batches_path} \
        --use-dictionary {artm_dict_path} \
        --load-model {artm_stage1_path} \
        -p {n_epochs} \
        --threads {n_cpus} \
        --rand-seed {seed} \
        --regularizer {regularizer} \
        --save-model {artm_stage2_path} \
        --force \
        --write-predictions {artm_files_topics_path} \
        --write-model-readable {artm_topics_identifiers_path}


sparsify_topic_model(run.path(Dirs.ARTM_BATCHES),
                     run.path(Files.ARTM_DICT),
                     run.path(Files.ARTM_STAGE1),
                     run.path(Files.ARTM_STAGE2),
                     run.path(Files.ARTM_FILES_TOPICS),
                     run.path(Files.ARTM_TOPICS_IDENTIFIERS),
                     run.path(Dirs.ARTM_LOGS))

Our topic model should be perfectly cooked now. It's time to taste it. Let's visualize the topics with the great [pyLDAvis](https://github.com/bmabey/pyLDAvis) tool. To do that, we first extract the relevant info from our model. We use BigARTM and it's not supported out of the box so we have a bit of work to do. If we'd have used Gensim or some other better-known (not better) library, this step would be a one-liner.

In [None]:
from bz2 import open as bz2_open
from json import loads as json_loads
from pickle import dump as pickle_dump, load as pickle_load
from typing import Optional

from numpy import ones as numpy_ones
from pandas import DataFrame, read_csv as pandas_read_csv
from pyLDAvis import prepare as pyldavis_prepare


def prepare_visualization(artm_files_topics_path: str,
                          artm_topics_identifiers_path: str,
                          common_counter_path: str,
                          filtered_identifiers_path: str,
                          pyldavis_data_path: str):

    def clean_artm_df(df: DataFrame,
                      to_delete: str,
                      transpose_name: Optional[str] = None):
        del df[to_delete]
        if transpose_name is not None:
            df = df.T
            df.index.name = transpose_name
        return df

    # We exchange rows and columns (transpose, .T) to have the topics as rows
    # and the identifiers as columns
    topics_identifiers_df = pandas_read_csv(
        artm_topics_identifiers_path,
        delimiter=";",
        index_col="token")
    topics_identifiers_df = clean_artm_df(topics_identifiers_df, "class_id", "topic")
    print("Start of the topics × identifiers dataframe:")
    display(topics_identifiers_df.head())

    files_topics_df = pandas_read_csv(
        artm_files_topics_path,
        delimiter=";",
        index_col="title")
    clean_artm_df(files_topics_df, "id")
    print("Start of the files × topics dataframe:")
    display(files_topics_df.head())

    files_topics_df /= files_topics_df.sum(axis=1)[:, None]
    filler = (numpy_ones((files_topics_df.shape[1],))
              / files_topics_df.shape[1])
    for i, row in files_topics_df.iterrows():
        if not (0.9 < row.sum() < 1.1):
            files_topics_df.loc[i, :] = filler
    vocab = topics_identifiers_df.columns
    with bz2_open(filtered_identifiers_path, "rt", encoding="utf8") as fh_rj, \
            open(common_counter_path, "rb") as fh_rp:
        common_identifiers_counter = pickle_load(fh_rp)
        doc_lengths_index = {}
        for row_str in fh_rj:
            row = json_loads(row_str)
            doc_lengths_index[
                "%s//%s//%s" % (
                    row["repository_id"],
                    row["lang"],
                    row["file_path"].replace(" ", "_"))
            ] = len(row["split_identifiers"])
        term_frequency = [common_identifiers_counter[t] for t in vocab]
        doc_lengths = [doc_lengths_index[doc] for doc in files_topics_df.index]

    with open(pyldavis_data_path, "wb") as fh:
        pyldavis_data = pyldavis_prepare(topic_term_dists=topics_identifiers_df.values, 
                                         doc_topic_dists=files_topics_df.values,
                                         doc_lengths=doc_lengths,
                                         vocab=vocab,
                                         term_frequency=term_frequency,
                                         sort_topics=False)
        pickle_dump(pyldavis_data, fh)


prepare_visualization(run.path(Files.ARTM_FILES_TOPICS),
                      run.path(Files.ARTM_TOPICS_IDENTIFIERS),
                      run.path(Files.COMMON_IDENTIFIERS_COUNTER),
                      run.path(Files.FILTERED_IDENTIFIERS),
                      run.path(Files.PYLDAVIS_DATA))

We can now visualize the topics we just learned!

In [None]:
from pickle import load as pickle_load

from pyLDAvis import display as pyldavis_display


def visualize(pyldavis_data_path: str):
    with open(pyldavis_data_path, "rb") as fh:
        pyldavis_data = pickle_load(fh)
    return pyldavis_display(pyldavis_data)


visualize(full_run.path(Files.PYLDAVIS_DATA))

## Projects topics

With our learned topic model, we can now tackle our first task: understand what projects are about and find similar projects to existing ones based on their topics.

To do that, we will compute the distance between the topics of all projects, and return the closest ones.

In [None]:
from bz2 import open as bz2_open
from json import loads as json_loads
from pickle import dump as pickle_dump, load as pickle_load
from re import compile as re_compile

from numpy import sum as np_sum, vectorize
from pandas import read_csv as pandas_read_csv


def build_projects_topics(artm_files_topics_path: str,
                          repos_topics_path: str):
    files_topics_df = pandas_read_csv(artm_files_topics_path, delimiter=";")
    all_but_repo_pattern = re_compile(r"//.+//.+$")
    files_topics_df["repository_id"] = files_topics_df["title"].apply(
        lambda x: all_but_repo_pattern.sub("", x))
    grouped_by_repo_df = files_topics_df.iloc[:, 2:].groupby("repository_id")
    repos_topics_df = grouped_by_repo_df.aggregate(np_sum)
    repos_topics_df /= repos_topics_df.sum(axis=1)[:, None]
    with open(repos_topics_path, "wb") as fh:
        pickle_dump(repos_topics_df, fh)


build_projects_topics(run.path(Files.ARTM_FILES_TOPICS),
                      run.path(Files.REPOS_TOPICS))

## Developers topics

Now that we've computed file and project topics, let's compute topics for developers: we'll weight the topics of each file depending on how many lines each developers wrote in it. That'll give us a topic distribution for each developer!

In [None]:
from bz2 import open as bz2_open
from collections import Counter
from json import loads as json_loads
from multiprocessing import Pool
from os.path import join as path_join
from pickle import dump as pickle_dump
from typing import Any, Dict

from git import *


def extract_author_stats(row: Dict[str, Any]):
    row = json_loads(row)
    repo = Repo(path_join("/devfest", "repos", "git-data", row["repository_id"]))
    file_id = "%s//%s//%s" % (
        row["repository_id"],
        row["lang"],
        row["file_path"].replace(" ", "_"))
    commit_counter = Counter()
    for blame_entry in repo.blame_incremental("HEAD", row["file_path"]):
        commit_counter[blame_entry.commit] += blame_entry.linenos.stop - blame_entry.linenos.start
    author_counter = Counter()
    for commit, lines in commit_counter.items():
        author = repo.git.show("-s", "--format=%ae", str(commit))
        author_counter[author] += lines
    return file_id, author_counter


def blame(filtered_identifiers_path: str,
          contributions_path: str):
    with bz2_open(filtered_identifiers_path, "rt", encoding="utf8") as fh, \
            open(contributions_path, "wb") as fh_contributions, \
            Pool() as pool:
        results = pool.map(extract_author_stats, fh.readlines())
        pickle_dump(results, fh_contributions)


blame(run.path(Files.FILTERED_IDENTIFIERS),
      run.path(Files.CONTRIBUTIONS))

In [None]:
from pickle import dump as pickle_dump, load as pickle_load

from pandas import DataFrame, read_csv as pandas_read_csv
from tqdm.notebook import tqdm


def build_authors_topics(contributions_path: str,
                         artm_files_topics_path: str,
                         authors_topics_path: str):
    with open(contributions_path, "rb") as fh_contributions:
        contribs = pickle_load(fh_contributions)
    files_topics_df = pandas_read_csv(artm_files_topics_path, delimiter=";")
    files_topics_df.set_index("title", inplace=True)
    del files_topics_df["id"]
    files_index = {f: i for i, f in enumerate(files_topics_df.index)}
    authors = sorted(set().union(*(c for _, c in contribs)))
    authors_index = {a: i for i, a in enumerate(authors)}
    authors_topics_df = DataFrame(0,
                                 index=authors,
                                 columns=["topic_%d" % i
                                          for i in range(files_topics_df.shape[1])])
    for file_id, counter in tqdm(contribs):
        file_topics = files_topics_df.loc[file_id]
        if len(file_topics.shape) > 1:
            file_topics = file_topics.iloc[0, :].squeeze()
        total = sum(counter.values())
        for author, lines in counter.items():
            authors_topics_df.loc[author, :] += file_topics * lines / total
    authors_topics_df /= authors_topics_df.sum(axis=1)[:, None]
    authors_topics_df.dropna(inplace=True)
    with open(authors_topics_path, "wb") as fh:
        pickle_dump(authors_topics_df, fh)


build_authors_topics(run.path(Files.CONTRIBUTIONS),
                     run.path(Files.ARTM_FILES_TOPICS),
                     run.path(Files.AUTHORS_TOPICS))

## Projects and developers search

We have topics for projects and developers, great!

Now we can compare them: we just have to define how far a given set of topics is from another and we're good to go. In the following cell we're using cosine similarity. It's widely used for that prupose and works quite well. Plus, `scikit-learn` has it already implemented if you don't want to write the (few) lines it requires :)

With that defined, we can compare devs to devs, devs to projects, projects to devs and projects to projects! Let's go.

In [None]:
from pickle import load as pickle_load

from numpy import sum as vectorize
from pandas import read_csv as pandas_read_csv
from sklearn.metrics.pairwise import cosine_similarity


def build_comparison_functions(artm_topics_identifiers_path: str,
                               repos_topics_path: str,
                               authors_topics_path: str):
    topics_identifiers_df = pandas_read_csv(artm_topics_identifiers_path, delimiter=";").T
    topics_topk = topics_identifiers_df.iloc[1:, :].values.argsort()[:, -10:][:, ::-1]
    vocab = topics_identifiers_df.iloc[0, :].values
    topics_top_words = vectorize(lambda x: vocab[x])(topics_topk)
    with open(authors_topics_path, "rb") as fh_authors_topics, \
            open(repos_topics_path, "rb") as fh_repos_topics:
        authors_topics_df = pickle_load(fh_authors_topics)
        repos_topics_df = pickle_load(fh_repos_topics)

    for repo, topics_dist in repos_topics_df.iterrows():
        topk = topics_dist.argsort()[-3:][::-1]
        probk = topics_dist[topk]
        # print("%s:\n%s" % (repo, "\n".join("  %.2f: %s" % (prob, ", ".join(topics_top_words[top + 1]))
        #                                    for prob, top in zip(probk, topk))))

    def build_comparison_function(df1, df2):
        distances = cosine_similarity(df1.values, df2.values)
        def f(key: str):
            dist = distances[df1.index.get_loc(key)]
            topk = dist.argsort()[-10:-1][::-1]
            probk = dist[topk]
            return [(df2.index[i], p) for i, p in zip(topk, probk)]
        return f

    return (build_comparison_function(repos_topics_df, repos_topics_df),
            build_comparison_function(repos_topics_df, authors_topics_df),
            build_comparison_function(authors_topics_df, repos_topics_df),
            build_comparison_function(authors_topics_df, authors_topics_df))


r_r, r_a, a_r, a_a = build_comparison_functions(
    full_run.path(Files.ARTM_TOPICS_IDENTIFIERS),
    full_run.path(Files.REPOS_TOPICS),
    full_run.path(Files.AUTHORS_TOPICS)
)


def display_top(key, f):
    top = f(key)
    print("**********")
    print("Closest to %s" % key)
    for o, prob in top:
        print("%60s: %.2f" % (o, prob))
    print("**********")

display_top("log4j", r_r)
display_top("log4j", r_a)
display_top("mwomack@apache.org", a_a)
display_top("lixiaojie_dev@outlook.com", a_r)