# Project and Developer Similarity

In this notebook, we study what projects and developers are about and how to find projects or devs that are close to them.

We use [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model), through the excellent [BigARTM](http://docs.bigartm.org/en/stable/index.html) library to achieve that. Roughly speaking, the topic model we use sees each code file as stemming from some topics (e.g., `setup.py` might come from topics about packaging and documentation).

To be able to apply this topic modeling technique, we need to transform each code file into a bag of identifiers.

We start by defining the paths we'll use for our inputs and outputs later in this notebook.

A note on how we use the cells in this notebook: __all cells should be only dependent from the first one (that defines paths)__. It means we will 
save and load all results in files to achieve that. This helps mitigate the problems that arise from stateful notebooks.

In [None]:
from enum import Enum
from os import makedirs
from os.path import join as path_join
from typing import Union

class Files(Enum):
    IDENTIFIERS = ["identifiers.jsonl.bz2"]
    SPLIT_IDENTIFIERS = ["split-identifiers.jsonl.bz2"]
    FILTERED_IDENTIFIERS = ["filtered-identifiers.jsonl.bz2"]
    IDENTIFIERS_COUNTER = ["identifiers-counter.pickle"]
    COMMON_IDENTIFIERS_COUNTER = ["common-identifiers-counter.pickle"]
    VW_DATASET = ["dataset.vw"]
    ARTM_DICT = ["bigartm", "identifiers.dict"]
    ARTM_STAGE1 = ["bigartm", "stage1.model"]
    ARTM_STAGE2 = ["bigartm", "stage2.model"]
    ARTM_FILES_TOPICS = ["bigartm", "files-topics.bigartm"]
    ARTM_TOPICS_IDENTIFIERS = ["bigartm", "topics-identifiers.bigartm"]
    PYLDAVIS_DATA = ["pyldavis-data.pickle"]


class Dirs(Enum):
    ARTM_LOGS = ["bigartm", "logs"]
    ARTM_BATCHES = ["bigartm", "batches"]


class Run:
    def __init__(self, run_name: str):
        self._run_name = run_name

    def path(self, file_or_dir: Union[Files, Dirs]):
        if isinstance(file_or_dir, Files):
            dir_path = path_join(self._run_name, *file_or_dir.value[:-1])
        elif isinstance(file_or_dir, Dirs):
            dir_path = path_join(self._run_name, *file_or_dir.value)
        makedirs(dir_path, exist_ok=True)
        return (dir_path
                if isinstance(file_or_dir, Dirs)
                else path_join(dir_path, file_or_dir.value[-1]))


run = Run("full")

## Preprocessing

We start by the bulk of the preprocessing: extracting identifiers with [gitbase](http://docs.bigartm.org/en/stable/index.html). Since gitbase exposes any codebase as a relational database, we can extract what we wish with a SQL query:

In [None]:
from bz2 import open as bz2_open
from json import dumps as json_dumps, loads as json_loads

from utils import SUPPORTED_LANGUAGES, query_gitbase


def extract_identifiers(identifiers_path: str, limit: int = 0):
    sql = """
        SELECT
            repository_id,
            LANGUAGE(file_path) AS lang,
            file_path,
            uast_extract(
                uast(blob_content,
                     LANGUAGE(file_path),
                     '//uast:Identifier'),
                'Name'
            ) AS identifiers
        FROM refs
        NATURAL JOIN commit_files
        NATURAL JOIN blobs
        WHERE
            ref_name = 'HEAD'
            AND NOT IS_VENDOR(file_path)
            AND NOT IS_BINARY(file_path)
            AND LANGUAGE(file_path) IN (%s)
        %s
    """ % (
        ",".join("'%s'" % language for language in SUPPORTED_LANGUAGES),
        "LIMIT %d" % limit if limit > 0 else ""
    )

    with bz2_open(identifiers_path, "wt", encoding="utf8") as fh:
        for row in query_gitbase(sql):
            if row["identifiers"] is None:
                continue
            # for key, value in row.items():
            #     row[key] = value.decode("utf8", "replace")
            row["identifiers"] = json_loads(row["identifiers"])
            fh.write("%s\n" % json_dumps(row))


extract_identifiers(run.path(Files.IDENTIFIERS))

Now that we have a file that stores all the identifiers, we can refine it until it is ready for topic modeling! The remaining steps are to further split each identifier (`set_timer` should become `set` and `timer`), and to apply some stemming (`connecting` and `connection` should both result in `connect`, note that the result stem might not be an English word).

In [None]:
from bz2 import open as bz2_open
from collections import Counter
from json import dumps as json_dumps, loads as json_loads
from pickle import dump as pickle_dump

from utils import TokenParser


def split_identifiers(identifiers_path: str,
                      split_identifiers_path: str,
                      counter_path: str):
    with bz2_open(identifiers_path, "rt", encoding="utf8") as fh_identifiers, \
            bz2_open(split_identifiers_path, "wt", encoding="utf8") as fh_split_identifiers, \
            open(counter_path, "wb") as fh_counter:
        identifiers_counter = Counter()
        token_parser = TokenParser()
        for row_str in fh_identifiers:
            row = json_loads(row_str)
            identifiers = row.pop("identifiers")
            split_identifiers = []
            for identifier in identifiers:
                split_identifiers.extend(token_parser(identifier))
            identifiers_counter.update(split_identifiers)
            row["split_identifiers"] = split_identifiers
            fh_split_identifiers.write("%s\n" % json_dumps(row))
        pickle_dump(identifiers_counter, fh_counter)


split_identifiers(run.path(Files.IDENTIFIERS),
                  run.path(Files.SPLIT_IDENTIFIERS),
                  run.path(Files.IDENTIFIERS_COUNTER))

The resulting identifiers still need some processing: some of them appear only a few times and will bring mostly noise to our models. We will discard them now. The first step is to find out which identifiers are common enough to be kept.

In [None]:
from pickle import dump as pickle_dump, load as pickle_load


def build_common_counter(count_threshold: int,
                         counter_path: str,
                         common_counter_path: str):
    with open(counter_path, "rb") as fh:
        identifiers_counter = pickle_load(fh)
    print("Found %d different identifiers" % len(identifiers_counter))

    common_identifiers_counter = identifiers_counter.copy()
    for identifier, count in identifiers_counter.items():
        if count < count_threshold:
            del common_identifiers_counter[identifier]
    with open(common_counter_path, "wb") as fh:
        pickle_dump(common_identifiers_counter, fh)
    print("Found %d different identifiers after pruning"
          % len(common_identifiers_counter))


build_common_counter(10,
                     run.path(Files.IDENTIFIERS_COUNTER),
                     run.path(Files.COMMON_IDENTIFIERS_COUNTER))

Now that we know which are the common identifiers, we can recreate our mapping from files to identifiers with only the ones that we want to keep.

In [None]:
from json import dumps as json_dumps, loads as json_loads
from pickle import load as pickle_load


def filter_identifiers(split_identifiers_path: str,
                       common_counter_path: str,
                       filtered_identifiers_path: str):
    with bz2_open(split_identifiers_path, "rt", encoding="utf8") as fh_split_identifiers, \
            open(common_counter_path, "rb") as fh_common_counter, \
            bz2_open(filtered_identifiers_path, "wt", encoding="utf8") as fh_filtered_identifiers:
        common_identifiers_counter = pickle_load(fh_common_counter)
        for row_str in fh_split_identifiers:
            row = json_loads(row_str)
            row["split_identifiers"] = [identifier
                                        for identifier in row["split_identifiers"]
                                        if identifier in common_identifiers_counter]
            if row["split_identifiers"]:
                fh_filtered_identifiers.write("%s\n" % json_dumps(row))


filter_identifiers(run.path(Files.SPLIT_IDENTIFIERS),
                   run.path(Files.COMMON_IDENTIFIERS_COUNTER),
                   run.path(Files.FILTERED_IDENTIFIERS))

## Topic Modeling

The preprocessing is over! We now create the input dataset, in the VW format (see https://bigartm.readthedocs.io/en/stable/tutorials/datasets.html). We replace spaces in `file_path` to avoid creating false identifiers (VW would consider the latter parts of a path containing spaces to be identifiers).

In [None]:
from collections import Counter
from os.path import join as path_join


def build_vw_dataset(filtered_identifiers_path: str,
                     vw_dataset_path: str):
    with bz2_open(filtered_identifiers_path, "rt", encoding="utf8") as fh_filtered_identifiers, \
            open(vw_dataset_path, "w") as fh_vw:
        for row_str in fh_filtered_identifiers:
            counter = Counter()
            row = json_loads(row_str)
            counter.update(row["split_identifiers"])
            fh_vw.write("%s//%s//%s %s\n" % (
                row["repository_id"],
                row["lang"],
                row["file_path"].replace(" ", "_"),
                " ".join("%s:%d" % (identifier, count)
                     for identifier, count in counter.items())
            ))


build_vw_dataset(run.path(Files.FILTERED_IDENTIFIERS),
                 run.path(Files.VW_DATASET))

Bigartm has its own binary format to efficiently store and access the data used to build its topic models. The next step is therefore to transform our VW dataset into the correct Bigartm format.

In [None]:
def prepare_bigartm(vw_dataset_path: str,
                    artm_batches_path: str,
                    artm_dict_path: str,
                    artm_logs_path: str):
    !rm -rf {artm_batches_path} {artm_dict_path}
    !bigartm \
        --log-dir {artm_logs_path} \
        -c {vw_dataset_path} \
        -p 0 \
        --save-batches {artm_batches_path} \
        --save-dictionary {artm_dict_path}


prepare_bigartm(run.path(Files.VW_DATASET),
                run.path(Dirs.ARTM_BATCHES),
                run.path(Files.ARTM_DICT),
                run.path(Dirs.ARTM_LOGS))

We can now train our first topic model! As per the Bigartm documentation, we don't use too much magic (yet), and only use one regularizer --- the decorrelation one. It will make sure that no 2 topics are about the same concepts.

In [None]:
from multiprocessing import cpu_count


def train_topic_model(artm_batches_path: str,
                      artm_dict_path: str,
                      artm_stage1_path: str,
                      artm_logs_path: str,
                      n_topics: int = 64,
                      n_epochs: int = 100,
                      n_cpus: int = cpu_count() * 2,
                      seed: int = 2019,
                      regularizer: str = '"1000 Decorrelation"'):
    !bigartm \
        --log-dir {artm_logs_path} \
        --use-batches {artm_batches_path} \
        --use-dictionary {artm_dict_path} \
        -t {n_topics} \
        -p {n_epochs} \
        --threads {n_cpus} \
        --rand-seed {seed} \
        --regularizer {regularizer} \
        --save-model {artm_stage1_path} \
        --force


train_topic_model(run.path(Dirs.ARTM_BATCHES),
                  run.path(Files.ARTM_DICT),
                  run.path(Files.ARTM_STAGE1),
                  run.path(Dirs.ARTM_LOGS))

This topic model is probably quite good already, but since Bigartm is a powerful library, we can improve it even further by making it sparser: documents and topics will be sharper, they will contain less words and topics and will focus on the most important ones.

In [None]:
def sparsify_topic_model(
    artm_batches_path: str,
    artm_dict_path: str,
    artm_stage1_path: str,
    artm_stage2_path: str,
    artm_files_topics_path: str,
    artm_topics_identifiers_path: str,
    artm_logs_path: str,
    n_epochs: int = 20,
    n_cpus: int = cpu_count() * 2,
    seed: int = 2019,
    regularizer: str = ' "1000 Decorrelation" "0.5 SparsePhi" "0.5 SparseTheta" '
):
    !bigartm \
        --log-dir {artm_logs_path} \
        --use-batches {artm_batches_path} \
        --use-dictionary {artm_dict_path} \
        --load-model {artm_stage1_path} \
        -p {n_epochs} \
        --threads {n_cpus} \
        --rand-seed {seed} \
        --regularizer {regularizer} \
        --save-model {artm_stage2_path} \
        --force \
        --write-predictions {artm_files_topics_path} \
        --write-model-readable {artm_topics_identifiers_path}


sparsify_topic_model(run.path(Dirs.ARTM_BATCHES),
                     run.path(Files.ARTM_DICT),
                     run.path(Files.ARTM_STAGE1),
                     run.path(Files.ARTM_STAGE2),
                     run.path(Files.ARTM_FILES_TOPICS),
                     run.path(Files.ARTM_TOPICS_IDENTIFIERS),
                     run.path(Dirs.ARTM_LOGS))

Our topic model should be perfectly cooked now. It's time to taste it. Let's visualize the topics with the great [pyLDAvis](https://github.com/bmabey/pyLDAvis) tool. To do that, we first extract the relevant info from our model. We use BigARTM and it's not supported out of the box so we have a bit of work to do. If we'd have used Gensim or some other better-known (not better) library, this step would be a one-liner.

In [None]:
from bz2 import open as bz2_open
from json import loads as json_loads
from pickle import dump as pickle_dump, load as pickle_load

from numpy import ones as numpy_ones
from pandas import read_csv as pandas_read_csv
from pyLDAvis import prepare as pyldavis_prepare


def prepare_visualization(artm_files_topics_path: str,
                          artm_topics_identifiers_path: str,
                          common_counter_path: str,
                          filtered_identifiers_path: str,
                          pyldavis_data_path: str):
    topics_identifiers_df = pandas_read_csv(artm_topics_identifiers_path, delimiter=";")
    files_topics_df = pandas_read_csv(artm_files_topics_path, delimiter=";")
    topic_term_dists = topics_identifiers_df.iloc[:, 2:].values.T
    doc_topic_dists = files_topics_df.iloc[:, 2:].values
    doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=1)
    for i, row in enumerate(doc_topic_dists):
        if not (0.9 < row.sum() < 1.1):
            doc_topic_dists[i] = (numpy_ones((doc_topic_dists.shape[1],))
                                  / doc_topic_dists.shape[1])
    doc_index = files_topics_df["title"].values
    vocab = topics_identifiers_df["token"].values
    with bz2_open(filtered_identifiers_path, "rt", encoding="utf8") as fh_rj, \
            open(common_counter_path, "rb") as fh_rp:
        common_identifiers_counter = pickle_load(fh_rp)
        doc_lengths_index = {}
        for row_str in fh_rj:
            row = json_loads(row_str)
            doc_lengths_index[
                "%s//%s//%s" % (
                    row["repository_id"],
                    row["lang"],
                    row["file_path"].replace(" ", "_"))
            ] = len(row["split_identifiers"])
        term_frequency = [common_identifiers_counter[t] for t in vocab]
        doc_lengths = [doc_lengths_index[doc] for doc in doc_index]

    with open(pyldavis_data_path, "wb") as fh:
        pyldavis_data = pyldavis_prepare(topic_term_dists=topic_term_dists, 
                                         doc_topic_dists=doc_topic_dists,
                                         doc_lengths=doc_lengths,
                                         vocab=vocab,
                                         term_frequency=term_frequency,
                                         sort_topics=False)
        pickle_dump(pyldavis_data, fh)


prepare_visualization(run.path(Files.ARTM_FILES_TOPICS),
                      run.path(Files.ARTM_TOPICS_IDENTIFIERS),
                      run.path(Files.COMMON_IDENTIFIERS_COUNTER),
                      run.path(Files.FILTERED_IDENTIFIERS),
                      run.path(Files.PYLDAVIS_DATA))

We can now visualize the topics we just learned!

In [None]:
from pickle import load as pickle_load

from pyLDAvis import display as pyldavis_display


def visualize(pyldavis_data_path: str):
    with open(pyldavis_data_path, "rb") as fh:
        pyldavis_data = pickle_load(fh)
    return pyldavis_display(pyldavis_data)


visualize(run.path(Files.PYLDAVIS_DATA))

## Projects search

With our learned topic model, we can now tackle our first task: understand what projects are about and find similar projects to existing ones based on their topics.

To do that, we will compute the distance between the topics of all projects, and return the closest ones.

In [None]:
from bz2 import open as bz2_open
from json import loads as json_loads
from pickle import load as pickle_load
from re import compile as re_compile

from numpy import sum as np_sum, vectorize
from pandas import read_csv as pandas_read_csv
from sklearn.metrics.pairwise import cosine_similarity


def build_projects_topics(artm_files_topics_path: str,
                          artm_topics_identifiers_path: str):
    files_topics_df = pandas_read_csv(artm_files_topics_path, delimiter=";")
    topics_identifiers_df = pandas_read_csv(artm_topics_identifiers_path, delimiter=";").T
    topics_topk = topics_identifiers_df.iloc[1:, :].values.argsort()[:, -10:][:, ::-1]
    vocab = topics_identifiers_df.iloc[0, :].values
    topics_top_words = vectorize(lambda x: vocab[x])(topics_topk)
    all_but_repo_pattern = re_compile(r"//.+//.+$")
    files_topics_df["repository_id"] = files_topics_df["title"].apply(
        lambda x: all_but_repo_pattern.sub("", x))
    grouped_by_repo_df = files_topics_df.iloc[:, 2:].groupby("repository_id")
    topics = grouped_by_repo_df.aggregate(np_sum).values
    topics /= topics.sum(axis=1, keepdims=1)
    repos = [r for r, _ in grouped_by_repo_df]
    repos_index = {r: i for i, r in enumerate(repos)}
    for i, ((repo, _), topics_dist) in enumerate(zip(grouped_by_repo_df, topics)):
        topk = topics_dist.argsort()[-3:][::-1]
        probk = topics_dist[topk]
        print("%s (%d):\n%s" % (repo, i, "\n".join("  %.2f: %s" % (prob, ", ".join(topics_top_words[top + 1]))
                                                   for prob, top in zip(probk, topk))))
    most_similar = cosine_similarity(topics).argsort()[:, -10:-1][:, ::-1]

    def query_similar(repo: str):
        return [repos[i] for i in most_similar[repos_index[repo]]]

    return query_similar


f = build_projects_topics(run.path(Files.ARTM_FILES_TOPICS),
                          run.path(Files.ARTM_TOPICS_IDENTIFIERS))

In [None]:
f("log4j")