# README generation experiment

The discrepancy between retrieval using READMEs and code motivated the following question:
**Assuming READMEs aren't available, can we generate useful READMEs from code?**

The following results show the metrics obtained from a sample of repositories evaluated with models available in [BEIR](https://github.com/beir-cellar/beir).

## README generation

We generate a README for each repository based on selected code.

### Code selection

READMEs were generated from repository code selected in two steps:
1. select up to 10 Python files per repo using dependency graph
2. for each file select only comments and function/class signatures

### Generation

This approach can be described as a simplified of [baleen](https://arxiv.org/pdf/2101.00436v2.pdf) system.


READMEs are generated with a two-hop process:
- summarize selected code
- generate README based on file summary
  This step uses Chain of Thought module.


Generation is implemented by a DSPy program that can be found [here](https://gist.github.com/lambdaofgod/c41cbec0063edac67de2edd2db815e9c).

```mermaid
flowchart LR

subgraph File summarizer
R[repo name] --> B["code selected\n from max k=10 files"]
B --> P1
P1{"describe what each\n file implements\n in 3 sentences\n Answer:"} --> C((code summary))
end
subgraph README generation CoT
C -->|context| P2
P2{"Given the ${context}, ${question}\n, produce the answer.\n Reasoning: "}
P2 --> R2(("generated\n reasoning"))

R2 -->|reasoning| P3{"Given the ${context}, ${question}\n, produce the answer.\n Reasoning: ${reasoning}\n Answer: "}
P3 --> GR(("generated\n README"))
end
```

### Sampling

The sampling is motivated by the fact that generating READMEs takes 6-10s on average for a repository depending on an LLM. Because we have 38k repositories in total we decide to sample them. We implemented a simple heuristic that basically adds repositories from least represented queries, keeping track of number of matched repositories per query and taking care not to add any repository that will make any task overrepresented.

#### Sampling considerations

First, let us mention how sampling influences the problem:
1. highly represented queries might get harder to retrieve (it is easier to retrieve 10 items if there are 1000 examples than 100)
2. least common queries might get easier to retrieve because there are less confounding documents
3. filtering out repositories with common tasks will also result in dropping repositories with other tasks (because repositories have multiple tasks)

Experiments ran on small samples (2-3k) suggested that the situation described in 1. didn't actually occur.

Because common queries are easy to retrieve even after sampling, we decided to ignore these queries - we do not calculate their metrics (corresponding repositories work only as distractors).
The specifics of what we consider 'big' is based on expected performance of random retrieval baseline, which will be discussed in appendix.

#### Iterative count-based sampling 

Because of these problems after some experimentation we choose to iteratively sample repositories based on task counts. We can think of this method as a way to jointly sample tasks and repositories, limiting the maximum number of repositories per task.

Input parameters:
repo["tasks"] is a list of tasks for this repository.

```
def sample_repos_by_task_count(repos, task_counts, min_task_count, max_task_count, n_repos_per_task):
    task_sample_counts = {task: 0 for task, _ in task_counts}
    task_repos = {task: [] for task, _ in task_counts}
    for task, task_count in task_counts:
        task_repos = get_repos_with_task(repos)
        valid_task_repos = [
            repo
            for repo in task_repos
            # filtering
            if not any_task_overrepresented(repo.tasks, max_task_count)
        ]
        sampled_task_repos = valid_task_repos[:n_repos_per_task]

        for repo in sampled_task_repos:
            for task in repo.tasks:
                task_sample_counts[task] += 1
        task_repos[task] = sampled_task_repos

    return task_repos
```

This method ensures that 

### Queries

Repositories sampled with our method contain additional tasks (when we sampled repositories per task they can contain other tasks). Many of these new tasks have multiple repositories in the sample.

Because of this we will use the tasks that have over 10 repositories in the samples as queries - final query set contains 429 tasks.

## Corpora

We use the following corpora (for repositories from the aforementioned sample)

- original README
- selected code
- generated README corpora
    - generated README
    - rationale (used by Chain of Thought for README generation)
    - generation context (selected repo file summaries)
- librarian corpora
    - dependency signature
    - generated tasks
    - librarian signature (dependency signature + generated tasks)

## Retrieval

We display the results for the following retrievers:
- BM25
- Word2Vec trained on Python code
- sentence transformer retrievers
    - [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
    - [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)


In addition to this we tested [flax-sentence-embeddings/st-codesearch-distilroberta-base](https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base) and SPLADE. These models were outperformed by the other methods (actually SPLADE was better for READMEs but only by 0.02) so we don't display their results.

In [1]:
import pandas as pd
import numpy as np
import ranky

In [15]:
sort_col = "borda_rank"
metric_df_cols = ["Accuracy@10", "Hits@10", "NDCG@10"] # + ["R_cap@10"]
shown_cols = ["corpus", "retriever", "generator_model", "borda_rank"] + metric_df_cols

ignored_retrievers = ["st-codesearch-distilroberta-base (sentence_transformer)", "splade"]
model_name = "llama3"


def is_sorting_ascending(sort_col):
    return sort_col == "borda_rank"

def set_model_name(df, code2doc_model):
    readme_generator_corpora = ["generated_readme", "generation_context", "generated_rationale"]
    signature_corpora = ["repository_signature", "generated_tasks"]
    df["generator_model"] = ""
    df.loc[df["corpus"].isin(readme_generator_corpora), "generator_model"] = code2doc_model
    df.loc[df["corpus"].isin(signature_corpora), "generator_model"] = "bigcode/starcoderbase-7b"
    return df

def load_beir_results_df(path, code2doc_model, metric_df_cols=metric_df_cols, ignored_retrievers=ignored_retrievers):
    
    beir_sample_results_df = pd.read_csv(path)

    beir_sample_results_df = beir_sample_results_df[~beir_sample_results_df["retriever"].isin(ignored_retrievers)]

    beir_sample_results_df = set_model_name(beir_sample_results_df, code2doc_model)
    beir_sample_results_df["borda_rank"] = ranky.borda(beir_sample_results_df[metric_df_cols])
    #beir_sample_results_df["majority_rank"] = ranky.majority(beir_sample_results_df[metric_df_cols])
    return beir_sample_results_df[shown_cols]


def merge_different_model_dfs(dfs):
    return pd.concat(dfs).drop_duplicates().reset_index(drop=True).sort_values(sort_col, ascending=is_sorting_ascending(sort_col))


str_types = {
    "readme": "",
    "generated_rationale": "generated_readme",
    "generated_readme": "generated_readme",
    "generation_context": "generated_readme",
    "dependency_signature": "librarian",
    "repository_signature": "librarian",
    "generated_tasks": "librarian",
    "selected_code": "",
}

cell_props = {
    "": "",
    "librarian": "color:red;",
    "generated_readme": "color:green;",
    "bm25": "color:blue;",
    "splade": "color:brown;",
    "Python code word2vec": "color:yellow;"
}

def string_cell_style(s):
    if s in str_types.keys():
        s = str_types[s]
    return cell_props.get(s, "")


def df_style(v):
    if type(v) is float:
        return '{:.2f}'.format(v)
    else:
        return string_cell_style(v)

def apply_style(df):
    number_cols = beir_sample_results_df.select_dtypes("number").columns
    return df.style.map(string_cell_style).format(dict([(col, "{:.3}") for col in number_cols]))

def show_grouped_results(results_df, groupby_col, agg="max", sort_by=sort_col):
    results_df = results_df.copy()
    results_df["borda_rank"] = ranky.borda(results_df[metric_df_cols])
    if agg == "max":
        agg_fn = lambda df: df.sort_values(sort_by, ascending=is_sorting_ascending(sort_by)).drop(columns=[groupby_col]).iloc[0]
    elif agg == "mean":
        agg_fn = lambda df: df.select_dtypes("number").mean()
    else:
        raise NotImplemented
    grouped_results = results_df.groupby(groupby_col).apply(agg_fn)
    grouped_results["borda_rank"] = ranky.borda(grouped_results[metric_df_cols])
    grouped_results = grouped_results.sort_values(sort_by, ascending=is_sorting_ascending(sort_by))

    # TODO: why this doesn't change the ranking?
    #grouped_results["borda_rank"] = ranky.borda(grouped_results[metric_df_cols])
    #grouped_results["majority_rank"] = ranky.majority(grouped_results[metric_df_cols])
    return apply_style(grouped_results.reset_index())

In [16]:
beir_sample_results_df = merge_different_model_dfs([
    load_beir_results_df('../output/code2doc/sample_per_task_5_repos/beir_results_codellama.csv', "codellama"),
    #load_beir_results_df(f"../output/code2doc/sample_small/beir_results_llama3.csv", "llama3")
])

In [17]:
apply_style(beir_sample_results_df)

Unnamed: 0,corpus,retriever,generator_model,borda_rank,Accuracy@10,Hits@10,NDCG@10
1,readme,all-mpnet-base-v2 (sentence_transformer),,1.33,0.916,4.1,0.455
0,readme,bm25,,1.67,0.911,4.12,0.449
2,readme,all-MiniLM-L12-v2 (sentence_transformer),,3.0,0.903,3.9,0.445
3,generated_readme,all-mpnet-base-v2 (sentence_transformer),codellama,4.0,0.836,2.68,0.3
4,generated_rationale,all-mpnet-base-v2 (sentence_transformer),codellama,5.33,0.813,2.61,0.298
5,generated_readme,all-MiniLM-L12-v2 (sentence_transformer),codellama,5.67,0.817,2.6,0.295
6,generated_rationale,all-MiniLM-L12-v2 (sentence_transformer),codellama,7.67,0.783,2.38,0.275
7,generated_readme,bm25,codellama,8.67,0.765,2.36,0.274
8,generation_context,all-mpnet-base-v2 (sentence_transformer),codellama,8.67,0.794,2.32,0.268
9,generated_rationale,bm25,codellama,10.0,0.745,2.29,0.27


In [18]:
type(apply_style(beir_sample_results_df).to_latex())

str

In [19]:
print(apply_style(beir_sample_results_df.reset_index(drop=True).round(3)).to_latex().replace("\\\\", "\\\\\n\hline"))

\begin{tabular}{llllrrrr}
 & corpus & retriever & generator_model & borda_rank & Accuracy@10 & Hits@10 & NDCG@10 \\
\hline
0 & readme & all-mpnet-base-v2 (sentence_transformer) &  & 1.33 & 0.916 & 4.1 & 0.455 \\
\hline
1 & readme & \colorblue bm25 &  & 1.67 & 0.911 & 4.12 & 0.449 \\
\hline
2 & readme & all-MiniLM-L12-v2 (sentence_transformer) &  & 3.0 & 0.903 & 3.9 & 0.445 \\
\hline
3 & \colorgreen generated_readme & all-mpnet-base-v2 (sentence_transformer) & codellama & 4.0 & 0.836 & 2.67 & 0.3 \\
\hline
4 & \colorgreen generated_rationale & all-mpnet-base-v2 (sentence_transformer) & codellama & 5.33 & 0.813 & 2.62 & 0.298 \\
\hline
5 & \colorgreen generated_readme & all-MiniLM-L12-v2 (sentence_transformer) & codellama & 5.67 & 0.817 & 2.6 & 0.295 \\
\hline
6 & \colorgreen generated_rationale & all-MiniLM-L12-v2 (sentence_transformer) & codellama & 7.67 & 0.783 & 2.38 & 0.275 \\
\hline
7 & \colorgreen generated_readme & \colorblue bm25 & codellama & 8.67 & 0.765 & 2.36 & 0.274 \\
\hli

## Comparing text generators

In [20]:
show_grouped_results(beir_sample_results_df[beir_sample_results_df["generator_model"] != ""], "generator_model")

Unnamed: 0,generator_model,corpus,retriever,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,codellama,generated_readme,all-mpnet-base-v2 (sentence_transformer),1.0,0.836,2.68,0.3
1,bigcode/starcoderbase-7b,repository_signature,all-mpnet-base-v2 (sentence_transformer),2.0,0.734,2.03,0.238


In [21]:
show_grouped_results(beir_sample_results_df[beir_sample_results_df["generator_model"] != ""], "generator_model", agg="mean")

Unnamed: 0,generator_model,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,codellama,1.0,0.712,2.03,0.233
1,bigcode/starcoderbase-7b,2.0,0.615,1.44,0.169


## Aggregating over corpora

The best result for each corpus

In [22]:
show_grouped_results(beir_sample_results_df.drop(columns=["generator_model"]), "corpus")

Unnamed: 0,corpus,retriever,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,readme,all-mpnet-base-v2 (sentence_transformer),1.0,0.916,4.1,0.455
1,generated_readme,all-mpnet-base-v2 (sentence_transformer),2.0,0.836,2.68,0.3
2,generated_rationale,all-mpnet-base-v2 (sentence_transformer),3.0,0.813,2.61,0.298
3,generation_context,all-mpnet-base-v2 (sentence_transformer),4.0,0.794,2.32,0.268
4,repository_signature,all-mpnet-base-v2 (sentence_transformer),5.0,0.734,2.03,0.238
5,dependency_signature,all-mpnet-base-v2 (sentence_transformer),6.0,0.716,1.91,0.229
6,selected_code,bm25,7.0,0.682,1.82,0.214
7,generated_tasks,all-MiniLM-L12-v2 (sentence_transformer),8.0,0.681,1.68,0.194


In [23]:
print(show_grouped_results(beir_sample_results_df, "corpus").to_latex())

\begin{tabular}{llllrrrr}
 & corpus & retriever & generator_model & borda_rank & Accuracy@10 & Hits@10 & NDCG@10 \\
0 & readme & all-mpnet-base-v2 (sentence_transformer) &  & 1.0 & 0.916 & 4.1 & 0.455 \\
1 & \colorgreen generated_readme & all-mpnet-base-v2 (sentence_transformer) & codellama & 2.0 & 0.836 & 2.68 & 0.3 \\
2 & \colorgreen generated_rationale & all-mpnet-base-v2 (sentence_transformer) & codellama & 3.0 & 0.813 & 2.61 & 0.298 \\
3 & \colorgreen generation_context & all-mpnet-base-v2 (sentence_transformer) & codellama & 4.0 & 0.794 & 2.32 & 0.268 \\
4 & \colorred repository_signature & all-mpnet-base-v2 (sentence_transformer) & bigcode/starcoderbase-7b & 5.0 & 0.734 & 2.03 & 0.238 \\
5 & \colorred dependency_signature & all-mpnet-base-v2 (sentence_transformer) &  & 6.0 & 0.716 & 1.91 & 0.229 \\
6 & selected_code & \colorblue bm25 &  & 7.0 & 0.682 & 1.82 & 0.214 \\
7 & \colorred generated_tasks & all-MiniLM-L12-v2 (sentence_transformer) & bigcode/starcoderbase-7b & 8.0 & 0.68

Average results for each corpus

In [24]:
print(
    show_grouped_results(beir_sample_results_df, "corpus", agg="mean").to_latex().replace("\\\\", "\\\\\n\hline")
)


\begin{tabular}{llrrrr}
 & corpus & borda_rank & Accuracy@10 & Hits@10 & NDCG@10 \\
\hline
0 & readme & 1.0 & 0.826 & 3.33 & 0.373 \\
\hline
1 & \colorgreen generated_readme & 2.0 & 0.744 & 2.21 & 0.25 \\
\hline
2 & \colorgreen generated_rationale & 3.0 & 0.717 & 2.08 & 0.24 \\
\hline
3 & \colorgreen generation_context & 4.0 & 0.676 & 1.8 & 0.21 \\
\hline
4 & \colorred repository_signature & 5.0 & 0.625 & 1.48 & 0.176 \\
\hline
5 & \colorred generated_tasks & 6.0 & 0.606 & 1.39 & 0.161 \\
\hline
6 & selected_code & 7.0 & 0.556 & 1.34 & 0.158 \\
\hline
7 & \colorred dependency_signature & 8.0 & 0.509 & 1.19 & 0.144 \\
\hline
\end{tabular}



In [25]:
show_grouped_results(beir_sample_results_df, "corpus", agg="mean").to_latex()

'\\begin{tabular}{llrrrr}\n & corpus & borda_rank & Accuracy@10 & Hits@10 & NDCG@10 \\\\\n0 & readme & 1.0 & 0.826 & 3.33 & 0.373 \\\\\n1 & \\colorgreen generated_readme & 2.0 & 0.744 & 2.21 & 0.25 \\\\\n2 & \\colorgreen generated_rationale & 3.0 & 0.717 & 2.08 & 0.24 \\\\\n3 & \\colorgreen generation_context & 4.0 & 0.676 & 1.8 & 0.21 \\\\\n4 & \\colorred repository_signature & 5.0 & 0.625 & 1.48 & 0.176 \\\\\n5 & \\colorred generated_tasks & 6.0 & 0.606 & 1.39 & 0.161 \\\\\n6 & selected_code & 7.0 & 0.556 & 1.34 & 0.158 \\\\\n7 & \\colorred dependency_signature & 8.0 & 0.509 & 1.19 & 0.144 \\\\\n\\end{tabular}\n'

In [26]:
beir_sample_results_df

Unnamed: 0,corpus,retriever,generator_model,borda_rank,Accuracy@10,Hits@10,NDCG@10
1,readme,all-mpnet-base-v2 (sentence_transformer),,1.333333,0.91576,4.09868,0.45536
0,readme,bm25,,1.666667,0.91095,4.12154,0.44892
2,readme,all-MiniLM-L12-v2 (sentence_transformer),,3.0,0.90253,3.90253,0.44474
3,generated_readme,all-mpnet-base-v2 (sentence_transformer),codellama,4.0,0.83634,2.67509,0.29979
4,generated_rationale,all-mpnet-base-v2 (sentence_transformer),codellama,5.333333,0.81348,2.61492,0.29785
5,generated_readme,all-MiniLM-L12-v2 (sentence_transformer),codellama,5.666667,0.81709,2.59687,0.29502
6,generated_rationale,all-MiniLM-L12-v2 (sentence_transformer),codellama,7.666667,0.78339,2.38387,0.27466
7,generated_readme,bm25,codellama,8.666667,0.76534,2.36101,0.27434
8,generation_context,all-mpnet-base-v2 (sentence_transformer),codellama,8.666667,0.79422,2.31769,0.26751
9,generated_rationale,bm25,codellama,10.0,0.74489,2.2852,0.27014


In [27]:
show_grouped_results(beir_sample_results_df, "corpus", agg="mean")

Unnamed: 0,corpus,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,readme,1.0,0.826,3.33,0.373
1,generated_readme,2.0,0.744,2.21,0.25
2,generated_rationale,3.0,0.717,2.08,0.24
3,generation_context,4.0,0.676,1.8,0.21
4,repository_signature,5.0,0.625,1.48,0.176
5,generated_tasks,6.0,0.606,1.39,0.161
6,selected_code,7.0,0.556,1.34,0.158
7,dependency_signature,8.0,0.509,1.19,0.144


## Aggregating over retrievers

The best result for each retriever

In [28]:
show_grouped_results(beir_sample_results_df, "retriever", agg="max")

Unnamed: 0,retriever,corpus,generator_model,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,all-mpnet-base-v2 (sentence_transformer),readme,,1.33,0.916,4.1,0.455
1,bm25,readme,,1.67,0.911,4.12,0.449
2,all-MiniLM-L12-v2 (sentence_transformer),readme,,3.0,0.903,3.9,0.445
3,Python code word2vec,readme,,4.0,0.573,1.21,0.143


In [29]:
show_grouped_results(beir_sample_results_df, "retriever", agg="mean")

Unnamed: 0,retriever,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,all-mpnet-base-v2 (sentence_transformer),1.0,0.77,2.37,0.271
1,all-MiniLM-L12-v2 (sentence_transformer),2.0,0.754,2.22,0.258
2,bm25,3.0,0.697,2.07,0.241
3,Python code word2vec,4.0,0.407,0.761,0.0857


### Aggregating over retrievers without original READMEs

In [30]:
show_grouped_results(beir_sample_results_df[beir_sample_results_df["corpus"] != "readme"], "retriever", agg="max")

Unnamed: 0,retriever,corpus,generator_model,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,all-mpnet-base-v2 (sentence_transformer),generated_readme,codellama,1.0,0.836,2.68,0.3
1,all-MiniLM-L12-v2 (sentence_transformer),generated_readme,codellama,2.0,0.817,2.6,0.295
2,bm25,generated_readme,codellama,3.0,0.765,2.36,0.274
3,Python code word2vec,generated_readme,codellama,4.0,0.556,1.19,0.131


In [14]:
show_grouped_results(beir_sample_results_df[beir_sample_results_df["corpus"] != "readme"], "retriever", agg="mean")

Unnamed: 0,retriever,borda_rank,Accuracy@10,Hits@10,NDCG@10
0,all-mpnet-base-v2 (sentence_transformer),8.93,0.749,2.12,0.245
1,all-MiniLM-L12-v2 (sentence_transformer),10.6,0.733,1.97,0.232
2,bm25,13.8,0.666,1.78,0.211
3,Python code word2vec,24.7,0.384,0.697,0.0775
