# README generation experiment

The discrepancy between retrieval using READMEs and code motivated the following question:
**If the READMEs are unavailable, can we generate them from code?**

The following results show the metrics obtained from a sample of repositories evaluated with models available in [BEIR](https://github.com/beir-cellar/beir).

## README generation

We generate a README for each repository based on selected code.

### Code selection

READMEs were generated from repository code selected in two steps:
1. select up to 10 Python files per repo using dependency graph
2. for each file select only comments and function/class signatures

### Generation

READMEs are generated with a two-step process:
- summarize selected code
- generate README based on file summary
  This step uses Chain of Thought module.

Generation is implemented by a DSPy program that can be found [here](https://gist.github.com/lambdaofgod/c41cbec0063edac67de2edd2db815e9c).

### Sampling

The sampling is motivated by the fact that generating READMEs takes ~10s for a repository. A sample was created to contain around >2k repositories. Generation takes ~8h for 2529 repositories.

First we select only repositories with at least k=3 tasks.
Repositories were sampled by first sampling 200 tasks that were constrained to correspond to > 20 repositories.
Then up to 20 repositories were sampled for each of these tasks.

### Queries

Repositories sampled with our method contain additional tasks (when we sampled repositories per task they can contain other tasks). Many of these new tasks have multiple repositories in the sample.

Because of this we will use the tasks that have over 10 repositories in the samples as queries - final query set contains 429 tasks.

## Corpora

We use the following corpora (for repositories from the aforementioned sample)

- original README
- selected code
- generated README corpora
    - generated README
    - rationale (used by Chain of Thought for README generation)
    - generation context (selected repo file summaries)
- librarian corpora
    - dependency signature
    - generated tasks
    - librarian signature (dependency signature + generated tasks)

## Retrieval

We display the results for the following retrievers:
- BM25
- Word2Vec trained on Python code
- sentence transformer retrievers
    - [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
    - [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)


In addition to this we tested [flax-sentence-embeddings/st-codesearch-distilroberta-base](https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base) and SPLADE. These models were outperformed by the other methods (actually SPLADE was better for READMEs but only by 0.02) so we don't display their results.

In [1]:
import pandas as pd
import numpy as np

In [28]:
metric_df_cols = ["corpus", "retriever", "Accuracy@10", "NDCG@10"]

ignored_retrievers = ["st-codesearch-distilroberta-base (sentence_transformer)", "splade"]

def load_beir_results_df(path="../output/results/beir_sample_results.csv", metric_df_cols=metric_df_cols, ignored_retrievers=ignored_retrievers):
    
    beir_sample_results_df = pd.read_csv(path)

    beir_sample_results_df = beir_sample_results_df[~beir_sample_results_df["retriever"].isin(ignored_retrievers)]
    
    return beir_sample_results_df[metric_df_cols]


str_types = {
    "readme": "",
    "generated_rationale": "generated_readme",
    "generated_readme": "generated_readme",
    "generation_context": "generated_readme",
    "dependency_signature": "librarian",
    "repository_signature": "librarian",
    "generated_tasks": "librarian",
    "selected_code": "",
}

cell_props = {
    "": "",
    "librarian": "color:red;",
    "generated_readme": "color:green;",
    "bm25": "color:blue;",
    "splade": "color:brown;",
    "Python code word2vec": "color:yellow;"
}

def string_cell_style(s):
    if s in str_types.keys():
        s = str_types[s]
    return cell_props.get(s, "")


def df_style(v):
    if type(v) is float:
        return '{:.2f}'.format(v)
    else:
        return string_cell_style(v)

def apply_style(df):
    number_cols = beir_sample_results_df.select_dtypes("number").columns
    return df.style.map(string_cell_style).format(dict([(col, "{:.3}") for col in number_cols]))

def show_grouped_results(results_df, groupby_col, agg="max", sort_by="Accuracy@10"):
    if agg == "max":
        agg_fn = lambda df: df.sort_values(sort_by, ascending=False).drop(columns=[groupby_col]).iloc[0]
    elif agg == "mean":
        agg_fn = lambda df: df.select_dtypes("number").mean()
    else:
        raise NotImplemented
    grouped_results = results_df.groupby(groupby_col).apply(agg_fn).sort_values(sort_by, ascending=False)
    return apply_style(grouped_results.reset_index())

In [29]:
beir_sample_results_df = load_beir_results_df()

## Complete results

In [30]:
apply_style(beir_sample_results_df)

Unnamed: 0,corpus,retriever,Accuracy@10,NDCG@10
1,readme,bm25,0.932,0.518
2,readme,all-mpnet-base-v2 (sentence_transformer),0.925,0.506
3,readme,all-MiniLM-L12-v2 (sentence_transformer),0.902,0.481
4,generated_rationale,all-mpnet-base-v2 (sentence_transformer),0.886,0.411
5,generated_readme,all-MiniLM-L12-v2 (sentence_transformer),0.876,0.396
6,generated_readme,all-mpnet-base-v2 (sentence_transformer),0.876,0.406
7,generated_rationale,all-MiniLM-L12-v2 (sentence_transformer),0.872,0.392
10,generation_context,all-mpnet-base-v2 (sentence_transformer),0.839,0.369
11,generated_readme,bm25,0.818,0.36
12,generation_context,all-MiniLM-L12-v2 (sentence_transformer),0.816,0.325


## Aggregating over corpora

The best result for each corpus

In [32]:
show_grouped_results(beir_sample_results_df, "corpus")

Unnamed: 0,corpus,retriever,Accuracy@10,NDCG@10
0,readme,bm25,0.932,0.518
1,generated_rationale,all-mpnet-base-v2 (sentence_transformer),0.886,0.411
2,generated_readme,all-MiniLM-L12-v2 (sentence_transformer),0.876,0.396
3,generation_context,all-mpnet-base-v2 (sentence_transformer),0.839,0.369
4,repository_signature,all-mpnet-base-v2 (sentence_transformer),0.8,0.317
5,dependency_signature,all-MiniLM-L12-v2 (sentence_transformer),0.79,0.268
6,generated_tasks,all-MiniLM-L12-v2 (sentence_transformer),0.781,0.263
7,selected_code,all-mpnet-base-v2 (sentence_transformer),0.765,0.256


Average results for each corpus

In [33]:
show_grouped_results(beir_sample_results_df, "corpus", agg="mean")

Unnamed: 0,corpus,Accuracy@10,NDCG@10
0,readme,0.871,0.432
1,generated_rationale,0.827,0.353
2,generated_readme,0.822,0.35
3,generation_context,0.766,0.297
4,repository_signature,0.736,0.244
5,generated_tasks,0.723,0.227
6,selected_code,0.642,0.203
7,dependency_signature,0.608,0.193


## Aggragating over retrievers

The best result for each retriever

In [34]:
show_grouped_results(beir_sample_results_df, "retriever", agg="max")

Unnamed: 0,retriever,corpus,Accuracy@10,NDCG@10
0,bm25,readme,0.932,0.518
1,all-mpnet-base-v2 (sentence_transformer),readme,0.925,0.506
2,all-MiniLM-L12-v2 (sentence_transformer),readme,0.902,0.481
3,Python code word2vec,generated_rationale,0.737,0.245


In [35]:
show_grouped_results(beir_sample_results_df, "retriever", agg="mean")

Unnamed: 0,retriever,Accuracy@10,NDCG@10
0,all-mpnet-base-v2 (sentence_transformer),0.83,0.352
1,all-MiniLM-L12-v2 (sentence_transformer),0.815,0.326
2,bm25,0.761,0.308
3,Python code word2vec,0.593,0.163
