Replication package for MSR'24 Mining Challenge
https://2024.msrconf.org/track/msr-2024-mining-challenge
The code can be found in the following repository on GitHub:
https://github.com/ncusi/MSR_Challenge_2024
The data is also available on DagsHub:
https://dagshub.com/ncusi/MSR_Challenge_2024
If you find any errors in the code, or have trouble getting it to run, please send the bugreport via this project GitHub issues.
You can set up the environment for this package, following
the recommended practices (described later in this document),
by running the init.bash
Bash script, and following
its instructions.
You can configure where this script intends to put local DVC cache etc. by editing values of variables at the top of the script, in the configuration block
# configuration
DVCSTORE_DIR='/mnt/data/dvcstore'
DEVGPT_DIR='/mnt/data/MSR_Challenge_2024/DevGPT-data.v9'
Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, it might be better to follow the steps described in this document manually.
To avoid dependency conflicts, it is strongly recommended to create a virtual environment. This can be done with, for example:
python3 -m venv venv
This needs to be done only once, from the top directory of the project.
For each session, you should activate this virtual environment:
source venv/bin/activate
This would make command line prompt include "(venv) " as prefix, thought it depends on the shell used.
Using virtual environment, either directly like shown above or
by using pipx
, might be required if you cannot install system
packages, but Python is configured in a very specific way.
Namely, if running pip install --user
results in the following
error:
error: externally-managed-environment
× This environment is externally managed
You can install dependencies defined in requirements.txt file
with pip
, using the following command:
python -m pip install -r requirements.txt
Note: the above assumes that you have activated virtual environment (venv).
You can re-run whole computation pipeline with dvc repro
, or at least
those parts that were made to use DVC (Data Version Control) tool.
You can also run experiments with dvc exp run
.
NOTE that DVC works best in a Git repository, and is by default configured to require it. If you clone this project with Git, it will work out of the box; if you get this project from Figshare (https://doi.org/10.6084/m9.figshare.24771117) you will need to either:
- use DVC without Git
by setting
core.no_scm
config option value to true in the DVC configuration withdvc config --local core.no_scm true
, or - run
git init
inside unpacked directory with replication package
You can download all data except for cloned repositories (see below for the explanation why they are excluded) from DagsHub. The DVC remote that points to https://dagshub.com/ncusi/MSR_Challenge_2024 is called "dagshub".
You can use dvc pull
for that:
dvc pull --remote dagshub
Because the initial external DevGPT dataset is quite large (it is 650 MB as *.zip file, and is 3.9 GB uncompressed), you might want to store DVC cache in some other place than your home repository.
You can do that with dvc cache dir
command:
dvc cache dir --local /mnt/data/username/.dvc/cache
where you need to replace username
with your login (on Linux you can
find it with the help of whoami
command).
To avoid recomputing results, which takes time, you can configure local dvc remote storage, for example:
cat <<EOF >>.dvc/config.local
[core]
remote = local
['remote "local"']
url = /mnt/data/dvcstore
EOF
Then you will be able to download computed data with dvc pull
,
and upload your results for others in the team with dvc push
.
This assumes that all of you have access to /mnt/data/dvcstore
,
either via doing the work on the same host (perhaps remotely),
or it is network storage available for the whole team.
DVC pipeline is composed of 14 stages (see dvc.yaml
file).
The stages for analyzing commit data, pull request (PR) data, and issues data
have similar dependencies. The graph of dependencies shown below
(created from the output of dvc dag --mermaid
) is therefore simplified
for readability.
flowchart TD
node1["clone_repos"]
node13["repo_stats_git"]
node14["repo_stats_github"]
node2["{commit,pr,issues}_agg"]
node3["{commit,pr,issues}_similarities"]
node4["{commit,pr,issues}_survival"]
node5["download_DevGPT"]
node5-->node13
node1-->node2
node1-->node4
node1-->node13
node5-->node14
node1-->node14
node2-->node4
node5-->node1
node5-->node2
node2-->node3
node1-->node3
node5-->node3
The notation used here to describe the acyclic directed graph (DAG) of DVC pipeline
dependencies (the goal of which is to reduce the dvc dag
graph size)
is to be understood as brace expansion. For example, {c,d,b}e
expands
to ce
, de
, ce
. This means that the following graph fragment:
flowchart LR
node0["clone_repos"]
node1["{commit,pr,issues}_agg"]
node2["{commit,pr,issues}_survival"]
node0-->node1
node1-->node2
expands in the following way:
flowchart LR
node0["clone_repos"]
node1a["commit_agg"]
node2a["commit_survival"]
node1b["pr_agg"]
node2b["pr_survival"]
node1c["issues_agg"]
node2c["issues_survival"]
node0-->node1a
node0-->node1b
node0-->node1c
node1a-->node2a
node1b-->node2b
node1c-->node2c
Each of the stages is described in dvc.yaml
using desc
field.
You can get list of stages with their descriptions with the dvc stage list
command:
Stage | Description |
---|---|
download_DevGPT | Download DevGPT dataset v9 from Zenodo |
clone_repos | Clone all repositories included in DevGPT dataset |
commit_agg | Latest commit sharings to CSV + per-project aggregates |
pr_agg | Latest pr (pull request) sharings to CSV + per-project aggregates |
issue_agg | Latest issue sharings to CSV + per-project aggregates |
commit_survival | Changes and lines survival (via blame) for latest commit sharings |
pr_survival | Changes and lines survival (via blame) for latest pr sharings |
pr_split_survival | Changes and lines survival (via blame) for pr sharings, all commits |
issue_survival | Changes and lines survival (via blame) for latest issue sharings |
repo_stats_git | Repository stats from git for all cloned project repos |
repo_stats_github | Repository info from GitHub for all cloned project repos |
commit_similarities | ChatGPT <-> commit diff similarities for commit sharings |
pr_similarities | ChatGPT <-> commit diff similarities for PR sharings |
issue_similarities | ChatGPT <-> commit diff similarities for issue sharings |
Running some of the DVC pipeline stages have additional requirements,
like Internet access, or git
installed, or a valid GitHub API key.
The following DVC stages require Internet access to work:
- download_DevGPT
- clone_repos
- pr_agg
- issue_agg
- repo_stats_github
The following DVC stages require git
installed to work:
- clone_repos
- commit_survival
- pr_survival
- pr_split_survival
- issue_survival
- repo_stats_git
The following DVC stage requires GitHub API token to work, because it uses GitHub's GraphQL API (which requires authentication):
- issue_agg
The following DVC stages would run faster with GitHub API token, because of much increased limits for authenticated GitHub REST API access:
- pr_agg
- issue_agg
- repo_stats_github
To update or replace GitHub API token, currently you will need to
edit the following line in src/utils/github.py
:
GITHUB_API_TOKEN = "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" # from jnareb
The token shown above expires on Mon, Apr 15 2024.
Cloned repositories of projects included in the DevGPT dataset are not stored in DVC. This is caused by space limitations and by DVC inability to handle dangling symlinks inside directories to be put in DVC storage1.
Therefore, the clone_repos stage clones the repositories and creates JSON file containing the summary of the results.
The file in question (data/repositories_download_status.json
) indicates
that certain stages of DVC pipeline need to have those repositories
cloned. This file is neither stored in Git (thanks to
data/.gitignore
), nor in DVC (since it is marked as cache: false
).
If you are interested only in modifying those stages that do not
require cloned repositories (those that do not use git
, see
"Additional stages' requirenemts"
section), to avoid re-running the whole DVC pipeline, you can use
either:
dvc repro --single-item <target>...
to reproduce only given stages by turning off the recursive search for changed dependencies, ordvc repro --downstream <starting target>...
to only execute the stages after the given targets in their corresponding pipelines, including the target stages themselves Seedvc repro
documentation.
The commit_similarities, pr_similarities, and issue_similarities are
signficantly time-consuming. Therefore, to avoid having to re-run them if they
are interrupted, they save their intermediate state as checkpoint file:
data/interim/commit_sharings_similarities_df.checkpoint_data.json
, etc.
These checkpoint files are marked as persistent DVC data files, and are not removed at the start of the stage.
Therefore, if you want to re-run those stages from scratch, you need to remove those checkpoint files before running the stage, for example with
rm data/interim/*.checkpoint_data.json
The final part of computations, and visualizations presented in the
"How I Learned to Stop Worrying and Love ChatGPT" paper
was done with Jupyter Notebooks residing in the notebooks/
directory.
Those notebooks are described in detail in notebooks/README.md
.
To be able to use installed dependencies when running those notebooks, it is recommended to start JupyterLab from this project top directory with:
jupyter lab --notebook-dir='.'