Skip to content

ncusi/MSR_Challenge_2024

Repository files navigation

How I Learned to Stop Worrying and Love ChatGPT

https://2024.msrconf.org/details/msr-2024-mining-challenge/6/How-I-Learned-to-Stop-Worrying-and-Love-ChatGPT

Replication package for MSR'24 Mining Challenge

https://2024.msrconf.org/track/msr-2024-mining-challenge

The code can be found in the following repository on GitHub:
https://github.com/ncusi/MSR_Challenge_2024
The data is also available on DagsHub:
https://dagshub.com/ncusi/MSR_Challenge_2024

If you find any errors in the code, or have trouble getting it to run, please send the bugreport via this project GitHub issues.

First time setup

You can set up the environment for this package, following the recommended practices (described later in this document), by running the init.bash Bash script, and following its instructions.

You can configure where this script intends to put local DVC cache etc. by editing values of variables at the top of the script, in the configuration block

# configuration
DVCSTORE_DIR='/mnt/data/dvcstore'
DEVGPT_DIR='/mnt/data/MSR_Challenge_2024/DevGPT-data.v9'

Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, it might be better to follow the steps described in this document manually.

Virtual environment

To avoid dependency conflicts, it is strongly recommended to create a virtual environment. This can be done with, for example:

python3 -m venv venv

This needs to be done only once, from the top directory of the project.
For each session, you should activate this virtual environment:

source venv/bin/activate

This would make command line prompt include "(venv) " as prefix, thought it depends on the shell used.

Using virtual environment, either directly like shown above or by using pipx, might be required if you cannot install system packages, but Python is configured in a very specific way. Namely, if running pip install --user results in the following error:

error: externally-managed-environment

× This environment is externally managed

Installing dependencies

You can install dependencies defined in requirements.txt file with pip, using the following command:

python -m pip install -r requirements.txt

Note: the above assumes that you have activated virtual environment (venv).

Running with DVC

You can re-run whole computation pipeline with dvc repro, or at least those parts that were made to use DVC (Data Version Control) tool.

You can also run experiments with dvc exp run.

NOTE that DVC works best in a Git repository, and is by default configured to require it. If you clone this project with Git, it will work out of the box; if you get this project from Figshare (https://doi.org/10.6084/m9.figshare.24771117) you will need to either:

  • use DVC without Git by setting core.no_scm config option value to true in the DVC configuration with dvc config --local core.no_scm true, or
  • run git init inside unpacked directory with replication package

Using DagsHub DVC remote

You can download all data except for cloned repositories (see below for the explanation why they are excluded) from DagsHub. The DVC remote that points to https://dagshub.com/ncusi/MSR_Challenge_2024 is called "dagshub".

You can use dvc pull for that:

dvc pull --remote dagshub

Configuring local DVC cache (optional)

Because the initial external DevGPT dataset is quite large (it is 650 MB as *.zip file, and is 3.9 GB uncompressed), you might want to store DVC cache in some other place than your home repository.

You can do that with dvc cache dir command:

dvc cache dir --local /mnt/data/username/.dvc/cache

where you need to replace username with your login (on Linux you can find it with the help of whoami command).

Configuring local DVC storage

To avoid recomputing results, which takes time, you can configure local dvc remote storage, for example:

cat <<EOF >>.dvc/config.local
[core]
    remote = local
['remote "local"']
    url = /mnt/data/dvcstore
EOF

Then you will be able to download computed data with dvc pull, and upload your results for others in the team with dvc push. This assumes that all of you have access to /mnt/data/dvcstore, either via doing the work on the same host (perhaps remotely), or it is network storage available for the whole team.

Description of DVC stages

DVC pipeline is composed of 14 stages (see dvc.yaml file). The stages for analyzing commit data, pull request (PR) data, and issues data have similar dependencies. The graph of dependencies shown below (created from the output of dvc dag --mermaid) is therefore simplified for readability.

flowchart TD
        node1["clone_repos"]
        node13["repo_stats_git"]
        node14["repo_stats_github"]
        node2["{commit,pr,issues}_agg"]
        node3["{commit,pr,issues}_similarities"]
        node4["{commit,pr,issues}_survival"]
        node5["download_DevGPT"]
        node5-->node13
        node1-->node2
        node1-->node4
        node1-->node13
        node5-->node14
        node1-->node14
        node2-->node4
        node5-->node1
        node5-->node2
        node2-->node3
        node1-->node3
        node5-->node3
Loading

The notation used here to describe the acyclic directed graph (DAG) of DVC pipeline dependencies (the goal of which is to reduce the dvc dag graph size) is to be understood as brace expansion. For example, {c,d,b}e expands to ce, de, ce. This means that the following graph fragment:

flowchart LR
    node0["clone_repos"]
    node1["{commit,pr,issues}_agg"]
    node2["{commit,pr,issues}_survival"]
    node0-->node1
    node1-->node2
Loading

expands in the following way:

flowchart LR
    node0["clone_repos"]
    node1a["commit_agg"]
    node2a["commit_survival"]
    node1b["pr_agg"]
    node2b["pr_survival"]
    node1c["issues_agg"]
    node2c["issues_survival"]
    node0-->node1a
    node0-->node1b
    node0-->node1c
    node1a-->node2a
    node1b-->node2b
    node1c-->node2c
Loading

Each of the stages is described in dvc.yaml using desc field. You can get list of stages with their descriptions with the dvc stage list command:

Stage Description
download_DevGPT Download DevGPT dataset v9 from Zenodo
clone_repos Clone all repositories included in DevGPT dataset
commit_agg Latest commit sharings to CSV + per-project aggregates
pr_agg Latest pr (pull request) sharings to CSV + per-project aggregates
issue_agg Latest issue sharings to CSV + per-project aggregates
commit_survival Changes and lines survival (via blame) for latest commit sharings
pr_survival Changes and lines survival (via blame) for latest pr sharings
pr_split_survival Changes and lines survival (via blame) for pr sharings, all commits
issue_survival Changes and lines survival (via blame) for latest issue sharings
repo_stats_git Repository stats from git for all cloned project repos
repo_stats_github Repository info from GitHub for all cloned project repos
commit_similarities ChatGPT <-> commit diff similarities for commit sharings
pr_similarities ChatGPT <-> commit diff similarities for PR sharings
issue_similarities ChatGPT <-> commit diff similarities for issue sharings

Additional stages' requirements

Running some of the DVC pipeline stages have additional requirements, like Internet access, or git installed, or a valid GitHub API key.

The following DVC stages require Internet access to work:

  • download_DevGPT
  • clone_repos
  • pr_agg
  • issue_agg
  • repo_stats_github

The following DVC stages require git installed to work:

  • clone_repos
  • commit_survival
  • pr_survival
  • pr_split_survival
  • issue_survival
  • repo_stats_git

The following DVC stage requires GitHub API token to work, because it uses GitHub's GraphQL API (which requires authentication):

  • issue_agg

The following DVC stages would run faster with GitHub API token, because of much increased limits for authenticated GitHub REST API access:

  • pr_agg
  • issue_agg
  • repo_stats_github

To update or replace GitHub API token, currently you will need to edit the following line in src/utils/github.py:

GITHUB_API_TOKEN = "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"  # from jnareb

The token shown above expires on Mon, Apr 15 2024.

No cloned repositories in DVC

Cloned repositories of projects included in the DevGPT dataset are not stored in DVC. This is caused by space limitations and by DVC inability to handle dangling symlinks inside directories to be put in DVC storage1.

Therefore, the clone_repos stage clones the repositories and creates JSON file containing the summary of the results.

The file in question (data/repositories_download_status.json) indicates that certain stages of DVC pipeline need to have those repositories cloned. This file is neither stored in Git (thanks to data/.gitignore), nor in DVC (since it is marked as cache: false).

If you are interested only in modifying those stages that do not require cloned repositories (those that do not use git, see "Additional stages' requirenemts" section), to avoid re-running the whole DVC pipeline, you can use either:

  • dvc repro --single-item <target>... to reproduce only given stages by turning off the recursive search for changed dependencies, or
  • dvc repro --downstream <starting target>... to only execute the stages after the given targets in their corresponding pipelines, including the target stages themselves See dvc repro documentation.

Stages with checkpoints

The commit_similarities, pr_similarities, and issue_similarities are signficantly time-consuming. Therefore, to avoid having to re-run them if they are interrupted, they save their intermediate state as checkpoint file: data/interim/commit_sharings_similarities_df.checkpoint_data.json, etc.

These checkpoint files are marked as persistent DVC data files, and are not removed at the start of the stage.

Therefore, if you want to re-run those stages from scratch, you need to remove those checkpoint files before running the stage, for example with

rm data/interim/*.checkpoint_data.json

Jupyter Notebooks

The final part of computations, and visualizations presented in the "How I Learned to Stop Worrying and Love ChatGPT" paper was done with Jupyter Notebooks residing in the notebooks/ directory.

Those notebooks are described in detail in notebooks/README.md.

To be able to use installed dependencies when running those notebooks, it is recommended to start JupyterLab from this project top directory with:

jupyter lab --notebook-dir='.'

Footnotes

  1. See issue #9971 in dvc repository