Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


IdeNTifying RedUndancies in Fork-based DEvelopment

Python library dependencies:

sklearn, numpy, SciPy, matplotlib, gensim, nltk, bs4, flask, GitHub-Flask

Configuration: LOCAL_DATA_PATH (for storing some data in local) access_token (for using GitHub API to fetch data) model_path (for storing the model in local)


[dupPR]: Reference paper: Yu, Yue, et al. "A dataset of duplicate pull-requests in github." Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 2018. (link: <including: 2323 Duplicate PR pairs in 26 repos>

dupPR for training set

dupPR for testing set

Non-duplicate PRs for training set

Non-duplicate PRs for testing set

labeled results for RQ1 precision evaluation


  1. python data/random_sample_select_pr.txt 400

    (It will generate data/random_sample_select_pr.txt using random sampling)

  2. python

    (It will take data/random_sample_select_pr.txt & data/clf/second_msr_pairs.txt as input, and write the output into files: evaluation/random_sample_select_pr_result.txt & evaluation/msr_second_part_result.txt)

  3. manually label output file: evaluation/random_sample_select_pr_result.txt, add Y/N/Unknown at end (see evaluation/random_sample_select_pr_result_example.txt as example)

  4. python

    (It will print precision & recall at different threshold to stdout.)


  1. python data/clf/second_msr_pairs.txt

    python data/clf/second_nondup.txt

    (It will take data/clf/second_msr_pairs.txt & data/clf/second_nondup.txt as input, and write the output into files: evaluation/second_msr_pairs_history.txt & evaluation/second_nondup_history.txt.)

  2. python

    (It will print precision, FPR, saved commits at different threshold to stdout.)


  1. python new

    python old

    (It will take data/clf/second_msr_pairs.txt as input, and write the output into files: result_on_topk_new.txt & result_on_topk_old.txt)

  2. python new

    python old

    (It will print topK recall for our method and another method to stdout.)


  1. python data/small_sample_for_precision.txt 70

    (It will generate data/small_sample_for_precision.txt using random sampling)

    python data/clf/second_msr_pairs.txt data/small_sample_for_recall.txt 200

    (It will generate data/small_sample_for_recall.txt using random sampling)

  2. python

    (It will take data/small_sample_for_precision.txt & data/small_sample_for_recall.txt as input, and write the output into files: evaluation/small_sample_for_precision.txt_XXXX.out.txt & evaluation/small_sample_for_recall.txt_XXXX.out.txt)

  3. manually label all the output files: evaluation/small_sample_for_precision.txt_XXXX.out.txt, add Y/N/Unknown at end (see evaluation/small_sample_for_precision.txt_new_example.out as example)

  4. python

    (It will print precision for all the leave-one-out models under a fixed recall to stdout.)

Main API:

python repo # detect all the PRs of repo
python repo pr_num # detect one PR

python repo # detect all the open PRs of repo

python repo1 repo2 # detect the PRs between repo1 and repo2

python result_file # print html for the PR pairs Classification Model using Machine Learning.

# Set up the input dataset
c = classify()

init_model_with_repo(repo) # prepare for prediction Natural Language Processing model for calculating the text similarity.

m = Model(texts)
text_sim = query_sim_tfidf(tokens1, tokens2) Calculate the similarity for feature extraction.

# Set up the params of compare (different metrics).
# Check for init NLP model.
feature_vector = get_pr_sim_vector(pull1, pull2) Detection on (open) pull requests.

detect.detect_one(repo, pr_num) Detection on pull requests of cross-projects.

detect_on_cross_forks.detect_on_pr(repo_name) compare on granularity of commits. About GitHub API setting and fetching.

              'fork' / 'pull' / 'issue' / 'commit' / 'branch',

get_pull(repo, num, renew)
get_pull_commit(pull, renew)
fetch_file_list(pull, renew)
get_another_pull(pull, renew)
check_too_big(pull) Get data from API, parse the raw diff.

parse_diff(file_name, diff) # parse raw diff
fetch_raw_diff(url) # parse raw diff from GitHub API


Identifying Redundancies in Fork-based Development






No releases published


No packages published