Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



24 Commits

Repository files navigation

This repo contains the appendix/instruments for our paper, "A Large-Scale Study of ML-Related Python Projects".

The contents of each folder are listed below:

data: Contains the data ordered by filtering/ processing steps

  • 1-dependents: Names of GitHub projects that are dependent on SciKit-Learn and TensorFlow libraries.
  • 1a-dependents_queried: List of projects with additional information obtained from the GitHub REST API
  • 2-forks_removed: List of GitHub projects left after removing forks.
  • 3a-number_commits_queried: List of projects with their number of commits.
  • 3-number_commits_filtered: List of projects with commits count >= 50.
  • 4-library_calls_filtered: List of projects with relevant library calls.
  • 5-attributes: List of projects along with the following attributes: number of contributors, branches, pull requests, tags, number of releases, issues, files, and their ML development phases.
  • 6-ml_stages: List of projects with information about how many files related to each ML stage are present
  • commit_stages: Information about which ml stage was changed in which commit. Contains one .csv file for each project.
  • API-dictionary: API dictionary mapping libraries calls to ML development phases.

scripts: Python scripts used in this study.

  • utilities: Collection of functions required for the scripts
  • data_acquisition: Scripts required to obtain the data from GitHub
  • filtering: Scripts for filtering the data obtained from GitHub
  • data_processing: Generating further information required for analysis
  • analysis: Used to generate the results presented in the paper based on filtering and data_processing


  • Required Python version: 3.8
  • Dependencies can be found in requirements.txt
  • Suggested installation procedure:
    • conda create -n ml-systems-study python=3.8
    • conda activate ml-systems-study
    • pip install -r requirements.txt
    • conda develop "path/to/cloned/repository"

Reproducing the results from the paper:

  • Obtaining the data from GitHub: Execute scripts/data_acquisition/
  • Filtering the data: Execute the Python scripts in scripts/filtering in the given order
  • Additional processing required for results to RQ3: Run scripts/data_processing/
  • Analysis:
    • RQ1 (Table 1 and fig. 4&5): scripts/analysis/
    • RQ2 (fig. 6&7): scripts/analysis/
    • RQ2 (fig. 8&9): scripts/analysis/
    • RQ3 (fig. 10&11): scripts/analysis/