Skip to content

Latest commit

History

History
230 lines (138 loc) 路 8.5 KB

CONTRIBUTING.md

File metadata and controls

230 lines (138 loc) 路 8.5 KB

Contributing to Soorgeon

Hi! Thanks for considering a contribution to Soorgeon. We're building Soorgeon to help data scientists convert their monolithic notebooks into maintainable pipelines. Our command-line interface takes a notebook as an input and generates a Ploomber pipeline as output (see demo here), and we need your help to ensure our tool is robust to real-world code (which is messy!).

We're downloading publicly available notebooks from Kaggle and testing if our tools can successfully refactor them.

This guide explains what the process looks like: from finding a candidate notebook to merging your changes. Hence, whenever we publish a new Soorgeon version, we'll test with all the contributed notebooks.

Setup development environment with conda

To contribute to soorgeon code base and make it better, you need to set up the development environment. The process is similar to how you would contribute to ploomber. Here is a more detailed instructions

The easiest way to setup the development environment is via the setup command; you must have miniconda installed.

Click here for miniconda installation details.

Make sure conda has conda-forge as channel, running the following:

conda config --add channels conda-forge

Once you have conda ready:

# get the code
# and we recommand your fork our repo and work on it
git clone https://github.com/{your_github_username}/soorgeon

# invoke is a library we use to manage one-off commands
pip install invoke

# move into soorgeon directory
cd soorgeon

# setup development environment
invoke setup

Then activate the environment:

conda activate soorgeon

And BAM! You are ready to go!

Branch name requirement

To prevent double execution of the same CI pipelines, we have chosen to set a limitation to github push event. Only pushes to certain branches will trigger the pipelines. That means if you have turned on github action and want to run workflows in your forked repo, you will need to either make pushes directly to your master branch or branches name strictly following this convention: dev/{your-branch-name}.

On the other hand, if you choose not to turn on github action in your own repo and simply run tests locally, you can disregard this information since your pull request from your forked repo to ploomber/ploomber repo will always trigger the pipelines.

Testing and submitting code

Please refer here for more detailed explanation.

Adding new test notebooks

1. Find a candidate notebook

Look for notebooks in Kaggle that run fast (ideally, <1 minute), use small datasets (<20 MB), have lots of code (the longer, the better), and have executed recently (no more than three months ago).

Here's is a sample notebook that has all those characteristics: kaggle.com/yuyougnchan/look-at-this-note-feature-engineering-is-easy.

2. Open an issue to suggest a notebook

Open an issue and share the URL with us, we'll take a quick look and let you know what we think.

3. Configure development environment

If we move forward, you can setup the development environment with:

pip install ".[dev]"

Note: We recommmend you run the command above in a virtual environment.

4. Configure Kaggle CLI

You must have an account in Kaggle to continue, once you create one, follow the instructions to configure the CLI client.

5. Download the notebook file .ipynb

Download the notebook with the following command:

python -m soorgeon._kaggle notebook user/notebook-name

# example
python -m soorgeon._kaggle notebook yuyougnchan/look-at-this-note-feature-engineering-is-easy

Note that the command above converts the notebook (.ipynb) to .py using %% cell separators. We prefer .py over .ipynb since they play better with git.

6. Download data

If you go to the data section in the notebook, you'll see a list (right side) of input datasets (look for the label Input (X.X MB)). Sometimes, authors may include many datasets, but the notebook may only use a few of them, so please check in the notebook contents which ones are actually used, we want to download as little data as possible to make testing fast.

Our example notebook takes us to this URL: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

We know this is a competition because the URL has the form: kaggle.com/c/{name}

Downloading a competitions dataset

To download a competition dataset, execute the following:

# Note: make sure to execute this in the _kaggle/{notebook-name}/ directory

python -m soorgeon._kaggle competition {name}

# example
python -m soorgeon._kaggle competition house-prices-advanced-regression-techniques

Downloading a user's dataset

Other notebooks use datasets that are not part of a competition. For example, this notebook, uses this dataset.

The URL is different, it has the format: kaggle.com/{user}/{dataset}

To download a dataset like that:

# Note: make sure to execute this in the _kaggle/{notebook-name}/ directory
python -m soorgeon._kaggle dataset user/notebook-name

# example
python -m soorgeon._kaggle dataset imakash3011/customer-personality-analysis

Final layout

Your layout should look like this:

_kaggle/
    {notebook-name}/
        nb.py
        input/
            data.csv

7. Notebook edits

The nb.py file may contain paths to files that are different from our setup, so locate all the relevant lines (e.g., df = pd.read_csv('/path/to/data.csv')) and add a relative path to the input/ directory (e.g. df = pd.read_csv('inputs/data.csv')).

If you find any calls to pip like: ! pip install {package}, remove them.

8. Test the notebook

Test it:

python -m soorgeon._kaggle test nb.py

A few things may go wrong, so you may have to do some edits to nb.py.

If missing dependencies

Add a requirements.txt under _kaggle/{notebook-name} and add all the dependencies:

# requirements.txt
scikit-learn
pandas
matplotlib

If the notebook is old, you may encounter problems if the API for a specific library changed since the notebook ran. We recommend using notebooks that were executed recently because fixing these API incompatibility issues requires a trial and error process of looking at the library's changelog and figuring out either what version to use or how to fix the code. Hence, it works with the latest version.

If you encounter issues like this, let us know by adding a comment in the issue you opened in Step 2.

9. Running Soorgeon

Let's now check if sorgeon can handle the notebook:

soorgeon refactor nb.py

If the command above throws a Looks like the following functions are using global variables... error, click here to see fixing instructions.

Add a comment on the issue you created in Step 2 if the command throws a different error or if you cannot fix the global variables issue. Please include the entire error traceback in the Github's issue.

10. Testing the generated pipeline

To ensure that the generate pipeline works, execute the following commands:

ploomber status
ploomber plot
ploomber build

Add a comment on the issue you created in Step 2 if any command throws an error. Please include the entire error traceback in the Github's issue.

11. Registering the notebook

Add a new entry to _kaggle/index.yaml:

(if using a dataset from a competition)

- url: https://www.kaggle.com/{user}/{notebook-name}
  data: https://www.kaggle.com/c/{competition-name}

(if using a user's dataset)

- url: https://www.kaggle.com/{user}/{notebook-name}
  data: https://www.kaggle.com/{user-another}/{dataset-name}

Then, open a pull request to merge your changes.

Thanks a lot for helping us make Soorgeon better! Happy notebook refactoring!