Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] remove unnecessary files to reduce sdist size #3579

Merged
merged 1 commit into from Nov 19, 2020

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Nov 18, 2020

Short Description

This PR removes unnecessary files from the tarball created by python setup.py sdist, to make that source distribution smaller.

Long Description

This week I've been revisiting a conference talk I gave in April, where I showed how to deploy a LightGBM model on AWS Lambda, using a Python Lambda.

To make external dependencies like lightgbm available to a Python Lambda, you have to create something called a "Lambda Layer", which is basically like a volume with files on it that get mounted into the filesystem used by your code. The sum of all "layers" that you add cannot exceed 250 MB uncompressed. This made it hard to set up, for example, a Lambda that used lightgbm, pandas, and scikit-learn. I had to do some surgery on the packages' source distributions to get under the limit: https://github.com/jameslamb/talks/blob/main/cloud-intro/scripts/create-layers.sh

To help people using Lambdas or any other settings that are very sensitive to code footprint, this PR proposes some changes to remove unnecessary files from the source distribution of the Python package.

If you look in the sdist logs, there are a bunch of files from the compute submodule that don't need to be included in lightgbm, like that submodule's tests and documentation.

cd python-package
python setup.py sdist

This pull request proposes changes to exclude them. I tested on my Mac and found the following changes in the package size.

master this PR
sdist (compressed) 740K 576K
sdist (uncompressed) 6.1M 4.3M
wheel (compressed) 1.0M 1.0M
wheel (uncompressed) 3.6M 3.6M

It makes sense that the wheel didn't change sizes, since we don't include compute in it.

script I used to check sizes (click me)
#!/bin/bash

pushd $(pwd)/python-package

    # clean up files from previous builds
    rm -rf build_cpp
    rm -rf build
    rm -rf compile
    rm -rf dist
    rm -rf lightgbm.egg-info
    rm ~/lgb-tmp.log

    echo ""
    echo "building source distribution"
    echo ""
    rm -rf dist/
    python setup.py sdist >> ~/lgb-tmp.log
    pushd dist/
        echo ""
        echo "sdist compressed size"
        echo ""
        du -a -h .
        tar -xf lightgbm*.tar.gz
        rm lightgbm*.tar.gz
        ls .
        echo ""
        echo "sdist uncompressed size"
        echo ""
        du -sh .
    popd

    echo ""
    echo "building wheel"
    echo ""
    rm -rf build_cpp
    rm -rf build
    rm -rf compile
    rm -rf lightgbm.egg-info
    rm -rf dist/
    python setup.py bdist_wheel --plat-name=macosx --universal >> ~/lgb-tmp.log
    pushd dist/
        echo ""
        echo "wheel compressed size"
        echo ""
        du -a -h .
        tar -xf lightgbm*.whl
        rm *.whl
        echo ""
        echo "wheel uncompressed size"
        echo ""
        du -sh .
    popd

popd

Notes for Reviewers

If you agree with the spirit of this PR, we should also be careful to exclude unnecessary files from the sub-modules introduced in #3405 , but that PR already has a lot of comments so I'd rather do that as a follow-up after #3405 is merged.

For reference, the compute submodule comes from https://github.com/boostorg/compute.

@@ -4,9 +4,13 @@ include *.rst *.txt
recursive-include lightgbm *.py *.txt *.so
recursive-include compile *.txt *.so
recursive-include compile/Release *.dll
recursive-include compile/compute *
recursive-include compile/compute/ *.txt
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this line from * to *.txt to exclude files like .appveyor.yml while still including important files like CMakeLists.txt and the license

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense! Thanks!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants