Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUILD: missing test data in 2.1.0 sdist/install #54907

Open
1 task done
mgorny opened this issue Aug 31, 2023 · 24 comments
Open
1 task done

BUILD: missing test data in 2.1.0 sdist/install #54907

mgorny opened this issue Aug 31, 2023 · 24 comments
Assignees
Labels
Build Library building on various platforms Testing pandas testing functions or related to the test suite

Comments

@mgorny
Copy link
Contributor

mgorny commented Aug 31, 2023

Installation check

Platform

Linux-6.4.7-gentoo-dist-x86_64-AMD_Ryzen_5_3600_6-Core_Processor-with-glibc2.38

Installation Method

pip install

pandas Version

2.1.0

Python Version

3.11.5

Installation Logs

$ pip install pandas
Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/d9/26/895a49ebddb4211f2d777150f38ef9e538deff6df7e179a3624c663efc98/pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy>=1.23.2 (from pandas)
  Obtaining dependency information for numpy>=1.23.2 from https://files.pythonhosted.org/packages/32/6a/65dbc57a89078af9ff8bfcd4c0761a50172d90192eaeb1b6f56e5fbf1c3d/numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading pandas-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 63.1 MB/s eta 0:00:00
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: pytz, tzdata, six, numpy, python-dateutil, pandas
Successfully installed numpy-1.25.2 pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3 six-1.16.0 tzdata-2023.3
$ cd ${VIRTUAL_ENV}/lib/python3.11/site-packages
# e.g.:
$ python -m pytest pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager -x
========================================================= test session starts =========================================================
platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.3.0
rootdir: /tmp/.venv/lib/python3.11/site-packages/pandas
configfile: pyproject.toml
plugins: asyncio-0.21.1, hypothesis-6.82.7
asyncio: mode=Mode.STRICT
collected 4 items                                                                                                                     

pandas/tests/io/parser/common/test_file_buffer_url.py F

============================================================== FAILURES ===============================================================
____________________________________________________ test_context_manager[c_high] _____________________________________________________

all_parsers = <pandas.tests.io.parser.conftest.CParserHighMemory object at 0x7ff0df5f2fd0>
datapath = <function datapath.<locals>.deco at 0x7ff0df775440>

    def test_context_manager(all_parsers, datapath):
        # make sure that opened files are closed
        parser = all_parsers
    
>       path = datapath("io", "data", "csv", "iris.csv")

pandas/tests/io/parser/common/test_file_buffer_url.py:372: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = ('io', 'data', 'csv', 'iris.csv'), path = '/tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv'

    def deco(*args):
        path = os.path.join(BASE_PATH, *args)
        if not os.path.exists(path):
            if strict_data_files:
>               raise ValueError(
                    f"Could not find file {path} and --no-strict-data-files is not set."
                )
E               ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-files is not set.

pandas/conftest.py:1201: ValueError
------------------------------ generated xml file: /tmp/.venv/lib/python3.11/site-packages/test-data.xml ------------------------------
======================================================== slowest 30 durations =========================================================

(3 durations < 0.005s hidden.  Use -vv to show these durations.)
======================================================= short test summary info =======================================================
FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_context_manager[c_high] - ValueError: Could not find file /tmp/.venv/lib/python3.11/site-packages/pandas/tests/io/data/csv/iris.csv and --no-strict-data-fil...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed in 0.12s ==========================================================

The file's in source directory, so I guess it isn't installed by meson.

@mgorny mgorny added Build Library building on various platforms Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2023
@mgorny
Copy link
Contributor Author

mgorny commented Aug 31, 2023

Ah, sorry, I was wrong. I was looking at the git repo and not sdist — the files are missing from sdist, so I guess that's why they aren't in the wheel either.

@mgorny mgorny changed the title BUILD: missing test data in 2.1.0 install BUILD: missing test data in 2.1.0 sdist/install Aug 31, 2023
@mroeschke
Copy link
Member

Thanks for the report. This was done (silently) to shrink down the wheel size #54052

The "public" way to run the tests from the install is to call pd.test, but I think --no-strict-data-files will need to be set

cc @lithomas1

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2023
@lithomas1
Copy link
Member

Yes, you'll need to pass --no-strict-data-files to pd.test.

IIRC, the data files for IO were never shipped, but we recently changed things to error on default without the data files.

@mgorny
Copy link
Contributor Author

mgorny commented Aug 31, 2023

I can understand shrinking wheel sizes but what I don't really understand is why you're also stripping it from GitHub archives that are our last fallback for when sdists are unsuitable for testing.

@lithomas1
Copy link
Member

The Github archive is the sdist, or it should be at least.

(see
https://pandas.pydata.org/docs/development/maintaining.html#release-process:~:text=Create%20a%20new%20GitHub%20release%3A for more info)

Is there a reason you can't pass --no-strict-data-files to pd.test?

(You might want to consider building from the git tag of the release then, if you need all the files.)

@bnavigator
Copy link
Contributor

The Github archive is the sdist, or it should be at least.

No it is not. See #54903

And more specifically, before 2.1 the Github archive contained the test data while the PyPI published sdist does not.

@lithomas1
Copy link
Member

I think we are mentioning different things here. In the assets tab of the release, are you talking about the pandas-2.1.0.tar.gz or the Source code (tar.gz.

I am talking about the first one.

@lithomas1
Copy link
Member

The second one is not something that I control IIUC.

@bnavigator
Copy link
Contributor

Check the contents. The second one does not have setup.py _version_meson.py or the test data. Makes it pretty useless.

I guess it is triggered by the change of .gitattributes.

@bnavigator
Copy link
Contributor

But it is the "github archive", we are talking about. It was our fallback to the sdist published on PyPI and as asset on github ("the first one")

@lithomas1
Copy link
Member

Sorry, I deleted my previous comment, I wanted to expand on it more.

I checked the Source code (tar.gz) file, and it has the pandas/tests/io/data directory. _version_meson.py is generated by the build system.

How are you building pandas?

Also, you mention the Github archive is a fallback, is there something wrong with the sdist on PyPI?

@bnavigator
Copy link
Contributor

bnavigator commented Aug 31, 2023

(You might want to consider building from the git tag of the release then, if you need all the files.)

That's the problem: The direct download link is the same as the "Source code" on the release page: https://github.com/pandas-dev/pandas/archive/refs/tags/v2.1.0.zip does not contain the data. Only a proper git clone will have it.

I checked the Source code (tar.gz) file, and it has the pandas/tests/io/data directory.

The directories in it are empty.

_version_meson.py is generated by the build system.

Only if there is a setup.py for versioneer to use. But that one is also missing.

Also, you mention the Github archive is a fallback, is there something wrong with the sdist on PyPI?

Yes, it lacks the test data. We distribution packagers need to run the test suite as completely as possible in order to ensure package integrity

@bnavigator
Copy link
Contributor

@mgorny
Copy link
Contributor Author

mgorny commented Sep 1, 2023

The version from https://github.com/gentoo/gentoo/blob/1761e8fcdfda09370046cdd0e382c3aa206d3f61/dev-python/pandas/pandas-2.1.0.ebuild is more up-to-date.

I've done --no-strict-data-files for now but 1) as @bnavigator points out, it's far from optimal, 2) it doesn't seem to cover lxml tests.

I've learned that apparently .gitattributes are necessary because of bad design in meson(-python) that apparently doesn't allow controlling sdist contents (sigh).

My only idea so far would be to move all the undesirable test data from subdirectories into one git submodule. That should prevent it from being included in sdist, and make it easy for us to fetch it independently and merge with the rest.

@lithomas1 lithomas1 added this to the 2.1.1 milestone Sep 1, 2023
@lithomas1 lithomas1 self-assigned this Sep 1, 2023
@rebecca-palmer
Copy link
Contributor

It's not just the test data - the documentation, and possibly also some smaller items, have also been removed.

In Debian, I've switched to using the git repository itself, so this isn't blocking for my packaging. I don't know whether the Gentoo or openSUSE build tools have an equivalent option.

move all the undesirable test data from subdirectories into one git submodule

If you do that, you might find my patch for loading test data from a different path useful. (Debian prefers to (also) run tests against the as-installed package, and we don't want test data taking up space in the user package either.)

@lithomas1
Copy link
Member

lithomas1 commented Sep 3, 2023

It's not just the test data - the documentation, and possibly also some smaller items, have also been removed.

Yes that is intentional. I don't really think you can do anything with the raw .rst files for the docs.

Is there any way SUSE and Gentoo can just switch to using the git tag?

(I have a fix in mind for this issue, but I'm not sure its going to work with the current state of meson/meson-python.
It's also likely going to be more involved, and will take me a while.)

@mgorny
Copy link
Contributor Author

mgorny commented Sep 4, 2023

Is there any way SUSE and Gentoo can just switch to using the git tag?

We can't. Gentoo users build from source, and we require sources that are available via plain HTTPS download. Fetching via git poses too many problems, in particular it doesn't support resuming and doesn't work over shoddy connections, let alone dealing with existing mirroring infrastructure.

@rebecca-palmer
Copy link
Contributor

I don't really think you can do anything with the raw .rst files for the docs.

What do you mean by that? In Debian, we do build the documentation from source.

you might find my patch for loading test data from a different path useful.

This was broken with 2.1 when I posted that; I think it is fixed now, but this has not yet been tested.

@lithomas1
Copy link
Member

What do you mean by that? In Debian, we do build the documentation from source.

Ah, I wasn't aware that Debian also built our docs.
Can you clarify why you're doing this? I'd expect most users to just go to pandas.pydata.org for the official pandas docs.
I also can't seem to find pandas documentation built by Debian online.

@mgorny
Copy link
Contributor Author

mgorny commented Sep 5, 2023

We're doing stuff like that because there are actually people who need to work without reliable Internet access, or without Internet access at all (e.g. while traveling).

@rebecca-palmer
Copy link
Contributor

The Debian-built documentation is offered as an installable package (python-pandas-doc). And yes, it's relatively rarely used.

@lithomas1 lithomas1 modified the milestones: 2.1.1, 2.1.2 Sep 21, 2023
@rockdrilla
Copy link

I'm using pandas benchmarks to run them with profiling-enabled Python during PGO build (i.e. script).

IMO, the preferred way to get pandas source code is to fetch/cache tarball from GitHub via URL https://github.com/pandas-dev/pandas/archive/v${version}.tar.gz (in order to use software like Sonatype Nexus).

I'd like to see asv_bench/ back in release archives because it costs a little disk space (both compressed and uncompressed):

$ tar -cf - asv_bench > asv_bench.tar
$ gzip -k asv_bench.tar
$ ls -lh asv_bench.*
-rw-r--r-- 1 400K Oct  5 00:13 asv_bench.tar
-rw-r--r-- 1  68K Oct  5 00:13 asv_bench.tar.gz

Proposed solution:

  1. remove asv_bench export-ignore from .gitattributes.
  2. add prune asv_bench to MANIFEST.in.

@lithomas1 lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue Feb 15, 2024
We don't ship these in the package,
but do want to run the tests that use them

tests_path() is removed completely because it is unclear whether it
should point to the tests code or the directory above the test data

Author: Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: pandas-dev/pandas#54907


Gbp-Pq: Name find_test_data.patch
@lithomas1
Copy link
Member

@mgorny @rebecca-palmer @bnavigator

Just a heads up.

I'm planning on removing tests from the pandas source distributions altogether. The plan is probably to make a separate package called pandas_tests, that will contain the tests (and the associated test data). pd.test will work as before, however you'll now need to have pandas_tests installed to run the tests.

One thing to note, though is that the tests will stay in the main pandas repo, so if you're building from the tag, the end result will be a compiled version of pandas with the tests included (unless you built an sdist and then a wheel from the sdist like we are planning on doing).

(Now everything is not set in stone yet, but the plan is to do this for 3.0, so we still have at least several months to do this).

Although tests will not be in the regular source distribution, I will be uploading a separate pandas_tests tarball, for the pandas_tests sdist. Would you be able to download that, and package for Linux distros that way?

Will this severely affect packaging for the Linux distros in any way?

@mgorny
Copy link
Contributor Author

mgorny commented Feb 21, 2024

I think it'd be fine, if it's just a matter of extracting a second archive, and possibly moving stuff around.

raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue Feb 22, 2024
We don't ship these in the package,
but do want to run the tests that use them

tests_path() is removed completely because it is unclear whether it
should point to the tests code or the directory above the test data

Author: Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: pandas-dev/pandas#54907


Gbp-Pq: Name find_test_data.patch
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 removed this from the 2.2.2 milestone Apr 9, 2024
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue May 25, 2024
We don't ship these in the package,
but do want to run the tests that use them

tests_path() is removed completely because it is unclear whether it
should point to the tests code or the directory above the test data

Author: Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: pandas-dev/pandas#54907


Gbp-Pq: Name find_test_data.patch
raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue Jun 11, 2024
We don't ship these in the package,
but do want to run the tests that use them

tests_path() is removed completely because it is unclear whether it
should point to the tests code or the directory above the test data

Author: Rebecca N. Palmer <rebecca_palmer@zoho.com>
Forwarded: pandas-dev/pandas#54907


Gbp-Pq: Name find_test_data.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

7 participants