Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore spark tests #2201

Closed
datapythonista opened this issue May 7, 2020 · 4 comments
Closed

Restore spark tests #2201

datapythonista opened this issue May 7, 2020 · 4 comments
Labels
ci Continuous Integration issues or PRs pyspark The Apache PySpark backend tests Issues or PRs related to tests

Comments

@datapythonista
Copy link
Contributor

To fix the CI, it was needed to temporary remove the tests for spark and omnisci, in #2194.

The original problem was that the CI was using more than the 10Gb of available disk space. After splitting the tests in two groups, to avoid downloading too much backend stuff in a single build, omnisci and spark give problems.

In the case of omnisci, the problem is that installing pymap on top of the rest of the libraries increase the conda environment resolution by something like 30 minutes or more. We've got like 50 dependencies, and after spending a decent amount of time trying to see if pinning something solves the problem, I couldn't find anything.

For spark, we get some errors that seem unrelated to the split. It's difficult to tell if these were working recently, since the CI has been having problems for a while.

FAILED ibis/spark/tests/test_udf.py::test_udf[my_string_length_pandas] - py4j...
FAILED ibis/spark/tests/test_udf.py::test_elementwise_udf_with_non_vectors[my_add_pandas]
FAILED ibis/spark/tests/test_udf.py::test_elementwise_udf_with_non_vectors_upcast[my_add_pandas]
FAILED ibis/spark/tests/test_udf.py::test_multiple_argument_udf[my_add_pandas]
FAILED ibis/spark/tests/test_udf.py::test_multiple_argument_udf_group_by[my_add_pandas]
FAILED ibis/spark/tests/test_udf.py::test_udaf_groupby - py4j.protocol.Py4JJa...
FAILED ibis/spark/tests/test_udf.py::test_compose_udfs[add_one-times_two_pandas]
FAILED ibis/spark/tests/test_udf.py::test_compose_udfs[add_one_pandas-times_two]
FAILED ibis/spark/tests/test_udf.py::test_compose_udfs[add_one_pandas-times_two_pandas]
FAILED ibis/spark/tests/test_udf.py::test_array_return_type_reduction_window[qs0]
FAILED ibis/spark/tests/test_udf.py::test_array_return_type_reduction_window[qs1]
FAILED ibis/tests/all/test_temporal.py::test_strftime[PySpark-%Y%m%d-%Y%m%d]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-01-6-Sunday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-02-0-Monday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-03-1-Tuesday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-04-2-Wednesday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-05-3-Thursday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-06-4-Friday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_scalar[PySpark-2017-01-07-5-Saturday]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_column[PySpark] - py...
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_column_group_by[PySpark-<lambda>-<lambda>0]
FAILED ibis/tests/all/test_temporal.py::test_day_of_week_column_group_by[PySpark-<lambda>-<lambda>1]
FAILED ibis/tests/all/test_vectorized_udf.py::test_elementwise_udf[PySpark]

An example of spark failure:

t = SparkDatabaseTable[table]
  name: udf
  schema:
    a : string
    b : int64
    c : float64
    key : string
df =    a  b    c key
0  a  1  4.0   a
1  b  2  5.0   a
2  c  3  6.0   b
fn = <function my_string_length_pandas at 0x7f9cc0f9b6a8>

    @pytest.mark.parametrize('fn', my_string_length_fns)
    def test_udf(t, df, fn):
        expr = fn(t.a)
    
        assert isinstance(expr, ir.ColumnExpr)
    
>       result = expr.execute()
@datapythonista datapythonista added ci Continuous Integration issues or PRs omnisci pyspark The Apache PySpark backend tests Issues or PRs related to tests labels May 7, 2020
@jreback
Copy link
Contributor

jreback commented May 7, 2020

cc @icexelloss @xmnlab

@vnlitvinov
Copy link
Contributor

@datapythonista can you give a link to read on why tests were turned off? I only see there were some "space concerns"... I'm interested in bringing back OmniSci testing, can I help with something here?

@datapythonista
Copy link
Contributor Author

@datapythonista can you give a link to read on why tests were turned off? I only see there were some "space concerns"... I'm interested in bringing back OmniSci testing, can I help with something here?

I don't have a particular link, we got the CI red, and we had to merge 6 or 7 PRs to fix it, since there were several problems going on at the same time.

There were two different things affecting omnisci tests, disk space and the conda solver.

For disk space, the approach was to split them in groups, so not all docker images were loaded in the same build. #2194 was the initial fix to that problem

Then, when splitting the tests, we were having the second problem with omnisci. Resolving the environment with pymapd on it was taking too long (like 40 minutes), and the build with omnisci on it was timing out. So, omnisci was initially left out of the tests.

After getting the CI green, I had a look, and apparently, requiring the latest version of pymad, 0.22 or higher in the dependencies, seems to fix the problem. The resolving time of the environment went from 40 minutes, to 30 seconds I think.

I readded omnisci to the CI in #2205. It also includes some improvements to the builds splits. Since it was better to create a new build group from omnisci, and having three groups was a bit messy, so a bit of refactoring was needed (mostly naming the builds, and making the patterns to select the tests to run a bit clearer, only small things).

For some reason, #2205 is failing when pytest-xdist launches the workers and splits the tests in different processes. Some research about the error, seems to be a bug in one of the pytest-xdist dependencies. But looks like we're running the latest version of that library, where the bug is already fixed. So, seems in our case the problem is not the same as the stackoverflow... discussions on that error.

If you want to work on the branch of #2205, and see if you can identify the problem, that would be great. Otherwise I'll try to have a look myself later this week.

@datapythonista datapythonista changed the title Restore spark and omnisci tests Restore spark tests Jul 2, 2020
@cpcloud
Copy link
Member

cpcloud commented Nov 23, 2021

Spark tests are running again as of #2937. Closing.

@cpcloud cpcloud closed this as completed Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration issues or PRs pyspark The Apache PySpark backend tests Issues or PRs related to tests
Projects
None yet
Development

No branches or pull requests

4 participants