# Inconsistencies between models evaluated with Ray and Local backends

I am seeing unit tests fail because AUC values at model evaluation time are different between Ray and Local backends.

The test proceeds as following:

1. A CSV of 100 samples with 10% NaNs in each column is generated as dataset `D`. 
2. A model `M_ray` is instantiated with a Ray backend.
2. `M_ray` is trained on `D`.
3. `M_ray` is evaluated on `D`, creating `metric_ray`.
4. `M_ray` is saved, then reloaded with a Local backend as `M_local`.
5. `M_local` is evaluated on `D`, creating `metric_local`.
6. `metric_ray` is compared with `metric_local`.

Sometimes the test passes and sometimes it fails. Before the contents of this notebook were produced, I confirmed the following:

1. The data looks identical between the two backends. This is right before the forward pass.
2. The output probabilities look identical between the two backends. This is a change from what I initially thought– the reason being that I was not considering that batches were being split between two Ray processes.
3. Tests that were passing were processing 90 samples. This is expected because drop_rows should be removing 10% of samples (the ones with NaNs).
3. Tests that were failing were processing 100 samples. This is unexpected for the same reason as above.

This notebook communicates why 100 samples were appearing in the failing tests.

In [1]:
!pip show pandas

Name: pandas
Version: 1.4.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /Users/geoffreyangus/repositories/predibase/ludwig/venv38/lib/python3.8/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: bayesmark, ludwig, mlflow, modin, seaborn, statsmodels, whylogs, xarray


In [2]:
!pip show dask

Name: dask
Version: 2022.1.0
Summary: Parallel PyData with Task Scheduling
Home-page: https://github.com/dask/dask/
Author: 
Author-email: 
License: BSD
Location: /Users/geoffreyangus/repositories/predibase/ludwig/venv38/lib/python3.8/site-packages
Requires: cloudpickle, fsspec, packaging, partd, pyyaml, toolz
Required-by: 


In [3]:
!python --version

Python 3.8.13


In [4]:
import pandas as pd
import dask.dataframe as dd

## Passing Test

Below is a dataset that resulted in a passing test, for reference. Note that the first NaN in the binary feature column appears at index 4.

In [5]:
pd.read_csv('/Users/geoffreyangus/Downloads/unittest2/dataset.csv').head(15)

Unnamed: 0,audio_D76FB,binary_E1D6C
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True


Pandas identifies the binary feature column to have dtype `object`.

In [6]:
pd.read_csv('/Users/geoffreyangus/Downloads/unittest2/dataset.csv')['binary_E1D6C'].dtype

dtype('O')

When Dask reads this CSV, it does so exactly as Pandas does:

In [7]:
dd.read_csv('/Users/geoffreyangus/Downloads/unittest2/dataset.csv').compute().head(15)

Unnamed: 0,audio_D76FB,binary_E1D6C
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True


Dask also identifies the binary feature column to have dtype `object`.

In [8]:
dd.read_csv('/Users/geoffreyangus/Downloads/unittest2/dataset.csv').compute()['binary_E1D6C'].dtype

dtype('O')

## Failing Test

This is a failing test. It seems that the issue is related to WHERE exactly the first NaN is located in the binary feature. Such an explanation is supported by the [Dask docs](https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html): 

> Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time.

First, inspect the dataframe instantiated using `pd.read_csv`. The binary feature is of type `object` because of the presence of NaNs. Note that in this failing test, the first NaN of the binary column appears in the row with index `10`.

In [9]:
pd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv').head(15)

Unnamed: 0,audio_58299,binary_EB37B
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False


In [10]:
pd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv')['binary_EB37B'].dtype

dtype('O')

When we read the same dataset using Dask, the NaN at index 10 disappears and is replaced with a `True`. Further inspection shows that Dask inferred the dtype of this column to be `bool`, which does not allow the presence of NaNs.

In [11]:
dd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv').compute().head(15)

Unnamed: 0,audio_58299,binary_EB37B
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False


In [12]:
dd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv').compute()['binary_EB37B'].dtype

dtype('bool')

## Possible fixes

From the same [Dask docs](https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html) page linked above:

>To fix this, you have a few options: (1) Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution. (2) Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. (3) Increase the size of the sample using the sample keyword.

Item (1) seems promising, but we don't know anything about the features at read time. Item (2) seems not to work for boolean dtypes:

In [13]:
# binary column still casted to bool and NaN at index 10 disappears
dd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv', assume_missing=True).compute().head(15)

Unnamed: 0,audio_58299,binary_EB37B
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False


Item (3) seems not to work despite using an obscenely large number of sampling bytes for `sample`:

In [14]:
# binary column still casted to bool and NaN at index 10 disappears
dd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv', sample=100000000000000).compute().head(15)

Unnamed: 0,audio_58299,binary_EB37B
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False


The most straightforward fix seems to be assigning all columns to have dtype `object` at read time. This would preserve all NaNs up until they are handled by the `handle_missing_values` function. My concern is that this is a big upstream change that may lead to unforeseen consequences on downstream preprocessing steps.

In [15]:
# All columns are upcasted to the most general dtype (object), so NaNs are preserved and dealt with later on in handle_missing_values
dd.read_csv('/Users/geoffreyangus/Downloads/unittest/dataset.csv', dtype=object).compute().head(15)

Unnamed: 0,audio_58299,binary_EB37B
0,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
1,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
2,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
3,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
4,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
5,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
6,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
7,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False
8,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,True
9,/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn...,False


## Additional Thoughts

If we were to make the above change, we would have to change the order of preprocessing. Right now, we do the following:

```
cast_columns => handle_missing_values => add_feature_data
```

Calling `cast_columns` before anything else may introduce similar errors having to do with the presence of NaNs in the raw dataframe. If we were to read all columns as `object` columns, then we likely want the following order:

```
handle_missing_values => cast_columns => add_feature_data
```

This immediately removes NaNs, then casts the columns in a way that is expected for `add_feature_data`.

This is just one thought and I am sure I am missing something. Please let me know what you think.