Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert DataFrame columns with type RangeIndex to strings #932

Merged
merged 6 commits into from
Nov 18, 2022

Conversation

hoxbro
Copy link
Member

@hoxbro hoxbro commented Oct 12, 2022

This should be seen as a small (and definitely not exhaustive) step toward working with non-strings columns and hv.Dimension.

Related to holoviz/holoviews#5353, and with the upcoming change from DeprecationWarning to FutureWarning in holoviz/holoviews#5472 will now give users warnings in their existing notebooks.

Most of these warnings will properly come from an initialized pd.DataFrame or pd.Series without defining the column names. This PR checks if the column is pd.RangeIndex and converts it to strings. An example is given below, with the upcoming change to FutureWarning and removing lru_cache to see all warnings for more information, see holoviz/holoviews#5472.

Before After
before after

@hoxbro hoxbro changed the title Convert DataFrame columns with RangeIndex to strings Convert DataFrame columns with type RangeIndex to strings Oct 12, 2022
@@ -671,6 +671,9 @@ def _process_data(self, kind, data, x, y, by, groupby, row, col,
data = data.to_frame()
if is_intake(data):
data = process_intake(data, use_dask or persist)
if isinstance(getattr(data, "columns", None), pd.RangeIndex):
data = data.rename(columns=str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoxbro the thing I was not sure with this change is that doesn't this line actually create a copy of the data? self.source_data is supposed to be a reference to the original data.

The approach I've implemented locally is to temporarily replace the columns of the source data with columns that contain string objects, in a try/finally block.

The way I was going may just be overkill!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It makes a copy. Looking at the previous if-statements, I would not say self.source_data refers to the original data, e.g., a series is converted to a dataframe. I see self.data_source as a reference to the original data, though I have not taken a thorough look at the rest of the codebase to see if this is the case.

hvplot/hvplot/converter.py

Lines 668 to 677 in eed0e87

self.data_source = data
self.is_series = is_series(data)
if self.is_series:
data = data.to_frame()
if is_intake(data):
data = process_intake(data, use_dask or persist)
if isinstance(getattr(data, "columns", None), pd.RangeIndex):
data = data.rename(columns=str)
self.source_data = data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.data_source is unused. We don't get any warning for that as flake is set up to basically ignore everything. Something else to fix (I'll ask Philipp if I can run black once).

@maximlt
Copy link
Member

maximlt commented Nov 3, 2022

Alright @philippjfr we'll need your input here to see how to best handle that :)

Since the latest release of HoloViews users get a warning when they make a plot from a DataFrame that has non-string columns. This is what we're trying to deal with here.

The converter makes a reference source_data to the "source" data early on instantiation. That data is the source except when it's a Series:

hvplot/hvplot/converter.py

Lines 664 to 679 in 4e55da4

def _process_data(self, kind, data, x, y, by, groupby, row, col,
use_dask, persist, backlog, label, group_label,
value_label, hover_cols, attr_labels, transforms,
stream, kwds):
gridded = kind in self._gridded_types
gridded_data = False
da = None
# Validate DataSource
self.data_source = data
self.is_series = is_series(data)
if self.is_series:
data = data.to_frame()
if is_intake(data):
data = process_intake(data, use_dask or persist)
self.source_data = data

This reference is used only once later in __call__. My first question for you is what is the purpose of this part of the code below? As far as I can see its purpose is to set _dataset on the HoloViews object returned by the method (e.g. line) call. Using git blame I saw references to making linked selections work. Anecdotally I've run the test suite removing obj._dataset = dataset and there were no errors, so it'd be nice to test what this code is used for.

hvplot/hvplot/converter.py

Lines 1215 to 1234 in 4e55da4

data = self.source_data
if self.datatype in ('geopandas', 'spatialpandas'):
columns = [c for c in data.columns if c != 'geometry']
shape_dims = ['Longitude', 'Latitude'] if self.geo else ['x', 'y']
dataset = Dataset(data, kdims=shape_dims+columns)
elif self.datatype == 'xarray':
import xarray as xr
if isinstance(data, xr.Dataset):
dataset = Dataset(data, self.indexes)
else:
name = data.name or self.label or self.value_label
dataset = Dataset(data, self.indexes, name)
else:
try:
dataset = Dataset(data, self.indexes)
except Exception:
dataset = Dataset(data)
dataset = dataset.redim(**self._redim)
obj = method(x, y)
obj._dataset = dataset

The warning is raised when the Dataset is instantiated in this code. The goal is then to give it a transformed DataFrame that doesn't have string columns.

Note: the converter processes the input data in many ways, and interestingly that includes converting its non-string columns to string with the method _transform_columnar_data. This is however unrelated to the problem at hand, as that processed data - self.data - isn't involved in the code above.

Simon's suggesting to make a change early on, before setting source_data. If columns is a pandas.RangeIndex then we use .rename to convert its columns to strings. columns is a RangeIndex when you create the dataframe this way pd.DataFrame(np.random.rand(2, 2)), which should cover most cases. I think that alternatively we could use _transform_columnar_data here, which would cover all the cases. My notes and questions on this:

  • this is making a copy of the source data before setting source_data. I don't know to which extent it's important for the HoloViews Dataset to keep a reference to the original data (same id)?
  • if that's the right approach, we'll need to add a comment as this code is pretty far from the place where the issue lies
  • if we replace it by using _transform_columnar_data, I think this call to that same method that happens later should be removed:

self.data = self._transform_columnar_data(self.data)

But not this one that seems to be needed for streaming dataframes:

data = self._transform_columnar_data(data)

(there are no tests for streaming data so we need to be extra careful here, maybe keeping that for another PR but if so we should definitely open an issue)

An alternative to Simon's suggestion would be to temporarily modify in place the columns of the source data. The goal is to try to avoid making a copy. This is relevant only if making a copy is a bad idea. That would look something like the following:

source_cols = None
if hasattr(self.source_data, 'columns') and any(isinstance(col, str) for col in self.source_data.columns):
    source_cols = data.columns
    self.source_data.columns = [str(col) for col in self.source_data.columns]  # This doesn't make a copy?
try:
    ...
    dataset = hv.Dataset(self.source_data)
    ...
    obj = method(x, y)
    obj._dataset = dataset
finally:
    if source_cols:
        self.source_data.columns = source_cols

Let us know how you would approach that 🙏

@maximlt maximlt merged commit 751911a into master Nov 18, 2022
@maximlt maximlt deleted the rangeindex_to_str branch November 18, 2022 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants