Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle lists and arrays as x, y in density_scatter and siblings #81

Open
janosh opened this issue May 24, 2023 · 7 comments
Open

Handle lists and arrays as x, y in density_scatter and siblings #81

janosh opened this issue May 24, 2023 · 7 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@janosh
Copy link
Owner

janosh commented May 24, 2023

This simple example

import numpy as np

from pymatviz import density_scatter


arr = np.arange(5)
lst = list(range(5))

ax = density_scatter(x=arr, y=arr)
ax = density_scatter(x=lst, y=lst)

embarrassingly raises

File ~/dev/pymatviz/pymatviz/parity.py:43, in hist_density(x, y, df, sort, bins)
     39 x, y = df_to_arrays(df, x, y)
     41 data, x_e, y_e = np.histogram2d(x, y, bins=bins)
---> 43 zs = scipy.interpolate.interpn(
     44     (0.5 * (x_e[1:] + x_e[:-1]), 0.5 * (y_e[1:] + y_e[:-1])),
     45     data,
     46     np.vstack([x, y]).T,
     47     method="splinef2d",
     48     bounds_error=False,
     49 )
...
    611 result[idx_valid] = interp.ev(xi[idx_valid, 0], xi[idx_valid, 1])
--> 612 result[np.logical_not(idx_valid)] = fill_value
    614 return result.reshape(xi_shape[:-1])

ValueError: cannot convert float NaN to integer
@janosh janosh added bug Something isn't working good first issue Good for newcomers labels May 24, 2023
@DanielYang59 DanielYang59 mentioned this issue Feb 2, 2024
5 tasks
@DanielYang59
Copy link
Contributor

I'm working on this.

@DanielYang59
Copy link
Contributor

Hi @janosh sorry for the long silence.

I'm thinking about cleaning up the code a little bit (starting with density_scatter and siblings, and then others if I have the energy).

What I want to do (in terms of dataset handling) is still allowing users to pass various types (ArrayLike, pd.DataFrame and more) into the plotter function, but internally convert them to a single data type (pd.DataFrame maybe?), to get rid of data processing codes inside the plotter and potential incompatibility issues like this.

This should not be hard (just a simple utility script should do the job) nor breaking. Can I have the permission to implement this change? And do you have any suggestions on this matter? Thanks!

@janosh
Copy link
Owner Author

janosh commented Feb 16, 2024

Hi @janosh sorry for the long silence.

no worries at all.

Can I have the permission to implement this change?

of course, that would be much appreciated! 🙏

And do you have any suggestions on this matter?

yes! have a look at

pymatviz/pymatviz/utils.py

Lines 486 to 496 in a2d29e4

def df_to_arrays(
df: pd.DataFrame | None,
*args: str | Sequence[str] | ArrayLike,
strict: bool = True,
) -> list[ArrayLike | dict[str, ArrayLike]]:
"""If df is None, this is a no-op: args are returned as-is. If df is a
dataframe, all following args are used as column names and the column data
returned as arrays (after dropping rows with NaNs in any column).
Args:
df (pd.DataFrame | None): Optional pandas DataFrame.

which is used in density_scatter and many other plot functions to standardize input data

xs, ys = df_to_arrays(df, x, y)

@DanielYang59
Copy link
Contributor

DanielYang59 commented Feb 16, 2024

yes! have a look at

pymatviz/pymatviz/utils.py

Lines 486 to 496 in a2d29e4

def df_to_arrays(
df: pd.DataFrame | None,
*args: str | Sequence[str] | ArrayLike,
strict: bool = True,
) -> list[ArrayLike | dict[str, ArrayLike]]:
"""If df is None, this is a no-op: args are returned as-is. If df is a
dataframe, all following args are used as column names and the column data
returned as arrays (after dropping rows with NaNs in any column).
Args:
df (pd.DataFrame | None): Optional pandas DataFrame.

Thanks a lot @janosh. Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array). In this case, which do you prefer? pd.DataFrame or np.ndarray or both (I personally prefer np.ndarray because it's more explicit and flat, and has efficient data processing support)? What do you think?

On top of all these technical things, I noticed despite all these beautiful plots, pymatviz seems to lack its documentation site (maybe I failed to find it?). This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

@janosh
Copy link
Owner Author

janosh commented Feb 17, 2024

This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

yes, that has been on my todo list for a long time. the current docs are terrible!

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure. we could essentially copy all the cells in _generate_assets.py into separate files and write some docs/explanation around it.

i'd prefer if not to use Jupyter notebooks for this as i find them more annoying to work with than Python scripts. luckily VS Code Interactive Window can run Python scripts just like notebooks and also supports exporting cells and their output to HTML. we just need to find a way to script this functionality so that the HTML is updated whenever source files change.

Screenshot 2024-02-17 at 10 37 41

Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array)

i'm open to that. i'd prefer dataframes over arrays though as they have a more powerful API

@DanielYang59
Copy link
Contributor

DanielYang59 commented Feb 17, 2024

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure.

I'm thinking just building separate a docs site with sphinx templates like "Read the Docs", I have some experience with that and I will start working on this soon (would open a separate PR aside from the data preprocessing topic )😄 .

i'd prefer if not to use Jupyter notebooks for this

I also find very long jupyter (like _generate_assets.py) could be hard to navigate through. Would be much better to separate assert/example generation scripts for each section (histogram/ptable and such). I would push a new PR once I finish the draft and iterate together. Thanks!

@janosh
Copy link
Owner Author

janosh commented Feb 17, 2024

i don't think we need to start from scratch and i'm not a huge fan of sphinx tbh.

there are some preliminary example notebooks already up on the current docs page. e.g. see https://pymatviz.janosh.dev/notebooks/mp_bimodal_e_form.

you can get a list of them by hitting cmd + k to bring up the nav palette on the current docs. i think just converting those notebooks to python scripts and adding more of them would be great!

Screenshot 2024-02-17 at 11 07 14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants