Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matplotlib can not handle pandas dataframe correctly when the label of the columns/index is strings but the actual data are float. #10344

Closed
oulongwen opened this issue Jan 29, 2018 · 15 comments

Comments

Projects
None yet
5 participants
@oulongwen
Copy link

commented Jan 29, 2018

Bug report

Bug summary

When using matplotlib.pyplot.scatter to plot a scatter plot for pandas.Dataframe, matplotlib will not generate ticks correctly if the data type of the columns/index of the dataframe is a string.
Code for reproduction

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


a = pd.DataFrame(np.random.rand(5, 1), columns=['0'])
b = np.random.randn(5, 1)
plt.scatter(a, b)
plt.show()

Actual outcome

figre1

Expected outcome
figure2

Matplotlib version

  • Operating system: Ubuntu 17.10
  • Matplotlib version: 2.1.2
  • Matplotlib backend (print(matplotlib.get_backend())): module://ipykernel.pylab.backend_inline
  • Python version: 3.6.3
  • Jupyter version: I tried this with and without jupyter notebook, this issue persists.

I tried this on two systems. The first one runs ubuntu 17.10 with matplotlib installed by anaconda; the second one runs Arch Linux with matplotlib installed with pip. They both have this issue.
I also tried this on Windows 10 and macOS High Sierra, neither of them has this issue.

@WeatherGod

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

@afvincent

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2018

Is it really a bug? Looking at the documentation, it is not obvious to me, even though it might have been working until recently :/. See for example types-of-inputs-to-plotting-functions. If I am correct, we are rather documenting the following to “natively” plot from pandas.DataFrame instances plotting-with-keyword-strings.

BTW thank you @oulongwen for the clean code snippet: I can indeed reproduce this behavior on the master branch with a Linux system (Fedora 27) and Python 3.6 (from Anaconda).

@oulongwen

This comment has been minimized.

Copy link
Author

commented Jan 29, 2018

@WeatherGod I think you are right. The matplotlib version on the Windows machine is v2.0.0. I can't check the version on the macOS machine right now, but I think it is also an older version. Both linux machines has matplotlib v2.1.2.

@jklymak

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2018

As of 2.1, strings get sent to Categorical support now, so are labeled with the string you supply and spaced an integer distance apart. i.e try

import matplotlib.pyplot as plt
import numpy

fig, ax = plt.subplots()
ax.plot(['a', 'd', 'b', 'Boooo', '1.23456'], np.random.rand(5))

If you want the x values to be floats, cast them as such...

@jklymak jklymak added the categorical label Jan 29, 2018

@oulongwen

This comment has been minimized.

Copy link
Author

commented Jan 29, 2018

@jklymak If you look at the code, the inherent data of the dataframe are actually floats, only the label of the column is string and matplotlib does not seem to handle it correctly.

@jklymak

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2018

Can you reproduce w/o pandas? Its definitely being sent to the categorical handler.

@oulongwen

This comment has been minimized.

Copy link
Author

commented Jan 29, 2018

@afvincent Thanks for your suggestions. Converting the dataframe to a np.array prior to making the plot would work. I am hoping that matplotlib can handle pandas dataframe directly at least in most scenarios (especially it worked perfectly before). But I guess the best practice is to convert the dataframe to a np.array before making the plot.

@afvincent

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2018

From a quick git bisect, the behavior started to change after:

b1848420bf80cb40211f37fb32f3a2d07e9b5f7f is the first bad commit
commit b1848420bf80cb40211f37fb32f3a2d07e9b5f7f
Author: hannah <story645@gmail.com>
Date:   Thu Jun 16 18:19:18 2016 -0400

    Basic support for plotting lists of strings/categorical data. Support for updating ticks/animation in progress/buggy, 'especially for scatter.

But again, I am not sure that this behavior was not in the first place simply working “by accident”.

@tacaswell

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

This is the string-categorical logic triggering again, however I would not expect this to work in general (as a data frame is not a '1D array like' thing). If you want to plot the column do

plt.scatter(a['0'], b)   

That this every worked is a bit of happenchance. The categorical code is being triggered because DataFrames give slightly inconsistent results for iterating through them vs using np.array, ex

In [9]: next(iter(a))
Out[9]: '0'

In [10]: np.array(a)
Out[10]: 
array([[ 0.82686393],
       [ 0.89649743],
       [ 0.00660374],
       [ 0.25387538],
       [ 0.15848726]])

I am inclined to close this as no-action as supporting full data frames is input to scatter is not something that I think we should support because it is not clear what heuristics should be used to deal with more than one column (particularly if they are of mixed types).

@afvincent

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2018

@tacaswell Rather plt.scatter(a['0'], b) I guess :).

@tacaswell

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

@oulongwen we support pd.Series as most inputs directly and pd.DataFrame in the data=df kwarg (ex plt.scatter('0', b, data=a) ).

@oulongwen

This comment has been minimized.

Copy link
Author

commented Jan 29, 2018

@tacaswell Yes I agree it is not common to send a dataframe to the scatter function. My use case is that I am trying to make a parity plot for a machine learning project where the actual target is stored in another file. So after reading it with pandas it is in a dataframe which is sent to scatter to make the parity plot. I can't think of any use case where a dataframe more than 2 columns would be used as a argument of the scatter function.

@oulongwen

This comment has been minimized.

Copy link
Author

commented Jan 29, 2018

Thank you @tacaswell. I think that would do the trick.

@tacaswell

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

Closing for the reasons discussed above.

@tacaswell tacaswell closed this Jan 29, 2018

@tacaswell

This comment has been minimized.

Copy link
Member

commented Jan 29, 2018

Thanks for reporting this @oulongwen hopefully the work arounds are enough for you to keep working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.