Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX copying Pandas DataFrames for Python bindings #1711

Merged
merged 14 commits into from Feb 21, 2019

Conversation

Projects
None yet
2 participants
@Yashwants19
Copy link
Contributor

commented Feb 7, 2019

Address #1706

@mlpack-bot

This comment has been minimized.

Copy link

commented Feb 7, 2019

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

  • All code should follow the style guide
  • Documentation added for any new functionality
  • Tests added for any new functionality
  • Headers and license information added to the top of any new code files
  • HISTORY.md updated if the changes are big or user-facing
  • All CI checks should be passing

Thank you again for your contributions! 👍

@rcurtin
Copy link
Member

left a comment

Hi @Yashwants19, thanks for taking a look into this! I have a couple comments that I hope are helpful. Let me know if I can clarify them, but I suspect that we will have to dig kind of deep into Pandas to get this working right. :)

if copy: # Copy the matrix if required.
return np.array(x, copy=True, dtype=dtype, order='C'), True
else:
return np.array(x, dtype=dtype), False

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 9, 2019

Member

What happens if we pass a Pandas Dataframe where the order is F instead of C? Try this code:

import pandas as pd
import numpy as np
d = pd.read_csv("some_data.csv")
x = np.array(d)
x.flags

You can see that F_CONTIGUOUS is true, but we need C_CONTIGUOUS to be true when we pass directly to mlpack. So I think we need a way to handle this.

In addition I think this still copies the Pandas dataframe, since when I look at x.flags I see

  C_CONTIGUOUS : False                                                                                                                                          
  F_CONTIGUOUS : True                                                                                                                                           
  OWNDATA : True                                                                                                                                                
  WRITEABLE : True                                                                                                                                              
  ALIGNED : True                                                                                                                                                
  WRITEBACKIFCOPY : False                                                                                                                                       
  UPDATEIFCOPY : False   

and OWNDATA : True implies that the matrix was copied.

Also, we should probably add a test case to src/mlpack/bindings/python/tests/test_python_binding.py. Take a look at the test testNumpyMatrix(self), and we can adapt this to use a Pandas dataframe. That would be a good way to ensure this works.

Show resolved Hide resolved src/mlpack/bindings/python/mlpack/matrix_utils.py Outdated

Yashwants19 added some commits Feb 11, 2019

@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

commented Feb 15, 2019

Hi @rcurtin please review this.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Feb 15, 2019

Hi @Yashwants19, don't worry, it is on my todo list. I have been busy and have not had a chance yet. I get emails for every bit of activity on the project, so don't worry, I did see your updates on this. :)

@rcurtin
Copy link
Member

left a comment

Hi @Yashwants19, thanks for taking the time to look into this further. I know I am being a little bit picky but I'm really concerned that we test this right, otherwise we will be getting bug reports soon. :) If you can add the tests I suggested and they work, I think this is good to merge. (The tests I suggested should be pretty easy to write---you can just basically adapt the ones we already have slightly.)

int_in=12,
double_in=4.0,
matrix_in=x,
copy_all_inputs=True)

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 15, 2019

Member

Looks like the spacing is off here.

@@ -139,6 +139,52 @@ def testNumpyMatrixForceCopy(self):

for j in range(100):
self.assertEqual(2 * x[j, 2], output['matrix_out'][j, 2])

def testPandasMatrix(self):

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 15, 2019

Member

Can you also add a test for a Pandas series object please?

x = pd.DataFrame(x)
return np.array(x, copy=True, dtype=dtype, order='C'), True
if isinstance(x, pd.core.series.Series) or isinstance(x, pd.DataFrame):
y = x.to_numpy()

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 15, 2019

Member

This only exists in Pandas 0.24 or newer, so we will need to handle cases where the Pandas version is older too.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Feb 16, 2019

Author Contributor

We can use x.values

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 16, 2019

Member

I think you are right, I did some playing with it and x.values does not seem to copy the matrix unless it is needed.

@@ -52,10 +52,19 @@ def to_matrix(x, dtype=np.double, copy=False):
return x.copy("C"), True
else:
return x, False
elif (isinstance(x, np.ndarray) and x.dtype == dtype and x.flags.f_contiguous):
if copy: # Copy the matrix if required.
return np.ndarray(x.shape,buffer=x.flatten(),dtype=dtype,order = 'C').copy("C"),True

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 15, 2019

Member

Doesn't this implicitly transpose the matrix? I think that we should test this by making another test for test_python_binding.py where we pass a numpy matrix with F_CONTIGUOUS set. (You can just make a random matrix with order='F'.)

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Feb 17, 2019

Author Contributor

No I think it doesn't transpose the matrix?

df = pd.read_csv('rann_test_q_3_100.csv',header=None)
a = pd.Series(np.random.rand(100))
print(df.values.flags)

it gives

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

when I compare df.values and to_matrix(df.values)[0]
as (df.values == to_matrix(df.values)[0]).all() it results True

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 18, 2019

Member

Huh, this is very strange to me. Take a look at this Python session:

>>> import numpy as np
>>> x = np.random.rand(5, 5)
>>> x = np.array(x, order='F', copy=True)
>>> x
array([[0.4374804 , 0.30758627, 0.97045243, 0.35728207, 0.95584896],
       [0.41457004, 0.82433351, 0.44260369, 0.80797501, 0.78434397],
       [0.10598124, 0.98438937, 0.13709334, 0.24600426, 0.20042625],
       [0.73949954, 0.54399577, 0.64043765, 0.52684565, 0.89393217],
       [0.61319107, 0.67340923, 0.37922999, 0.32466414, 0.91222045]])
>>> x.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
>>> y = np.ndarray(x.shape, buffer=x.flatten(), order='C')
>>> y
array([[0.4374804 , 0.30758627, 0.97045243, 0.35728207, 0.95584896],
       [0.41457004, 0.82433351, 0.44260369, 0.80797501, 0.78434397],
       [0.10598124, 0.98438937, 0.13709334, 0.24600426, 0.20042625],
       [0.73949954, 0.54399577, 0.64043765, 0.52684565, 0.89393217],
       [0.61319107, 0.67340923, 0.37922999, 0.32466414, 0.91222045]])
>>> y.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
>>> z = np.ndarray(y.shape, buffer=y.flatten(), order='F')
>>> z
array([[0.4374804 , 0.41457004, 0.10598124, 0.73949954, 0.61319107],
       [0.30758627, 0.82433351, 0.98438937, 0.54399577, 0.67340923],
       [0.97045243, 0.44260369, 0.13709334, 0.64043765, 0.37922999],
       [0.35728207, 0.80797501, 0.24600426, 0.52684565, 0.32466414],
       [0.95584896, 0.78434397, 0.20042625, 0.89393217, 0.91222045]])
>>> z.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Basically I created a matrix x that is random with F_CONTIGUOUS. Then I used np.ndarray() with buffer=x.flatten() to convert to order C_CONTIGUOUS. But somehow this does not copy the matrix or transpose it (which does not make sense to me---if it is F_CONTIGUOUS it is column-major; if it is C_CONTIGUOUS it is row-major; converting between the two by flattening should result in different element orderings). But then, when I do the same thing to convert back to F_CONTIGUOUS, now the matrix is transposed (but not copied).

So, that's very strange to me, but it doesn't matter, since you have written a test case with F_CONTIGUOUS matrices, so we know it will work either way. 👍

Yashwants19 added some commits Feb 17, 2019

@rcurtin
Copy link
Member

left a comment

@Yashwants19 this is looking pretty close to done to me---thanks again. I left just a few comments; let me know what you think. Once these are addressed I think this is ready for merge. 👍

if copy: # Copy the matrix if required.
return np.ndarray(y.shape,buffer=y.flatten(),dtype=dtype,order = 'C').copy("C"),True
else:
return np.ndarray(y.shape,buffer=y.flatten(),dtype=dtype, order = 'C'),False

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 18, 2019

Member

What about the else for this if block? We should probably fall back on the original return np.array(x, copy=True, dtype=dtype, order='C'), True that was there.

The matrix with F_CONTIGUOUS set we pass in, we should get back with the third
dimension doubled and the fifth forgotten.
"""
x = np.array(np.random.rand(100, 5),order = 'F');

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 18, 2019

Member

Can you fix the spacing here (and in general in the PR)? It should be, e.g.,

x = np.array(np.random.rand(100, 5), order='F')

The keys here are that there are spaces between each argument, but not spaces between keyword arguments.

output = test_python_binding(string_in='hello',
int_in=12,
double_in=4.0,
col_in=z)

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 18, 2019

Member

We should test a Pandas series also to be passed as matrix_in. But that may require the fix for one-dimensional objects which is another part of #1710. If so, we can wait to add that.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 20, 2019

Member

If you can let me know what you think here before we merge, I'd appreciate that. 👍

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Feb 20, 2019

Author Contributor

hi @rcurtin when we pass Pandas Series as matrix_in it gives Segmentation Fault as I believe this is because the function numpy_to_mat_d(), expects a numpy.ndarray[numpy.double_t, ndim=2], but we are passing one with ndim=1

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 20, 2019

Member

Ok, that's what I figured but I just wanted to check. We can handle it in a different PR then. :)

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Feb 21, 2019

Author Contributor

Yes we can test this on different PR :)

Yashwants19 added some commits Feb 19, 2019

@rcurtin
Copy link
Member

left a comment

Thanks again for the hard work. I think that the code is fine here, so I'm happy to merge it in. But there is one more comment which should be addressed, so if you can please respond to it before I merge that would be great. 👍

output = test_python_binding(string_in='hello',
int_in=12,
double_in=4.0,
col_in=z)

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 20, 2019

Member

If you can let me know what you think here before we merge, I'd appreciate that. 👍

if copy: # Copy the matrix if required.
return np.ndarray(y.shape, buffer=y.flatten(), dtype=y.dtype, order='C').copy("C"), True
else:
return np.ndarray(y.shape, buffer=y.flatten(), dtype=y.dtype, order='C'), False

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 20, 2019

Member

Are you sure about the y.dtype here? It's possible that this may not be the desired dtype.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Feb 20, 2019

Author Contributor

No, I thing we should use dtype only, I am Sorry I was testing Pandas series so I have used y.dtype . But i realize dtype gives desired result correctly. Should I make another commit to correct this.??

This comment has been minimized.

Copy link
@rcurtin

rcurtin Feb 20, 2019

Member

Yes, absolutely---we should not commit incorrect code.

@mlpack-bot

mlpack-bot bot approved these changes Feb 21, 2019

Copy link

left a comment

Second approval provided automatically after 24 hours. 👍

@rcurtin

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Thanks again for the contribution! 👍

@rcurtin rcurtin merged commit 4dbcfba into mlpack:master Feb 21, 2019

6 checks passed

LaTeX Documentation Checks Build finished.
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Yashwants19 Yashwants19 deleted the Yashwants19:fix_copying_Pandas_DataFrames branch Feb 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.