Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can only use .str accessor with string values, which use np.object_ dtype in pandas #439

Closed
Clem-D opened this issue Jan 29, 2019 · 10 comments

Comments

@Clem-D
Copy link

Clem-D commented Jan 29, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Linux release 7.5.1804 (Core)
  • Modin installed from (source or binary): pip install modin
  • Modin version: 0.3.0
  • Python version: 3.5.2
  • Exact command to reproduce: df['foo'] = df['foo'].str.replace('.', ',')

Describe the problem

This issue follows the #414
And yes df['foo'] = df['foo'].str.replace('.', ',') worked with pandas.
Actually all my code used to worked with pandas ^^

Source code / logs

Can only use .str accessor with string values, which use np.object_ dtype in pandas

@devin-petersohn
Copy link
Collaborator

Thanks @Clem-D, was this working on 0.2.5? We recently added a SeriesView class that may be interfering with the normal behavior.

Would you be able to tell me if this works instead:

df['foo'] = df['foo'].series.str.replace('.', ',')

This will call the literal pandas series code. If it is working, it is a fairly easy fix.

@Clem-D
Copy link
Author

Clem-D commented Jan 30, 2019

Hello @devin-petersohn, I don't know if it works in 0.2.5 because as I said in #414 I got another error before (with "encoding" keyword).
Anyway using series doesn't work neither.

@devin-petersohn
Copy link
Collaborator

I see, the encoding does not allow testing on 0.2.5.

This is an interesting issue, because it is using pandas for that series code.

What does print(df['foo'].dtype) print? It is giving an error related to the dtype.

Also, does this fix the issue:

df['foo'] = df['foo'].apply(str).str.replace('.', ',')

It will force the column to string dtype because it is not recognizing it as a string column.

@Clem-D
Copy link
Author

Clem-D commented Jan 31, 2019

print(df['foo'].dtype) gives "object"
df['foo'] = df['foo'].apply(str).str.replace('.', ',') actually works :)

@devin-petersohn
Copy link
Collaborator

devin-petersohn commented Jan 31, 2019

This is interesting. I will investigate to see if I can reproduce this. It may be that internally Modin is losing track of the dtype after encoding is set to latin1.

@devin-petersohn
Copy link
Collaborator

Hi @Clem-D, I have been trying to reproduce this error, but I haven't been successful. Here is some of the code I wrote to try to reproduce the error.

import pandas
import numpy as np

frame_data = np.random.randint(0, 100, size=(1000, 100))
df = pandas.DataFrame(frame_data).add_prefix("col")

# mix the dtypes to see if that is the issue
for i in range(len(df.columns)):
     df.iloc[:, i] = ["hi " + str(o) if o > 50 else o for o in df.iloc[:, i]]

df.to_excel("temp.xlsx", encoding="latin1")

import modin.pandas as pd
df = pd.read_excel("temp.xlsx", encoding="latin1")
df["col1"] = df["col1"].str.replace("hi", "hello")

Could you provide a sample of the column data that you're trying to replace? That would help a lot.

@Clem-D
Copy link
Author

Clem-D commented Feb 20, 2019

Sure ! Here is an example file
modinExample.xlsx

And the following code I use :

python --version gives me 3.5.1
then I typed python to access to the IDLE

>>> import ray
/home/user/.pyenv/versions/3.5.1/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.

>> ray.init()
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-02-20_12-04-38_32023/logs.
Waiting for redis server at 127.0.0.1:10671 to respond...
Waiting for redis server at 127.0.0.1:61758 to respond...
Starting Redis shard with 10.0 GB max memory.
Starting the Plasma object store with 6.6930343930000005 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
{'object_store_address': '/tmp/ray/session_2019-02-20_12-04-38_32023/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2019-02-20_12-04-38_32023/sockets/raylet', 'redis_address': '10.69.10.51:10671', 'node_ip_address': None, 'webui_url': None}

>>> ray.global_state.cluster_resources()["CPU"]
4.0 

>>> import modin.pandas as pd
>>> df = pd.read_excel("/home/user/Desktop/modinExample.xlsx", encoding="latin1")
/home/cdelestre/.pyenv/versions/noe/lib/python3.5/site-packages/modin/error_message.py:32: UserWarning: `read_excel` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
  warnings.warn(message)

>>> df['Parent(s) (by id coma separated ie. «1,2»)'] = df['Parent(s) (by id coma separated ie. «1,2»)'].str.replace('.', ',')
File "<stdin>", line 1, in <module>
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/modin/pandas/series.py", line 181, in __getattribute__
    method = self.series.__getattribute__(item)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/accessor.py", line 133, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1895, in __init__
    self._validate(data)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1917, in _validate
    raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

>>> df['foo'] = df['foo'].str.replace('.', ',')
File "<stdin>", line 1, in <module>
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/modin/pandas/series.py", line 181, in __getattribute__
    method = self.series.__getattribute__(item)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/accessor.py", line 133, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1895, in __init__
    self._validate(data)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1917, in _validate
    raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

>>> df['bar'] = df['bar'].str.replace('.', ',')
File "<stdin>", line 1, in <module>
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/modin/pandas/series.py", line 181, in __getattribute__
    method = self.series.__getattribute__(item)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/accessor.py", line 133, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1895, in __init__
    self._validate(data)
  File "/home/user/.pyenv/versions/noe/lib/python3.5/site-packages/pandas/core/strings.py", line 1917, in _validate
    raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

pip freeze gives me :

dask==1.1.0
(...)
modin==0.3.0
(...)
numpy==1.15.0
pandas==0.23.4
pathlib2==2.3.3
(...)
xlrd==1.0.0
XlsxWriter==1.0.2

(I filtered only interesting packages but I can give you the full output if you want).

@devin-petersohn
Copy link
Collaborator

Hi @Clem-D, sorry for the late reply. With the example you provided, I get the same error in pandas and Modin. Is there a different example that is also working in pandas?

@Clem-D
Copy link
Author

Clem-D commented Apr 15, 2019

Hello,
You're right it's because I forgot to add the line
df[["Parent(s) (by id coma separated ie. «1,2»)"]] = df[["Parent(s) (by id coma separated ie. «1,2»)"]].astype(str)
Before
df['Parent(s) (by id coma separated ie. «1,2»)'] = df['Parent(s) (by id coma separated ie. «1,2»)'].str.replace('.', ',')
which works with regular panda

@devin-petersohn
Copy link
Collaborator

Closing this. Feel free to reopen if the discussion should continue or if issue was not resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants