Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in DataFrame.replace regex handling #6777

Closed
adamdivak opened this issue Apr 3, 2014 · 2 comments · Fixed by #6778
Closed

Ambiguity in DataFrame.replace regex handling #6777

adamdivak opened this issue Apr 3, 2014 · 2 comments · Fixed by #6778
Labels
API Design Bug Strings String extension data type and string data
Milestone

Comments

@adamdivak
Copy link

Dear developers,

I think that it is confusing how DataFrame.replace interprets the to_replace values as a pure string or a regexp.

When you pass a dictionary of from and to values, these are interpreted as pure string literals and work as expected. If you pass a nested dictionary with the same values, it is interpreted as a regex even if regex = False is specifically added. This causes a problem if you want to replace values that have special characters in them.

Let's see an example:

@buddha[J:T26]|19> df = pd.DataFrame({'a' : ['()', 'something else']})
@buddha[J:T26]|20> df
              <20>
                a
0              ()
1  something else

[2 rows x 1 columns]
@buddha[J:T26]|21> df.replace({'()' : 'parantheses'})
              <21>
                a
0     parantheses
1  something else

[2 rows x 1 columns]
@buddha[J:T26]|22> df.replace({'a' : {'()' : 'parantheses'}})
              <22>
                                                   a
0                parantheses(parantheses)parantheses
1  paranthesessparanthesesoparanthesesmparanthese...

[2 rows x 1 columns]

As you can see, in the first case the () got replaced as expected. In the second case, even though the only thing I changed was to specify the column in which the replace should occur, the parentheses got treated as a regex. The same happens if I add regex = False to the options.

The current workaround for me is to make sure that the from/to values are regexes.

Cheers,
Adam

Ps.: Pandas is awesome, thanks a lot for all the effort!

@Buddha[J:T26]|24> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 23 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.13.1
Cython: None
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.2.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.8.4
lxml: 3.3.1
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None

@cpcloud
Copy link
Member

cpcloud commented Apr 3, 2014

Thanks for the report, for now you can also use re.escape to make sure that special characters are escaped. I'll look into this.

@adamdivak
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants