Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

Merged
merged 26 commits into from
Jun 1, 2019

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Oct 15, 2018

closes #23163
closes #23011
closes #23551
#23555 / #23556

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Several times while working around str.cat (most recently #22725) @jreback and @WillAyd mentioned that the .str accessor should not work for (e.g.) bytes data. So far, the constructor of the StringMethods object didn't infer for Series -- this PR adds that.

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types, but for floats, the inference-code would need to be able to distinguish between actual floats and missing values (currently, both of those return inferred_type='mixed')

@pep8speaks
Copy link

pep8speaks commented Oct 15, 2018

Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-30 15:17:50 UTC

@h-vetinari
Copy link
Contributor Author

Hm, turns out the .str-on-bytes is used in several more places. Reverting that part for now.

@codecov
Copy link

codecov bot commented Oct 16, 2018

Codecov Report

Merging #23167 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23167      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50643    50682      +39     
==========================================
+ Hits        46515    46550      +35     
- Misses       4128     4132       +4
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 41.74% <95.52%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.92% <100%> (+0.05%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a60d1bd...f62e344. Read the comment docs.

@jreback jreback added the Strings String extension data type and string data label Oct 17, 2018
@@ -208,6 +208,8 @@ Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reword a bit. something like

The `.str`-accessor will raise on any non-(string or unicode) ``object`` dtype data. Previously this would accept ``bytes`` for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is how you would love to see the .str-accessor work (and it's not like my sentence is better).

The problem is that the .str-accessor has effectively been an .object-accessor for a very long time, and is used as such - even pandas-internally, like .str.join for lists or str.decode for bytes.

Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.

So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.

pandas/core/strings.py Show resolved Hide resolved
raise AttributeError('Can only use .str accessor with Index, '
'not MultiIndex')

# see src/inference.pyx which can contain string values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty is only ok if its object dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# see src/inference.pyx which can contain string values
allowed_types = ['string', 'unicode', 'empty',
'mixed', 'mixed-integer']
if isinstance(data, ABCSeries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this accept bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See OP:

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

I originally commented out the only tests in test_strings that needed bytes and completely forbade bytes, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.

if isinstance(data, ABCSeries):
allowed_types = allowed_types + ['bytes']

values = data if isinstance(data, Index) else data.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr(data, 'values', data)

else:
inf_type = lib.infer_dtype(values)

all_na_obj = is_object_dtype(values.dtype) and data.isna().all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is by defintion what 'empty' means, you only need to check if its object type in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'.

>>> import pandas as pd
>>> import numpy as np
>>> import pandas._libs.lib as lib
>>> lib.infer_dtype(pd.Index([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([]))
'floating'
>>> lib.infer_dtype(pd.Index([]))
'empty'
>>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([], dtype=object))
'empty'
>>> lib.infer_dtype(pd.Index([], dtype=object))
'empty'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, still separating these out to separate if/elif clauses makes this easier to grok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary anymore if I use .dropna. Then the only 'mixed' are really mixed (and not caused by e.g. np.nan)


values = data if isinstance(data, Index) else data.values
if is_categorical_dtype(data.dtype):
inf_type = lib.infer_dtype(values.categories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls pls don't use abbreviation. inferred_type is the name for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came from the existing code. Will change

all_na_obj = is_object_dtype(values.dtype) and data.isna().all()

# same for Series and Index (that is not MultiIndex)
if inf_type not in allowed_types and not all_na_obj:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing this here, I would for example have an if block that handles the empty (and checks if its object_dtype, then you immediately know if its ok or not)
then pass thru string and unicode. anything else is a raise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok in principle, but it's definitely necessary to allow 'mixed':

>>> lib.infer_dtype(pd.Series(['a', np.nan]))
'mixed'

Disallowing 'mixed-integer' would also break existing code, but I mentioned in the OP:

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types.

Check out the example in #13877 - currently the .str-methods are generally permissive about just returning NaN if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith. Could introduce an 'errors'-kwarg where necessary, but all of that needs to happen on method level - the .str-constructor must allow these to pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, we need a very clear path where each case can be handled independently if need be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could infer on data.dropna(), then the answer would be much clearer, I think.

I'd actually like to save the result of all this type inferring in an attribute of the StringMethods object (like ._inferred_type) so that it can be reused by the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independently, we could also (at a later stage) add the errors-kwarg to .str itself, and the methods then inherit it.

@@ -2274,7 +2274,8 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
result = cat_core(all_cols, sep)

if isinstance(self._orig, Index):
result = Index(result, name=self._orig.name)
# add dtype for case that result is all-NA
result = Index(result, dtype='object', name=self._orig.name)
else: # Series
result = Series(result, index=data.index, name=self._orig.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob need dtype=object (you don't need quotes FYI)

pandas/tests/test_strings.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. There are still several open questions on how to proceed.

pandas/core/strings.py Show resolved Hide resolved
raise AttributeError('Can only use .str accessor with Index, '
'not MultiIndex')

# see src/inference.pyx which can contain string values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# see src/inference.pyx which can contain string values
allowed_types = ['string', 'unicode', 'empty',
'mixed', 'mixed-integer']
if isinstance(data, ABCSeries):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See OP:

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

I originally commented out the only tests in test_strings that needed bytes and completely forbade bytes, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.


values = data if isinstance(data, Index) else data.values
if is_categorical_dtype(data.dtype):
inf_type = lib.infer_dtype(values.categories)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came from the existing code. Will change

else:
inf_type = lib.infer_dtype(values)

all_na_obj = is_object_dtype(values.dtype) and data.isna().all()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'.

>>> import pandas as pd
>>> import numpy as np
>>> import pandas._libs.lib as lib
>>> lib.infer_dtype(pd.Index([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([]))
'floating'
>>> lib.infer_dtype(pd.Index([]))
'empty'
>>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([], dtype=object))
'empty'
>>> lib.infer_dtype(pd.Index([], dtype=object))
'empty'

all_na_obj = is_object_dtype(values.dtype) and data.isna().all()

# same for Series and Index (that is not MultiIndex)
if inf_type not in allowed_types and not all_na_obj:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok in principle, but it's definitely necessary to allow 'mixed':

>>> lib.infer_dtype(pd.Series(['a', np.nan]))
'mixed'

Disallowing 'mixed-integer' would also break existing code, but I mentioned in the OP:

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types.

Check out the example in #13877 - currently the .str-methods are generally permissive about just returning NaN if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith. Could introduce an 'errors'-kwarg where necessary, but all of that needs to happen on method level - the .str-constructor must allow these to pass.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review; test additions will follow

@@ -208,6 +208,8 @@ Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is how you would love to see the .str-accessor work (and it's not like my sentence is better).

The problem is that the .str-accessor has effectively been an .object-accessor for a very long time, and is used as such - even pandas-internally, like .str.join for lists or str.decode for bytes.

Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.

So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).

pandas/tests/test_strings.py Outdated Show resolved Hide resolved
else:
inf_type = lib.infer_dtype(values)

all_na_obj = is_object_dtype(values.dtype) and data.isna().all()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary anymore if I use .dropna. Then the only 'mixed' are really mixed (and not caused by e.g. np.nan)

@@ -208,6 +208,8 @@ Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

@@ -208,6 +208,8 @@ Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 30, 2018

@jreback

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

I implemented just that for the bytes case now. PTAL

Side note: I needed to use #23422 already to be able to use the type inference.

@h-vetinari h-vetinari force-pushed the str_infer branch 3 times, most recently from 5f96220 to e1b69af Compare November 5, 2018 19:01

values = getattr(data, 'values', data) # Series / Index
values = getattr(values, 'categories', values) # categorical / normal
# missing values mess up type inference -> skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

'mess up' -> obfuscate

@@ -2400,13 +2411,17 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
'side': 'beginning',
'method': 'split'})
def split(self, pat=None, n=-1, expand=False):
if self._inferred_type in ['bytes']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a method instead and call it where appropriate (or can do a decorator)

def assert_bytes_allowed(self):
   ....

@pytest.mark.parametrize('other', [Series, Index])
def test_str_cat_all_na(self, box, other):
# check that all NaNs in caller / target work
s = Index(['a', 'b', 'c', 'd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this more transparent and don't use abbreviations. This is very hard to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand you here - this uses the same format as the other tests for str.cat, and I can't see any abbreviations.

@@ -377,11 +377,36 @@ def test_str_cat_align_mixed_inputs(self, join):
with tm.assert_raises_regex(ValueError, rgx):
s.str.cat([t, z], join=join)

def test_str_cat_raises(self):
@pytest.mark.parametrize('box', [Series, Index])
def test_str_cat_raises(self, box):
# non-strings hiding behind object dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a test that validates that a bytes object is not wokring for all methods but decode

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

I switched to a decorator like you asked - it's much more elegant IMO, and also uncovered some deficiencies (e.g. the function names were not preserved in the existing wrappers).

pandas/core/strings.py Show resolved Hide resolved
pandas/core/strings.py Show resolved Hide resolved
pandas/core/strings.py Show resolved Hide resolved
pandas/core/strings.py Show resolved Hide resolved
@@ -2917,7 +3021,8 @@ def rindex(self, sub, start=0, end=None):
5 3.0
dtype: float64
""")
len = _noarg_wrapper(len, docstring=_shared_docs['len'], dtype=int)
len = _noarg_wrapper(len, docstring=_shared_docs['len'],
forbidden_types=None, dtype=int)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that len should still be allowed for bytes to preserve a sort of minimal functionality compared to what is working as of v.0.23.4

('find', ['some_pattern']),
('rfind', ['some_pattern']),
# ('index', ['some_pattern']),
# ('rindex', ['some_pattern']),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index and partition and get_dummies were having some errors on Index. Those will have to be investigated (or xfailed)

+ [('floating', [1.0, np.nan, 2.0]),
('integer', [1, 2, 3]),
# ('decimal', [...]),
# ('complex', [...]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filling in these types is WIP, but will be easy.

t.str
assert not hasattr(t, 'str')

def test_api_mi_raises(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the (existing) error for trying to call MultiIndex.str was not tested so far.

t = box(values, dtype=object) # object dtype to avoid casting

bytes_allowed = method_name in ['encode', 'decode', 'len']
mixed_allowed = method_name not in ['cat']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this test is written/parametrized should make it very transparent which (inferred) dtypes are allowed for which method.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of issue refs for the bugs I found. I thought about splitting these fixes off into separate PRs, but all of them would essentially need the same parametrized-fixture-machinery I'm building in this PR, so I'm leaving them here for now.

@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0):
if regex.groups == 0:
raise ValueError("pattern contains no capture groups")

if isinstance(arr, ABCIndex):
if isinstance(arr, ABCIndexClass):
Copy link
Contributor Author

@h-vetinari h-vetinari Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line fixes #23555. (the problem was that a CategoricalIndex is not an instance of ABCIndex)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, ABCIndexClass is the base index class.

pandas/core/strings.py Show resolved Hide resolved
pandas/core/strings.py Show resolved Hide resolved

# for category, we do the stuff on the categories, so blow it up
# to the full series again
# But for some operations, we have to do the stuff on the full values,
# so make it possible to skip this step as the method already did this
# before the transformation...
if use_codes and self._is_categorical:
result = take_1d(result, self._orig.cat.codes)
# if self._orig is a CategoricalIndex, there is no .cat-accessor
result = take_1d(result, Series(self._orig).cat.codes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lines fixes #23556

count = _pat_wrapper(str_count, flags=True, name='count')
startswith = _pat_wrapper(str_startswith, na=True, name='startswith')
endswith = _pat_wrapper(str_endswith, na=True, name='endswith')
findall = _pat_wrapper(str_findall, flags=True, name='findall')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, these methods would end up with the name 'str_count', instead of 'count', for example. xref #23551

('integer', [1, np.nan, 2]),
('mixed-integer-float', [1, np.nan, 2.0]),
('decimal', [Decimal(1), np.nan, Decimal(2)]),
# ('complex', [1 + 1j, np.nan, 2 + 2j]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the timedelta64 case below are commented out due to #23554

pytest.xfail(reason='Method not nan-safe on Index; see GH 23558')
if (method_name == 'get_dummies' and box == Index
and inferred_dtype == 'empty' and dtype == object):
pytest.xfail(reason='Need to fortify get_dummies corner case')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case wasn't an issue up until now, because Index.str raised if the inferred dtype was 'empty'. As such, I haven't opened an issue for it yet.

@h-vetinari
Copy link
Contributor Author

@jreback
Since all these things are interrelated, the only way I could think of how to split this PR, is adding the (initially mostly xfailed) tests of this PR in #23582, and then all the code fixes can follow independently.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this PR is too much going on. pls break apart with a sepearate PR

  • for the name propagating fixing
  • .bytes accessor
  • test structure changes (you already did this i think as a pre-cursor)

@@ -2258,6 +2258,8 @@ def tuples_to_object_array(ndarray[object] tuples):
result = np.empty((n, k), dtype=object)
for i in range(n):
tup = tuples[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would need to set the result to np.nan if its a null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.empty((n, k), dtype=object) would have None in them, which then get used just like np.nan later. But can change of course.

@@ -2274,6 +2276,8 @@ def to_object_array_tuples(rows: list):

k = 0
for i in range(n):
if checknull(rows[i]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2283,11 +2287,15 @@ def to_object_array_tuples(rows: list):
try:
for i in range(n):
row = rows[i]
if checknull(row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these

@@ -54,6 +55,8 @@ def cat_core(list_of_columns, sep):
nd.array
The concatenation of list_of_columns with sep
"""
if sep == '':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont' you

if sep == '':
    list_with_sep = list_of_columns
else: 
   ....
return np.sum(....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0):
if regex.groups == 0:
raise ValueError("pattern contains no capture groups")

if isinstance(arr, ABCIndex):
if isinstance(arr, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, ABCIndexClass is the base index class.

pandas/core/strings.py Show resolved Hide resolved
@h-vetinari
Copy link
Contributor Author

@jreback

so this PR is too much going on. pls break apart with a sepearate PR

Agreed, which is why I split off the tests (which are the main source of changes, and responsible for checking them) in #23582. Let's do that one first, and then I can easily split up this PR.

@jreback
Copy link
Contributor

jreback commented May 12, 2019

this looked ok, if @pandas-dev/pandas-core wants to finish this up.

@h-vetinari
Copy link
Contributor Author

@jreback
Couldn't respond to your last ping at the time and then forgot. Merged master.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly doc-comments. ping on green.

.. warning::

Before v.0.25.0, the ``.str``-accessor did only the most rudimentary type checks. Starting with
v.0.25.0, the type of the Series is inferred (like it has already been the case for ``Index.str``),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove what you have in the parens (like it...)

The ``.str``-accessor performs stricter type checks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword this in a more positive light. (basically remove up thru numpy)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.
From now on, the inferred dtype of the Series is checked to be correct (particularly, not ``'bytes'``), as :attr:`Index.str` does already.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From now on, the inferred dtype of the Series is checked to be correct (particularly, not 'bytes'), as :attr:Index.str does alread

:attr:`Series.str` will now infer the dtype data *within* the Series, in particular, bytes will raise an exception

pandas/core/strings.py Show resolved Hide resolved
The ``.str``-accessor performs stricter type checks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issues that are being closed here (before the examples)

@jreback jreback added this to the 0.25.0 milestone May 19, 2019
@jreback
Copy link
Contributor

jreback commented May 19, 2019

@h-vetinari a couple of comments, merge master and ping on green for merge.

@h-vetinari
Copy link
Contributor Author

@jreback
This is green.

Note that I enabled 'bytes' also for methods .str.get and .str.slice, as those are still useful in that case, and to reduce the potential impact of the breaking change.

@h-vetinari
Copy link
Contributor Author

Argh, just double-checked and saw that I had forgotten to save the whatsnew changes...

and inferred_dtype in ['boolean', 'date', 'time']):
pytest.xfail(reason='Inferring incorrectly because of NaNs; '
'solved by GH 23167')
if (method_name in ['partition', 'rpartition'] and box == Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice on removing some of the xfails; ideally we can remove the rest in future PRs

@jreback jreback merged commit 9a42cbe into pandas-dev:master Jun 1, 2019
@jreback
Copy link
Contributor

jreback commented Jun 1, 2019

thanks @h-vetinari

as you have seen, targeted PRs merge much much faster than giant refactors

@h-vetinari
Copy link
Contributor Author

@jreback: as you have seen, targeted PRs merge much much faster than giant refactors

Fair, but that's not a good reason to completely avoid all refactors. Sometimes, painful(ly large) PRs are necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
3 participants