REF: StringArray._from_sequence, use less memory #35519

topper-123 · 2020-08-02T19:15:46Z

closes BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

@ldacey, can you try if this fixes your problem?

jreback · 2020-08-02T19:35:00Z

pandas/_libs/lib.pyx

@@ -1698,6 +1698,20 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
    return validator.validate(values)


+cpdef ndarray ensure_string_array(ndarray values, object na_value):


this should return a new array
and pls
add a doc string

also i think we have a routine like this already

topper-123 · 2020-08-03T18:21:45Z

I looked at how the array in pd.Series(data, str) is created. There is a similar function that ensures a string array, but it's slower.

pd.Series(data, str) ends up calling dtypes.cast.construct_1d_ndarray_preserving_na, where curently all scalars are first converted to "<U" dtype, then they're converted back to object dtype (and str type) and finally it adds back the nan-likes, where appropriate. That's a lot of steps and in generally the same problem as I'm looking into for StringArray.

By using the new method in construct_1d_ndarray_preserving_na also, we avoid creating a lot of new objects, by doing a single iteration and reusing exisiting str objects and adding nan-likes in the same iteration rather than in a seperate step. This change means faster performance and lower memory usage:

>>> data = ["aaaaaa" for _ in range(201368)]
>>> %timeit pd.Series(data, dtype=str)
66.5 ms ± 656 µs per loop  # master
36.4 ms ± 330 µs per loop  # this PR
>>> %timeit pd.Series(data, dtype="string")
63.9 ms ± 832 µs per loop  # master
37.2 ms ± 459 µs per loop  # this PR

So a bit less than a doubling in performance.

This PR might in the new form now not only be a bug fix for #35499, but also a more fundamental change. Maybe the change to construct_1d_ndarray_preserving_na should go in a seperate PR that goes into v1.2?

simonjayhawkins · 2020-08-03T19:32:04Z

pandas/_libs/lib.pyx

+    values : array-like
+        The values to be converted to str, if needed
+    na_value : Any
+        The value to use for na. For example, np.nan or pd.NAN


simonjayhawkins · 2020-08-03T19:32:20Z

pandas/_libs/lib.pyx

+    convert_na_value : bool, default True
+        If False, existing na values will be used unchanged in the new array
+    copy : bool, default True
+        Whether to wnsure that a new array is returned


typo + full stop

simonjayhawkins · 2020-08-03T19:36:53Z

pandas/_libs/lib.pyx

+    if convert_na_value:
+        for i in range(n):
+            val = result[i]
+            if not checknull(val):


do we also need to account for pandas.options.mode.use_inf_as_na

Not super familiar with pandas.options.mode.use_inf_as_na, but it looks like it only treats inf as nan, but it does not convert, right?

ok, support looks patchy, but does work as I expected for Int64, maybe I misunderstood the functionality, but I do expect it to convert in operations such as astype or DataFrame construction.

>>> import numpy as np >>> import pandas as pd >>> >>> pd.__version__ '1.2.0.dev0+27.g53f6b4711' >>> >>> with pd.option_context("mode.use_inf_as_na", True): ... df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])}) ... >>> df a 0 NaN 1 inf 2 1.0 >>> >>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])}) >>> with pd.option_context("mode.use_inf_as_na", True): ... df = df.astype("float32") ... >>> df a 0 NaN 1 inf 2 1.0 >>> >>> with pd.option_context("mode.use_inf_as_na", True): ... df = df.astype("category") ... Traceback (most recent call last): ... ValueError: Categorical categories cannot be null >>> >>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])}) >>> with pd.option_context("mode.use_inf_as_na", True): ... df = df.astype("Int64") ... >>> df a 0 <NA> 1 <NA> 2 1 >>> >>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, 1.0])}) >>> with pd.option_context("mode.use_inf_as_na", False): ... df = df.astype("Int64") ... Traceback (most recent call last): ... TypeError: cannot safely cast non-equivalent float64 to int64 >>> df a 0 NaN 1 inf 2 1.0 >>> >>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])}) >>> with pd.option_context("mode.use_inf_as_na", False): ... df = df.astype("string") ... >>> df a 0 nan 1 inf 2 foo >>> >>> df = pd.DataFrame({"a": np.array([np.nan, np.inf, "foo"])}) >>> with pd.option_context("mode.use_inf_as_na", True): ... df = df.astype("string") ... >>> df a 0 nan 1 inf 2 foo >>> >>>

pd.options.mode.use_inf_as_na = True doesn't change inf to nan, so I'm not sure an astype should return a string or a nan. For example:

>>> pd.options.mode.use_inf_as_na = True >>> x = pd.Series([np.inf, np.nan]) >>> x 0 NaN 1 NaN dtype: float64 >>> x[0] inf # it's an inf, only the repr is a Nan it's and treated as nan in calculations...

Pandas 1.0 had not decided on how to convert infs to strings when pd.options.mode.use_inf_as_na = True:

>>> pd.options.mode.use_inf_as_na = True >>> x = pd.Series([np.inf, np.nan]) >>> x.astype(str) 0 inf # <- string "inf", not string "nan" 1 nan dtype: object >>> x.astype("string") 0 <NA> # <- pd.NA 1 <NA> dtype: string

Probably a choice of convention on how inf should be converted when use_inf_as_na = True needs to be decided on? In that case it'd be a different PR IMO.

Probably a choice of convention on how inf should be converted when use_inf_as_na = True needs to be decided on? In that case it'd be a different PR IMO.

sure

simonjayhawkins · 2020-08-03T19:43:21Z

Maybe the change to construct_1d_ndarray_preserving_na should go in a seperate PR that goes into v1.2?

I think now that we are using semver, I'm happy including all the bugfixes in patch versions. I think that'll keep minor releases more focused on enhancements but would also mean more work maintaining the backport branch. let's see what others think.

jreback · 2020-08-03T23:17:43Z

pandas/_libs/lib.pyx

@@ -1698,6 +1698,48 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
    return validator.validate(values)


+cpdef ndarray ensure_string_array(


this looks like a generalization of: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L621

pls de-dupe these as appropriate

Updated.

astype_str was only used in once, so I`ve deleted it and use ensure_string_array instead.

jreback · 2020-08-04T11:12:00Z

can u add a memory asv for this case

topper-123 · 2020-08-04T17:54:19Z

I'm not familiar with memory ASVS, only timing ones. Can you point to a file where I can one find? (I can then use that as a template for mine)

jreback · 2020-08-04T19:24:17Z

I'm not familiar with memory ASVS, only timing ones. Can you point to a file where I can one find? (I can then use that as a template for mine)

https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/rolling.py#L24

jreback · 2020-08-04T19:28:02Z

pandas/core/dtypes/cast.py

-        subarr2 = subarr.astype(object)
-        subarr2[na_values] = np.asarray(values, dtype=object)[na_values]
-        subarr = subarr2
+        values = np.asarray(values, dtype="object")


dont' actually need the np.asarray call here

topper-123 · 2020-08-05T13:30:47Z

I've added ASV's. I'm however not able to run them on my machine (windows 10), so not what they show what I think they should show...Maybe someone could check it?

jreback · 2020-08-10T13:32:30Z

I've added ASV's. I'm however not able to run them on my machine (windows 10), so not what they show what I think they should show...Maybe someone could check it?

you can just run memits to assert that they are workign, no?

jreback · 2020-08-13T18:12:18Z

@topper-123 can you rebase and show the %memit on before / after

topper-123 · 2020-08-16T18:05:46Z

Ok, I've used memit, didn't know about that, it's nice.

Results:

>>> data = ["aaaaaa" for _ in range(201368)]
>>>
>>> %timeit pd.Series(data, dtype=str)
75.2 ms ± 725 µs per loop  # master
40.6 ms ± 389 µs per loop  # this PR
>>> %memit pd.Series(data, dtype=str)
peak memory: 100.96 MiB, increment: 18.87 MiB  # master
peak memory: 84.96 MiB, increment: 1.63 MiB  # this PR
>>>
>>> %timeit pd.Series(data, dtype="string")
70 ms ± 928 µs per loop  # master
41.9 ms ± 910 µs per loop  # this PR
>>> %memit pd.Series(data, dtype="string")
peak memory: 96.68 MiB, increment: 13.47 MiB  # master
peak memory: 85.25 MiB, increment: 1.55 MiB  # this PR

so a significant decrease in both time and memory usage.

the time and memory improvements are of course related, coming from avoiding the (simplified) pattern np.array(data, dtype=str).astype(object), that is used in master.

jreback · 2020-08-17T14:34:52Z

thanks @topper-123

simonjayhawkins · 2020-08-17T14:38:34Z

@meeseeksdev backport 1.1.x

…ss memory

…35770) Co-authored-by: Terji Petersen <contribute@tensortable.com>

jorisvandenbossche · 2020-09-19T08:19:56Z

pandas/tests/arrays/string_/test_string.py

-    tm.assert_numpy_array_equal(a, original)
+
+    expected = nan_arr if copy else na_arr
+    tm.assert_numpy_array_equal(nan_arr, expected)


I am not sure this test is correctly changed. I think we should never mutate the input, whether copy is True or False (which is what it was testing before).
IMO the copy keyword is to indicate to simply always copy, or if False it will only copy when needed. And when you need to mutate, I would say the copy is needed.

It's also not taking a copy of the original array, so not even checking it wasn't changed (because even if it was changed, it would still compare equal to itself)

jreback requested changes Aug 2, 2020

View reviewed changes

simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels Aug 3, 2020

simonjayhawkins added this to the 1.1.1 milestone Aug 3, 2020

simonjayhawkins reviewed Aug 3, 2020

View reviewed changes

jreback requested changes Aug 3, 2020

View reviewed changes

topper-123 force-pushed the refactor_StringArray._from_sequence branch from 0562d08 to 1337e7e Compare August 4, 2020 07:10

jreback reviewed Aug 4, 2020

View reviewed changes

topper-123 force-pushed the refactor_StringArray._from_sequence branch from f69e74c to bfe387e Compare August 4, 2020 22:24

jreback approved these changes Aug 14, 2020

View reviewed changes

topper-123 added 9 commits August 16, 2020 18:40

REF: StringArray._from_sequence

df8e4d6

Use ensure_string_array in also in construct_1d_ndarray_preserving_na

ac7ee27

fix linting

887736a

fix copy param

61f3bd3

fix comments

c6afa1e

delete libs_.lib.astype_str

ce18bb9

correct input parameter type

9ef0355

Add ASVs

3db2884

cleanups

47b5d69

topper-123 force-pushed the refactor_StringArray._from_sequence branch from 52e53b1 to 47b5d69 Compare August 17, 2020 13:20

jreback merged commit 3b0f23a into pandas-dev:master Aug 17, 2020

meeseeksmachine mentioned this pull request Aug 17, 2020

Backport PR #35519 on branch 1.1.x (REF: StringArray._from_sequence, use less memory) #35770

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 17, 2020

Backport PR pandas-dev#35519: REF: StringArray._from_sequence, use le…

701dc79

…ss memory

topper-123 deleted the refactor_StringArray._from_sequence branch August 17, 2020 14:41

simonjayhawkins pushed a commit that referenced this pull request Aug 17, 2020

Backport PR #35519: REF: StringArray._from_sequence, use less memory (#…

66d08dc

…35770) Co-authored-by: Terji Petersen <contribute@tensortable.com>

dsaxton mentioned this pull request Aug 24, 2020

PERF: Vectorized string operations are slower than for-loops #35864

Open

3 tasks

This was referenced Sep 12, 2020

PERF: creating string Series/Arrays from sequence with many strings #36304

Merged

PERF: constructing string Series #36317

Merged

PERF: StringArray construction #36325

Merged

PERF: construct DataFrame with string array and dtype=str #36432

Merged

dsaxton mentioned this pull request Sep 19, 2020

BUG: conversion of float32 to string shows too much precision #36451

Closed

jorisvandenbossche reviewed Sep 19, 2020

View reviewed changes

dsaxton mentioned this pull request Oct 7, 2020

REGR: change in Series.astype(str) behavior for None #36904

Closed

topper-123 mentioned this pull request Oct 23, 2020

PERF: ensure_string_array with non-numpy input array #37371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: StringArray._from_sequence, use less memory #35519

REF: StringArray._from_sequence, use less memory #35519

topper-123 commented Aug 2, 2020

jreback Aug 2, 2020

topper-123 commented Aug 3, 2020 •

edited

simonjayhawkins Aug 3, 2020

simonjayhawkins Aug 3, 2020

simonjayhawkins Aug 3, 2020

topper-123 Aug 3, 2020

simonjayhawkins Aug 5, 2020

topper-123 Aug 5, 2020 •

edited

simonjayhawkins Aug 5, 2020

simonjayhawkins commented Aug 3, 2020 •

edited

jreback Aug 3, 2020

topper-123 Aug 4, 2020 •

edited

jreback commented Aug 4, 2020

topper-123 commented Aug 4, 2020

jreback commented Aug 4, 2020

jreback Aug 4, 2020

topper-123 Aug 4, 2020

topper-123 commented Aug 5, 2020

jreback commented Aug 10, 2020 •

edited

jreback commented Aug 13, 2020

topper-123 commented Aug 16, 2020

jreback commented Aug 17, 2020

simonjayhawkins commented Aug 17, 2020

jorisvandenbossche Sep 19, 2020

		@@ -1698,6 +1698,20 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
		return validator.validate(values)


		cpdef ndarray ensure_string_array(ndarray values, object na_value):

		@@ -1698,6 +1698,48 @@ cpdef bint is_string_array(ndarray values, bint skipna=False):
		return validator.validate(values)


		cpdef ndarray ensure_string_array(

REF: StringArray._from_sequence, use less memory #35519

REF: StringArray._from_sequence, use less memory #35519

Conversation

topper-123 commented Aug 2, 2020

Choose a reason for hiding this comment

topper-123 commented Aug 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Aug 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins commented Aug 3, 2020 • edited

Choose a reason for hiding this comment

topper-123 Aug 4, 2020 • edited

Choose a reason for hiding this comment

jreback commented Aug 4, 2020

topper-123 commented Aug 4, 2020

jreback commented Aug 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Aug 5, 2020

jreback commented Aug 10, 2020 • edited

jreback commented Aug 13, 2020

topper-123 commented Aug 16, 2020

jreback commented Aug 17, 2020

simonjayhawkins commented Aug 17, 2020

Choose a reason for hiding this comment

topper-123 commented Aug 3, 2020 •

edited

topper-123 Aug 5, 2020 •

edited

simonjayhawkins commented Aug 3, 2020 •

edited

topper-123 Aug 4, 2020 •

edited

jreback commented Aug 10, 2020 •

edited