Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

Closed
jzwinck opened this issue Nov 7, 2017 · 5 comments · Fixed by #22229
Closed

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

jzwinck opened this issue Nov 7, 2017 · 5 comments · Fixed by #22229
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Nov 7, 2017

DataFrame.to_records() outputs string columns with the object dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with np.save()). I wrote the following function to fix this:

def to_records_plain(df):
    """Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
    This gives more compact storage and does not require pickling objects when saving to disk.
    Assumes all object arrays in df are strings.

    >>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
    >>> to_records_plain(df)
    rec.array([(0, 1,  0.5, b'x'), (1, 2,  0.9, b'yyy')], 
              dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
    """
    records = df.to_records()
    descr = records.dtype.descr
    for ii, (name, dtype) in enumerate(descr):
        if dtype == '|O':
            length = df[name].str.len().max()
            descr[ii] = (name, 'S{}'.format(length))

    return records.astype(descr)

I suggest exposing something like this as an option in DataFrame.to_records(). An option to convert to Unicode ('U') too would be good too (NumPy's 'S' is effectively bytes in Python 3).

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Output-Formatting __repr__ of pandas objects, to_string Enhancement labels Nov 7, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 7, 2017

@jzwinck : Thanks for sharing this! Seems reasonable to add a parameter to .to_records() which attempts to optimize the dtype if possible. Have a look around and see how you can incorporate your idea.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

what would i propose to name this option?

@gfyoung
Copy link
Member

gfyoung commented Nov 7, 2017

Well in to_numeric, we have a parameter called downcast, so either that or compress_dtype ?

@jzwinck
Copy link
Contributor Author

jzwinck commented Nov 8, 2017

It's different to downcast (which concerns narrowing). Maybe inplace_strings or direct_strings. Support for downcast in to_records() would be handy too, but that's a separate issue.

@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2017

True, only threw it out there because similar names. But yeah, your suggestions are also reasonable.

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 8, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 8, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018
…ev#18146)

This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 26, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 26, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 28, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 29, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 30, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 30, 2018
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.
gfyoung pushed a commit that referenced this issue Dec 30, 2018
… (#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes gh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
thoo added a commit to thoo/pandas that referenced this issue Dec 30, 2018
* upstream/master:
  REF/TST: replace capture_stdout with pytest capsys fixture (pandas-dev#24501)
  BUG: fix .iat assignment creates a new column (pandas-dev#24495)
  DOC: add checks on the returns section in the docstrings (pandas-dev#23138) (pandas-dev#23432)
  ENH: Add strings_as_fixed_length parameter for df.to_records() (pandas-dev#18146) (pandas-dev#22229)
  TST: Skip db tests unless explicitly specified in -m pattern (pandas-dev#24492)
  Mix EA into DTA/TDA; part of 24024 (pandas-dev#24502)
  DOC: Fix building of a single API document (pandas-dev#24506)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants