Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

Closed
jzwinck opened this issue Nov 7, 2017 · 5 comments

Comments

Projects
None yet
3 participants
@jzwinck
Copy link
Contributor

commented Nov 7, 2017

DataFrame.to_records() outputs string columns with the object dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with np.save()). I wrote the following function to fix this:

def to_records_plain(df):
    """Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
    This gives more compact storage and does not require pickling objects when saving to disk.
    Assumes all object arrays in df are strings.

    >>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
    >>> to_records_plain(df)
    rec.array([(0, 1,  0.5, b'x'), (1, 2,  0.9, b'yyy')], 
              dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
    """
    records = df.to_records()
    descr = records.dtype.descr
    for ii, (name, dtype) in enumerate(descr):
        if dtype == '|O':
            length = df[name].str.len().max()
            descr[ii] = (name, 'S{}'.format(length))

    return records.astype(descr)

I suggest exposing something like this as an option in DataFrame.to_records(). An option to convert to Unicode ('U') too would be good too (NumPy's 'S' is effectively bytes in Python 3).

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 7, 2017

@jzwinck : Thanks for sharing this! Seems reasonable to add a parameter to .to_records() which attempts to optimize the dtype if possible. Have a look around and see how you can incorporate your idea.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2017

what would i propose to name this option?

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 7, 2017

Well in to_numeric, we have a parameter called downcast, so either that or compress_dtype ?

@jzwinck

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2017

It's different to downcast (which concerns narrowing). Maybe inplace_strings or direct_strings. Support for downcast in to_records() would be handy too, but that's a separate issue.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 8, 2017

True, only threw it out there because similar names. But yeah, your suggestions are also reasonable.

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add string_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This options records dtype for string as arrays as 'Sx', where x
is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 7, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 8, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 8, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018

ENH: Add strings_as_bytes option for df.to_records() (pandas-dev#18146)
This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

qinghao1 added a commit to qinghao1/pandas that referenced this issue Aug 10, 2018

ENH: Add strings_as_fixed_length option for df.to_records() (pandas-d…
…ev#18146)

This option changes DataFrame.to_records() dtype for string arrays
to 'Sx', where x is the length of the longest string, instead of 'O"

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 19, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 26, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 26, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 28, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 29, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 30, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018

gfyoung added a commit to qinghao1/pandas that referenced this issue Dec 30, 2018

ENH: Allow fixed-length strings in df.to_records()
Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

gfyoung added a commit that referenced this issue Dec 30, 2018

ENH: Add strings_as_fixed_length parameter for df.to_records() (#18146)…
… (#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes gh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments

thoo added a commit to thoo/pandas that referenced this issue Dec 30, 2018

Merge remote-tracking branch 'upstream/master' into read_excel-docstring
* upstream/master:
  REF/TST: replace capture_stdout with pytest capsys fixture (pandas-dev#24501)
  BUG: fix .iat assignment creates a new column (pandas-dev#24495)
  DOC: add checks on the returns section in the docstrings (pandas-dev#23138) (pandas-dev#23432)
  ENH: Add strings_as_fixed_length parameter for df.to_records() (pandas-dev#18146) (pandas-dev#22229)
  TST: Skip db tests unless explicitly specified in -m pattern (pandas-dev#24492)
  Mix EA into DTA/TDA; part of 24024 (pandas-dev#24502)
  DOC: Fix building of a single API document (pandas-dev#24506)

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

ENH: Add strings_as_fixed_length parameter for df.to_records() (panda…
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

ENH: Add strings_as_fixed_length parameter for df.to_records() (panda…
…s-dev#18146) (pandas-dev#22229)

* ENH: Allow fixed-length strings in df.to_records()

Adds parameter to allow string-like columns to be
cast as fixed-length string-like dtypes for more
efficient storage.

Closes pandas-devgh-18146.

Originally authored by @qinghao1 but cleaned up
by @gfyoung to fix merge conflicts.

* Add dtype parameters instead of fix-string-like

The original parameter was causing a lot of acrobatics
with regards to string dtypes between 2.x and 3.x.

The new parameters simplify the internal logic and
pass the responsibility and motivation of memory
efficiency back to the users.

* MAINT: Use is_dict_like in to_records

More generic than checking whether our
mappings are instances of dict.

Expands is_dict_like check to include
whether it has a __contains__ method.

* TST: Add test for is_dict_like expanded def

* MAINT: Address final comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.