append a categorical with different categories to the existing #12509

dneise · 2016-03-02T11:21:17Z

I just ran into the same problem as the person asking this question on SO

http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

Jeff gave an excellent answer as usual, I believe he is a pandas developer as well?

So I was wondering whether something like hist answer might be planned to become the default behaviour, when appending Categoricals.

jreback · 2016-03-02T12:34:16Z

this was discussed in #9927

would be ok with adding this as a sub-section in the Cookbook somewhere.

would you like to do a pull-request? you can point to the SO post and do a short-in-line version.

dneise · 2016-03-02T14:43:24Z

Thanks for the quick reply.

I'm a physicist having no experience in collaborating on such a big project as pandas. Sure I would like to gain some experience by improving the docs, but I need to learn how.

Also, the problem I have goes a tiny step further than the SO question. I am parsing log-files (1.5k files with 100M lines in total; 12GB is the total size of all files) into a dataframe, so I can get some insight into our experiment. I am parsing the log files one by one, and would like to append them to a table in a HDF5 file. A part of each log message, is the name of the process, which created the message. And I know there are a lot less names of processes than lines. So I thought using Categoricals is feasible here. (It might just be another form of efficient string storage .. I don't know... )

I have no way of knowing the complete set of categories in advance. From your SO answer, I learned how to create the following Categoricals using an explicit set of categories. But I have not yet understood how/if I can append following Categoricals to a Table in an HDF5 file

(I should add an example here)

jreback · 2016-03-03T02:03:08Z

docs for contributing are here

consistency across DataFrames Resolves pandas-dev#12509

Closes: pandas-dev#12509

consistency across DataFrames Resolves pandas-dev#12509

Closes: pandas-dev#12509

…e space Resolves: pandas-dev#12509

ngirase10 · 2023-11-28T21:44:17Z

Is this still open? Interested in working on this — thanks!

jonathanho168 · 2023-12-01T03:39:57Z

It seems like the overall idea @dneise wants to accomplish is to dynamically create and update a set of categories based on the data from multiple dataframes.

We haven't fully explored the codebase yet, but from a cursory exploration, it seems that there are two ways to potentially accomplish this:

Extend Categorical with a new method that takes in the same inputs as the constructor? Kind of like pandas.Categorical.from_codes but with a potentially incomplete set of categories, which can be added to later.
Revise the spec of pandas.factorize, so that it takes in an additional optional parameter -- we want to pass in a set of mappings that can be added to if we encounter new data values in the new dataframe.

jreback added Docs Difficulty Novice Categorical Categorical Data Type labels Mar 2, 2016

jreback added this to the 0.18.1 milestone Mar 2, 2016

jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016

jreback modified the milestones: 0.19.0, Next Major Release Sep 28, 2016

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

pdpark pushed a commit to pdpark/pandas that referenced this issue Jan 15, 2018

Doc: Adds example of categorical data for efficient storage and

5c05768

consistency across DataFrames Resolves pandas-dev#12509

pdpark mentioned this issue Jan 15, 2018

Doc: Adds example of categorical data for efficient storage and consistency across DataFrames #19245

Closed

1 task

pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 3, 2018

Doc: Different example using categorical data type for efficient storage

428f9af

Closes: pandas-dev#12509

pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 18, 2018

Doc: Adds example of categorical data for efficient storage and

360e8a1

consistency across DataFrames Resolves pandas-dev#12509

pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 18, 2018

Doc: Different example using categorical data type for efficient storage

fdc51c2

Closes: pandas-dev#12509

pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 18, 2018

Doc: Updated example of using categorical data type to save on storag…

5c2b355

…e space Resolves: pandas-dev#12509

jbrockmendel removed the Effort Low label Oct 21, 2019

jbrockmendel mentioned this issue Apr 15, 2022

API: make CategoricalIndex._concat consistent with pd.concat #41626

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jonathanho168 mentioned this issue Dec 12, 2023

Add optional parameter for pd.factorize to handle new categories in categorical data dynamically #56466

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

append a categorical with different categories to the existing #12509

append a categorical with different categories to the existing #12509

dneise commented Mar 2, 2016

jreback commented Mar 2, 2016

dneise commented Mar 2, 2016

jreback commented Mar 3, 2016

ngirase10 commented Nov 28, 2023

jonathanho168 commented Dec 1, 2023

append a categorical with different categories to the existing #12509

append a categorical with different categories to the existing #12509

Comments

dneise commented Mar 2, 2016

jreback commented Mar 2, 2016

dneise commented Mar 2, 2016

jreback commented Mar 3, 2016

ngirase10 commented Nov 28, 2023

jonathanho168 commented Dec 1, 2023