Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append a categorical with different categories to the existing #12509

Open
dneise opened this issue Mar 2, 2016 · 5 comments
Open

append a categorical with different categories to the existing #12509

dneise opened this issue Mar 2, 2016 · 5 comments
Labels

Comments

@dneise
Copy link

dneise commented Mar 2, 2016

I just ran into the same problem as the person asking this question on SO

http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

Jeff gave an excellent answer as usual, I believe he is a pandas developer as well?

So I was wondering whether something like hist answer might be planned to become the default behaviour, when appending Categoricals.

@jreback jreback added this to the 0.18.1 milestone Mar 2, 2016
@jreback
Copy link
Contributor

jreback commented Mar 2, 2016

this was discussed in #9927

would be ok with adding this as a sub-section in the Cookbook somewhere.

would you like to do a pull-request? you can point to the SO post and do a short-in-line version.

@dneise
Copy link
Author

dneise commented Mar 2, 2016

Thanks for the quick reply.

I'm a physicist having no experience in collaborating on such a big project as pandas. Sure I would like to gain some experience by improving the docs, but I need to learn how.

Also, the problem I have goes a tiny step further than the SO question. I am parsing log-files (1.5k files with 100M lines in total; 12GB is the total size of all files) into a dataframe, so I can get some insight into our experiment. I am parsing the log files one by one, and would like to append them to a table in a HDF5 file. A part of each log message, is the name of the process, which created the message. And I know there are a lot less names of processes than lines. So I thought using Categoricals is feasible here. (It might just be another form of efficient string storage .. I don't know... )

I have no way of knowing the complete set of categories in advance. From your SO answer, I learned how to create the following Categoricals using an explicit set of categories. But I have not yet understood how/if I can append following Categoricals to a Table in an HDF5 file

(I should add an example here)

@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

docs for contributing are here

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016
@jreback jreback modified the milestones: 0.19.0, Next Major Release Sep 28, 2016
pdpark pushed a commit to pdpark/pandas that referenced this issue Jan 15, 2018
pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 18, 2018
pdpark pushed a commit to pdpark/pandas that referenced this issue Feb 18, 2018
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@ngirase10
Copy link

Is this still open? Interested in working on this — thanks!

@jonathanho168
Copy link

It seems like the overall idea @dneise wants to accomplish is to dynamically create and update a set of categories based on the data from multiple dataframes.

We haven't fully explored the codebase yet, but from a cursory exploration, it seems that there are two ways to potentially accomplish this:

  1. Extend Categorical with a new method that takes in the same inputs as the constructor? Kind of like pandas.Categorical.from_codes but with a potentially incomplete set of categories, which can be added to later.
  2. Revise the spec of pandas.factorize, so that it takes in an additional optional parameter -- we want to pass in a set of mappings that can be added to if we encounter new data values in the new dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment