Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New DataFrame feature: listify() and unlistify() #10511

Closed
dov opened this issue Jul 4, 2015 · 20 comments · Fixed by #27267
Closed

New DataFrame feature: listify() and unlistify() #10511

dov opened this issue Jul 4, 2015 · 20 comments · Fixed by #27267
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@dov
Copy link

dov commented Jul 4, 2015

While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:

  1. Add a new column whose values at each cell is a python list of the values the new column takes on.
  2. Unlistify the column thereby creating a new row for each element in the above lists.

I.e. I propose two new DataFrame methods, listify() and unlistify().

listify(df, column): Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column values.

unlistify(df, column): Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column for each row and generate a new row for each value.

The functions may be expanded to support multiple columns. listify() may e.g. support a post processing function, that will be applied on the list.

The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.

#!/usr/bin/python

import pandas as pd

def listify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  old_index = df.index
  col_idx = matches[0] + len(old_index.shape) # Since we will reset the index

  column_names = list(df.index.names) + list(df.columns)
  gb_cols = [c for c in column_names
             if c!= column]

  # Helper function to generate the squashed dataframe
  def fnc(d):
    row = list(d.values[0])
    return pd.DataFrame([row[:col_idx]
                         + [[v[col_idx] for v in list(d.values)]]
                         + row[col_idx+1:]])

  return (df
          .reset_index()
          .groupby(gb_cols)
          .apply(fnc)
          .rename(columns = lambda i : column_names[i])
          .set_index(old_index.names)
          )

def unlistify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  col_idx = matches[0]

  # Helper function to expand and repeat the column col_idx
  def fnc(d):
    row = list(d.values[0])
    bef = row[:col_idx]
    aft = row[col_idx+1:]
    col = row[col_idx]
    z = [bef + [c] + aft for c in col]
    return pd.DataFrame(z)

  col_idx += len(df.index.shape) # Since we will push reset the index
  index_names = list(df.index.names)
  column_names = list(index_names) + list(df.columns)
  return (df
          .reset_index()
          .groupby(level=0,as_index=0)
          .apply(fnc)
          .rename(columns = lambda i :column_names[i])
          .set_index(index_names)
          )


# Examples of how to listify and unlistify a column.

df = pd.DataFrame([[11,range(5),10],
                   [22,range(3),20]],
                   columns = ['A','B','C']).set_index('C')
print 'org'
print df
print '--'
df = unlistify(df,'B')
print 'unlistify(df,B)'
print df
print '--'
df = listify(df,'B')
print 'listify(df,B)'
print df

The corresponding output:

org
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]
--
unlistify(df,B)
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2
--
listify(df,B)
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]
@shoyer
Copy link
Member

shoyer commented Jul 5, 2015

How do you create these data frames with nested lists? My sense is that there is almost always a better way.

I'm reluctant to expand the dataframe API for new methods unless they are broadly useful -- this is a large part of why we added the .pipe method.

@dov
Copy link
Author

dov commented Jul 5, 2015

My use case is the use of a dataframe through various part of an experiment, containing planning, execution, data collection and analysis. At the planning stage when building the initial data frame I still don't know what one of the columns -- containing "perpendicular" (data independent from all the other columns) data will contain. Once I know it, I want to insert the new data. I.e. I want to go from:

     A 
C                      
10  11
20  22

to:
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2

With the unlistify() function this is trivial:

df['B'] = [range(5),range(3)]
df = df.unlistify('B')

But perhaps there is a different simple way that I have missed?

I don't see how the pipe() method is relevant as it seems that it is just syntactic sugar for doing right-hand chaining of operators instead of left-hand function calls. I agree that it can be used together with my listify() and unlistify() implementations. If my proposal is not accepted, I will certainly use my implemtations, with or without pipe.

@shoyer
Copy link
Member

shoyer commented Jul 5, 2015

Maybe you could show an example of how you make the dataframe? Presumably you're not reading it in from CSV. I guess my thought is that you might be able to easily create the "unlisted" dataframe in the first place.

Pipe is indeed a side point here. Mostly I mentioned it to point out that we are trying to put other libraries on equal footings to what we put in pandas proper.

@shoyer
Copy link
Member

shoyer commented Jul 5, 2015

For example, suppose the number of items in the lists depends on the other columns, e.g., suppose df['c'] = [list(range(a, b)) for a, b in zip(df.a, df.b)]. It might seem natural to use unlistify in this context. But you might just as easily write df = pd.concat([pd.DataFrame({'a': a, 'b': b, 'c': list(range(a, b))}) for a, b in zip(df.a, df.b)].

@dov
Copy link
Author

dov commented Jul 5, 2015

Thanks for the pd.concat() example. It is indeed simpler than my use of groupby(). But still if e.g. df has 10 columns before adding a new column, the use of pd.concat() becomes quite cumbersome.

In any case, I think it is a legitimate use case to start off with a a few dimensions and then unlisting new dimensions to add additional complexity. I don't think it is justified to force the user to declare all dimenions in advance. Initially I thought of adding a function to MultiIndex.add_factor() that would allow adding a new factor (like adding a new dimension to MultiIndex.from_product()). But I think the unlistify() concept is more general.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2015

@dov you are just using a much less-efficient form of multi-index. You lose all performance and indexing with the list/unlistify. Mainly the mixed-list structures force you to be in object dtype. Further I am not sure that the list-structure actually allows you to do any operations easily. Everything first requires conversions, again leading to less that desirable performance / storage characteristics.

Compare to a regular multi-index

In [24]: df.reset_index().set_index(['C','B'])
Out[24]: 
       A
C  B    
10 0  11
   1  11
   2  11
   3  11
   4  11
20 0  22
   1  22
   2  22

In [28]: df2 = df.reset_index().set_index(['C','B'])

In [29]: df2.loc[10]
Out[29]: 
    A
B    
0  11
1  11
2  11
3  11
4  11

In [30]: df2.dtypes
Out[30]: 
A    int64
dtype: object

In [31]: df2.index
Out[31]: 
MultiIndex(levels=[[10, 20], [0, 1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2]],
           names=[u'C', u'B'])


Not sure why one would prefer a non-native structure that would have not any advantages (and several key disadvantages) over the multi-index structure.

Maybe you could shed some light on why you are not using a multi-index structure.

@dov
Copy link
Author

dov commented Jul 5, 2015

I think the difference is that I see the DataFrame as something evolving in which you don't see the whole picture at the time of its construction. I will try to give an example. Let's say that you have an xy-table with a camera pictures of a plate filled with micro organisms. The field of view is much smaller than the plate. The goal is to image processing and classification of the micro organisms in the images.

Here are the steps that needs to be carried out:

  1. Create a plan containing x,y coordinates of where to snapshot the images. These may be stored in two columns of a dataframe.
  2. Take the images. The references to the images may be stored in a third column of the dataframe in 1.
  3. Do image detection. Each image generates a list of microorganisms with properties such as size x,y position etc. Each organism's properties should be stored on a separate line in a dataframe.

Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property.

In this sense the dataframe is like a logbook for the experiment. It grows and possibly contracts as the experiment progresses.

This is my goal. The idea of the unlistify() function was just a means of doing this. shoyer showed me the same functionality can be achieved through pd.concat().

@shoyer
Copy link
Member

shoyer commented Jul 6, 2015

Of course the dataframe in 3 could be created with references back to the image dataframe of 1. But it may make more sense to expand the dataframe in 1 to make room for the detected image property.

I think the idiomatic way to do such an operation in pandas would be to use Database-style DataFrame joining/merging.

@libbkmz
Copy link

libbkmz commented Apr 22, 2016

I'll save it here for better search for this issue.

This SO answer explain how to do this thing really easily.
http://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

as said above, this is not good data management, nor likely to be supported by pandas. this leads to really inefficient representations of data.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 22, 2016
@libbkmz
Copy link

libbkmz commented Apr 22, 2016

@jreback Yeah, I agree with you. But, when I'm getting data from MongoDB with really nested structure, and I want to make some columns for data processing, this is only way to preprocess data and save shape references between origianl data and preprocessed dataset

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

@libbkmz then you need a proper layer in between. you can certainly pre-process using pandas, but de-listifying (ala exploding) encourages poor data representation in the first place :<

@NirantK
Copy link

NirantK commented Jan 8, 2018

Hello @jreback , just wanted to ask if you want to close this issue? Do you want to tag it as won't fix for now?

@chaxor
Copy link

chaxor commented Jan 17, 2018

I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue. I fully understand that it is less efficient and there are likely better dataframe architectures. However, I (and looking at the number of stackoverflow questions on this - many, many others) have run into this an astounding number of times.
Many times this is not even the developer that makes this decision but is handed some datatype which translates to a dataframe with this structure. Also, you can always make the argument of efficiency / performance for architecture differences and ultimately end up not using interpreted languages at all, but here we are. Python is used in many communities because of it's ability to do higher level functions, not because of it's performance.

Therefore, I would think it highly useful to have at least an unlistify method for dataframes, as it would make things much faster than trying to find the most efficient / least amount of lines of code stackoverflow question every time you run into this issue in different situations. That way people at least have a very simple method baked into pandas to 'correct' some of the bad architecture yielded by other data structures given to them.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2018

@chaxor

I would think that the number of people that run into this problem of having lists or iterable structures within cells of a dataframe would give credence to this issue.

not sure what to make of that statement. pandas has almost 2200 issues, and no full time folks working on. People prioritize what they will. Do you want to submit a pull-request to fix this? if so great.

@dmarinav
Copy link

Pandas might have 2200 issues (like any other software or app nowadays), but I believe this one is very important when dealing with unstructured data. I work with unstructured data, and listify() and unlistify() can be very handy here and save lots of valuable time.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2018

@dmarinav and you are welcome to submit a fix. what folks choose to work on is pretty much up to them.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 1, 2018

FWIW, I can see someone building a JSONArray on top of #19268 (probably not within pandas). I think that + a custom .json accessor would make for a useful library. Users could store nested data in that array type, and have a bunch of methods at only apply to nested data, like listify and unlistify.

@summerela
Copy link

I end up having to do this kind of thing all the time.. and it's a complete PITA. Having a feature that allows us to "explode" a column containing lists into multiple rows would be wonderful.

@orenovadia
Copy link

+1

I stumble upon a need for this many times (and made package with just this function).

I think that explode might be a better name for unlistify (see spark).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants