Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested dataframe not behaving like dplyr nested dataframe #93

Closed
cah-dipanjan opened this issue Mar 17, 2022 · 2 comments
Closed

Nested dataframe not behaving like dplyr nested dataframe #93

cah-dipanjan opened this issue Mar 17, 2022 · 2 comments
Labels
question Further information is requested

Comments

@cah-dipanjan
Copy link

cah-dipanjan commented Mar 17, 2022

Code -

import pandas as pd
from datar.all import *

df1 = pd.DataFrame(
    {"x": [1, 2, 3, 4], "y": [11, 12, 13, 14], "z": [21, 22, 23, 24]}
)

df1 = df1 >> nest(data1=~f.x)


df2 = pd.DataFrame(
    {"x": [1, 2, 2, 6], "y": [11, 12, 10, 14], "l": [21, 22, 23, 24]}
)

df2 = df2 >> nest(data2=~f.x)


df = (
    df1
    >> nest_join(df2)
    >> rename(data2=f._y_joined)
    >> group_by(f.x)
    >> mutate(ct=f.data2.size)
    >> ungroup()
)


df

Result -

x data1 data2 ct
1 <DF 1x2> <DF 1x1> <bound method GroupBy.size of <pandas.core.gro...
2 <DF 1x2> <DF 1x1> <bound method GroupBy.size of <pandas.core.gro...
3 <DF 1x2> <DF 0x1> <bound method GroupBy.size of <pandas.core.gro...
4 <DF 1x2> <DF 0x1> <bound method GroupBy.size of <pandas.core.gro...

Expected -

x data1 data2 ct
1 <DF 1x2> <DF 1x1> 1
2 <DF 1x2> <DF 1x1> 1
3 <DF 1x2> <DF 0x1> 0
4 <DF 1x2> <DF 0x1> 0
@pwwang
Copy link
Owner

pwwang commented Mar 17, 2022

Looks like you want to get the number of rows of nested frames in column data2?

f.data2.size (you should actually do f.data2.size() is getting the sizes of the series in each group, which are all 1's.

What you need to do is:

>>> df = (
...     df1
...     >> nest_join(df2)
...     >> rename(data2=f._y_joined)
...     >> group_by(f.x)
...     >> mutate(ct=f.data2.transform(lambda x: nrow(x.iloc[0])))
...     >> ungroup()
... )
>>> df
        x     data1     data2      ct
  <int64>  <object>  <object> <int64>
0       1  <DF 1x2>  <DF 1x1>       1
1       2  <DF 1x2>  <DF 1x1>       1
2       3  <DF 1x2>  <DF 0x1>       0
3       4  <DF 1x2>  <DF 0x1>       0

Explanation on mutate(ct=f.data2.transform(lambda x: nrow(x.iloc[0]))):

  • f.data2 is the column with the nested frames, and it is evaluated as a SeriesGroupBy object, since the entire frame is grouped
  • .transform() is to transform the nested frames into their corresponding # rows
  • lambda x: nrow(x.iloc[0])) is to get the # rows of the nested frames in each group in that f.data2 column. Note that x is a Series object with the nested frame as the only element. However, the index of the object is from 0 to 3 in each group, we can't use x[0] directly to get the element, instead, we need x.iloc[0].

@pwwang pwwang added the question Further information is requested label Mar 17, 2022
@cah-dipanjan
Copy link
Author

cah-dipanjan commented Apr 6, 2022

Thanks a lot for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants