Do you support two DataFrame operations now? #83

banduoba · 2021-12-23T15:13:35Z

banduoba
Dec 23, 2021

Is the use of pdpipe supported if there are two or more DataFrames?

Dec 29, 2021

Processing two dataframes

If one of the dataframes is an input dataframe to a pipeline, and the other is either a static one, or is completely derived from the input, then yes - pdpipe is adequate.

If, on the other hand, both dataframes are input dataframes to a pipeline, then no, currently pdpipe is not built to handle just joint operations. This is because it's really not clear how to streamline such pipelines. I have some ideas, but this requires basically multi-input pipelines that must have some merge/join stages, and possibly multiple outputs.

Using the `fillna` method

Regarding your second question, pdpipe.df.fillna() actually works great:

>>> import pandas as pd; import pdpipe as pdp;…

View full answer

shaypal5 · 2021-12-24T12:55:40Z

shaypal5
Dec 24, 2021
Maintainer

It depends on what exactly do you mean. Do you mean on several dataframes on after the other, or on operations that by nature operation on two or more dataframes?

0 replies

banduoba · 2021-12-24T14:21:56Z

banduoba
Dec 24, 2021
Author

For example, df.drop(df[df['col1'].isin(df_other['col1'])].index), this is what I said about using two DataFrame, or more.

I also want to know, if I want to use df['col1'].fillna(0), which method do I need to use, through the pdpipe.df can not use this method, whether this built-in method is not supported.

0 replies

shaypal5 · 2021-12-29T09:59:55Z

shaypal5
Dec 29, 2021
Maintainer

Processing two dataframes

If one of the dataframes is an input dataframe to a pipeline, and the other is either a static one, or is completely derived from the input, then yes - pdpipe is adequate.

If, on the other hand, both dataframes are input dataframes to a pipeline, then no, currently pdpipe is not built to handle just joint operations. This is because it's really not clear how to streamline such pipelines. I have some ideas, but this requires basically multi-input pipelines that must have some merge/join stages, and possibly multiple outputs.

Using the `fillna` method

Regarding your second question, pdpipe.df.fillna() actually works great:

>>> import pandas as pd; import pdpipe as pdp; 
>>> df = pd.DataFrame([[23, np.nan, 1], [19, 'Bo', 3, '4$'], [15, 'Di', -2, '53.2$'], [5, 'Mo', 31 , '200,00$']], columns=['age', 'name', 4, 'salary'], index=[0, 1, 2, 3])
>>> df
   age name  4   salary
0   23  NaN  1     None
1   19   Bo  3       4$
2   15   Di -2    53.2$
3    5   Mo  3  200,00$
>>> nafiller = pdp.df.fillna(value=0)
>>> nafiller
PdPipelineStage: Apply dataframe method fillna with kwargs {'value': 0}
>>> nafiller(df)
   age name  4   salary
0   23    0  1        0
1   19   Bo  3       4$
2   15   Di -2    53.2$
3    5   Mo  3  200,00$

0 replies

banduoba · 2021-12-29T14:50:04Z

banduoba
Dec 29, 2021
Author

0 replies

banduoba · 2021-12-29T14:50:15Z

banduoba
Dec 29, 2021
Author

@shaypal5

0 replies

shaypal5 · 2022-01-02T08:14:42Z

shaypal5
Jan 2, 2022
Maintainer

The df attribute of the pdpipe package (or pdp if you import pdpipe as pdp) is a handle for all wrappers the wrap pandas.DataFrame methods as pdpipe.PdPipelineStage constructors.

So when you're writing pdp.df you're not getting the same object - a dataframe - that you defined using df = .... This is the same way in which each df.dtypes attribute of various dataframe objects will yield different results - those are properties of objects, which is a dynamic kind of attribute in python.

So I think you should ramp up a bit more about how attribute, submodules and properties work in python.

Additionally, this is not the kind of use that makes sense for pdpipe. This package is of use when you want to define a pipeline object that is independent of a dataframe. I don't see a lot of use for it when you have a dataframe loaded in a notebook and you want to process it - why not just use the pandas methods?

Also, if you want to get my example to work, why didn't you just use it as I gave it to you? You changed it in a way that makes no sense.

Again, this is how you initialize a stage that fills na values:

nafiller = pdp.df.fillna(value=0)

0 replies

banduoba · 2022-01-03T12:30:08Z

banduoba
Jan 3, 2022
Author

Thank you for your answer.

0 replies

shaypal5 · 2022-01-03T16:03:54Z

shaypal5
Jan 3, 2022
Maintainer

No problem. Thank you for taking an interest in my project, and for taking the time to ask me about it. :)

0 replies

shaypal5 · 2022-02-22T08:17:27Z

shaypal5
Feb 22, 2022
Maintainer

Update: pdpipe supports multi-dataframe operations using application context objects:

See the pipeline stage here:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/pipeline.py#L20

It gets a product-review dataframe as context, used to enrich some of the rows in the input dataframe with product sentiment features.

In another file, on fit-transform the pipeline is provided with the train review dataframe:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/buyer.py#L112

And on transform it is provided with the rollout/holdout review dataframe:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/buyer.py#L283

If you want the same dataframe to be merged/joined to input dataframes each time, you should supply it to the fit_context parameter of the pipeline (note: it will be saved in memory as part of the pipeline, or to disk, if serialized); If you want a different one used depending on the specific application of the (possibly fitted) pipeline, provide it to the application_context parameter each time.

2 replies

banduoba Feb 23, 2022
Author

Thank you very much, it's great!

shaypal5 Feb 23, 2022
Maintainer

Yay! :)
So glad to help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you support two DataFrame operations now? #83

{{title}}

Replies: 9 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Do you support two DataFrame operations now? #83

banduoba Dec 23, 2021

Processing two dataframes

Using the fillna method

Replies: 9 comments · 2 replies

shaypal5 Dec 24, 2021 Maintainer

banduoba Dec 24, 2021 Author

shaypal5 Dec 29, 2021 Maintainer

Processing two dataframes

Using the fillna method

banduoba Dec 29, 2021 Author

banduoba Dec 29, 2021 Author

shaypal5 Jan 2, 2022 Maintainer

banduoba Jan 3, 2022 Author

shaypal5 Jan 3, 2022 Maintainer

shaypal5 Feb 22, 2022 Maintainer

banduoba Feb 23, 2022 Author

shaypal5 Feb 23, 2022 Maintainer

banduoba
Dec 23, 2021

Using the `fillna` method

Replies: 9 comments 2 replies

shaypal5
Dec 24, 2021
Maintainer

banduoba
Dec 24, 2021
Author

shaypal5
Dec 29, 2021
Maintainer

Using the `fillna` method

banduoba
Dec 29, 2021
Author

banduoba
Dec 29, 2021
Author

shaypal5
Jan 2, 2022
Maintainer

banduoba
Jan 3, 2022
Author

shaypal5
Jan 3, 2022
Maintainer

shaypal5
Feb 22, 2022
Maintainer

banduoba Feb 23, 2022
Author

shaypal5 Feb 23, 2022
Maintainer