Accesing a complete series when using ApplyByCols #106

vecorro · 2021-11-06T18:54:36Z

vecorro
Nov 6, 2021

I have a transformation where I have have a text column and I'd like to limit the number of words by calculating the average number of words (mean) and adding to this number 2x the std of the number of words. To achieve this, I need scan all the text values in the series, calculate the mean and std, and then perform the transformation that limits the number of words in each string. I tried several things but none of them worked. I had to carry on with that transform as an additional step outside the pdpipeline.

thanks!

Answered by shaypal5

Jul 3, 2022

Hey @vecorro !

Sorry for the extremely belated response.
I still wanted to address this, for future users.

Ok, if we first define this:

class SentenceClipper:
  def __init__(self, lim: int) -> None:
    self.lim = lim
  def __call__(self, text: str):
    words = text.split(' ')
    words = words[:self.lim]
    return ' '.join(words)

class SentenceColumnClipper(pdp.PdPipelineStage):

  def _prec(self, df: pd.DataFrame) -> bool:
    return 'text' in df.columns

  def _transform(self, df: pd.DataFrame, verbose:bool) -> pd.DataFrame:
    lim = self.application_context['nwords_lim']
    print(f'Calcualted word limit: {lim}')
    clipper = SentenceClipper(lim=lim)
    res_df = df.copy()
    re…

View full answer

shaypal5 · 2021-11-15T15:47:33Z

shaypal5
Nov 15, 2021
Maintainer

I think this can already be done with current stages, but I can't get to it right now. This is on the top of my list.

0 replies

shaypal5 · 2022-07-03T15:09:14Z

shaypal5
Jul 3, 2022
Maintainer

Hey @vecorro !

Sorry for the extremely belated response.
I still wanted to address this, for future users.

Ok, if we first define this:

class SentenceClipper:
  def __init__(self, lim: int) -> None:
    self.lim = lim
  def __call__(self, text: str):
    words = text.split(' ')
    words = words[:self.lim]
    return ' '.join(words)

class SentenceColumnClipper(pdp.PdPipelineStage):

  def _prec(self, df: pd.DataFrame) -> bool:
    return 'text' in df.columns

  def _transform(self, df: pd.DataFrame, verbose:bool) -> pd.DataFrame:
    lim = self.application_context['nwords_lim']
    print(f'Calcualted word limit: {lim}')
    clipper = SentenceClipper(lim=lim)
    res_df = df.copy()
    res_df['text'] = res_df['text'].map(clipper)
    return res_df

text_pline = pdp.PdPipeline([
  pdp.MapColVals(
    columns='text',
    value_map=lambda txt: len(txt.split(' ')),
    result_columns='nwords',
    drop=False,
  ),
  pdp.ApplicationContextEnricher(
    nwords_lim=lambda X: int(X['nwords'].mean() + 1 * X['nwords'].std()),
  ),
  SentenceColumnClipper(),
  pdp.ColDrop('nwords'),
])

Then this works:

>>> textdf = pd.DataFrame(
...   data=[
...     [23, 'help me'],
...     [19, 'dont touch that please you dumb person'],
...     [15, 'yes!'],
...     [5, 'no way'],
...   ],
...   columns=['age', 'text'],
...   index=[0, 1, 2, 3],
... )
>>> textdf
   age                                    text
0   23                                 help me
1   19  dont touch that please you dumb person
2   15                                    yes!
3    5                                  no way
>>> text_pline.fit_transform(textdf)
   age                        text
0   23                     help me
1   19  dont touch that please you
2   15                        yes!
3    5                      no way

Please repoen this issue if this doesn't work for you.

I'm also converting it to a discussion so it can be used as a reference for this type of things.

I'm working on a way to steamline use of application context, to make this kind of things more natural, and avoid defining custom pipeline stages as much as possible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accesing a complete series when using ApplyByCols #106

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Accesing a complete series when using ApplyByCols #106

vecorro Nov 6, 2021

Replies: 2 comments

shaypal5 Nov 15, 2021 Maintainer

shaypal5 Jul 3, 2022 Maintainer

vecorro
Nov 6, 2021

shaypal5
Nov 15, 2021
Maintainer

shaypal5
Jul 3, 2022
Maintainer