-
I have a transformation where I have have a text column and I'd like to limit the number of words by calculating the average number of words (mean) and adding to this number 2x the std of the number of words. To achieve this, I need scan all the text values in the series, calculate the mean and std, and then perform the transformation that limits the number of words in each string. I tried several things but none of them worked. I had to carry on with that transform as an additional step outside the pdpipeline. thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I think this can already be done with current stages, but I can't get to it right now. This is on the top of my list. |
Beta Was this translation helpful? Give feedback.
-
Hey @vecorro ! Sorry for the extremely belated response. Ok, if we first define this: class SentenceClipper:
def __init__(self, lim: int) -> None:
self.lim = lim
def __call__(self, text: str):
words = text.split(' ')
words = words[:self.lim]
return ' '.join(words)
class SentenceColumnClipper(pdp.PdPipelineStage):
def _prec(self, df: pd.DataFrame) -> bool:
return 'text' in df.columns
def _transform(self, df: pd.DataFrame, verbose:bool) -> pd.DataFrame:
lim = self.application_context['nwords_lim']
print(f'Calcualted word limit: {lim}')
clipper = SentenceClipper(lim=lim)
res_df = df.copy()
res_df['text'] = res_df['text'].map(clipper)
return res_df
text_pline = pdp.PdPipeline([
pdp.MapColVals(
columns='text',
value_map=lambda txt: len(txt.split(' ')),
result_columns='nwords',
drop=False,
),
pdp.ApplicationContextEnricher(
nwords_lim=lambda X: int(X['nwords'].mean() + 1 * X['nwords'].std()),
),
SentenceColumnClipper(),
pdp.ColDrop('nwords'),
]) Then this works: >>> textdf = pd.DataFrame(
... data=[
... [23, 'help me'],
... [19, 'dont touch that please you dumb person'],
... [15, 'yes!'],
... [5, 'no way'],
... ],
... columns=['age', 'text'],
... index=[0, 1, 2, 3],
... )
>>> textdf
age text
0 23 help me
1 19 dont touch that please you dumb person
2 15 yes!
3 5 no way
>>> text_pline.fit_transform(textdf)
age text
0 23 help me
1 19 dont touch that please you
2 15 yes!
3 5 no way Please repoen this issue if this doesn't work for you. I'm also converting it to a discussion so it can be used as a reference for this type of things. I'm working on a way to steamline use of application context, to make this kind of things more natural, and avoid defining custom pipeline stages as much as possible. |
Beta Was this translation helpful? Give feedback.
Hey @vecorro !
Sorry for the extremely belated response.
I still wanted to address this, for future users.
Ok, if we first define this: