Vectorize apply_replacement #207

matheusfacure · 2022-08-10T16:52:30Z

Status

IN DEVELOPMENT

Todo list

Documentation
Tests added and passed

Background context

Pandas .apply is incredibly slow to run. Since this function is used in multiple learners, speeding it up should yield tremendous grains in performance. As a side note, we should never introduce new call .apply. They are a major source of headache

Description of the changes proposed in the pull request

Vectorize apply_replacements

jmoralez · 2022-08-10T18:33:51Z

WDYT about using map on the dicts instead?

import numpy as np
import pandas as pd

nrows = 1_000_000
ncols = 20
n_unique_vals = 100
df = pd.DataFrame(
    np.random.randint(0, n_unique_vals, (nrows, ncols)),
    columns=[f'x{i}' for i in range(ncols)],
    dtype=str
)
cols_to_replace = [f'x{i}' for i in range(0, 20, 2)]
vec = {col: {str(i): str(i + 1) for i in range(n_unique_vals - 10)}  # last 10 values were not seen
       for col in cols_to_replace}  
# define one of the replacement columns as float
df['x0'] = df['x0'].astype('float')
vec['x0'] = {float(k): float(v) for k, v in vec['x0'].items()}
replace_unseen = -1

def apply_replacements(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        return np.select(
            # the original had an and here so I guess it should be &
            [df[col].isna() & (df[col].dtype == "float"), ~df[col].isin(vec[col].keys())],
            [np.nan, replace_unseen],
            df[col].replace(vec[col])
        )
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res1 = apply_replacements(df, cols_to_replace, vec, replace_unseen)
# Wall time: 1min 22s

# proposal
def apply_replacements2(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        replaced = df[col].map(vec[col])
        unseen = df[col].notnull() & replaced.isnull()
        replaced[unseen] = replace_unseen
        return replaced
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res2 = apply_replacements2(df, cols_to_replace, vec, replace_unseen)
# Wall time: 3.93 s

pd.testing.assert_frame_equal(res1, res2)

matheusfacure · 2022-08-10T19:16:27Z

WOW! What is this magic? How does map works?

jmoralez · 2022-08-10T21:09:04Z

The main difference is that replace only changes the values you provide in the dict, whereas map tries to replace all of them and when there isn't a match it sets the value to null, which in this case I think is helpful for us because we can get the ones that didn't match very easily.

codecov-commenter · 2022-08-10T21:44:45Z

Codecov Report

Merging #207 (8c5c9b0) into master (3cd7bec) will decrease coverage by 0.39%.
The diff coverage is 93.20%.

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   94.69%   94.29%   -0.40%     
==========================================
  Files          25       34       +9     
  Lines        1507     2050     +543     
  Branches      203      269      +66     
==========================================
+ Hits         1427     1933     +506     
- Misses         48       80      +32     
- Partials       32       37       +5

Impacted Files	Coverage Δ
src/fklearn/causal/validation/cate.py	`0.00% <0.00%> (ø)`
src/fklearn/data/datasets.py	`100.00% <ø> (ø)`
src/fklearn/tuning/parameter_tuners.py	`79.48% <ø> (ø)`
src/fklearn/tuning/selectors.py	`90.47% <ø> (ø)`
src/fklearn/validation/validator.py	`88.88% <71.42%> (-5.40%)`	⬇️
src/fklearn/preprocessing/splitting.py	`95.00% <92.59%> (-0.84%)`	⬇️
src/fklearn/training/calibration.py	`96.36% <94.73%> (-3.64%)`	⬇️
src/fklearn/causal/cate_learning/meta_learners.py	`94.93% <94.93%> (ø)`
src/fklearn/training/transformation.py	`93.97% <95.34%> (+0.04%)`	⬆️
src/fklearn/validation/evaluators.py	`93.95% <96.29%> (+4.32%)`	⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

denismr · 2022-08-11T13:28:54Z

src/fklearn/training/transformation.py

+        replaced[unseen] = replace_unseen
+        return replaced


Suggested change

replaced[unseen] = replace_unseen

return replaced

return replaced.mask(unseen, replace_unseen)

matheusfacure · 2022-08-11T14:26:19Z

src/fklearn/training/transformation.py

-        replaced[unseen] = replace_unseen
-        return replaced
+    def column_categorizer(col: str) -> pd.Series:
+        return df[col].map(lambda x: vec[col].get(x, replace_unseen), na_action='ignore')


is this faster than .apply? I don't know how map works under the hood, but if it implements a for loop in the backend for lambda function, than its just as bad as apply no?

You're right, I wasn't benchmarking against the original implementation, this takes about the same time (4.5s original, 3.8s this one). Do you have an example where the original is very slow? That may be a better case to benchmark.

vectorize colum categ

4db1868

matheusfacure requested a review from a team as a code owner August 10, 2022 16:52

fix lint

836d8ed

fix type issue

8a548c8

matheusfacure added 2 commits August 10, 2022 16:31

use map

632796e

output to float

e27869b

return float

3893f2f

use dict.get and na_action='ignore'

8c5c9b0

denismr reviewed Aug 11, 2022

View reviewed changes

matheusfacure commented Aug 11, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize apply_replacement #207

Vectorize apply_replacement #207

matheusfacure commented Aug 10, 2022 •

edited

Loading

jmoralez commented Aug 10, 2022

matheusfacure commented Aug 10, 2022

jmoralez commented Aug 10, 2022

codecov-commenter commented Aug 10, 2022 •

edited

Loading

denismr Aug 11, 2022

matheusfacure Aug 11, 2022

jmoralez Aug 11, 2022

	replaced[unseen] = replace_unseen
	return replaced
	return replaced.mask(unseen, replace_unseen)

Vectorize apply_replacement #207

Are you sure you want to change the base?

Vectorize apply_replacement #207

Conversation

matheusfacure commented Aug 10, 2022 • edited Loading

Status

Todo list

Background context

Description of the changes proposed in the pull request

jmoralez commented Aug 10, 2022

matheusfacure commented Aug 10, 2022

jmoralez commented Aug 10, 2022

codecov-commenter commented Aug 10, 2022 • edited Loading

Codecov Report

denismr Aug 11, 2022

Choose a reason for hiding this comment

matheusfacure Aug 11, 2022

Choose a reason for hiding this comment

jmoralez Aug 11, 2022

Choose a reason for hiding this comment

matheusfacure commented Aug 10, 2022 •

edited

Loading

codecov-commenter commented Aug 10, 2022 •

edited

Loading