Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize apply_replacement #207

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

matheusfacure
Copy link
Member

@matheusfacure matheusfacure commented Aug 10, 2022

Status

IN DEVELOPMENT

Todo list

  • Documentation
  • Tests added and passed

Background context

Pandas .apply is incredibly slow to run. Since this function is used in multiple learners, speeding it up should yield tremendous grains in performance. As a side note, we should never introduce new call .apply. They are a major source of headache

Description of the changes proposed in the pull request

Vectorize apply_replacements

@matheusfacure matheusfacure requested a review from a team as a code owner August 10, 2022 16:52
@jmoralez
Copy link

WDYT about using map on the dicts instead?

import numpy as np
import pandas as pd

nrows = 1_000_000
ncols = 20
n_unique_vals = 100
df = pd.DataFrame(
    np.random.randint(0, n_unique_vals, (nrows, ncols)),
    columns=[f'x{i}' for i in range(ncols)],
    dtype=str
)
cols_to_replace = [f'x{i}' for i in range(0, 20, 2)]
vec = {col: {str(i): str(i + 1) for i in range(n_unique_vals - 10)}  # last 10 values were not seen
       for col in cols_to_replace}  
# define one of the replacement columns as float
df['x0'] = df['x0'].astype('float')
vec['x0'] = {float(k): float(v) for k, v in vec['x0'].items()}
replace_unseen = -1

def apply_replacements(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        return np.select(
            # the original had an and here so I guess it should be &
            [df[col].isna() & (df[col].dtype == "float"), ~df[col].isin(vec[col].keys())],
            [np.nan, replace_unseen],
            df[col].replace(vec[col])
        )
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res1 = apply_replacements(df, cols_to_replace, vec, replace_unseen)
# Wall time: 1min 22s

# proposal
def apply_replacements2(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        replaced = df[col].map(vec[col])
        unseen = df[col].notnull() & replaced.isnull()
        replaced[unseen] = replace_unseen
        return replaced
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res2 = apply_replacements2(df, cols_to_replace, vec, replace_unseen)
# Wall time: 3.93 s

pd.testing.assert_frame_equal(res1, res2)

@matheusfacure
Copy link
Member Author

WOW! What is this magic? How does map works?

@jmoralez
Copy link

The main difference is that replace only changes the values you provide in the dict, whereas map tries to replace all of them and when there isn't a match it sets the value to null, which in this case I think is helpful for us because we can get the ones that didn't match very easily.

@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2022

Codecov Report

Merging #207 (8c5c9b0) into master (3cd7bec) will decrease coverage by 0.39%.
The diff coverage is 93.20%.

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   94.69%   94.29%   -0.40%     
==========================================
  Files          25       34       +9     
  Lines        1507     2050     +543     
  Branches      203      269      +66     
==========================================
+ Hits         1427     1933     +506     
- Misses         48       80      +32     
- Partials       32       37       +5     
Impacted Files Coverage Δ
src/fklearn/causal/validation/cate.py 0.00% <0.00%> (ø)
src/fklearn/data/datasets.py 100.00% <ø> (ø)
src/fklearn/tuning/parameter_tuners.py 79.48% <ø> (ø)
src/fklearn/tuning/selectors.py 90.47% <ø> (ø)
src/fklearn/validation/validator.py 88.88% <71.42%> (-5.40%) ⬇️
src/fklearn/preprocessing/splitting.py 95.00% <92.59%> (-0.84%) ⬇️
src/fklearn/training/calibration.py 96.36% <94.73%> (-3.64%) ⬇️
src/fklearn/causal/cate_learning/meta_learners.py 94.93% <94.93%> (ø)
src/fklearn/training/transformation.py 93.97% <95.34%> (+0.04%) ⬆️
src/fklearn/validation/evaluators.py 93.95% <96.29%> (+4.32%) ⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines 360 to 361
replaced[unseen] = replace_unseen
return replaced
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
replaced[unseen] = replace_unseen
return replaced
return replaced.mask(unseen, replace_unseen)

replaced[unseen] = replace_unseen
return replaced
def column_categorizer(col: str) -> pd.Series:
return df[col].map(lambda x: vec[col].get(x, replace_unseen), na_action='ignore')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this faster than .apply? I don't know how map works under the hood, but if it implements a for loop in the backend for lambda function, than its just as bad as apply no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I wasn't benchmarking against the original implementation, this takes about the same time (4.5s original, 3.8s this one). Do you have an example where the original is very slow? That may be a better case to benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants