 ## Profiling faster alternatives to DataFrame.drop

What?
For e.g. LogTransformer, dropping the original column is one of the slower operations. Would be good to look into faster alternatives, e.g. df.columns.isin(wanted_columns)

Why?
Speed up tubular

How?
Explore alternatives to pd.DataFrame.drop, profile candidates to confirm efficiency benefits

In [1]:
import pandas as pd
import numpy as np
import time

We'll be profiling df.drop as well as two alternatives: df.columns.isin and df.filter in this notebook. Profiling will be done in terms of the average time taken over 50 runs by each approach. We'll be using a randomly generated DataFrame with 1 million rows and 100 columns for each run. The first 10 columns will be used for dropping purpose

In [2]:
np.random.seed(0)  # For reproducibility

# Columns to be dropped
cols_to_drop = [f'col_{i}' for i in range(10)]
cols_to_keep = [f'col_{i}' for i in range(10,100)]

# Initialize accumulators for time measurements
total_time_drop = 0
total_time_isin = 0
total_time_filter = 0
total_time_del = 0

runs = 50  # Number of runs

In [3]:
for _ in range(runs):
    # Regenerate DataFrame with 1 million rows and 100 columns for each run
    df = pd.DataFrame(np.random.rand(1000000, 100), columns=[f'col_{i}' for i in range(100)])

    # Measure time for DataFrame.drop()
    start_time_drop = time.time()
    df.drop(cols_to_drop, axis=1)
    total_time_drop += time.time() - start_time_drop

    # Measure time for Columns.isin()
    start_time_isin = time.time()
    df.loc[:, df.columns.isin(cols_to_keep)]
    total_time_isin += time.time() - start_time_isin

    # Measure time for DataFrame.filter()
    start_time_filter = time.time()
    df.filter(items=cols_to_keep)
    total_time_filter += time.time() - start_time_filter

    # del statement to drop columns from DataFrame
    start_time_del = time.time()
    for col in cols_to_drop:
        del df[col]
    total_time_del = time.time() - start_time_del

# Calculate average times
average_time_drop = total_time_drop / runs
average_time_isin = total_time_isin / runs
average_time_filter = total_time_filter / runs
average_time_del = total_time_del / runs

print("Average time taken for drop across 50 iterations: ", average_time_drop)
print("Average time taken for isin across 50 iterations: ", average_time_isin)
print("Average time taken for average_time_filter across 50 iterations: ", average_time_filter)
print("Average time taken for del across 50 iterations: ", average_time_del)

Average time taken for drop across 50 iterations:  0.1302215051651001
Average time taken for isin across 50 iterations:  0.1303914499282837
Average time taken for average_time_filter across 50 iterations:  0.12967921733856203
Average time taken for del across 50 iterations:  0.00010490894317626953
