### Efficiency Examples in Data Engineering

This notebook explains some efficiency techniques for data engineering process:

- Optimize data processing with Pandas
- Implement batch processing
- Use code profiling for resource usage checking 

## Optimize Data Processing with Pandas

### Before Optimization

In [1]:
import pandas as pd

df = pd.read_csv("sales_data.csv")

total_sales = {}
for index, row in df.iterrows():
    customer_id = row["customer_id"]
    sales = row["quantity"] * row["price"]
    if customer_id in total_sales:
        total_sales[customer_id] += sales
    else:
        total_sales[customer_id] = sales

result = pd.DataFrame(list(total_sales.items()), columns=["customer_id", "total_sales"])

print(result)

   customer_id  total_sales
0            1         60.0
1            2        120.0
2            3        180.0
3            4         75.0
4            5        135.0


### After Optimization

In [2]:
import pandas as pd

df = pd.read_csv("sales_data.csv")

df["sales"] = df["quantity"] * df["price"]
result = df.groupby("customer_id", as_index=False)["sales"].sum()

print(result)

   customer_id  sales
0            1   60.0
1            2  120.0
2            3  180.0
3            4   75.0
4            5  135.0


## Implement Batch Processing

### Before Batch Processing

In [5]:
import pandas as pd

df = pd.read_csv("transactions.csv")

total_amount = df["amount"].sum()

print(f"Total amount (without batch processing): {total_amount}")

Total amount (without batch processing): 1505.0


### Using Batch Processing

In [7]:
import pandas as pd

# Function to process data in batches
def process_data_in_batches(file_path, batch_size):
    total_amount = 0
    for chunk in pd.read_csv(file_path, chunksize=batch_size):
        total_amount += chunk["amount"].sum()
    return total_amount


# Batch processing: processing data in batches of 3 rows
file_path = "transactions.csv"
batch_size = 3
total_amount = process_data_in_batches(file_path, batch_size)

print(f"Total amount (with batch processing): {total_amount}")

Total amount (with batch processing): 1505.0


## Use Code Profiling

### Code Profiling Example 1

In [1]:
import pandas as pd
import cProfile
import pstats


def load_data(file_path):
    df = pd.read_csv(file_path)
    return df


def calculate_total_amount(df):
    total_amount = df["amount"].sum()
    return total_amount


def main():
    file_path = "transactions.csv"
    df = load_data(file_path)
    total_amount = calculate_total_amount(df)
    print(f"Total amount: {total_amount}")


if __name__ == "__main__":
    profiler = cProfile.Profile()
    profiler.enable()

    main()

    profiler.disable()
    stats = pstats.Stats(profiler).sort_stats("cumtime")
    stats.print_stats()

Total amount: 1505.0
         3434 function calls (3298 primitive calls) in 0.007 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.007    0.007 /tmp/ipykernel_9601/2783729850.py:16(main)
        1    0.000    0.000    0.006    0.006 /tmp/ipykernel_9601/2783729850.py:6(load_data)
        1    0.000    0.000    0.006    0.006 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:170(wrapper)
      2/1    0.000    0.000    0.006    0.006 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:323(wrapper)
        1    0.000    0.000    0.006    0.006 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:854(read_csv)
        1    0.000    0.000    0.006    0.006 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:571(_read)
        1    0.000    0.000    0.003    0.003 /home/nadir/anaconda3/lib/

### Code Profiling Example 2

In [21]:
import pandas as pd
import cProfile
import pstats


def process_data_in_batches(file_path, batch_size):
    total_amount = 0
    for chunk in pd.read_csv(file_path, chunksize=batch_size):
        total_amount += chunk["amount"].sum()
    return total_amount


def main():
    file_path = "transactions.csv"
    batch_size = 3
    total_amount = process_data_in_batches(file_path, batch_size)
    print(f"Total amount: {total_amount}")


if __name__ == "__main__":
    profiler = cProfile.Profile()
    profiler.enable()

    main()

    profiler.disable()
    stats = pstats.Stats(profiler).sort_stats("cumtime")
    stats.print_stats()

Total amount: 1505.0
         8518 function calls (8413 primitive calls) in 0.060 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.060    0.060 /tmp/ipykernel_9601/219282257.py:13(main)
        1    0.000    0.000    0.060    0.060 /tmp/ipykernel_9601/219282257.py:6(process_data_in_batches)
        5    0.000    0.000    0.049    0.010 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1696(__next__)
        5    0.000    0.000    0.049    0.010 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1803(get_chunk)
        5    0.000    0.000    0.049    0.010 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1762(read)
        4    0.000    0.000    0.040    0.010 /home/nadir/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:609(__init__)
        4    0.001    0.000    0.039    0.010 /home/nadir/a