### Efficiency Examples in Data Engineering

This notebook explains some efficiency techniques for data engineering process:

- Optimize data processing with Pandas
- Implement batch processing
- Use code profiling for resource usage checking 

## Optimize Data Processing with Pandas

### Before Optimization

In [1]:
import pandas as pd

df = pd.read_csv("sales_data.csv")

total_sales = {}
for index, row in df.iterrows():
    customer_id = row["customer_id"]
    sales = row["quantity"] * row["price"]
    if customer_id in total_sales:
        total_sales[customer_id] += sales
    else:
        total_sales[customer_id] = sales

result = pd.DataFrame(list(total_sales.items()), columns=["customer_id", "total_sales"])

print(result)

   customer_id  total_sales
0            1         60.0
1            2        120.0
2            3        180.0
3            4         75.0
4            5        135.0


### After Optimization

In [2]:
import pandas as pd

df = pd.read_csv("sales_data.csv")

df["sales"] = df["quantity"] * df["price"]
result = df.groupby("customer_id", as_index=False)["sales"].sum()

print(result)

   customer_id  sales
0            1   60.0
1            2  120.0
2            3  180.0
3            4   75.0
4            5  135.0


## Implement Batch Processing

### Before Batch Processing

In [5]:
import pandas as pd

df = pd.read_csv("transactions.csv")

total_amount = df["amount"].sum()

print(f"Total amount (without batch processing): {total_amount}")

Total amount (without batch processing): 1505.0


### Using Batch Processing

In [7]:
import pandas as pd

# Function to process data in batches
def process_data_in_batches(file_path, batch_size):
    total_amount = 0
    for chunk in pd.read_csv(file_path, chunksize=batch_size):
        total_amount += chunk["amount"].sum()
    return total_amount


# Batch processing: processing data in batches of 3 rows
file_path = "transactions.csv"
batch_size = 3
total_amount = process_data_in_batches(file_path, batch_size)

print(f"Total amount (with batch processing): {total_amount}")

Total amount (with batch processing): 1505.0
