# Data Cleaning and Transformation with Pandas `.apply`

In this notebook, we will explore the powerful `.apply` method in the pandas library. The `.apply` method allows you to apply a function along an axis of the DataFrame (either rows or columns), offering great flexibility for complex data transformations.

## Objectives

1. **Data Cleaning**:
    - **Standardizing Categories**: Clean and standardize the 'category' column to handle inconsistencies such as extra spaces, mixed case, and missing values.
    - **Handling Missing Values**: Replace missing values with appropriate placeholders.

2. **Data Transformation**:
    - **Applying Discounts**: Calculate final prices after applying category-based discounts.
    - **Complex Operations**: Demonstrate the use of `.apply` for complex row-wise operations involving multiple columns.

## `.apply` Method Overview

The `.apply` method is a versatile tool for applying custom functions to DataFrame rows or columns. It is particularly useful for operations that are too complex for `.map` or `.replace`.

### Pros:
- **Flexibility**: Can handle more complex operations and custom functions.
- **Versatility**: Can be applied to both Series and DataFrames.

### Cons:
- **Performance**: May be slower than `.map`, especially with large DataFrames or complex functions.
- **Complexity**: Slightly more complex to use compared to `.map`.

In this notebook, we will demonstrate how to use the `.apply` method effectively to clean and transform data, highlighting its strengths in handling complex scenarios.


In [None]:
import pandas as pd

def build_sample_dataframe() -> pd.DataFrame:
    data = {
        'category': ['Electronics', 'Clothing', 'Home', 'Electronics ', ' clothing', 'Home', 'electronics', 'Clothing', 'Books', None, ''],
        'item': ['Laptop', 'T-Shirt', 'Sofa', 'Smartphone', 'Jeans', 'Table', 'Headphones', 'Jacket', 'Novel', 'Lamp', ''],
        'price': [999.50, 19.75, 299.00, 699.25, 49.50, 199.00, 89.75, 79.50, 14.25, 39.00, 0.0]
    }
    _df = pd.DataFrame(data)
    return _df


In [None]:
def clean_category(category):
    if pd.isnull(category) or category.strip() == '':
        return 'Unknown'
    return category.strip().lower()

In [None]:
def apply_discount(row):
    if row['category'] == 'electronics':
        discount = 0.10
    elif row['category'] == 'clothing':
        discount = 0.15
    elif row['category'] == 'home':
        discount = 0.05
    else:
        discount = 0.0
    final_price = row['price'] * (1 - discount)
    return final_price

In [3]:
df = build_sample_dataframe()
df['category'] = df['category'].apply(clean_category)
df['final_price'] = df.apply(apply_discount, axis=1)
print(df)

       category        item   price
0   electronics      Laptop  999.50
1      clothing     T-Shirt   19.75
2          home        Sofa  299.00
3   electronics  Smartphone  699.25
4      clothing       Jeans   49.50
5          home       Table  199.00
6   electronics  Headphones   89.75
7      clothing      Jacket   79.50
8         books       Novel   14.25
9       Unknown        Lamp   39.00
10      Unknown                0.00


The .map method is great for simpler, element-wise transformations, but it has limitations when it comes to more complex operations that involve multiple columns or require conditional logic. In the example we discussed, where we apply discounts based on the category and calculate the final price, .map wouldn't be sufficient because it only works on a single Series and doesn't handle row-wise operations involving multiple columns.