# Pandas Methods: `.map`, `.apply`, and `.replace`

## Introduction

In this notebook, we will explore three powerful Pandas methods: `.map`, `.apply`, and `.replace`. These methods are essential for data cleaning and transformation tasks, particularly when you need to assign integers to string data for splitting out a dimension from a fact table.

### Objectives

- Understand the usage of `.map`, `.apply`, and `.replace`.
- Learn how to use these methods to clean data by mapping string values to integers.
- Compare the performance and complexity of each method.

### Prerequisites

- Basic knowledge of Python and Pandas.
- Familiarity with data manipulation and cleaning tasks.

### Setup

Before getting started please make sure you have completed the initial setup for working local `../SETUP.md`

### Why Create a Function to Return the Sample Data?

To ensure consistency and flexibility in our examples, we created a function to build and return the sample DataFrame. Here are the key reasons for this approach:

1. **Consistent Dataset Across Multiple Examples**:
   - By using a function to generate the sample data, we ensure that all examples in this notebook use the same dataset. This consistency helps in comparing the results of different methods without discrepancies caused by varying data.

2. **Flexibility for Experimentation**:
   - The function allows anyone exploring this notebook to easily adjust the dataset for their own experiments. They can modify the data within the function and observe how the changes affect the cleaning and mapping processes.

3. **Separation of Data Creation and Cleaning Processes**:
   - In real-life scenarios, the creation of data is often separate from the data cleaning process. By encapsulating the data creation in a function, we mimic this separation, making the examples more realistic and aligned with typical data workflows.




In [1]:
# Create Sample Data
import pandas as pd

def build_sample_dataframe() -> pd.DataFrame:
    """
    Builds and returns a sample DataFrame for demonstration purposes.

    The DataFrame contains three columns: 'category', 'item', and 'price'.
    The 'category' column includes some inconsistencies such as extra spaces,
    mixed case, and missing values to simulate real-world data cleaning scenarios.

    Returns:
        pd.DataFrame: A DataFrame with sample data.
    """
    data = {
        'category': ['Electronics', 'Clothing', 'Home', 'Electronics ', ' clothing', 'Home', 'electronics', 'Clothing', 'Books', None, ''],
        'item': ['Laptop', 'T-Shirt', 'Sofa', 'Smartphone', 'Jeans', 'Table', 'Headphones', 'Jacket', 'Novel', 'Lamp', ''],
        'price': [999.50, 19.75, 299.00, 699.25, 49.50, 199.00, 89.75, 79.50, 14.25, 39.00, 0.0]
    }
    _df = pd.DataFrame(data)
    return _df


### .map **Example**
#### Using `.map`

The `.map` method is a straightforward way to substitute each value in a Series with another value from a dictionary. It is particularly useful for simple value replacements.

**Pros**:
- **Efficiency**: `.map` is fast and efficient for simple value replacements.
- **Simplicity**: Easy to use and understand for basic mapping tasks.

**Cons**:
- **Limited Flexibility**: `.map` is limited to Series and cannot be directly applied to DataFrames.
- **Handling Missing Values**: Requires additional steps to handle missing values or values not present in the dictionary.

In [2]:
# Load sample data
df = build_sample_dataframe()

# Clean the data: strip spaces, convert to title case, and fill NaNs
df['category'] = df['category'].str.strip().str.title().fillna('Unknown')

# Dictionary for mapping string data to integers
category_dict = {'Electronics': 1, 'Clothing': 2, 'Home': 3, 'Books': 4, 'Unknown': 0}

# Applying .map to clean data
df['category'] = df['category'].map(category_dict)
print(df)

    category        item   price
0        1.0      Laptop  999.50
1        2.0     T-Shirt   19.75
2        3.0        Sofa  299.00
3        1.0  Smartphone  699.25
4        2.0       Jeans   49.50
5        3.0       Table  199.00
6        1.0  Headphones   89.75
7        2.0      Jacket   79.50
8        4.0       Novel   14.25
9        0.0        Lamp   39.00
10       NaN                0.00


### .apply **Example**
#### Using `.apply`

The `.apply` method allows you to apply a function along an axis of the DataFrame (either rows or columns). It offers more flexibility compared to `.map`.

**Pros**:
- **Flexibility**: Can handle more complex operations and custom functions.
- **Versatility**: Can be applied to both Series and DataFrames.

**Cons**:
- **Performance**: May be slower than `.map`, especially with large DataFrames or complex functions.
- **Complexity**: Slightly more complex to use compared to `.map`.

In [3]:
# Load sample data
df = build_sample_dataframe()

# Dictionary for mapping string data to integers
category_dict = {'Electronics': 1, 'Clothing': 2, 'Home': 3, 'Books': 4, 'Unknown': 0}

# Function to clean and map data
def clean_and_map(value):
    if pd.isna(value) or value.strip() == '':
        return category_dict['Unknown']
    return category_dict.get(value.strip().title(), value)

# Applying .apply to clean data
df['category'] = df['category'].apply(clean_and_map)
print(df)

    category        item   price
0          1      Laptop  999.50
1          2     T-Shirt   19.75
2          3        Sofa  299.00
3          1  Smartphone  699.25
4          2       Jeans   49.50
5          3       Table  199.00
6          1  Headphones   89.75
7          2      Jacket   79.50
8          4       Novel   14.25
9          0        Lamp   39.00
10         0                0.00


### .replace **Example**
#### Using `.replace`

The `.replace` method is used to replace values in a DataFrame or Series. It can handle both simple and complex replacements.

**Pros**:
- **Versatility**: Can replace values in both Series and DataFrames.
- **Flexibility**: Can handle complex replacements, including multiple values and patterns.

**Cons**:
- **Performance**: May be slower for large DataFrames compared to `.map`.
- **Complexity**: Slightly more complex syntax for advanced replacements.

In [4]:
# Load sample data
df = build_sample_dataframe()

# Clean the data: strip spaces, convert to title case, and fill NaNs
df['category'] = df['category'].str.strip().str.title().fillna('Unknown')

# Dictionary for mapping string data to integers
category_dict = {'Electronics': 1, 'Clothing': 2, 'Home': 3, 'Books': 4, 'Unknown': 0}

# Applying .replace to clean data
df['category'] = df['category'].replace(category_dict)
print(df)

   category        item   price
0         1      Laptop  999.50
1         2     T-Shirt   19.75
2         3        Sofa  299.00
3         1  Smartphone  699.25
4         2       Jeans   49.50
5         3       Table  199.00
6         1  Headphones   89.75
7         2      Jacket   79.50
8         4       Novel   14.25
9         0        Lamp   39.00
10                         0.00


### Conclusion
In this notebook, we explored how to use .map, .apply, and groupby for data cleaning and aggregation tasks in Pandas. Each method has its own strengths and is suitable for different scenarios. Understanding these differences can help you choose the right method for your data manipulation tasks, balancing complexity and performance.