<a href="https://colab.research.google.com/github/krauseannelize/nb-py-ms-exercises/blob/sprint04/notebooks/s04_pandas_data_wrangling/40_defining_functions_to_clean_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 40 | Defining Functions to Clean Data

## Creating a Cleaning Function

### Defining the Function

In [None]:
import pandas as pd

# Define a function to clean a DataFrame
def clean_data(df):

  # Standardize text columns (trim spaces, convert to lowercase)
  # Select all columns in DataFrame with object data type
  for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip().str.lower()

  df = df.drop_duplicates()  # Remove duplicate rows
  df = df.dropna()  # Drop missing values (or use fillna if needed)
  return df

### Adding the DataFrame

In [None]:
# Create a sample DataFrame with duplicates, missing values, and non-standardized text
sample_data = {
    "Name": [" Alice  ", "BOB", "Alice", "charlie", None, "BOB"],
    "Age": [25, 30, 25, 35, 40, 30],
    "City": ["New York", "Los Angeles", "new york", "Chicago", "Chicago", "LOS ANGELES"]
}

df = pd.DataFrame(sample_data)

print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,BOB,30,Los Angeles
2,Alice,25,new york
3,charlie,35,Chicago
4,,40,Chicago
5,BOB,30,LOS ANGELES


### Apply Function to DataFrame

In [None]:
df_cleaned = clean_data(df)

print("\nCleaned DataFrame:")
df_cleaned


Cleaned DataFrame:


Unnamed: 0,Name,Age,City
0,alice,25,new york
1,bob,30,los angeles
3,charlie,35,chicago


## The `lambda` Function

A `lambda` function is a small, anonymous function. It is useful when you need a quick, one-off function for a short task, instead of writing a full `def` block.

```python
# basic syntax
lambda arguments: expression
```

- `arguments` → the input and can take any number
- `expression` → the result to return and can only be one

For example:

```python
lambda x: x + 10
```

will do the same as:

```python
def add_ten(x):
  return x + 10
```

In [None]:
# Define a function to multiply by 2
def multiply_by_two(x):
    return x * 2

print(multiply_by_two(5))

10


In [None]:
# Using a lambda function to multiply by 2
lambda_times_two = lambda x: x * 2

print(lambda_times_two(5))

10


## The `.apply()` Method

The `.apply()` method is used to apply a function along an axis of a DataFrame or Series.

```python
# Basic syntax
dataframe.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
```

| Parameter | Status | Description |
| --- | --- | --- |
| `func` | Required | Function to apply |
| `axis` | Optional | 0 or `index` → apply to columns (_default_); 1 or `columns` → apply to rows |
| `raw` | Optional | Pass row or column as Series (`False` (_default_)) or ndarray object (`True`) |
| `result_type` | Optional | When `axis=1`, controls return format (`expand`, `reduce`, `broadcast`) |
| `kwargs` | Optional | Additional keyword arguments for the function |

In [10]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [10, 20, 30],
    "C": [100, 200, 300]
})

print("Original DataFrame:")
print(df)

# Row-wise aggregation: sum
df_row_sum = df.apply(np.sum, axis=1)

# Row-wise transformation: square each row (works elementwise on the row Series)
df_row_squared = df.apply(lambda row: row**2, axis=1)

print("\nUsing apply to sum rows:")
print(df_row_sum)

print("\nUsing apply to square values row-wise:")
print(df_row_squared)

Original DataFrame:
   A   B    C
0  1  10  100
1  2  20  200
2  3  30  300

Using apply to sum rows:
0    111
1    222
2    333
dtype: int64

Using apply to square values row-wise:
   A    B      C
0  1  100  10000
1  4  400  40000
2  9  900  90000


## The `.map()` Method

The `.map()` method applies a function elementwise to a Series or, starting in pandas ≥ 2.1, to a DataFrame.

```python
# Basic syntax
dataframe.map(func, na_action=None, **kwargs)
```

| Parameter | Status | Description |
| --- | --- | --- |
| `func` | Required | Function to apply |
| `na_action` | Optional | 'ignore' skips NULLs; otherwise, the function is applied to all non-null and null values |
| `kwargs` | Optional | Additional keyword arguments for the function |

⚠️ _**Note**: `.applymap()` was deprecated and `.map()` now replaces it for DataFrames._

In [11]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [10, 20, 30],
    "C": [100, 200, 300]
})

print("Original DataFrame:")
print(df)

# DataFrame aggregation: sum
# Using np.sum with .map() has no effect because .map() works element-wise
df_sum = df.map(np.sum)

# DataFrame transformation: square each value
df_squared = df.map(lambda x: x**2)

print("\nUsing map to sum:")
print(df_sum)

print("\nUsing map to square:")
print(df_squared)

Original DataFrame:
   A   B    C
0  1  10  100
1  2  20  200
2  3  30  300

Using map to sum:
   A   B    C
0  1  10  100
1  2  20  200
2  3  30  300

Using map to square:
   A    B      C
0  1  100  10000
1  4  400  40000
2  9  900  90000


## Using `.apply()` and `.map()` with `lambda` Functions to Clean Data

While `.apply()` is the most flexible choice (works row‑wise, column‑wise, or elementwise), `.map()` can be the faster solution in certain elementwise operations. Key benefits:

- Faster than `for` loops on large datasets because it's optimized under the hood
- Cleaner syntax and easier to read

### Using a `for` Loop in the Cleaning Function

In [None]:
import pandas as pd
import time

# Define a function to clean a DataFrame
def clean_data(df):

  # Standardize text columns (trim spaces, convert to lowercase)
  # Select all columns in DataFrame with object data type
  for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip().str.lower()

  # Clean and convert the 'Salary' column to numeric
  # pandas Series.str.replace() → regex=True by default (pandas >=2.0)
  # set regex=False to treat special characters literally
  if 'Salary' in df.columns:
    df['Salary'] = df['Salary'].str.replace("$", "", regex=False).str.replace(" ", "", regex=False).astype(float)

  df = df.drop_duplicates()  # Remove duplicate rows
  df = df.dropna()  # Drop missing values (or use fillna if needed)
  return df

# Create a sample DataFrame with duplicates, missing values, and non-standardized text
sample_data = {
    "Name": [" Alice  ", "BOB", "Alice", "charlie", None, "BOB"],
    "Age": [25, 30, 25, 35, 40, 30],
    "City": ["New York", "Los Angeles", "new york", "Chicago", "Chicago", "LOS ANGELES"],
    "Salary": [" $1000 ", " $2000", " $1000", " $3000 ", None, " $2000"]
}

df = pd.DataFrame(sample_data)

print("Original DataFrame:")
print(df)

# Clean the DataFrame with the function
start = time.time()
df_cleaned = clean_data(df)
end = time.time()

print("\nCleaned DataFrame:")
print(df_cleaned)

print(f"\nCleaning took {end - start:.4f} seconds")

Original DataFrame:
       Name  Age         City   Salary
0   Alice     25     New York   $1000 
1       BOB   30  Los Angeles    $2000
2     Alice   25     new york    $1000
3   charlie   35      Chicago   $3000 
4      None   40      Chicago     None
5       BOB   30  LOS ANGELES    $2000

Cleaned DataFrame:
      Name  Age         City  Salary
0    alice   25     new york  1000.0
1      bob   30  los angeles  2000.0
3  charlie   35      chicago  3000.0

Cleaning took 0.0048 seconds


### Using `.apply` in the Cleaning Function Instead

This version replaces the `for` loop with `.apply()` for cleaner syntax and better performance on large datasets. It also directly applies transformations to specific columns without conditional checks.

In [None]:
import pandas as pd
import time

# Define a function to clean a DataFrame
def clean_data(df):

  # Clean text columns using .apply instead of a for loop
  df[['Name', 'City']] = df[['Name', 'City']].apply(lambda col: col.str.strip().str.lower())

  # Clean and convert the 'Salary' column to numeric
  # x is a Python string, so str.replace() is always literal
  # no regex flag required
  df['Salary'] = df['Salary'].apply(lambda x: float(x.replace('$', '').strip()) if pd.notnull(x) else x)

  df = df.drop_duplicates()  # Remove duplicate rows
  df = df.dropna()  # Drop missing values (or use fillna if needed)
  return df

# Create a sample DataFrame with duplicates, missing values, and non-standardized text
sample_data = {
    "Name": [" Alice  ", "BOB", "Alice", "charlie", None, "BOB"],
    "Age": [25, 30, 25, 35, 40, 30],
    "City": ["New York", "Los Angeles", "new york", "Chicago", "Chicago", "LOS ANGELES"],
    "Salary": [" $1000 ", " $2000", " $1000", " $3000 ", None, " $2000"]
}

df = pd.DataFrame(sample_data)

print("Original DataFrame:")
print(df)

# Clean the DataFrame with the function
start = time.time()
df_cleaned = clean_data(df)
end = time.time()

print("\nCleaned DataFrame:")
print(df_cleaned)

print(f"\nCleaning took {end - start:.4f} seconds")

Original DataFrame:
       Name  Age         City   Salary
0   Alice     25     New York   $1000 
1       BOB   30  Los Angeles    $2000
2     Alice   25     new york    $1000
3   charlie   35      Chicago   $3000 
4      None   40      Chicago     None
5       BOB   30  LOS ANGELES    $2000

Cleaned DataFrame:
      Name  Age         City  Salary
0    alice   25     new york  1000.0
1      bob   30  los angeles  2000.0
3  charlie   35      chicago  3000.0

Cleaning took 0.0053 seconds


⚠️ On small datasets, `for` loops may be slightly faster due to lower overhead. However, `.apply()` scales better and is preferred for larger DataFrames.

## Cleaning & Manipulating Data

### Defining the Cleaning Function

In [None]:
import pandas as pd

# Define a function to clean a DataFrame
def clean_data(df):

  # Clean text columns using .apply
  df[df.select_dtypes(include=['object']).columns] = (
      df.select_dtypes(include=['object']).apply(lambda col: col.str.strip().str.title()))

  # Clean and convert the 'Revenue' column to numeric
  if 'Revenue' in df.columns:
    df['Revenue'] = df['Revenue'].apply(lambda x: float(x.replace("$", "").replace(" ", "")) if isinstance(x, str) else x)

  df = df.drop_duplicates()  # Remove duplicate rows
  df = df.dropna()  # Drop missing values (or use fillna if needed)
  return df

### Creating a Sales DataFrame

In [None]:
# Create a sample DataFrame with sales data
sales_data = {
    "Salesperson": [" Alice  ", "BOB", "Alice", "charlie", "Alice", "BOB"],
    "Region": ["North", "South", "North", "West", "West", "South"],
    "Revenue": [" $5000 ", " $7000", " $5000", " $9000 ", " $11000", " $7000"],
    "Products_Sold": [50, 70, 50, 90, 110, 70]
}

df = pd.DataFrame(sales_data)

print("Original DataFrame:")
print(df)

Original DataFrame:
  Salesperson Region  Revenue  Products_Sold
0     Alice    North   $5000              50
1         BOB  South    $7000             70
2       Alice  North    $5000             50
3     charlie   West   $9000              90
4       Alice   West   $11000            110
5         BOB  South    $7000             70


### Cleaning the DataFrame

In [None]:
# Clean the DataFrame with the function
df_cleaned = clean_data(df)

print("\nCleaned DataFrame:")
print(df_cleaned)


Cleaned DataFrame:
  Salesperson Region  Revenue  Products_Sold
0       Alice  North   5000.0             50
1         Bob  South   7000.0             70
3     Charlie   West   9000.0             90
4       Alice   West  11000.0            110


### Manipulating the Cleanded DataFrame

In [None]:
# Using .apply() to calculate 10% commission on Revenue
df_cleaned["Commission"] = df_cleaned["Revenue"].apply(lambda x: x * 0.10)

print("\nCommission added using .apply():")
print(df_cleaned)


Commission added using .apply():
  Salesperson Region  Revenue  Products_Sold  Commission
0       Alice  North   5000.0             50       500.0
1         Bob  South   7000.0             70       700.0
3     Charlie   West   9000.0             90       900.0
4       Alice   West  11000.0            110      1100.0


In [None]:
# Using .map() to give a 5% sales bonus on Products_Sold
df_cleaned["Sales_Bonus"] = df_cleaned["Products_Sold"].map(lambda x: int(x * 1.05))

print("\nSales_Bonus added using .map():")
print(df_cleaned)


Sales_Bonus added using .map():
  Salesperson Region  Revenue  Products_Sold  Commission  Sales_Bonus
0       Alice  North   5000.0             50       500.0           52
1         Bob  South   7000.0             70       700.0           73
3     Charlie   West   9000.0             90       900.0           94
4       Alice   West  11000.0            110      1100.0          115


## Grouping & Pivot Tables

Using `seaborn`’s built‑in `mpg` dataset to illustrate data cleaning and manipulation with grouping and pivot tables.

### Importing `mpg` Dataset

In [None]:
import seaborn as sns
import pandas as pd

mpg = sns.load_dataset('mpg')
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Inspect and Clean `mpg` Dataset

In [None]:
# Check basic information about the dataset
print("Dataset Info:")
print(mpg.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None


In [None]:
# Check for missing values
print("\nMissing Values:")
print(mpg.isnull().sum())


Missing Values:
mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


In [None]:
# Check for NaN values explicitly
print("\nNaN Values:")
print(mpg.isna().sum())


NaN Values:
mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


In [None]:
# Drop rows with any NaN values (inplace modifies the DataFrame directly)
mpg.dropna(inplace=True)

In [None]:
# Recheck for NaN values
print("\nNaN Values:")
print(mpg.isna().sum())


NaN Values:
mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


In [None]:
# Check for duplicate rows
print("Duplicate Rows:", mpg.duplicated().sum())

Duplicate Rows: 0


### Grouping `mpg` Dataset

In [None]:
# Apply groupby operation - Average MPG by number of cylinders
grouped_mpg = mpg.groupby('origin')['horsepower'].mean()
print("\nGrouped MPG by Cylinders:")
print(grouped_mpg)


Grouped MPG by Cylinders:
origin
europe     80.558824
japan      79.835443
usa       119.048980
Name: horsepower, dtype: float64


### Pivot Table on `mpg` Dataset

In [None]:
# Apply pivot table - Average MPG by origin and cylinders
pivot_mpg = mpg.pivot_table(values='mpg', index='origin', columns='cylinders', aggfunc='mean')
print("\nPivot Table - MPG by Origin and Cylinders:")
print(pivot_mpg)


Pivot Table - MPG by Origin and Cylinders:
cylinders      3          4          5          6          8
origin                                                      
europe       NaN  28.106557  27.366667  20.100000        NaN
japan      20.55  31.595652        NaN  23.883333        NaN
usa          NaN  28.013043        NaN  19.645205  14.963107


## Exercise 1: Standardizing Names

Create a function to **clean names** by:

- Removing extra spaces
- Converting names to **title case**
- Replacing any special characters (e.g., `é` → `e`)

Then, use **`apply` with a lambda function** to clean the `"Name"` column.

```python
data = {'Name': [' john doe ', 'Jane SMITH', 'michael O’CONNOR', 'sAm   wilson', 'Mary-jane']}
```

In [None]:
import pandas as pd

def clean_names(df):
  df['Name'] = df['Name'].apply(lambda x: x.strip().title().replace('é', 'e'))
  return df

data = {'Name': [' john doe ', 'Jane SMITH', 'michael O’CONNOR', 'sAm   wilson', 'Mary-jane']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

clean_df = clean_names(df)

print("\nCleaned DataFrame:")
print(clean_df)

Original DataFrame:
               Name
0         john doe 
1        Jane SMITH
2  michael O’CONNOR
3      sAm   wilson
4         Mary-jane

Cleaned DataFrame:
               Name
0          John Doe
1        Jane Smith
2  Michael O’Connor
3      Sam   Wilson
4         Mary-Jane


## Exercise 2: Converting Age to Integers

Create a function to:

- Convert **text-based numbers** (e.g., `"twenty-one"`) into **integers**.
- Remove any **non-numeric values**.

Then, **apply the function with a lambda** to clean the `"Age"` column.

```python
data = {'Age': ['25', 'Thirty-two', 40, 'twenty-one', 'N/A']}
```

In [None]:
import pandas as pd

def text2num(text):
  try:
    replace = {
        "thirty-two": 32,
        "twenty-one": 21}
    text = str(text).lower().strip()
    if text in replace:
      return replace[text]
    return int(text)
  except:
    return 0

data = {'Age': ['25', 'Thirty-two', 40, 'twenty-one', 'N/A']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

df['Clean_Age'] = df['Age'].apply(lambda x: text2num(x))

print("\nCleaned DataFrame:")
print(df)

Original DataFrame:
          Age
0          25
1  Thirty-two
2          40
3  twenty-one
4         N/A

Cleaned DataFrame:
          Age  Clean_Age
0          25         25
1  Thirty-two         32
2          40         40
3  twenty-one         21
4         N/A          0


## Exercise 3: Cleaning Salary Data

Create a function to:

- **Remove currency symbols** (`$`, `€`, `₹`).
- **Convert text-based salaries** like `"50K"` into numbers (`50000`).
- Handle missing or invalid values properly.

Then, **apply the function with a lambda** to clean the `"Salary"` column.

```python
data = {'Salary': ['$50,000', '€60K', '75000', 'Eighty Thousand', 'N/A']}
```

In [None]:
import pandas as pd

def clean_salary(salary):
  try:
    s = str(salary).lower().strip()

    if s == "eighty thousand":
      return 80000

    for sym in ["$", "€", "₹", ","]:
            s = s.replace(sym, "")

    if s.endswith("k") and s[:-1].isdigit():
      return int(s[:-1]) * 1000

    return float(s)
  except:
    return 0


data = {'Salary': ['$50,000', '€60K', '75000', 'Eighty Thousand', 'N/A']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

df['Clean_Salary'] = df['Salary'].apply(lambda x: clean_salary(x))

print("\nCleaned DataFrame:")
print(df)

Original DataFrame:
            Salary
0          $50,000
1             €60K
2            75000
3  Eighty Thousand
4              N/A

Cleaned DataFrame:
            Salary  Clean_Salary
0          $50,000       50000.0
1             €60K       60000.0
2            75000       75000.0
3  Eighty Thousand       80000.0
4              N/A           0.0


## Exercise 4: Standardizing Dates

Create a function to:

- Convert **various date formats** into **YYYY-MM-DD**.
- Handle missing or incorrectly formatted dates properly.

Then, **apply the function with a lambda** to clean the `"Date Joined"` column.

```python
data = {'Date Joined': ['01/15/2020', '2021-02-20', 'March 10, 2019', 'April 25, 20', 'N/A']}
```

In [None]:
import pandas as pd

def clean_dates(val):
  try:
    val = pd.to_datetime(val)
    return val.strftime('%Y-%m-%d')
  except:
    return pd.NaT

data = {'Date Joined': ['01/15/2020', '2021-02-20', 'March 10, 2019', 'April 25, 20', 'N/A']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

df['Clean_Date'] = df['Date Joined'].apply(lambda x: clean_dates(x))

print("\nCleaned DataFrame:")
print(df)

Original DataFrame:
      Date Joined
0      01/15/2020
1      2021-02-20
2  March 10, 2019
3    April 25, 20
4             N/A

Cleaned DataFrame:
      Date Joined  Clean_Date
0      01/15/2020  2020-01-15
1      2021-02-20  2021-02-20
2  March 10, 2019  2019-03-10
3    April 25, 20  2020-04-25
4             N/A         NaT
