Data Cleaning, remove empty strings/values and replace them with None/Nan

In [7]:
import pandas as pd
import numpy as np

Method to replace empty cells with none

In [8]:
def replace_empty_with_none(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replaces all empty or whitespace-only strings in a DataFrame with NaN (None).
    """
    cleaned_df = df.copy()
    cleaned_df = cleaned_df.replace(r'^\s*$', np.nan, regex=True)
    return cleaned_df


Tests for replacing empty

In [9]:
def test_replace_empty_with_none():
    # arrange
    df = pd.DataFrame({
        "A": ["", " ", "Hello", None]
    })

    # act
    cleaned = replace_empty_with_none(df)

    # assert
    assert isinstance(cleaned, pd.DataFrame), "Output should be a pandas DataFrame"

    assert cleaned.shape == df.shape, "Shape of DataFrame should not change"

    # pd.isna(value) â†’ checks if a value (or values) is missing (i.e., NaN, None, or NaT)
    assert pd.isna(cleaned.loc[0, "A"]), "Row 0, Col A should be NaN"
    assert pd.isna(cleaned.loc[1, "A"]), "Row 1, Col A should be NaN"

    assert cleaned.loc[2, "A"] == "Hello", "Non-empty text should remain the same"
   
    nan_count = cleaned.isna().sum().sum()
    assert nan_count == 3, f"Expected 3 NaN values, got {nan_count}"

    print("âœ… All assertions passed â€” function works correctly!")

# run it
test_replace_empty_with_none()

âœ… All assertions passed â€” function works correctly!


Check if there are any duplicates with the same ID

In [3]:
def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove duplicates from the DataFrame
    """
    cleaned_df = df.drop_duplicates(subset=["ID"])
    return cleaned_df


Tests for method to check for duplicates with the same ID

In [4]:
test_df = pd.DataFrame({
    "ID": [101, 101],
    "Ownerâ€™s Name": ["Michiel", "HennuyÃ¨res"],
    "City/Region ": ["Paris", "Lyon"],
    "Sale-Price (â‚¬)": [250000, 310000],
    "  Date of-Sale ": ["2025-01-15", "2025-03-10"]
})

print("ðŸ§¾ Original dataframe:")
print(test_df)

# === Remove duplicates ===
cleaned_df = remove_duplicates(test_df)

print("\nCleaned dataframe:")
print(cleaned_df)

ðŸ§¾ Original dataframe:
    ID Ownerâ€™s Name City/Region   Sale-Price (â‚¬)   Date of-Sale 
0  101      Michiel        Paris          250000      2025-01-15
1  101   HennuyÃ¨res         Lyon          310000      2025-03-10

Cleaned dataframe:
    ID Ownerâ€™s Name City/Region   Sale-Price (â‚¬)   Date of-Sale 
0  101      Michiel        Paris          250000      2025-01-15


Transform the headers/titles from csv need to be transformed into snakecase. Because then it is easier to select the columns in dataframe... 

df.property_id
vs
df["Property ID]

In [5]:
import pandas as pd
import re
import unicodedata

def normalize_column_names(df: pd.DataFrame) -> pd.DataFrame:
    """
    Return a copy of the DataFrame with normalized, snake_case column names.
    - Removes accents
    - Converts to lowercase
    - Replaces spaces and symbols with underscores
    - Removes non-alphanumeric characters
    - Collapses multiple underscores
    """
    def clean(col):
        # Normalize accents (Ã© â†’ e)
        col = unicodedata.normalize('NFKD', col)
        col = ''.join(c for c in col if not unicodedata.combining(c))
        
        # Lowercase and replace separators
        col = col.lower()
        col = re.sub(r"[ \-()/.,:;+]", "_", col)
        
        # Remove remaining special characters
        col = re.sub(r"[^0-9a-z_]", "", col)
        
        # Collapse multiple underscores and trim edges
        col = re.sub(r"_+", "_", col).strip("_")
        
        return col

    df_copy = df.copy()
    df_copy.columns = [clean(str(col)) for col in df_copy.columns]
    return df_copy




In [6]:
test_df = pd.DataFrame({
    "Property ID (Ref#)": [101, 102],
    "Ownerâ€™s Name": ["Michiel", "HennuyÃ¨res"],
    "City/Region ": ["Paris", "Lyon"],
    "Sale-Price (â‚¬)": [250000, 310000],
    "  Date of-Sale ": ["2025-01-15", "2025-03-10"]
})

print("ðŸ§¾ Original columns:")
print(test_df.columns.tolist())

# === Normalize ===
clean_df = normalize_column_names(test_df)

print("\nCleaned columns:")
print(clean_df.columns.tolist())

print("\nCleaned dataframe:")
print(clean_df.head())

print("\nSee the columns")

print(clean_df.columns)

print("\nSelect the column:")

print(clean_df["property_id_ref"])
print("\nOr select it this way:")
print(clean_df.property_id_ref)



ðŸ§¾ Original columns:
['Property ID (Ref#)', 'Ownerâ€™s Name', 'City/Region ', 'Sale-Price (â‚¬)', '  Date of-Sale ']

Cleaned columns:
['property_id_ref', 'owners_name', 'city_region', 'sale_price', 'date_of_sale']

Cleaned dataframe:
   property_id_ref owners_name city_region  sale_price date_of_sale
0              101     Michiel       Paris      250000   2025-01-15
1              102  HennuyÃ¨res        Lyon      310000   2025-03-10

See the columns
Index(['property_id_ref', 'owners_name', 'city_region', 'sale_price',
       'date_of_sale'],
      dtype='object')

Select the column:
0    101
1    102
Name: property_id_ref, dtype: int64

Or select it this way:
0    101
1    102
Name: property_id_ref, dtype: int64
