# <font color="#418FDE" size="6.5" uppercase>**Cleaning Columns**</font>

>Last update: 20251225.
    
By the end of this Lecture, you will be able to:
- Handle missing values in DataFrames using strategies such as dropping, filling, and interpolation with Pandas 2.3.1 methods. 
- Standardize text and categorical columns by trimming, case-normalizing, and mapping inconsistent labels. 
- Convert column dtypes to appropriate numeric, datetime, and categorical types to support accurate computations. 


## **1. Handling Missing Data**

### **1.1. Finding Missing Values**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_01_01.jpg?v=1766709096" width="250">



>* Identify how missing values are encoded and stored
>* Treat missing values as unknowns, not real data

>* Use column summaries to spot missing data patterns
>* Inspect specific rows to see systematic missingness

>* Detect disguised placeholders and treat them as missing
>* Profile columns, flag invalid values, then clean



In [None]:
#@title Python Code - Finding Missing Values

# Show how pandas marks missing values clearly.
# Create a tiny DataFrame with different missing markers.
# Summarize missing values counts for each column.

import pandas as pd

# Create a small example dataset with different missing styles.
data = {
    "height_inches": [70, None, 65, -1],
    "weight_pounds": [180, 150, "N/A", 200],
    "city": ["New York", "", "unknown", "Chicago"],
}

# Build a DataFrame from the dictionary data.
df = pd.DataFrame(data)

# Replace placeholder codes with proper missing values markers.
df_clean = df.replace({"N/A": pd.NA, "": pd.NA, -1: pd.NA, "unknown": pd.NA})

# Show the cleaned DataFrame to inspect missing markers visually.
print("Cleaned DataFrame with standardized missing markers:")
print(df_clean)

# Use isna with sum to count missing values per column.
missing_counts = df_clean.isna().sum()

# Print missing counts to understand missingness pattern quickly.
print("\nMissing values per column:")
print(missing_counts)



### **1.2. Choosing Dropna Or Fillna**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_01_02.jpg?v=1766709244" width="250">



>* Choose between dropping or filling missing entries
>* Dropping works for rare, random gaps; beware bias

>* Fill gaps to keep all valuable observations
>* Choose constants or statistics, acknowledging strong assumptions

>* Weigh data loss, missingness pattern, and assumptions
>* Mix drop, fill, and documentation to manage bias



In [None]:
#@title Python Code - Choosing Dropna Or Fillna

# Demonstrate choosing dropna or fillna for missing values in columns.
# Show how dropping removes rows and filling keeps all observations.
# Compare effects on a tiny sales DataFrame with missing discounts.

import pandas as pd

# Create small sales DataFrame with some missing discount values.
data = {"order_id": [1, 2, 3, 4], "amount_dollars": [50, 80, 40, 60], "discount_dollars": [0, None, 5, None]}

df = pd.DataFrame(data)

# Show original data with missing discount values clearly visible.
print("Original DataFrame with missing discounts:")
print(df)

# Drop rows where discount is missing, keep only complete discount information.
dropped = df.dropna(subset=["discount_dollars"])

print("\nAfter dropping rows with missing discounts:")
print(dropped)

# Fill missing discounts with zero dollars, assuming no discount was applied.
filled = df.fillna({"discount_dollars": 0})

print("\nAfter filling missing discounts with zero dollars:")
print(filled)



### **1.3. Interpolation For Missing Values**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_01_03.jpg?v=1766709270" width="250">



>* Interpolation estimates ordered missing values using neighbors
>* Preserves series continuity with methods matching data behavior

>* Linear interpolation estimates values between nearby points
>* Pandas offers time-based and forward/backward interpolation

>* Use interpolation carefully; values are only estimates
>* Limit to short gaps and combine with other methods



In [None]:
#@title Python Code - Interpolation For Missing Values

# Demonstrate simple interpolation for missing numeric values in ordered data.
# Show differences between raw data and interpolated temperature readings.
# Use pandas interpolate methods with a small time indexed DataFrame.

import pandas as pd
import numpy as np

# Create hourly temperature data with some missing values in degrees Fahrenheit.
times = pd.date_range(start="2024-01-01 08:00", periods=6, freq="H")
values = [68.0, np.nan, np.nan, 74.0, 75.0, np.nan]

df = pd.DataFrame({"temperature_F": values}, index=times)

# Show original data with missing values clearly visible for comparison.
print("Original temperature data with missing values:")
print(df)

# Apply simple linear interpolation based on position between known values.
df_linear = df.interpolate(method="linear", limit_direction="forward")

# Apply time based interpolation using actual timestamp distances.
df_time = df.interpolate(method="time", limit_direction="forward")

# Combine original and interpolated values into one comparison DataFrame.
combined = pd.concat([df, df_linear, df_time], axis=1)
combined.columns = ["original_F", "linear_F", "time_based_F"]

# Print comparison to see how interpolation filled missing temperature values.
print("\nComparison of original and interpolated temperatures:")
print(combined)



## **2. Text Cleaning Basics**

### **2.1. Trimming And Lowercasing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_02_01.jpg?v=1766709297" width="250">



>* Trim spaces and lowercase to standardize text
>* Prevents identical values being counted as different

>* Different spellings create separate, incorrect categories
>* Trim and lowercase text to unify values

>* Use trimming and lowercasing, but think first
>* Protect codes where case or spaces matter



In [None]:
#@title Python Code - Trimming And Lowercasing

# Demonstrate trimming spaces in text columns using pandas string methods.
# Demonstrate converting text to lowercase for consistent comparisons.
# Show before and after cleaning for a small example DataFrame.

import pandas as pandas_library

# Create a small DataFrame with messy city and country text values.
raw_data = {"city": [" New York ", "los angeles", "Chicago  "], "country": [" USA", "usa ", "UsA"]}

# Build the DataFrame from the raw_data dictionary using pandas.
df = pandas_library.DataFrame(raw_data)

# Show the original messy text values before any cleaning operations.
print("Original DataFrame with messy text values:\n", df)

# Trim spaces from both ends of each string in city and country columns.
df["city"] = df["city"].str.strip()

df["country"] = df["country"].str.strip()

# Convert all letters in city and country columns to lowercase consistently.
df["city"] = df["city"].str.lower()

df["country"] = df["country"].str.lower()

# Show the cleaned DataFrame after trimming and lowercasing operations.
print("\nCleaned DataFrame with trimmed lowercase text:\n", df)



### **2.2. Standardizing Category Labels**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_02_02.jpg?v=1766709318" width="250">



>* Standardize categories so each concept has one label
>* Map all label variations to chosen canonical forms

>* Inspect unique values, find variants of same category
>* Map variants to canonical labels and document choices

>* Balance merging categories with keeping important distinctions
>* Create clear rules for unknowns and edge cases



In [None]:
#@title Python Code - Standardizing Category Labels

# Demonstrate standardizing messy category labels using simple Pandas mapping.
# Show how different text variants become one consistent canonical category label.
# Help beginners see why label standardization improves clean analysis results.

import pandas as pd

# Create a small DataFrame with messy country labels and product categories.
data = {
    "country": ["USA", "U.S.A.", "US", "United States", "usa"],
    "product": ["Electronics", "electronics", "Elec.", "Elec", "ELECTRONICS"],
}

# Build the DataFrame from the dictionary using Pandas DataFrame constructor.
df = pd.DataFrame(data)

# Show the original messy categories before any cleaning or standardization.
print("Original categories table before standardization:\n", df)

# Define canonical mapping for country labels using a simple Python dictionary.
country_map = {
    "usa": "United States",
    "u.s.a.": "United States",
    "us": "United States",
}

# Convert country text to lowercase then map variants to the canonical label.
df["country_clean"] = df["country"].str.lower().map(country_map).fillna("United States")

# Define canonical mapping for product labels using lowercase keys and full names.
product_map = {
    "electronics": "Electronics",
    "elec.": "Electronics",
    "elec": "Electronics",
}

# Convert product text to lowercase then map variants to the canonical label.
df["product_clean"] = df["product"].str.lower().map(product_map).fillna("Electronics")

# Show the cleaned categories and compare them with the original messy values.
print("\nStandardized categories table after mapping variants:\n", df)



### **2.3. Whitespace and symbol cleanup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_02_03.jpg?v=1766709347" width="250">



>* Messy whitespace and symbols distort text columns
>* Uncleaned issues create fake categories and confusion

>* Hidden whitespace and symbols create fake differences
>* Normalize spaces and dashes to unify categories

>* Decide which symbols are meaningful or noise
>* Strip unwanted symbols and standardize remaining characters



In [None]:
#@title Python Code - Whitespace and symbol cleanup

# Demonstrate cleaning messy whitespace and symbols in Pandas text columns.
# Show how different looking similar strings become consistent categories.
# Use simple replacements for spaces, dashes, and decorative symbols.

import pandas as pd

# Create a small DataFrame with messy city and department text values.
data = {
    "city": ["New York", "New  York", "New\u00a0York", "New York."],
    "department": ["Sales", "Sales!", "Sales ", "Salesâœ…"],
}

df = pd.DataFrame(data)

# Display the original messy text values for both columns.
print("Original values with messy whitespace and symbols:\n")
print(df)

# Replace non breaking spaces and tabs with normal spaces in city column.
df["city_clean"] = df["city"].str.replace("\u00a0", " ", regex=False).str.replace("\t", " ", regex=False)

# Remove trailing punctuation like periods and commas from city values.
df["city_clean"] = df["city_clean"].str.replace(r"[.,]+$", "", regex=True)

# Collapse multiple spaces into a single space and strip edges in city values.
df["city_clean"] = df["city_clean"].str.replace(r"\s+", " ", regex=True).str.strip()

# Remove decorative symbols and emojis from department values.
df["dept_clean"] = df["department"].str.replace(r"[^A-Za-z ]", "", regex=True)

# Collapse spaces and strip edges in department cleaned values.
df["dept_clean"] = df["dept_clean"].str.replace(r"\s+", " ", regex=True).str.strip()

# Show cleaned columns and compare unique category counts before and after.
print("\nCleaned values with normalized whitespace and symbols:\n")
print(df[["city_clean", "dept_clean"]])

print("\nUnique city values before and after cleaning:")
print(len(df["city"].unique()), "before,", len(df["city_clean"].unique()), "after")



## **3. Converting Column Types**

### **3.1. Numeric Conversion Errors**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_03_01.jpg?v=1766709375" width="250">



>* Messy nonnumeric entries often break numeric conversion
>* These errors reveal data quality issues and risks

>* Different locales write numbers with varying separators
>* Mismatched formats cause failed conversions and bias

>* Text-like numbers break sorting and calculations
>* Consistent error handling keeps numeric columns reliable



In [None]:
#@title Python Code - Numeric Conversion Errors

# Demonstrate numeric conversion errors with messy column values.
# Show how invalid strings become missing numeric values using Pandas.
# Compare original strings and converted numeric column side by side.

import pandas as pd

# Create a small DataFrame with messy numeric looking strings.
data = {"sales_dollars": ["1200", "2,500", "N/A", "?", "3500"]}

df = pd.DataFrame(data)

# Show the original column values before any numeric conversion attempt.
print("Original sales_dollars column values:")
print(df["sales_dollars"].tolist())

# Convert to numeric with errors coerced into missing values using NaN.
df["sales_numeric"] = pd.to_numeric(df["sales_dollars"], errors="coerce")

# Show the converted numeric column and highlight missing conversion results.
print("\nConverted sales_numeric column values:")
print(df["sales_numeric"].tolist())

# Count how many values failed conversion and became missing numeric entries.
failed_count = df["sales_numeric"].isna().sum()
print("\nNumber of failed numeric conversions:", int(failed_count))



### **3.2. Datetime Parsing Formats**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_03_02.jpg?v=1766709399" width="250">



>* Correctly parse date strings into datetime objects
>* Avoid ambiguous formats that distort time-based analysis

>* Recognize different date, time, and timezone patterns
>* Specify formats to reduce ambiguity and ensure consistency

>* Standardize messy datetime strings before parsing them
>* Define supported formats, review failures, protect analysis



In [None]:
#@title Python Code - Datetime Parsing Formats

# Demonstrate parsing different datetime string formats with pandas to_datetime function.
# Show how specifying exact format removes ambiguity between similar date patterns.
# Compare automatic parsing with explicit format strings for safer datetime conversions.

import pandas as pd

# Create a small DataFrame containing several differently formatted date strings.
raw_dates = ["03-04-2024", "2024/03/04", "04-03-2024", "December 31, 2024"]

df = pd.DataFrame({"raw_date": raw_dates})

# Show the original raw strings so we can compare before and after parsing.
print("Original raw_date column values:")
print(df["raw_date"].to_string(index=False))

# Let pandas guess formats automatically, which may misinterpret ambiguous patterns.
df["auto_parsed"] = pd.to_datetime(df["raw_date"], errors="coerce", dayfirst=False)

# Parse using an explicit month-day-year format for clearly defined numeric patterns.
mask_numeric = df["raw_date"].str.contains("-", regex=False)

format_mdy = "%m-%d-%Y"

# Apply explicit format only to matching rows, leave others as missing for now.
df.loc[mask_numeric, "explicit_mdy"] = pd.to_datetime(
    df.loc[mask_numeric, "raw_date"], format=format_mdy, errors="coerce"
)

# Finally, display the DataFrame to compare automatic and explicit parsing results.
print("\nDataFrame with automatic and explicit parsed columns:")
print(df.to_string(index=False))



### **3.3. Efficient Categorical Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_B/image_03_03.jpg?v=1766709416" width="250">



>* Categorical dtypes store repeated labels more efficiently
>* They reduce memory, speed analysis, clarify group structure

>* Define categories and order to reflect meaning
>* Catch unexpected or misspelled labels as errors

>* Categorical columns improve grouping, summaries, and visuals
>* They guide modeling choices and prevent misleading calculations



In [None]:
#@title Python Code - Efficient Categorical Types

# Demonstrate converting text columns into efficient categorical types.
# Show memory usage difference between object and categorical columns.
# Show ordered categories enabling meaningful sorting for ordinal responses.

import pandas as pd

# Create a small DataFrame with repeated text categories.
data = {
    "payment_method": ["card", "cash", "card", "card", "cash", "check", "card", "cash"],
    "satisfaction": ["very dissatisfied", "dissatisfied", "satisfied", "very satisfied", "neutral", "satisfied", "neutral", "very satisfied"],
}

df = pd.DataFrame(data)

# Show original dtypes and memory usage for the DataFrame.
print("Original dtypes and memory usage:")
print(df.dtypes)
print("Memory usage bytes:", df.memory_usage(deep=True).sum())

# Convert payment_method to categorical type for efficiency.
df["payment_method_cat"] = df["payment_method"].astype("category")

# Define ordered categories for satisfaction, representing an ordinal scale.
order = ["very dissatisfied", "dissatisfied", "neutral", "satisfied", "very satisfied"]

df["satisfaction_cat"] = pd.Categorical(df["satisfaction"], categories=order, ordered=True)

# Show new dtypes and memory usage after categorical conversion.
print("\nWith categorical columns dtypes and memory usage:")
print(df.dtypes)
print("Memory usage bytes:", df.memory_usage(deep=True).sum())

# Sort by ordered satisfaction categories to show meaningful ordering.
sorted_df = df.sort_values("satisfaction_cat")

print("\nSorted by ordered satisfaction category:")
print(sorted_df[["satisfaction_cat", "payment_method_cat"]])



# <font color="#418FDE" size="6.5" uppercase>**Cleaning Columns**</font>


In this lecture, you learned to:
- Handle missing values in DataFrames using strategies such as dropping, filling, and interpolation with Pandas 2.3.1 methods. 
- Standardize text and categorical columns by trimming, case-normalizing, and mapping inconsistent labels. 
- Convert column dtypes to appropriate numeric, datetime, and categorical types to support accurate computations. 

In the next Module (Module 3), we will go over 'Transforming Data'