# <font color="#418FDE" size="6.5" uppercase>**Cleaning And Scaling**</font>

>Last update: 20260201.
    
By the end of this Lecture, you will be able to:
- Identify common data issues such as missing values and inconsistent formats. 
- Describe simple strategies for handling missing values in features. 
- Explain why scaling numeric features can help some models behave more sensibly. 


## **1. Common Data Issues**

### **1.1. Handling Missing Data**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_01_01.jpg?v=1769965126" width="250">



>* Missing data are absent, unknown, or unusable values
>* They appear as blanks, codes, or impossible numbers

>* Missing data can be random or systematic
>* Check where gaps cluster and what they correlate with

>* Use checks and domain knowledge to spot missingness
>* Question odd values, gaps, and placeholder codes



In [None]:
#@title Python Code - Handling Missing Data

# This script shows basic missing data handling.
# It uses a tiny table with simple issues.
# Focus on detecting and counting missing entries.

# import pandas for simple table handling.
import pandas as pd

# create a tiny dataset with obvious missing values.
data = {
    "age": [25, None, 40, -1],
    "city": ["Paris", "", None, "Berlin"],
}

# build a DataFrame from the dictionary.
df = pd.DataFrame(data)

# replace impossible age values with proper missing markers.
df.loc[df["age"] < 0, "age"] = pd.NA

# treat empty city strings as missing markers also.
df.loc[df["city"] == "", "city"] = pd.NA

# show the small cleaned table with missing markers.
print("Cleaned tiny dataset with missing markers:")
print(df)

# count missing values in each column for inspection.
missing_counts = df.isna().sum()

# print a short summary of missing values per column.
print("\nMissing values per column:")
print(missing_counts)




### **1.2. Unit Mismatches**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_01_02.jpg?v=1769965158" width="250">



>* Same feature recorded in different units misleads
>* Model learns unit errors instead of real patterns

>* Unit confusion appears across domains and sources
>* Plausible numbers hide issues, causing bad models

>* Inspect distributions and clusters to spot inconsistent units
>* Use metadata and experts to standardize measurements



### **1.3. Detecting Outliers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_01_03.jpg?v=1769965172" width="250">



>* Outliers are values far from most data
>* They may be real events or errors

>* Use plots and summaries to spot extremes
>* Unusual values may be real events or errors

>* Use judgment and context when evaluating extreme values
>* Distinguish meaningful rare events from errors or noise



In [None]:
#@title Python Code - Detecting Outliers

# This script shows simple numeric outlier detection.
# It uses pandas and matplotlib for tiny examples.
# Focus on visualizing unusual values in small data.

# !pip install pandas matplotlib seaborn.

# Import required libraries for data handling.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set deterministic random seed for reproducibility.
np.random.seed(42)

# Create mostly normal purchase amounts with noise.
normal_purchases = np.random.normal(loc=50.0, scale=10.0, size=40)

# Manually add a few extreme purchase amounts.
outlier_purchases = np.array([5.0, 200.0, 300.0, 400.0])

# Combine normal and outlier purchases into one array.
all_purchases = np.concatenate((normal_purchases, outlier_purchases))

# Build a DataFrame to hold purchase amounts.
data = pd.DataFrame({"purchase_amount": all_purchases})

# Compute simple summary statistics for quick overview.
summary = data["purchase_amount"].describe()

# Print summary to see min max and quartiles.
print("Purchase amount summary statistics:\n", summary)

# Compute interquartile range for rule based detection.
q1 = summary["25%"]
q3 = summary["75%"]

# Calculate IQR and lower and upper detection bounds.
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Create boolean mask for suspected outlier rows.
outlier_mask = (
    (data["purchase_amount"] < lower_bound)
    | (data["purchase_amount"] > upper_bound)
)

# Select only rows flagged as potential outliers.
suspected_outliers = data.loc[outlier_mask, "purchase_amount"]

# Print detected outliers with their index positions.
print("\nSuspected outliers using IQR rule:")
print(suspected_outliers)

# Create a simple boxplot to visualize outliers.
plt.figure(figsize=(6, 4))
plt.boxplot(data["purchase_amount"], vert=True)

# Label axes and title for clear interpretation.
plt.ylabel("Purchase amount in dollars")
plt.title("Boxplot highlighting potential purchase outliers")

# Display the plot to visually inspect unusual values.
plt.show()




## **2. Handling Missing Data**

### **2.1. Removing Incomplete Data**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_02_01.jpg?v=1769965219" width="250">



>* Drop rows or features with missing values
>* Method is simple but can remove important information

>* Use missing proportion and feature importance together
>* Drop rare incomplete rows or mostly empty features

>* Dropping missing data can bias your results
>* Check missingness patterns and trade-offs before deleting



In [None]:
#@title Python Code - Removing Incomplete Data

# This script shows removing incomplete data safely.
# We use pandas to drop missing rows simply.
# Focus on understanding what gets removed carefully.

# import pandas for simple table handling.
import pandas as pd

# create a tiny survey style dataset.
data = {
    "age": [25, 30, None, 40],
    "satisfaction": [4, None, 5, 3],
    "comment": ["good", "ok", "great", None],
}

# build a DataFrame from the dictionary.
df = pd.DataFrame(data)

# show the original data with missing values.
print("Original data with possible missing values:")
print(df)

# count missing values per column for inspection.
missing_counts = df.isna().sum()

# print a short summary of missing values.
print("\nMissing values per column before dropping:")
print(missing_counts)

# drop rows where any value is missing completely.
df_dropped_any = df.dropna(how="any")

# show the remaining rows after dropping any missing.
print("\nData after dropping rows with any missing:")
print(df_dropped_any)

# drop rows only if satisfaction is missing specifically.
df_dropped_target = df.dropna(subset=["satisfaction"])

# show data where satisfaction is always present.
print("\nData after keeping rows with satisfaction present:")
print(df_dropped_target)




### **2.2. Basic Imputation Methods**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_02_02.jpg?v=1769965253" width="250">



>* Fill missing values using simple imputation guesses
>* Use mean or median to keep data

>* Fill missing categories with the most common value
>* Method is simple but can add bias

>* Use nearby time or space values to fill
>* Assumes gradual change; imputed values add uncertainty



In [None]:
#@title Python Code - Basic Imputation Methods

# This script shows basic imputation methods.
# It uses a tiny housing style dataset.
# Focus is on simple missing value strategies.

# import required libraries for data handling.
import numpy as np
import pandas as pd

# set a deterministic random seed for reproducibility.
np.random.seed(42)

# create a tiny dataset with some missing values.
data = {
    "size_sqm": [50, 60, None, 80, 90, None],
    "bedrooms": [1, 2, 2, None, 3, None],
    "city": [
        "Springfield",
        "Springfield",
        None,
        "Shelbyville",
        "Springfield",
        None,
    ],
}

# build a pandas dataframe from the dictionary.
df = pd.DataFrame(data)

# show the original data with missing values.
print("Original data with missing values:")
print(df)

# compute mean and median for numeric size feature.
size_mean = df["size_sqm"].mean()
size_median = df["size_sqm"].median()

# compute mode for the categorical city feature.
city_mode_series = df["city"].mode()
city_mode = city_mode_series.iloc[0] if not city_mode_series.empty else None

# create copies for different imputation strategies.
df_mean_imputed = df.copy()
df_median_imputed = df.copy()
df_mode_imputed = df.copy()

# fill numeric size using mean imputation.
df_mean_imputed["size_sqm"] = df_mean_imputed["size_sqm"].fillna(size_mean)

# fill numeric size using median imputation.
df_median_imputed["size_sqm"] = df_median_imputed["size_sqm"].fillna(
    size_median
)

# fill categorical city using mode imputation.
if city_mode is not None:
    df_mode_imputed["city"] = df_mode_imputed["city"].fillna(city_mode)

# demonstrate forward fill on bedrooms as a time like feature.
df_ffill = df.copy()
df_ffill["bedrooms"] = df_ffill["bedrooms"].ffill()

# print summary of each imputation result.
print("\nMean imputed size_sqm column:")
print(df_mean_imputed["size_sqm"])

print("\nMedian imputed size_sqm column:")
print(df_median_imputed["size_sqm"])

print("\nMode imputed city column:")
print(df_mode_imputed["city"])

print("\nForward filled bedrooms column:")
print(df_ffill["bedrooms"])



### **2.3. Recording Data Assumptions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_02_03.jpg?v=1769965316" width="250">



>* Handling missing data always involves hidden assumptions
>* Write assumptions clearly to ensure transparency and critique

>* Link each missing-data choice to its reason
>* Document assumptions and impacts in shared project notes

>* Note uncertainty and bias from missing-data choices
>* Document impacts to compare methods and ensure fairness



## **3. Why Scale Features**

### **3.1. Uneven Feature Scales**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_03_01.jpg?v=1769965330" width="250">



>* Different features can have very different ranges
>* Algorithms treat large numbers as more important features

>* Uneven scales let large-magnitude features dominate learning
>* Small-range but important features may be ignored

>* Uneven scales hide true feature importance in models
>* They distort similarity, leading to biased decisions



In [None]:
#@title Python Code - Uneven Feature Scales

# This script shows uneven feature scales clearly.
# We compare raw and scaled numeric house features.
# Focus on how large scales dominate distances.

# import required numeric and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# set deterministic random seed for reproducibility.
np.random.seed(42)

# create tiny synthetic house dataset with uneven scales.
sizes = np.array([60, 80, 120, 200, 250], dtype=float)

# create prices with much larger numeric magnitudes.
prices = np.array([120000, 150000, 220000, 400000, 520000], dtype=float)

# create bedroom counts with very small magnitudes.
bedrooms = np.array([2, 2, 3, 4, 5], dtype=float)

# stack features into matrix with shape validation.
features = np.column_stack((sizes, prices, bedrooms))

# confirm feature matrix shape is as expected.
print("Feature matrix shape:", features.shape)

# choose reference house index for distance comparisons.
ref_index = 0

# compute raw Euclidean distances to reference house.
raw_diffs = features - features[ref_index]
raw_distances = np.sqrt(np.sum(raw_diffs ** 2, axis=1))

# print raw distances to show price domination.
print("Raw distances from first house:")
print(np.round(raw_distances, 2))

# manually scale each feature to zero mean unit variance.
means = features.mean(axis=0)
stds = features.std(axis=0)

# avoid division by zero using safe replacement.
stds_safe = np.where(stds == 0, 1.0, stds)

# compute standardized feature matrix using safe denominators.
features_scaled = (features - means) / stds_safe

# compute scaled distances to same reference house.
scaled_diffs = features_scaled - features_scaled[ref_index]
scaled_distances = np.sqrt(np.sum(scaled_diffs ** 2, axis=1))

# print scaled distances where features contribute comparably.
print("Scaled distances from first house:")
print(np.round(scaled_distances, 2))

# prepare x positions for plotting raw and scaled distances.
indices = np.arange(features.shape[0])

# create bar plot comparing raw and scaled distances.
plt.figure(figsize=(6, 4))
plt.bar(indices - 0.15, raw_distances, width=0.3, label="Raw distances")

# add bars for scaled distances beside raw distances.
plt.bar(indices + 0.15, scaled_distances, width=0.3, label="Scaled distances")

# label axes and add legend for clarity.
plt.xlabel("House index")
plt.ylabel("Distance from first house")
plt.title("Effect of uneven feature scales on distances")
plt.legend()

# display the plot to visually compare distances.
plt.show()




### **3.2. Model Sensitivity to Scale**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_03_02.jpg?v=1769965391" width="250">



>* Large-number features can unfairly dominate learning
>* Distance-based models ignore small-scale feature differences

>* Large-scale features dominate gradients and slow learning
>* Distance-based models overemphasize big-number features

>* Unscaled features distort coefficient penalties and interpretation
>* Scaling makes regularization fair and reflects true importance



In [None]:
#@title Python Code - Model Sensitivity to Scale

# This script shows why feature scaling matters.
# We compare distances before and after scaling.
# Focus on how one large feature dominates distances.

# Required libraries are already available in Colab.
# Uncomment next line only if numpy is missing.
# pip install numpy.

# Import numpy for simple numeric operations.
import numpy as np

# Set a deterministic random seed for reproducibility.
np.random.seed(42)

# Create tiny dataset with two numeric features.
# Feature one is small, feature two is very large.
small_feature = np.array([1.0, 2.0, 3.0, 4.0])
large_feature = np.array([1000.0, 2000.0, 3000.0, 4000.0])

# Stack features into a matrix with shape checks.
X = np.column_stack((small_feature, large_feature))
assert X.shape == (4, 2)

# Choose a reference point similar to first row.
reference_point = np.array([1.5, 1500.0])
assert reference_point.shape == (2,)

# Define a simple Euclidean distance function.
def euclidean_distance(a, b):
    diff = a - b
    return float(np.sqrt(np.sum(diff ** 2)))

# Compute distances before scaling to show dominance.
distances_raw = []
for row in X:
    distances_raw.append(euclidean_distance(row, reference_point))

# Manually standardize each feature using mean and std.
means = X.mean(axis=0)
stds = X.std(axis=0)

# Avoid division by zero using safe replacement.
stds_safe = np.where(stds == 0, 1.0, stds)

# Scale dataset and reference point using statistics.
X_scaled = (X - means) / stds_safe
reference_scaled = (reference_point - means) / stds_safe

# Validate scaled shapes before computing distances.
assert X_scaled.shape == X.shape
assert reference_scaled.shape == reference_point.shape

# Compute distances after scaling to compare effects.
distances_scaled = []
for row in X_scaled:
    distances_scaled.append(euclidean_distance(row, reference_scaled))

# Print raw and scaled distances side by side.
print("Index  Raw_Distance  Scaled_Distance")
for i, (d_raw, d_scaled) in enumerate(zip(distances_raw, distances_scaled)):
    print(i, round(d_raw, 2), round(d_scaled, 2))

# Print short explanation about model sensitivity.
print("\nNotice how scaling balances feature influence on distance.")



### **3.3. Normalization Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_B/image_03_03.jpg?v=1769965414" width="250">



>* Normalization puts numeric features on similar scales
>* Keeps information while preventing one feature dominating

>* Turn different units into comparable standardized scores
>* Keep order of values while adjusting their scale

>* Normalization stabilizes training and speeds model learning
>* It prevents large-scale features from overpowering others



In [None]:
#@title Python Code - Normalization Basics

# This script shows basic feature normalization concepts.
# We compare raw and normalized numeric feature scales.
# Focus is on simple min max and zscore.

# No extra installs are required for this script.
# All used libraries are available by default.

# Import numpy for numeric array handling.
import numpy as np

# Create a tiny dataset with different feature scales.
raw_data = np.array([
    [20.0, 50000.0],
    [35.0, 80000.0],
    [50.0, 120000.0],
    [65.0, 200000.0],
])

# Print the raw data to inspect scales.
print("Raw data age income:", raw_data)

# Compute column wise minimum and maximum values.
col_min = raw_data.min(axis=0)
col_max = raw_data.max(axis=0)

# Apply simple min max normalization per column.
minmax_data = (raw_data - col_min) / (col_max - col_min)

# Print normalized data to compare scales.
print("Minmax normalized data values:", minmax_data)

# Compute column wise mean and standard deviation.
col_mean = raw_data.mean(axis=0)
col_std = raw_data.std(axis=0)

# Avoid division by zero using safe replacement.
col_std_safe = np.where(col_std == 0.0, 1.0, col_std)

# Apply zscore normalization centering and scaling.
zscore_data = (raw_data - col_mean) / col_std_safe

# Print zscore normalized data for comparison.
print("Zscore normalized data values:", zscore_data)

# Show how a simple distance changes after scaling.
point_a = raw_data[0]
point_b = raw_data[3]

# Compute Euclidean distance on raw features.
raw_distance = np.linalg.norm(point_a - point_b)

# Compute Euclidean distance on minmax normalized features.
minmax_distance = np.linalg.norm(
    minmax_data[0] - minmax_data[3]
)

# Print both distances to highlight scale effects.
print("Raw distance value:", float(raw_distance))
print("Minmax distance value:", float(minmax_distance))




# <font color="#418FDE" size="6.5" uppercase>**Cleaning And Scaling**</font>


In this lecture, you learned to:
- Identify common data issues such as missing values and inconsistent formats. 
- Describe simple strategies for handling missing values in features. 
- Explain why scaling numeric features can help some models behave more sensibly. 

In the next Module (Module 8), we will go over 'Nonlinear Models'