# <font color="#418FDE" size="6.5" uppercase>**Simplifying Features**</font>

>Last update: 20260201.
    
By the end of this Lecture, you will be able to:
- Explain why high-dimensional data can be hard to visualize and reason about. 
- Describe the idea of representing data with fewer combined features. 
- Interpret simple low-dimensional visualizations that summarize higher-dimensional data. 


## **1. High Dimensional Intuition**

### **1.1. Intuition for Many Features**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_01_01.jpg?v=1769972596" width="250">



>* Many features describe each item as coordinates
>* Human intuition fails beyond three or four dimensions

>* Few features are easy to plot visually
>* Many features break our ability to see structure

>* Many features make comparisons and decisions overwhelming
>* Our intuition misses subtle multi-feature patterns and structure



### **1.2. Visualizing Many Dimensions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_01_02.jpg?v=1769972609" width="250">



>* We easily understand data in two or three dimensions
>* High-dimensional datasets exceed our visual and cognitive limits

>* We use 2D plots and color encodings
>* These views fragment patterns and strain our reasoning

>* Complex multi-axis charts quickly become cluttered
>* Human perception struggles to see subtle high-dimensional patterns



### **1.3. Noise and Clutter**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_01_03.jpg?v=1769972620" width="250">



>* Many features add noise and overwhelming clutter
>* Extra weak features hide the real underlying patterns

>* Extra dimensions add small noises that accumulate
>* Noisy features distort distances and visual projections

>* Too many weak cues overwhelm human reasoning
>* High dimensions blur signal, distances, and patterns



In [None]:
#@title Python Code - Noise and Clutter

# This script shows noise clutter in high dimensions.
# We compare clean and noisy features visually and numerically.
# Use this to build intuition about confusing extra dimensions.

# Required scientific plotting libraries are already available in Colab.
# They provide arrays, dataframes, and simple visualization tools.
# No additional installations are needed for this short example.

import numpy as np
import matplotlib.pyplot as plt

# Set a deterministic random seed for reproducible results.
np.random.seed(42)

# Create a small number of clean two dimensional points.
clean_points = np.random.normal(loc=0.0, scale=1.0, size=(40, 2))

# Create several extra noisy dimensions with weak information.
noisy_extra = np.random.normal(loc=0.0, scale=3.0, size=(40, 8))

# Combine clean and noisy features into one high dimensional array.
high_dim_points = np.concatenate((clean_points, noisy_extra), axis=1)

# Validate that shapes match expectations before further operations.
assert clean_points.shape[0] == high_dim_points.shape[0]

# Compute pairwise distances using only the two clean dimensions.
clean_distances = np.linalg.norm(
    clean_points[None, :, :] - clean_points[:, None, :], axis=2
)

# Compute pairwise distances using all noisy high dimensional features.
high_dim_distances = np.linalg.norm(
    high_dim_points[None, :, :] - high_dim_points[:, None, :], axis=2
)

# Select a reference point index for distance comparison.
reference_index = 0

# Extract distances from the reference point in both spaces.
clean_from_ref = clean_distances[reference_index]

# Extract high dimensional distances from the same reference point.
high_from_ref = high_dim_distances[reference_index]

# Sort indices by distance in the clean two dimensional space.
clean_order = np.argsort(clean_from_ref)

# Sort indices by distance in the noisy high dimensional space.
high_order = np.argsort(high_from_ref)

# Print nearest neighbors in clean space and noisy space.
print("Nearest neighbors using only two clean features:")
print(clean_order[:6])
print("Nearest neighbors using all noisy extra features:")
print(high_order[:6])

# Create a simple scatter plot of the clean two dimensional points.
plt.scatter(clean_points[:, 0], clean_points[:, 1], c="lightgray", label="all points")

# Highlight the reference point in the clean two dimensional view.
plt.scatter(
    clean_points[reference_index, 0], clean_points[reference_index, 1], c="red", label="reference"
)

# Highlight the nearest neighbor in clean space for visual comparison.
plt.scatter(
    clean_points[clean_order[1], 0], clean_points[clean_order[1], 1], c="blue", label="clean neighbor"
)

# Add labels and title explaining noise clutter intuition.
plt.xlabel("Feature one (clean signal)")
plt.ylabel("Feature two (clean signal)")

# Show how extra noisy dimensions can reshuffle neighbor relationships.
plt.title("Clean view versus hidden noisy clutter in extra dimensions")

# Display legend to distinguish reference and neighbor points.
plt.legend()

# Render the final plot window for this short teaching script.
plt.show()




## **2. Summarizing Features**

### **2.1. Merging Related Features**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_02_01.jpg?v=1769972662" width="250">



>* Combine measurements that tell the same story
>* Create one summary feature to capture broad patterns

>* Combine related variables into one composite score
>* Composite scores simplify comparisons while keeping key information

>* Merging correlated features cuts noise and redundancy
>* Combined features reveal clearer, simpler data patterns



### **2.2. Key Patterns in Data**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_02_02.jpg?v=1769972672" width="250">



>* Look for underlying themes across many features
>* Use pattern scores to summarize each person

>* Patterns appear from features changing together across data
>* These patterns become new combined features describing behavior

>* Focus on main variation patterns, not measurements
>* Pattern scores summarize patients with fewer features



In [None]:
#@title Python Code - Key Patterns in Data

# This script shows key patterns using combined features.
# We create lifestyle data and summarize main patterns.
# Focus is representing many features with fewer pattern scores.

# import required numerical and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# set deterministic random seed for reproducible results.
np.random.seed(42)

# create small synthetic lifestyle dataset with three features.
num_people = 30
exercise_hours = np.random.normal(loc=4.0, scale=1.0, size=num_people)

# create sleep hours positively related to exercise hours.
sleep_hours = 6.0 + 0.5 * exercise_hours + np.random.normal(
    loc=0.0, scale=0.5, size=num_people
)

# create stress score negatively related to exercise and sleep.
stress_score = 8.0 - 0.6 * exercise_hours - 0.4 * sleep_hours + np.random.normal(
    loc=0.0, scale=0.7, size=num_people
)

# stack features into one data matrix for processing.
X = np.column_stack((exercise_hours, sleep_hours, stress_score))

# validate matrix shape before further operations.
assert X.shape == (num_people, 3)

# center each feature by subtracting its mean value.
X_mean = X.mean(axis=0)
X_centered = X - X_mean

# compute covariance matrix capturing how features move together.
cov_matrix = np.cov(X_centered.T)

# compute eigenvalues and eigenvectors of covariance matrix.
values, vectors = np.linalg.eig(cov_matrix)

# sort eigenvalues to find strongest shared variation direction.
order = np.argsort(values)[::-1]
values = values[order]
vectors = vectors[:, order]

# take first eigenvector as main lifestyle pattern direction.
main_pattern_direction = vectors[:, 0]

# project each person onto this main pattern direction.
pattern_scores = X_centered.dot(main_pattern_direction)

# print brief explanation and first few pattern scores.
print("Each person now has one main pattern score summarizing lifestyle.")
print("First five pattern scores (higher means healthier overall pattern):")
print(np.round(pattern_scores[:5], 2))

# create scatter plot comparing exercise and stress colored by pattern score.
plt.figure(figsize=(6, 4))
plt.scatter(
    exercise_hours,
    stress_score,
    c=pattern_scores,
    cmap="viridis",
)

# label axes and add colorbar explaining pattern strength.
plt.xlabel("Exercise hours per week")
plt.ylabel("Stress score (higher means more stress)")
plt.title("One combined pattern summarizing three lifestyle features")
plt.colorbar(label="Main lifestyle pattern score")
plt.tight_layout()
plt.show()




### **2.3. Balancing Detail Loss**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_02_03.jpg?v=1769972707" width="250">



>* Combining features trades detail for simpler patterns
>* Check simplified data still supports reliable conclusions

>* Reducing features should remove noise, not signal
>* Broader features reveal patterns while hiding tiny details

>* Over-simplifying features can hide important local patterns
>* We must test simplifications for accuracy and fairness



In [None]:
#@title Python Code - Balancing Detail Loss

# This script shows balancing detail loss visually.
# We compare original features and simplified features.
# Focus on keeping signal while discarding noisy detail.

# Required libraries are available in Colab by default.
# Uncomment next lines only if running elsewhere.
# import sys for environment specific installation.
# import subprocess for optional manual package installation.

# Import required numerical and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# Set deterministic random seed for reproducible behavior.
np.random.seed(42)

# Create small synthetic dataset with three features.
num_points = 40
base_trend = np.linspace(0, 10, num_points)

# Add two noisy features around the base trend.
feature_one = base_trend + np.random.normal(0, 0.8, num_points)
feature_two = base_trend + np.random.normal(0, 1.2, num_points)

# Stack features into a matrix and validate shape.
data_matrix = np.column_stack((feature_one, feature_two))
assert data_matrix.shape == (num_points, 2)

# Create a combined feature by averaging both features.
combined_feature = data_matrix.mean(axis=1)

# Compute simple detail loss measure using variance difference.
original_variance = data_matrix.var(axis=0, ddof=1).mean()
combined_variance = combined_feature.var(ddof=1)
variance_ratio = combined_variance / original_variance

# Print short summary about simplification and variance.
print("Original average variance across features:", round(original_variance, 3))
print("Combined feature variance after simplification:", round(combined_variance, 3))
print("Variance kept ratio close to one means less detail loss.")
print("Here ratio value is:", round(variance_ratio, 3))

# Prepare figure with two subplots for comparison.
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plot original two features against base trend.
axes[0].scatter(base_trend, feature_one, color="tab:blue", label="Feature one")
axes[0].scatter(base_trend, feature_two, color="tab:orange", label="Feature two")
axes[0].set_title("Original separate noisy features")
axes[0].set_xlabel("Underlying trend index value")
axes[0].set_ylabel("Measured feature values scale")
axes[0].legend(loc="upper left")

# Plot combined feature showing simplified representation.
axes[1].scatter(base_trend, combined_feature, color="tab:green", label="Combined feature")
axes[1].set_title("Simplified combined feature view")
axes[1].set_xlabel("Underlying trend index value")
axes[1].set_ylabel("Combined feature values scale")
axes[1].legend(loc="upper left")

# Adjust layout and display the single figure.
plt.tight_layout()
plt.show()




## **3. Reading 2D Projections**

### **3.1. Basic 2D Plots**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_03_01.jpg?v=1769972746" width="250">



>* 2D plots are snapshots of complex data
>* Each dot summarizes many original feature values

>* Abstract axes summarize key patterns in data
>* Translate axis directions into real-world meanings mentally

>* Look for dense areas, gaps, and directions
>* Treat distance as similarity to spot patterns, outliers



In [None]:
#@title Python Code - Basic 2D Plots

# This script shows a simple two dimensional projection plot.
# We summarize higher dimensional style scores into two combined axes.
# Use this to practice reading basic two dimensional scatter plots.

# Import required numerical and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# Set a deterministic random seed for reproducible synthetic data.
np.random.seed(42)

# Create small synthetic data representing three customer style features.
base_points = np.array([
    [2.0, 7.0, 3.0],
    [6.0, 2.0, 8.0],
    [4.0, 5.0, 6.0],
    [8.0, 1.0, 9.0],
    [1.0, 8.0, 2.0],
    [7.0, 3.0, 7.0],
])

# Check that the data has expected two dimensional shape.
assert base_points.shape[1] == 3

# Define simple weights to create two combined projection directions.
weights_one = np.array([0.6, 0.3, 0.1])
weights_two = np.array([-0.2, 0.5, 0.7])

# Compute first projection as weighted sum for each record.
proj_one = base_points.dot(weights_one)

# Compute second projection as another weighted sum for each record.
proj_two = base_points.dot(weights_two)

# Stack projections into two dimensional coordinates for plotting.
projected_points = np.column_stack((proj_one, proj_two))

# Confirm projected data has two columns representing two axes.
assert projected_points.shape[1] == 2

# Prepare short labels describing each synthetic customer record.
labels = [
    "Eco casual", "Luxury formal", "Balanced style",
    "Bold luxury", "Relaxed eco", "Modern luxury",
]

# Create a scatter plot showing the two dimensional projection.
plt.figure(figsize=(6, 5))
plt.scatter(projected_points[:, 0], projected_points[:, 1], color="teal")

# Label each point to support interpretation of plot positions.
for point, label in zip(projected_points, labels):
    plt.text(point[0] + 0.05, point[1] + 0.05, label, fontsize=8)

# Add axis labels that describe abstract combined style directions.
plt.xlabel("Component one budget to luxury style direction")
plt.ylabel("Component two relaxed to bold style direction")

# Add a concise title explaining this two dimensional projection.
plt.title("Reading a simple two dimensional projection of style data")

# Adjust layout and display the final scatter plot figure.
plt.tight_layout()
plt.show()




### **3.2. Seeing clusters and trends**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_03_02.jpg?v=1769972785" width="250">



>* Look for tight groups of nearby points
>* Separated dense groups suggest meaningful data subgroups

>* Projections can show smooth gradients, not just clusters
>* Gradual changes reveal fuzzy boundaries and related groups

>* Use density, overlap, and outliers to interpret
>* Form hypotheses about structure, reliability, and anomalies



In [None]:
#@title Python Code - Seeing clusters and trends

# This script shows clusters and trends visually.
# We use simple synthetic data with two projections.
# You will compare clusters and gradients between projections.

# import required libraries for arrays and plotting.
import numpy as np
import matplotlib.pyplot as plt

# set deterministic random seed for reproducible results.
np.random.seed(42)

# create two tight clusters and one gradient cluster.
cluster_one = np.random.normal(loc=(-2, -2), scale=0.4, size=(40, 2))

# create second cluster with different center location.
cluster_two = np.random.normal(loc=(2, 2), scale=0.4, size=(40, 2))

# create gradient points stretching between clusters.
gradient_line = np.linspace(-2.0, 2.0, 40)

# stack gradient coordinates into two dimensional array.
gradient_cluster = np.column_stack((gradient_line, gradient_line))

# combine all points into one dataset array.
data = np.vstack((cluster_one, cluster_two, gradient_cluster))

# validate dataset shape before plotting anything.
assert data.shape == (120, 2), "Unexpected data shape detected"

# create figure with two side by side subplots.
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

# first subplot shows original coordinates scatter.
axes[0].scatter(data[:, 0], data[:, 1], c="steelblue", s=20, alpha=0.8)

# add helpful title and axis labels for first subplot.
axes[0].set_title("Projection A: clear clusters and gradient")

# label axes to remind these are combined features.
axes[0].set_xlabel("Combined feature one")
axes[0].set_ylabel("Combined feature two")

# build a second projection mixing original coordinates.
proj_x = (0.7 * data[:, 0]) + (0.3 * data[:, 1])

# build second projection y using different mixing weights.
proj_y = (0.2 * data[:, 0]) - (0.6 * data[:, 1])

# second subplot shows mixed projection scatter.
axes[1].scatter(proj_x, proj_y, c="darkorange", s=20, alpha=0.8)

# add title describing weaker separation in second projection.
axes[1].set_title("Projection B: fuzzier clusters and trends")

# label axes to emphasize different combined features.
axes[1].set_xlabel("Combined feature three")
axes[1].set_ylabel("Combined feature four")

# adjust layout for readability and display the figure.
plt.tight_layout()
plt.show()



### **3.3. Insights From Plots**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_B/image_03_03.jpg?v=1769972806" width="250">



>* Treat projections as clues about data organization
>* Clusters, gradients, and shapes reveal hidden relationships

>* Axes are abstract combined features, not originals
>* Use axis movement and distances to interpret patterns

>* 2D projections lose information, so interpret cautiously
>* Use plots to form hypotheses, then verify elsewhere



In [None]:
#@title Python Code - Insights From Plots

# This script shows simple two dimensional projections visually.
# It helps connect scatter patterns with underlying feature structure.
# We use synthetic customers to keep ideas very clear.

# !pip install numpy.
# !pip install matplotlib.
# !pip install seaborn.

# Import required libraries for numerical work and plotting.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set deterministic random seed for reproducible synthetic data.
rng = np.random.default_rng(seed=42)

# Create three small customer groups with simple spending patterns.
cluster_centers = np.array([[2.0, 8.0, 3.0], [7.0, 2.0, 6.0], [4.5, 5.0, 8.0]])

# Choose number of customers per group and total count.
points_per_cluster = 20
num_clusters = cluster_centers.shape[0]

# Generate noisy points around each cluster center deterministically.
noise = rng.normal(loc=0.0, scale=0.6, size=(points_per_cluster * num_clusters, 3))

# Repeat centers and add noise to create full dataset.
base = np.repeat(cluster_centers, repeats=points_per_cluster, axis=0)

# Combine base and noise to obtain final three dimensional features.
data = base + noise

# Validate shape to avoid unexpected broadcasting mistakes.
assert data.shape == (points_per_cluster * num_clusters, 3)

# Manually build two combined features as simple projections.
combined_one = (0.5 * data[:, 0]) + (0.5 * data[:, 1])

# Build second combined feature emphasizing third original feature strongly.
combined_two = (0.2 * data[:, 1]) + (0.8 * data[:, 2])

# Stack combined features into two dimensional projection array.
projection_2d = np.column_stack((combined_one, combined_two))

# Validate projection shape before plotting for safety.
assert projection_2d.shape[1] == 2 and projection_2d.shape[0] == data.shape[0]

# Build simple cluster labels to color points by original group.
labels = np.repeat(np.arange(num_clusters), repeats=points_per_cluster)

# Print short explanation connecting plot to high dimensional structure.
print("Each point is a customer summarized by two combined features.")
print("Colors show original groups that lived in three feature dimensions.")
print("Look for separated colored clouds or smooth gradients across axes.")

# Create scatter plot showing two dimensional projection of customers.
plt.figure(figsize=(6, 5))

# Use seaborn scatterplot for nicer default styling and legend.
sns.scatterplot(x=projection_2d[:, 0], y=projection_2d[:, 1], hue=labels, palette="deep")

# Label axes to emphasize they are abstract combined features.
plt.xlabel("Combined feature one summarizing two spending types")
plt.ylabel("Combined feature two summarizing another spending mix")

# Add title reminding viewers this is a compressed projection.
plt.title("Reading a two dimensional projection of three dimensional customers")

# Show legend and final plot for visual interpretation practice.
plt.show()



# <font color="#418FDE" size="6.5" uppercase>**Simplifying Features**</font>


In this lecture, you learned to:
- Explain why high-dimensional data can be hard to visualize and reason about. 
- Describe the idea of representing data with fewer combined features. 
- Interpret simple low-dimensional visualizations that summarize higher-dimensional data. 

In the next Module (Module 10), we will go over 'Responsible Practice'