### üìò Notebook 03 ‚Äî Feature Engineering & Clustering

In this notebook, we move from **data preparation** to **unsupervised learning**.  
Using the cleaned dataset (`books_final_1000.csv`), we‚Äôll extract key features such as ratings, price, and genre to group similar books with **K-Means clustering**.

**Goals:**
- Prepare numerical and categorical features  
- Apply **K-Means** and evaluate with **Elbow Method** & **Silhouette Score**  
- Visualize and interpret clusters for future recommendations


### Step 1 ‚Äî Imports & Setup

Import core libraries for clustering and visualization,  
reload the shared `functions.py` module, and verify that all paths from `config.yaml` are available.


In [None]:
# ============================================================
# Step 1 ‚Äî Imports & Setup 
# ============================================================

# --- System and project setup ---
import sys
from pathlib import Path

# Add 'notebooks' folder to path (functions.py lives there)
sys.path.append("notebooks")

# --- Load shared utilities ---
from functions import load_config, ensure_directories

# --- ML and visualization libraries ---
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# --- Visualization settings (como en clase) ---
sns.set(style="whitegrid", palette="muted")
plt.rcParams["figure.figsize"] = (8, 5)

# --- Load configuration and verify folders ---
config_path = Path("..") / "config.yaml"
config = load_config(config_path)
ensure_directories(config["paths"])

print("‚úÖ Environment ready ‚Äî config loaded and directories verified.")


### Step 2 ‚Äî Load Final Dataset  

Load the cleaned and standardized dataset (`books_final_1000.csv`) generated in the previous notebook.  
We‚Äôll inspect its structure, check column types, and verify that all key variables are ready for feature preparation.  


In [None]:
# ============================================================
# Step 2 ‚Äî Load Final Dataset
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load dataset from data/clean ---
data_path = Path("..") / config["paths"]["data_clean"] / "books_final_1000.csv"
df = pd.read_csv(data_path)

print(f"‚úÖ Dataset loaded successfully: {data_path}")
print(f"Shape: {df.shape}\n")

# --- Quick overview ---
display(df.head(5))

# --- Basic info and types ---
print("\nüîç DataFrame Info:")
print(df.info())

# --- Missing values summary ---
missing_summary = df.isna().sum()
missing_summary = missing_summary[missing_summary > 0]

if not missing_summary.empty:
    print("\n‚ö†Ô∏è Missing values summary:")
    print(missing_summary)
else:
    print("\n‚úÖ No missing values detected.")

# --- Optional: Unique values check (for categorical columns) ---
print("\nüß© Unique values per column:")
print(df.nunique())


### Step 3 ‚Äî Feature Preparation

Select relevant features for clustering:
- `avg_rating` (numerical)
- `price` (numerical)
- `genre` (categorical)

We‚Äôll:
1. Fill missing prices with the median value  
2. Encode `genre` using One-Hot Encoding  
3. Standardize numerical features with `StandardScaler`  
4. Combine all into a clean feature matrix for K-Means


In [None]:
# ============================================================
# Step 3 ‚Äî Feature Preparation (Fix Missing Price Fill)
# ============================================================

from sklearn.preprocessing import StandardScaler
import pandas as pd

# --- Select relevant columns ---
features = ["avg_rating", "price", "genre"]
df_features = df[features].copy()

# --- Handle missing prices ---
median_price = df_features["price"].median()
df_features["price"] = df_features["price"].fillna(median_price)

# --- Reflect the filled prices back into df ---
df["price"] = df["price"].fillna(median_price)
print(f"Filled missing 'price' values with median: {median_price:.2f}")

# --- One-Hot Encode 'genre' ---
df_encoded = pd.get_dummies(df_features, columns=["genre"], drop_first=True)
print(f"Encoded dataset has {df_encoded.shape[1]} columns after one-hot encoding.")

# --- Standardize numeric columns ---
scaler = StandardScaler()
numeric_cols = ["avg_rating", "price"]
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])
print("üìè Numeric columns standardized (avg_rating, price).")

# --- Optional sanity check ---
missing_check = df_encoded.isna().sum().sum()
if missing_check == 0:
    print("‚úÖ No missing values remain in feature matrix.")
else:
    print(f"‚ö†Ô∏è {missing_check} missing values still present ‚Äî check source data.")

# --- Assign to feature matrix for clustering ---
X = df_encoded.values

print(f"\n‚úÖ Feature matrix ready for clustering. Shape: {df_encoded.shape}")

# --- Quick preview ---
display(df_encoded.head(5))


### Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)

We‚Äôll apply **K-Means clustering** to group similar books based on their features.  
To choose the optimal number of clusters (`k`), we‚Äôll use:
- the **Elbow Method** (inertia plot), and  
- the **Silhouette Score** (cluster separation quality).  

This helps identify a balance between compactness and separation of clusters.


In [None]:
# ============================================================
# Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)
# ============================================================

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import numpy as np

# --- Feature matrix ---
X = df_encoded.copy()

# --- Initialize lists ---
inertias = []
silhouette_scores = []
K_range = range(2, 11)

print("üîπ Running K-Means for k = 2 to 10...\n")

# --- Run K-Means across different k values ---
for k in K_range:
    try:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
        score = silhouette_score(X, kmeans.labels_)
        silhouette_scores.append(score)
        print(f"‚úÖ k={k} ‚Äî Inertia={kmeans.inertia_:.2f}, Silhouette={score:.4f}")
    except Exception as e:
        print(f"‚ö†Ô∏è Error for k={k}: {e}")
        inertias.append(np.nan)
        silhouette_scores.append(np.nan)

# --- Determine best k by silhouette score ---
valid_scores = [s for s in silhouette_scores if not np.isnan(s)]
best_k = K_range[silhouette_scores.index(max(valid_scores))]
print(f"\nüåü Best k by silhouette score: {best_k}\n")

# ============================================================
# üìà Visualization ‚Äî Elbow & Silhouette
# ============================================================

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

# --- Elbow Method ---
ax1.plot(K_range, inertias, marker='o', color='steelblue')
ax1.set_title("Elbow Method ‚Äî K-Means Inertia", fontsize=11)
ax1.set_xlabel("Number of Clusters (k)")
ax1.set_ylabel("Inertia")

# --- Silhouette Scores ---
ax2.plot(K_range, silhouette_scores, marker='o', color='orange')
ax2.set_title("Silhouette Scores by Number of Clusters", fontsize=11)
ax2.set_xlabel("Number of Clusters (k)")
ax2.set_ylabel("Silhouette Score")

plt.tight_layout()
plt.show()

# ============================================================
# üíæ Save Results
# ============================================================

viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)

fig.savefig(viz_path / "kmeans_elbow_silhouette_combined.png", dpi=300, bbox_inches="tight")
print(f"üíæ Combined plot saved ‚Üí {viz_path / 'kmeans_elbow_silhouette_combined.png'}")

metrics_path = Path("..") / "data" / "clean"
metrics_path.mkdir(parents=True, exist_ok=True)

metrics_df = pd.DataFrame({
    "k": list(K_range),
    "inertia": inertias,
    "silhouette": silhouette_scores
})
metrics_df.to_csv(metrics_path / "kmeans_metrics.csv", index=False, encoding="utf-8-sig")

print(f"üíæ Metrics saved ‚Üí {metrics_path / 'kmeans_metrics.csv'}")


### ============================================================
### Step 4.1 ‚Äî Interpretation of Clustering Metrics
### ============================================================

The Elbow Method shows a sharp drop in inertia from k = 2 to k = 3,
after which the curve flattens ‚Äî suggesting that adding more clusters
beyond k = 2 provides limited improvement in compactness.

The Silhouette Scores confirm that:
- k = 2 achieves the highest separation (‚âà 0.75),
  meaning clusters are well-defined and internally cohesive.
- Higher k values slightly decrease silhouette quality,
  indicating overlapping or less distinct groups.

üìä **Decision:** We‚Äôll proceed with **k = 2 clusters**, as it offers 
a strong balance between compactness and separation.

üß† **Interpretation:**
Each book is represented by numerical and categorical features such as
*avg_rating*, *genre*, and *price*.  
K-Means groups books with similar characteristics into two main clusters ‚Äî
possibly reflecting broad patterns such as
‚Äúhigh-rated / premium‚Äù vs. ‚Äúlower-rated / affordable‚Äù categories,
which can later inform recommendation logic.


### Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (PCA 2D Projection)

Now that we‚Äôve decided on **k = 2 clusters**, we‚Äôll train the final K-Means model.  
Then, we‚Äôll apply **Principal Component Analysis (PCA)** to reduce the high-dimensional
feature space into **two components**, allowing us to visualize the book clusters in 2D.  

This helps identify group patterns ‚Äî for example,  
books that share similar ratings, prices, or genres might fall close together.


In [None]:
# ============================================================
# Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (PCA 2D)
# ============================================================

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import numpy as np

# --- Final number of clusters ---
k_final = 2
print(f"üè∑Ô∏è Applying final K-Means model with k = {k_final}...\n")

# --- Scale features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded)

# --- Train K-Means ---
kmeans_final = KMeans(n_clusters=k_final, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_scaled)

# --- Assign clusters safely ---
df = df.copy()
df["cluster"] = cluster_labels

# --- PCA for visualization ---
pca = PCA(n_components=2, random_state=42)
pca_components = pca.fit_transform(X_scaled)
df["pca_1"] = pca_components[:, 0]
df["pca_2"] = pca_components[:, 1]

# ============================================================
# üé® Enhanced Visualization ‚Äî PCA 2D Scatter Plot
# ============================================================

fig_pca, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x="pca_1", y="pca_2",
    hue="cluster",
    palette="Set2",
    s=65,
    alpha=0.85,
    edgecolor="white",
    linewidth=0.7,
    ax=ax
)
ax.set_title(f"Book Clusters ‚Äî PCA 2D Projection (k = {k_final})", fontsize=13, pad=10)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.legend(title="Cluster", loc="best", fontsize=9)
ax.grid(alpha=0.25, linestyle="--")
plt.tight_layout()
plt.show()

# ============================================================
# üíæ Save Plot
# ============================================================

viz_path = Path("..") / "visualizations"
fig_pca.savefig(viz_path / f"pca_clusters_k{k_final}.png", dpi=300, bbox_inches="tight")

print(f"Clustering completed. {df['cluster'].nunique()} clusters created.")
print(f"PCA cluster visualization saved ‚Üí {viz_path / f'pca_clusters_k{k_final}.png'}\n")

# --- Preview sample ---
display(df[["title", "author", "avg_rating", "genre", "price", "cluster"]].head(10))


### üìä Interpretation of Final Clusters

The 2-cluster configuration reveals two distinct groups of books based primarily on their average rating and price level.
Although genre information wasn‚Äôt directly used in the clustering, it helps describe each cluster‚Äôs general tendencies.

üìò Cluster 0 (Green):

Contains books with higher average prices, sometimes reaching premium levels.

Average ratings remain solid (‚âà 4.1), but these titles show greater variability in pricing, suggesting a mix of editions or market positions.

The genre distribution is more diverse, often including specialized or non-mainstream categories.

üìô Cluster 1 (Orange):

Concentrates books with moderate, consistent prices (around the dataset‚Äôs median ‚âà 9 EUR).

Also maintains strong average ratings (‚âà 4.1‚Äì4.3), representing a stable quality baseline.

This group likely includes popular, widely accessible titles, often in fiction.

üß† Interpretation:
The model separates books mainly by price range and rating consistency:

Cluster 1 groups books that are affordable and evenly rated, suggesting mainstream popularity.

Cluster 0 includes higher-priced or more varied titles, pointing toward specialized or premium segments.

This segmentation provides a useful foundation for recommendation logic ‚Äî books within the same cluster share similar value and rating profiles, which can guide price-sensitive or quality-focused suggestions.

### Step 5.1 ‚Äî Analyze Cluster Centroids

Let‚Äôs inspect the numerical centroids of each cluster to understand
how books are grouped ‚Äî for example, by their average rating or price.


In [None]:
# ============================================================
# Step 5.1 ‚Äî Analyze Cluster Centroids
# ============================================================

"""
üéØ Step 5.1 ‚Äî Analyze Cluster Centroids
Resumir los valores promedio de rating, precio y el g√©nero predominante
para entender las caracter√≠sticas principales de cada cluster.
"""

cluster_centroids = (
    df.groupby("cluster")
    .agg({
        "avg_rating": "mean",
        "price": "mean",
        "genre": lambda x: x.mode().iloc[0] if not x.mode().empty else "Unknown"
    })
    .round(2)
)

# --- Add cluster sizes ---
cluster_counts = df["cluster"].value_counts().sort_index()
cluster_centroids["count"] = cluster_counts.values

print("üìä Cluster Centroids Summary:\n")
display(cluster_centroids)

print("\nüß≠ Interpretation Guide:")
print("- avg_rating ‚Üí Average book rating per cluster.")
print("- price ‚Üí Mean price, useful to detect premium vs. budget titles.")
print("- genre ‚Üí Most common genre in each cluster.")
print("- count ‚Üí Number of books in each cluster.")


### ============================================================
### Step 5.2 ‚Äî Cluster Interpretation & Summary
### ============================================================

üß≠ **Interpretation**

**Cluster 0 ‚Äî ‚ÄúDiverse & Low-Data Group‚Äù**  
Books in this cluster display greater variability and include several entries
with missing or undefined genre information (often labeled as *Unknown*).  
Average prices are slightly lower or absent, and the group blends fiction with
a few non-fiction or academic titles.  
These books may represent **niche or irregular records** ‚Äî titles with limited metadata,
specialized topics, or editions without consistent pricing.

**Cluster 1 ‚Äî ‚ÄúMainstream & Popular Fiction‚Äù**  
This cluster gathers the majority of the dataset, dominated by **Fiction**
and related genres.  
Books here show **consistent prices**, slightly **higher average ratings** (‚âà 4.1‚Äì4.3),
and reflect the **core of widely read, well-rated works** ‚Äî from bestsellers
to recognized literary classics.

üí¨ **Overall Insight**  
With the expanded dataset of ‚âà 1 100 books, the clustering naturally separates titles
into two broad groups:

- A **mainstream segment** (Cluster 1) capturing structured, popular fiction.  
- A **diverse / incomplete-metadata segment** (Cluster 0) containing outliers
  or books missing genre and pricing details.

This suggests that **genre completeness and pricing consistency**
are major differentiators in how the model groups books.  
Future data enrichment ‚Äî for instance, filling missing *genre* values through
an API like *Google Books* ‚Äî could sharpen these boundaries
and yield finer thematic clusters for recommendation logic.


## Step 6 ‚Äî Export Final Clustered Dataset

We‚Äôll now export the final dataset including the cluster labels (`cluster`)
and PCA coordinates (`pca_1`, `pca_2`) for visualization and further analysis.  
This dataset can be used in **Tableau**, **Power BI**, or in the next notebook
for building the **Recommendation System**.


In [None]:
# ============================================================
# Step 6 ‚Äî Export Final Clustered Dataset & Cluster Summary
# ============================================================

from pathlib import Path
import pandas as pd

# --- Define export paths ---
data_clean_path = Path("..") / config["paths"]["data_clean"]
viz_path = Path("..") / "visualizations"
data_clean_path.mkdir(parents=True, exist_ok=True)
viz_path.mkdir(parents=True, exist_ok=True)

# --- Export final dataset ---
final_cluster_path = data_clean_path / "books_clustered_final.csv"

export_cols = [
    "title", "author", "avg_rating", "genre", "price", "currency",
    "cover_url", "link", "cluster", "pca_1", "pca_2"
]

df[export_cols].to_csv(final_cluster_path, index=False, encoding="utf-8-sig")
print(f"üíæ Final clustered dataset saved successfully ‚Üí {final_cluster_path.resolve()}")

# --- Quick preview ---
print("\nüìò Sample of exported dataset:")
display(df[export_cols].head(10))

# ============================================================
# üìä Cluster Summary (for visualization or presentation)
# ============================================================

cluster_summary = (
    df.groupby("cluster")
    .agg({
        "avg_rating": "mean",
        "price": "mean",
        "genre": lambda x: x.mode().iloc[0] if not x.mode().empty else "Unknown"
    })
    .round(2)
    .reset_index()
)

# --- Add cluster sizes ---
cluster_summary["count"] = df["cluster"].value_counts().sort_index().values

# --- Save summary ---
summary_path = data_clean_path / "cluster_summary.csv"
cluster_summary.to_csv(summary_path, index=False, encoding="utf-8-sig")

print("\nüìó Cluster Summary:")
display(cluster_summary)

print(f"\nüíæ Cluster summary saved ‚Üí {summary_path.resolve()}")


In [None]:
# ============================================================
# üìä Cluster Summary Table ‚Äî Final Styled Version (Updated for k = 2)
# ============================================================

import pandas as pd
from IPython.display import display, HTML

# --- Use real summary from notebook ---
cluster_summary = pd.DataFrame({
    "Cluster": [0, 1],
    "Dominant Genre": ["Unknown", "Fiction"],
    "Avg Rating": [4.05, 4.21],
    "Avg Price (EUR)": [7.89, 9.15]
})

# --- Styling (same as before) ---
styled_summary = (
    cluster_summary.style
    .set_caption("üìö Cluster Summary ‚Äî Average Features per Group (k = 2)")
    .set_table_styles([
        {"selector": "caption", 
         "props": [("text-align", "left"), ("font-size", "16px"), 
                   ("font-weight", "bold"), ("color", "#00c3ff")]},
        {"selector": "table", 
         "props": [("border", "2px solid #00c3ff"), ("border-radius", "8px"), 
                   ("border-collapse", "collapse")]},
        {"selector": "th", 
         "props": [("background-color", "#1c1c1c"), ("color", "white"), 
                   ("text-align", "center"), ("font-size", "14px")]},
        {"selector": "td", 
         "props": [("background-color", "#505050"), ("color", "#f2f2f2"), 
                   ("font-size", "13px"), ("text-align", "center")]}
    ])
    .hide(axis="index")
    .format({
        "Avg Rating": "{:.2f}",
        "Avg Price (EUR)": "{:.2f}"
    })
)

# --- Display ---
display(styled_summary)
