### üìò Notebook 03 ‚Äî Feature Engineering & Clustering

In this notebook, we move from **data preparation** to **unsupervised learning**.  
Using the cleaned dataset (`books_final_1000.csv`), we‚Äôll extract key features such as ratings, price, and genre to group similar books with **K-Means clustering**.

**Goals:**
- Prepare numerical and categorical features  
- Apply **K-Means** and evaluate with **Elbow Method** & **Silhouette Score**  
- Visualize and interpret clusters for future recommendations


### Step 1 ‚Äî Imports & Setup

Import core libraries for clustering and visualization,  
reload the shared `functions.py` module, and verify that all paths from `config.yaml` are available.


In [None]:
# ============================================================
# Step 1 ‚Äî Imports & Setup 
# ============================================================

# --- System and project setup ---
import sys
from pathlib import Path

# Add 'notebooks' folder to path (functions.py lives there)
sys.path.append("notebooks")

# --- Load shared utilities ---
from functions import load_config, ensure_directories

# --- ML and visualization libraries ---
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# --- Visualization settings (como en clase) ---
sns.set(style="whitegrid", palette="muted")
plt.rcParams["figure.figsize"] = (8, 5)

# --- Load configuration and verify folders ---
config_path = Path("..") / "config.yaml"
config = load_config(config_path)
ensure_directories(config["paths"])

print("‚úÖ Environment ready ‚Äî config loaded and directories verified.")


### Step 2 ‚Äî Load Final Dataset  

Load the cleaned and standardized dataset (`books_final_1000.csv`) generated in the previous notebook.  
We‚Äôll inspect its structure, check column types, and verify that all key variables are ready for feature preparation.  


In [None]:
# ============================================================
# Step 2 ‚Äî Load Final Dataset
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load dataset from data/clean ---
data_path = Path("..") / config["paths"]["data_clean"] / "books_final_1000.csv"
df = pd.read_csv(data_path)

print(f"‚úÖ Dataset loaded successfully: {data_path}")
print(f"Shape: {df.shape}\n")

# --- Quick overview ---
display(df.head(5))

# --- Basic info and types ---
print("\nüîç DataFrame Info:")
print(df.info())

# --- Missing values summary ---
missing_summary = df.isna().sum()
missing_summary = missing_summary[missing_summary > 0]

if not missing_summary.empty:
    print("\n‚ö†Ô∏è Missing values summary:")
    print(missing_summary)
else:
    print("\n‚úÖ No missing values detected.")

# --- Optional: Unique values check (for categorical columns) ---
print("\nüß© Unique values per column:")
print(df.nunique())


### Step 2.1 ‚Äî Clean & Normalize Genres

In [None]:
# ============================================================
# Step 2.1 ‚Äî Clean & Normalize Genres
# ============================================================

"""
üéØ Ensure genre names are consistent and meaningful.
Preserve subcategories (e.g., Young Adult Fiction, Juvenile Fiction)
and fix missing or inconsistent values.
"""

import numpy as np

# --- Clean genre text ---
df["genre"] = (
    df["genre"]
    .astype(str)
    .str.strip()
    .replace({"nan": np.nan, "None": np.nan})
)

# --- Replace NaN with "Unknown" ---
df["genre"] = df["genre"].fillna("Unknown")

# --- Title case normalization ---
df["genre"] = df["genre"].str.title()

# --- Fix common variants ---
genre_replacements = {
    "Juvenile Fiction ": "Juvenile Fiction",
    "Young Adult Fiction ": "Young Adult Fiction",
    "Biography & Autobiography ": "Biography & Autobiography",
    "Nan": "Unknown"
}
df["genre"] = df["genre"].replace(genre_replacements)

# --- Final check ---
print("‚úÖ Genre normalization complete.\n")
print("Top 10 genres after cleaning:")
display(df["genre"].value_counts().head(10))


### Step 2.2 ‚Äî Consolidate Main Genres (for Descriptive Analysis)

Before running clustering, we standardize and consolidate similar genre labels
(e.g., Juvenile Fiction and Young Adult Fiction) under their main categories.

This step does not affect the clustering model directly,
but ensures cleaner genre information for later descriptive and visualization steps
(e.g., identifying dominant genres within each cluster).

In [None]:
# ============================================================
# Step 2.2 ‚Äî Clean and Normalize Genres Before Encoding
# ============================================================

def normalize_genre(value):
    if pd.isna(value):
        return "Unknown"
    value = value.strip().title()
    # Simplify subcategories
    if "Juvenile" in value:
        return "Juvenile Fiction"
    if "Young Adult" in value:
        return "Young Adult Fiction"
    if "Biography" in value:
        return "Biography & Autobiography"
    if "Comics" in value or "Graphic" in value:
        return "Comics & Graphic Novels"
    if "Poet" in value:
        return "Poetry"
    if "Drama" in value:
        return "Drama"
    if "Relig" in value:
        return "Religion"
    if "History" in value:
        return "History"
    if "Fiction" in value:
        return "Fiction"
    return value

# Apply normalization
df["genre"] = df["genre"].apply(normalize_genre)

print("‚úÖ Genres normalized successfully.\n")
print("Top 10 genres after normalization:")
display(df["genre"].value_counts().head(10))


### Step 3 ‚Äî Feature Preparation

We now prepare the numerical features for K-Means clustering.

- Selected features: **avg_rating** and **price**
- Missing prices are filled with the median value
- Features are standardized using **StandardScaler**
- The resulting matrix `X` will be used for clustering


In [None]:
# ============================================================
# Step 3 ‚Äî Feature Preparation (Numeric Features Only)
# ============================================================

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# --- Select relevant numerical columns for clustering ---
features = ["avg_rating", "price"]
df_features = df[features].copy()

# --- Handle missing prices ---
median_price = df_features["price"].median()
df_features["price"] = df_features["price"].fillna(median_price)

# --- Reflect the filled prices back into the main DataFrame ---
df["price"] = df["price"].fillna(median_price)
print(f"Filled missing 'price' values with median: {median_price:.2f}")

# --- Standardize numeric columns ---
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_features)

# --- Convert back to DataFrame for easier handling ---
df_encoded = pd.DataFrame(df_scaled, columns=features)

# --- Sanity check ---
missing_check = df_encoded.isna().sum().sum()
if missing_check == 0:
    print("‚úÖ No missing values remain in feature matrix.")
else:
    print(f"‚ö†Ô∏è {missing_check} missing values still present ‚Äî check source data.")

# --- Assign to feature matrix for clustering ---
X = df_encoded.values

print(f"\n‚úÖ Feature matrix ready for clustering. Shape: {df_encoded.shape}")

# --- Quick preview ---
display(df_encoded.head(5))


## Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)

In this step, we apply the K-Means algorithm to identify potential groups (clusters) among the books.

We will:
1. Run K-Means for different values of *k* (from 2 to 10)
2. Record both the **Inertia** (Elbow Method) and the **Silhouette Score**
3. Identify the optimal number of clusters (`best_k`) based on the highest silhouette score
4. Visualize both metrics side by side for comparison

The **Elbow Method** helps detect the point where adding more clusters no longer improves the model significantly,  
while the **Silhouette Score** evaluates how well-separated the clusters are (higher values indicate better-defined clusters).

Finally, both the visualizations and metrics are saved for later analysis.


In [None]:
# ============================================================
# Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)
# ============================================================

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd

# --- Feature matrix (numeric only) ---
X = df_encoded.values

# --- Initialize lists ---
inertias = []
silhouette_scores = []
K_range = range(2, 11)

print("Running K-Means for k = 2 to 10...\n")

# --- Run K-Means across different k values ---
for k in K_range:
    try:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
        score = silhouette_score(X, kmeans.labels_)
        silhouette_scores.append(score)
        print(f"k={k} ‚Äî Inertia={kmeans.inertia_:.2f}, Silhouette={score:.4f}")
    except Exception as e:
        print(f"‚ö†Ô∏è Error for k={k}: {e}")
        inertias.append(np.nan)
        silhouette_scores.append(np.nan)

# --- Determine best k by silhouette score ---
valid_scores = [s for s in silhouette_scores if not np.isnan(s)]
best_k = K_range[silhouette_scores.index(max(valid_scores))]

# --- Show top 3 silhouette values ---
sorted_scores = sorted(zip(K_range, silhouette_scores), key=lambda x: x[1], reverse=True)
print("\nTop 3 silhouette scores:")
for i, (k_val, s_val) in enumerate(sorted_scores[:3], start=1):
    print(f"{i}. k={k_val} ‚Üí silhouette={s_val:.4f}")

print(f"\nBest k by silhouette score: {best_k}")

# ============================================================
# Visualization ‚Äî Elbow & Silhouette
# ============================================================

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

# Elbow Method
ax1.plot(K_range, inertias, marker='o', color='steelblue')
ax1.set_title("Elbow Method ‚Äî K-Means Inertia", fontsize=11)
ax1.set_xlabel("Number of Clusters (k)")
ax1.set_ylabel("Inertia")

# Silhouette Scores
ax2.plot(K_range, silhouette_scores, marker='o', color='orange')
ax2.set_title("Silhouette Scores by Number of Clusters", fontsize=11)
ax2.set_xlabel("Number of Clusters (k)")
ax2.set_ylabel("Silhouette Score")

plt.tight_layout()
plt.show()

# ============================================================
# Save Results
# ============================================================

viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)
fig.savefig(viz_path / "kmeans_elbow_silhouette_combined.png", dpi=300, bbox_inches="tight")

metrics_path = Path("..") / "data" / "clean"
metrics_path.mkdir(parents=True, exist_ok=True)

metrics_df = pd.DataFrame({
    "k": list(K_range),
    "inertia": inertias,
    "silhouette": silhouette_scores
})
metrics_df.to_csv(metrics_path / "kmeans_metrics.csv", index=False, encoding="utf-8-sig")

print(f"\nüíæ Results saved ‚Üí {metrics_path / 'kmeans_metrics.csv'}")
print(f"üè∑Ô∏è Best number of clusters: {best_k}")



###  Interpretation of Clustering Metrics


The Elbow Method shows a sharp drop in inertia from k = 2 to k = 3,
after which the curve flattens ‚Äî suggesting that adding more clusters
beyond k = 2 provides limited improvement in compactness.

The Silhouette Scores confirm that:
- k = 2 achieves the highest separation (‚âà 0.75),
  meaning clusters are well-defined and internally cohesive.
- Higher k values slightly decrease silhouette quality,
  indicating overlapping or less distinct groups.

üìä **Decision:** We‚Äôll proceed with **k = 2 clusters**, as it offers 
a strong balance between compactness and separation.

üß† **Interpretation:**
Each book is represented by numerical and categorical features such as
*avg_rating*, *genre*, and *price*.  
K-Means groups books with similar characteristics into two main clusters ‚Äî
possibly reflecting broad patterns such as
‚Äúhigh-rated / premium‚Äù vs. ‚Äúlower-rated / affordable‚Äù categories,
which can later inform recommendation logic.


## Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (PCA 2D)

After determining the optimal number of clusters (`k=2`), we apply the final K-Means model to assign each book to a specific cluster.

Steps performed:
1. Scale all numeric features for consistent distance calculation.
2. Train the K-Means model using the selected value of *k*.
3. Assign the resulting cluster labels to each book.
4. Apply **Principal Component Analysis (PCA)** to reduce the feature space to two dimensions for visualization.
5. Plot the resulting clusters using a 2D scatter plot, where each point represents a book and color indicates its cluster.

The resulting visualization helps identify distinct groups of books based on their ratings, price levels, and genre characteristics.
Both the model output and visualization are saved for further analysis.


In [None]:
# ============================================================
# Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (PCA 2D)
# ============================================================

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import numpy as np

# --- Final number of clusters ---
k_final = 2
print(f"Applying final K-Means model with k = {k_final}...\n")

# --- Use scaled numeric features from df_encoded (already standardized) ---
X_scaled = df_encoded.values

# --- Train K-Means ---
kmeans_final = KMeans(n_clusters=k_final, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_scaled)

# --- Assign clusters safely ---
df = df.copy()
df["cluster"] = cluster_labels

# --- PCA for visualization ---
pca = PCA(n_components=2, random_state=42)
pca_components = pca.fit_transform(X_scaled)
df["pca_1"] = pca_components[:, 0]
df["pca_2"] = pca_components[:, 1]

# ============================================================
# Visualization ‚Äî PCA 2D Scatter Plot
# ============================================================

fig_pca, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x="pca_1", y="pca_2",
    hue="cluster",
    palette="Set2",
    s=65,
    alpha=0.85,
    edgecolor="white",
    linewidth=0.7,
    ax=ax
)
ax.set_title(f"Book Clusters ‚Äî PCA 2D Projection (k = {k_final})", fontsize=13, pad=10)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.legend(title="Cluster", loc="best", fontsize=9)
ax.grid(alpha=0.25, linestyle="--")
plt.tight_layout()
plt.show()

# ============================================================
# Save Plot
# ============================================================

viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)
fig_pca.savefig(viz_path / f"pca_clusters_k{k_final}.png", dpi=300, bbox_inches="tight")

print(f"Clustering completed. {df['cluster'].nunique()} clusters created.")
print(f"PCA cluster visualization saved ‚Üí {viz_path / f'pca_clusters_k{k_final}.png'}\n")

# --- Preview sample ---
display(df[["title", "author", "avg_rating", "genre", "price", "cluster"]].head(10))


### Cluster Interpretation & Summary

### üìä Interpretation of Final Clusters

The 2-cluster configuration reveals two distinct groups of books based primarily on their **average rating** and **price level**.  
Although genre information wasn‚Äôt directly used in the clustering, it helps describe each cluster‚Äôs general tendencies.

---

#### üìò **Cluster 0 (Green)**
- Contains books with **higher average prices**, sometimes reaching premium levels.  
- ‚≠ê Average ratings remain solid (‚âà 4.1), but these titles show greater **variability in pricing**, suggesting a mix of editions or market positions.  
- The genre distribution is more diverse, often including **specialized or non-mainstream categories**.

---

#### üìô **Cluster 1 (Orange)**
- Concentrates books with **moderate, consistent prices** (around the dataset‚Äôs median ‚âà 9 EUR).  
- ‚≠ê Maintains strong average ratings (‚âà 4.1 ‚Äì 4.3), representing a stable quality baseline.  
- This group likely includes **popular, widely accessible titles**, often within **fiction**.

---

### üß† **Interpretation**

The model separates books mainly by **price range** and **rating consistency**:

- **Cluster 1** groups books that are **affordable and evenly rated**, suggesting **mainstream popularity**.  
- **Cluster 0** includes **higher-priced or more varied titles**, pointing toward **specialized or premium segments**.

This segmentation provides a solid foundation for future **recommendation logic** ‚Äî  
books within the same cluster share similar *value* and *rating profiles*,  
helping guide both **price-sensitive** and **quality-focused** suggestions.


### Step 5.1 ‚Äî Analyze Cluster Centroids

Let‚Äôs inspect the numerical centroids of each cluster to understand
how books are grouped ‚Äî for example, by their average rating or price.


In [None]:
# ============================================================
# Step 5.1 ‚Äî Cluster Composition Summary
# ============================================================

"""
üéØ Step 5.1 ‚Äî Analyze Cluster Composition by Genre
This version summarizes how genres distribute within each cluster,
while also showing the average rating and price per cluster.
"""

# --- Numeric centroids (average rating & price) ---
cluster_centroids = (
    df.groupby("cluster", as_index=False)
    .agg({
        "avg_rating": "mean",
        "price": "mean"
    })
    .round(2)
)

# --- Genre distribution within each cluster ---
genre_distribution = (
    df.groupby(["cluster", "genre"])
    .size()
    .reset_index(name="count")
)

# --- Calculate proportions per cluster ---
cluster_sizes = df["cluster"].value_counts().to_dict()
genre_distribution["proportion_%"] = genre_distribution.apply(
    lambda row: round((row["count"] / cluster_sizes[row["cluster"]]) * 100, 2),
    axis=1
)

# --- Merge numeric averages with genre distribution ---
cluster_summary = (
    genre_distribution.merge(cluster_centroids, on="cluster", how="left")
    .sort_values(["cluster", "count"], ascending=[True, False])
)

# --- Display top 5 genres per cluster ---
print("üìä Cluster Composition Summary (Top 5 Genres per Cluster):\n")
display(cluster_summary.groupby("cluster").head(5))

print("\nüß≠ Interpretation Guide:")
print("- avg_rating ‚Üí Average rating within the cluster.")
print("- price ‚Üí Mean price (EUR) within the cluster.")
print("- count ‚Üí Number of books belonging to that genre in the cluster.")
print("- proportion_% ‚Üí Relative share (%) of that genre within the cluster.")


### üìö Interpretation ‚Äî Cluster Composition by Genre

The genre distribution confirms that **Cluster 0** represents the mainstream market:
books with average prices around ‚Ç¨9 and high reader satisfaction (‚âà 4.1 ‚òÖ),
mostly from **fiction-related categories**.

**Cluster 1**, in contrast, groups a few high-priced titles (‚âà ‚Ç¨43),
covering specialized or collector genres such as *Art* or *Literary Criticism*.
These books maintain high ratings, suggesting that higher price is associated
with niche appeal or premium editions rather than lower quality.

Overall, while genre was not part of the clustering features,
its distribution provides valuable context:
the clusters reflect **economic segmentation** in the book market,
with Fiction dominating the accessible range and rare genres defining the premium range.



## Step 6 ‚Äî Export Final Clustered Dataset

We‚Äôll now export the final dataset including the cluster labels (`cluster`)
and PCA coordinates (`pca_1`, `pca_2`) for visualization and further analysis.  
This dataset can be used in **Tableau**, **Power BI**, or in the next notebook
for building the **Recommendation System**.


In [None]:
# ============================================================
# Step 6 ‚Äî Export Final Clustered Dataset & Cluster Summary
# ============================================================

from pathlib import Path
import pandas as pd

# --- Define export paths ---
data_clean_path = Path("..") / config["paths"]["data_clean"]
viz_path = Path("..") / "visualizations"
data_clean_path.mkdir(parents=True, exist_ok=True)
viz_path.mkdir(parents=True, exist_ok=True)

# --- Export final dataset ---
final_cluster_path = data_clean_path / "books_clustered_final.csv"

export_cols = [
    "title", "author", "avg_rating", "genre", "price", "currency",
    "cover_url", "link", "cluster", "pca_1", "pca_2"
]

df[export_cols].to_csv(final_cluster_path, index=False, encoding="utf-8-sig")
print(f"üíæ Final clustered dataset saved successfully ‚Üí {final_cluster_path.resolve()}")

print("\nüìò Sample of exported dataset:")
display(df[export_cols].head(10))

# ============================================================
# üìä Cluster Summary (improved: true genre distribution)
# ============================================================

# --- Numeric stats ---
cluster_centroids = (
    df.groupby("cluster")
    .agg({
        "avg_rating": "mean",
        "price": "mean"
    })
    .round(2)
    .reset_index()
)

# --- Genre distribution ---
genre_distribution = (
    df.groupby(["cluster", "genre"])
    .size()
    .reset_index(name="count")
)

# --- Add proportions per cluster ---
cluster_sizes = df["cluster"].value_counts().to_dict()
genre_distribution["proportion_%"] = genre_distribution.apply(
    lambda row: round((row["count"] / cluster_sizes[row["cluster"]]) * 100, 2),
    axis=1
)

# --- Merge numeric stats with genre composition ---
cluster_summary = genre_distribution.merge(cluster_centroids, on="cluster", how="left")

# --- Sort by cluster and descending count ---
cluster_summary = cluster_summary.sort_values(["cluster", "count"], ascending=[True, False])

print("\nüìó Cluster Summary (Top Genres per Cluster):")
display(cluster_summary.groupby("cluster").head(5))

print(f"\nüíæ Cluster summary saved ‚Üí {data_clean_path / 'cluster_summary.csv'}")
