### üìò Notebook 03 ‚Äî Feature Engineering & Clustering

In this notebook, we move from **data preparation** to **unsupervised learning**.  
Using the cleaned dataset (`books_final_1000.csv`), we‚Äôll extract key features such as ratings, price, and genre to group similar books with **K-Means clustering**.

**Goals:**
- Prepare numerical and categorical features  
- Apply **K-Means** and evaluate with **Elbow Method** & **Silhouette Score**  
- Visualize and interpret clusters for future recommendations


### Step 1 ‚Äî Imports & Setup

Import core libraries for clustering and visualization,  
reload the shared `functions.py` module, and verify that all paths from `config.yaml` are available.


In [None]:
# ============================================================
# Step 1 ‚Äî Imports & Setup 
# ============================================================

import sys
from pathlib import Path
import importlib

# --- Access shared functions ---
sys.path.append("notebooks")
from functions import load_config, ensure_directories
import functions
importlib.reload(functions)

# --- Additional libraries for ML and visualization ---
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# --- Load configuration from root ---
config_path = Path("..") / "config.yaml"
config = load_config(config_path)

# --- Verify folders ---
ensure_directories(config["paths"])

print("‚úÖ Environment ready ‚Äî config loaded and directories verified.")


### Step 2 ‚Äî Load Final Dataset  

Load the cleaned and standardized dataset (`books_final_1000.csv`) generated in the previous notebook.  
We‚Äôll inspect its structure, check column types, and verify that all key variables are ready for feature preparation.  


In [None]:
# ============================================================
# Step 2 ‚Äî Load Final Dataset
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load dataset from data/clean ---
data_path = Path("..") / config["paths"]["data_clean"] / "books_final_1000.csv"
df = pd.read_csv(data_path)

print(f"‚úÖ Dataset loaded successfully: {data_path}")
print(f"Shape: {df.shape}\n")

# --- Quick overview ---
display(df.head(5))

# --- Basic info and types ---
print("\nüîç DataFrame Info:")
print(df.info())

# --- Missing values summary ---
missing_summary = df.isna().sum()
missing_summary = missing_summary[missing_summary > 0]

if not missing_summary.empty:
    print("\n‚ö†Ô∏è Missing values summary:")
    print(missing_summary)
else:
    print("\n‚úÖ No missing values detected.")


### Step 3 ‚Äî Feature Preparation

Select relevant features for clustering:
- `avg_rating` (numerical)
- `price` (numerical)
- `genre` (categorical)

We‚Äôll:
1. Fill missing prices with the median value  
2. Encode `genre` using One-Hot Encoding  
3. Standardize numerical features with `StandardScaler`  
4. Combine all into a clean feature matrix for K-Means


In [None]:
# ============================================================
# Step 3 ‚Äî Feature Preparation
# ============================================================

from sklearn.preprocessing import StandardScaler

# --- Select relevant columns ---
features = ["avg_rating", "price", "genre"]
df_features = df[features].copy()

# --- Handle missing values ---
# Fill missing prices with median (robust against outliers)
median_price = df_features["price"].median()
df_features["price"] = df_features["price"].fillna(median_price)

print(f"üí∞ Filled missing 'price' values with median: {median_price:.2f}")

# --- One-Hot Encode 'genre' ---
df_encoded = pd.get_dummies(df_features, columns=["genre"], drop_first=True)

# --- Standardize numeric columns ---
scaler = StandardScaler()
numeric_cols = ["avg_rating", "price"]
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

# Ensure missing prices are filled before clustering
median_price = df["price"].median()
df["price"] = df["price"].fillna(median_price)

print(f"‚úÖ Missing prices filled with median: {median_price}")


print(f"‚úÖ Feature matrix ready for clustering. Shape: {df_encoded.shape}")

# --- Quick preview ---
display(df_encoded.head(5))


### Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)

We‚Äôll apply **K-Means clustering** to group similar books based on their features.  
To choose the optimal number of clusters (`k`), we‚Äôll use:
- the **Elbow Method** (inertia plot), and  
- the **Silhouette Score** (cluster separation quality).  

This helps identify a balance between compactness and separation of clusters.


In [None]:
# ============================================================
# Step 4 ‚Äî K-Means Clustering (Elbow & Silhouette Method)
# ============================================================

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from pathlib import Path

# --- Feature matrix ---
X = df_encoded.copy()

# --- Initialize lists ---
inertias = []
silhouette_scores = []
K_range = range(2, 11)  # test k between 2 and 10

print("üîπ Running K-Means for k = 2 to 10...\n")

# --- Run K-Means across different k values ---
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

print("‚úÖ K-Means training completed.")

# --- Elbow Method ---
fig_elbow, ax1 = plt.subplots(figsize=(6, 4))
ax1.plot(K_range, inertias, marker='o', color='steelblue')
ax1.set_title("Elbow Method ‚Äî K-Means Inertia")
ax1.set_xlabel("Number of Clusters (k)")
ax1.set_ylabel("Inertia")
plt.tight_layout()
plt.show()

# --- Silhouette Scores ---
fig_silhouette, ax2 = plt.subplots(figsize=(6, 4))
ax2.plot(K_range, silhouette_scores, marker='o', color='orange')
ax2.set_title("Silhouette Scores by Number of Clusters")
ax2.set_xlabel("Number of Clusters (k)")
ax2.set_ylabel("Silhouette Score")
plt.tight_layout()
plt.show()

# --- Save both plots ---
viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)

fig_elbow.savefig(viz_path / "elbow_method_kmeans.png", dpi=300, bbox_inches="tight")
fig_silhouette.savefig(viz_path / "silhouette_scores.png", dpi=300, bbox_inches="tight")

print(f"üíæ Elbow plot saved ‚Üí {viz_path / 'elbow_method_kmeans.png'}")
print(f"üíæ Silhouette plot saved ‚Üí {viz_path / 'silhouette_scores.png'}")


### Step 4.1 ‚Äî Interpretation of Clustering Metrics

The **Elbow Method** shows a clear bend between *k = 3* and *k = 4*,  
indicating that adding more clusters beyond this point brings little improvement in compactness.  

The **Silhouette Scores** confirm that:
- *k = 2* yields the highest separation (‚âà 0.75),  
  but that configuration is too broad and oversimplifies book diversity.  
- *k = 3 ‚Äì 4* offers a better trade-off between cohesion (books inside each cluster are similar)  
  and separation (clusters differ from each other).  

üìä **Decision:** We‚Äôll proceed with **k = 3 clusters**,  
as it balances interpretability and internal consistency.

üß† **What we‚Äôre analyzing:**  
Each book is represented by numerical and categorical features such as  
`avg_rating`, `genre`, and `price`.  
K-Means groups books with similar characteristics into clusters ‚Äî  
helping us identify **reader preference patterns** or **content similarities**  
that could later power a **recommendation system**.


### Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (PCA 2D Projection)

Now that we‚Äôve decided on **k = 3 clusters**, we‚Äôll train the final K-Means model.  
Then, we‚Äôll apply **Principal Component Analysis (PCA)** to reduce the high-dimensional
feature space into **two components**, allowing us to visualize the book clusters in 2D.  

This helps identify group patterns ‚Äî for example,  
books that share similar ratings, prices, or genres might fall close together.


In [None]:
# ============================================================
# Step 5 ‚Äî Apply Final K-Means & Visualize Clusters (Enhanced)
# ============================================================

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# --- Final number of clusters ---
k_final = 3
print(f"üè∑Ô∏è Applying final K-Means model with k = {k_final}...\n")

# --- Scale features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded)

# --- Train K-Means ---
kmeans_final = KMeans(n_clusters=k_final, random_state=42, n_init=10)
df["cluster"] = kmeans_final.fit_predict(X_scaled)

# --- PCA for visualization ---
pca = PCA(n_components=2, random_state=42)
pca_components = pca.fit_transform(X_scaled)
df["pca_1"] = pca_components[:, 0]
df["pca_2"] = pca_components[:, 1]

# --- Enhanced Visualization ---
fig_pca, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(
    data=df,
    x="pca_1", y="pca_2",
    hue="cluster",
    palette="Set2",
    s=60,
    alpha=0.85,
    edgecolor="white",
    linewidth=0.6,
    ax=ax
)
ax.set_title("Book Clusters ‚Äî PCA 2D Projection (k = 3)", fontsize=13, pad=10)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.legend(title="Cluster", loc="upper right", fontsize=9)
ax.grid(alpha=0.3, linestyle="--")
plt.tight_layout()
plt.show()

# --- Save plot for presentation ---
viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)
fig_pca.savefig(viz_path / "pca_clusters_k3.png", dpi=300, bbox_inches="tight")

print(f"‚úÖ Clustering completed. {df['cluster'].nunique()} clusters created.")
print(f"üíæ PCA cluster visualization saved ‚Üí {viz_path / 'pca_clusters_k3.png'}\n")

# --- Preview sample ---
display(df[["title", "author", "avg_rating", "genre", "price", "cluster"]].head(10))


### Step 5.1 ‚Äî Analyze Cluster Centroids

Let‚Äôs inspect the numerical centroids of each cluster to understand
how books are grouped ‚Äî for example, by their average rating or price.


In [None]:
# ============================================================
# Step 5.1 ‚Äî Analyze Cluster Centroids
# ============================================================

cluster_centroids = (
    df.groupby("cluster")
    .agg({
        "avg_rating": "mean",
        "price": "mean",
        "genre": lambda x: x.mode().iloc[0] if not x.mode().empty else "Unknown"
    })
    .round(2)
)

print(" Cluster Centroids Summary:\n")
display(cluster_centroids)

print("\n Interpretation Guide:")
print("- avg_rating ‚Üí Average book rating per cluster.")
print("- price ‚Üí Mean price, useful to detect premium vs. budget titles.")
print("- genre ‚Üí Most common genre in each cluster.")


## Step 5.2 ‚Äî Cluster Interpretation & Summary

 üß≠ Interpretation

- **Cluster 0 ‚Äî ‚ÄúYouth-Oriented Titles‚Äù**  
  Books in this group are mainly **Juvenile Fiction**, slightly higher in rating, and moderately priced.  
  They likely attract younger readers and reflect more accessible, engaging narratives.

- **Cluster 1 ‚Äî ‚ÄúGeneral Fiction Classics‚Äù**  
  Dominated by **Fiction**, this cluster includes a broad mix of popular titles and literary works.  
  Prices are a bit higher on average, suggesting the presence of well-known or premium editions.

- **Cluster 2 ‚Äî ‚ÄúSpecialized Literature‚Äù**  
  This cluster groups more niche genres such as *Governesses in Literature*.  
  These books maintain solid ratings but cater to smaller, topic-focused audiences.

---

üí¨ Overall Insight

Across all clusters, average ratings remain consistently high (‚âà 4.1‚Äì4.2),  
indicating that Goodreads‚Äô most popular books share strong reader approval.  
Price variation is modest ‚Äî meaning **genre and thematic focus** play a greater role in the clustering than cost.  
This insight will be valuable for the **next phase: building the recommendation engine**,  
where similar clusters can guide personalized book suggestions.


## Step 6 ‚Äî Export Final Clustered Dataset

We‚Äôll now export the final dataset including the cluster labels (`cluster`)
and PCA coordinates (`pca_1`, `pca_2`) for visualization and further analysis.  
This dataset can be used in **Tableau**, **Power BI**, or in the next notebook
for building the **Recommendation System**.


In [None]:
# ============================================================
# Step 6 ‚Äî Export Final Clustered Dataset & Cluster Summary
# ============================================================

from pathlib import Path
import pandas as pd

# --- Define export paths ---
data_clean_path = Path("..") / config["paths"]["data_clean"]
viz_path = Path("..") / "visualizations"
data_clean_path.mkdir(parents=True, exist_ok=True)
viz_path.mkdir(parents=True, exist_ok=True)

final_cluster_path = data_clean_path / "books_clustered_final.csv"

# --- Select relevant columns ---
export_cols = [
    "title",
    "author",
    "avg_rating",
    "genre",
    "price",
    "currency",
    "cover_url",
    "link",
    "cluster",
    "pca_1",
    "pca_2"
]

# --- Save dataset ---
df[export_cols].to_csv(final_cluster_path, index=False, encoding="utf-8-sig")
print(f"üíæ Final clustered dataset saved successfully ‚Üí {final_cluster_path.resolve()}")

# --- Quick preview ---
print("\nüìò Sample of exported dataset:")
display(df[export_cols].head(10))

# ============================================================
# üìä Cluster Summary (for visualization or presentation)
# ============================================================

cluster_summary = (
    df.groupby("cluster")
    .agg({
        "avg_rating": "mean",
        "price": "mean",
        "genre": lambda x: x.mode().iloc[0] if not x.mode().empty else "Unknown"
    })
    .round(2)
    .reset_index()
)

print("\nüìó Cluster Summary:")
display(cluster_summary)




In [None]:
# ============================================================
# üìä Cluster Summary Table ‚Äî Final Styled Version
# ============================================================
import pandas as pd
from IPython.display import display, HTML



# --- Cluster centroids summary ---
cluster_summary = pd.DataFrame({
    "Cluster": [0, 1, 2],
    "Dominant Genre": ["Juvenile Fiction", "Fiction", "Governesses in Literature"],
    "Avg Rating": [4.18, 4.10, 4.16],
    "Avg Price (EUR)": [8.66, 9.25, 8.99],
})

# --- Styling ---
styled_summary = (
    cluster_summary.style
    .set_caption("üìö Cluster Summary ‚Äî Average Features per Group")
    .set_table_styles([
    {"selector": "caption", 
     "props": [("text-align", "left"), ("font-size", "16px"), ("font-weight", "bold"), ("color", "#00c3ff")]},
    {"selector": "table", 
     "props": [("border", "2px solid #00c3ff"), ("border-radius", "8px"), ("border-collapse", "collapse")]},
    {"selector": "th", 
     "props": [("background-color", "#1c1c1c"), ("color", "white"), ("text-align", "center"), ("font-size", "14px")]},
    {"selector": "td", 
     "props": [("background-color", "#505050"), ("color", "#f2f2f2"), ("font-size", "13px"), ("text-align", "center")]}
])
    .hide(axis="index")
    .format({
        "Avg Rating": "{:.2f}",
        "Avg Price (EUR)": "{:.2f}"
    })
)

# --- Display ---
display(styled_summary)
