## Step 1 ‚Äî Load Configuration & Base Dataset

In this step, we load the main configuration file (`config.yaml`) to access all project paths, and then import the base dataset `books_clustered_final.csv` from the `data/clean` directory.

This dataset contains the books used for clustering in the previous step. It will serve as the foundation for enriching missing information such as ratings and genres using external data sources.


In [None]:
# ============================================================
# Step 1 ‚Äî Load Configuration & Base Dataset
# ============================================================
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
import pandas as pd
from pathlib import Path
from functions import load_config, ensure_directories

# --- Load configuration from project root ---
config_path = Path("..") / "config.yaml"
config = load_config(config_path)

# --- Ensure all folders exist ---
ensure_directories(config["paths"])

# --- Load base dataset ---
data_clean_path = Path("..") / config["paths"]["data_clean"]
input_file = data_clean_path / "books_clustered_final.csv"

df_main = pd.read_csv(input_file)

print(f"Dataset loaded successfully: {df_main.shape}")
df_main.head(3)


## Step 2 ‚Äî Load Goodreads Dataset (Kaggle goodbooks-10k)

In this step, we load the `books.csv` file from the Kaggle dataset ‚Äúgoodbooks-10k‚Äù.  
This dataset contains around 10 000 books with standardized metadata such as title, author, average rating, number of ratings, and publication year.  
It is a lighter and cleaner dataset than the previous BrightData version and aligns well with our book titles.



In [None]:
# ============================================================
# Step 2 ‚Äî Load Goodreads Dataset (Kaggle goodbooks-10k)
# ============================================================

import pandas as pd
from pathlib import Path

# Define path (using config paths)
data_raw_path = Path("..") / "data" / "raw"
goodreads_file = data_raw_path / "books.csv"  # rename your downloaded books.csv to this

# Load dataset
df_goodreads = pd.read_csv(goodreads_file)
print(f"Kaggle Goodreads dataset loaded: {df_goodreads.shape}")

# Display available columns
print("Columns:", df_goodreads.columns.tolist())

# Preview
df_goodreads.head(3)


In [None]:
print(df_goodreads.columns.tolist())


## Step 3 ‚Äî Preprocess Titles & Authors for Merging

Before merging both datasets, we standardize and align the column names used as matching keys.

In the Kaggle dataset, the relevant columns are:
- `title` ‚Üí book title  
- `authors` ‚Üí author name(s)  
- `average_rating` ‚Üí Goodreads average rating  
- `ratings_count` ‚Üí total number of ratings  
- `original_publication_year` ‚Üí publication year  
- `image_url` ‚Üí cover image

We will:
1. Keep only these relevant columns.  
2. Rename them for consistency.  
3. Normalize `title` and `author` text to lowercase for reliable matching.


In [None]:
# ============================================================
# Step 3 ‚Äî Preprocess Titles & Authors for Merging
# ============================================================

# Select and rename relevant columns
cols_to_keep = [
    "title",
    "authors",
    "average_rating",
    "ratings_count",
    "original_publication_year",
    "image_url"
]

df_goodreads = df_goodreads[cols_to_keep].rename(columns={
    "authors": "author",
    "average_rating": "avg_rating_goodreads",
    "ratings_count": "ratings_count_goodreads",
    "original_publication_year": "published_year_goodreads",
    "image_url": "cover_url_goodreads"
})

# Normalize titles and authors in both datasets
df_main["title_clean"] = df_main["title"].str.lower().str.strip()
df_main["author_clean"] = df_main["author"].str.lower().str.strip()

df_goodreads["title_clean"] = df_goodreads["title"].str.lower().str.strip()
df_goodreads["author_clean"] = df_goodreads["author"].str.lower().str.strip()

print("Columns prepared for merging:")
print(df_goodreads.head(3))


## Step 4 ‚Äî Merge Datasets (Left Join by Title & Author)

In this step, we merge our main dataset (`books_clustered_final.csv`) with the Kaggle Goodreads dataset (`books.csv`)
using the normalized columns `title_clean` and `author_clean` as join keys.

This allows us to enrich our dataset with:
- More accurate average ratings from Goodreads
- Total number of ratings (`ratings_count_goodreads`)
- Publication year
- Cover image URL

We use a **left join** to keep all entries from our main dataset.


In [None]:
# ============================================================
# Step 4 ‚Äî Merge Datasets (Left Join by Title & Author)
# ============================================================

# Perform left join
df_merged = pd.merge(
    df_main,
    df_goodreads[
        [
            "title_clean",
            "author_clean",
            "avg_rating_goodreads",
            "ratings_count_goodreads",
            "published_year_goodreads",
            "cover_url_goodreads"
        ]
    ],
    on=["title_clean", "author_clean"],
    how="left"
)

print(f"Merge completed: {df_merged.shape}")

# Display sample of enriched data
df_merged[
    ["title", "author", "avg_rating", "avg_rating_goodreads", "ratings_count_goodreads"]
].head(10)


## Step 5 ‚Äî Replace Imputed Ratings and Save Enriched Dataset

In this step, we replace the imputed values from our main dataset
with the real Goodreads data obtained from the merge.

Specifically:
- Replace `avg_rating` values equal to 4.11 with the Goodreads rating when available.
- Add the Goodreads `ratings_count` as a new feature.
- Save the enriched dataset as `books_final_enriched.csv` in the `data/clean` folder.


In [None]:
# ============================================================
# Step 5 ‚Äî Safely Replace Imputed Ratings and Save Enriched Dataset
# ============================================================

from functions import save_dataset
from pathlib import Path

# --- Create a copy to be safe ---
df_enriched = df_merged.copy()

# Replace only imputed ratings (4.11) with Goodreads ratings when available
mask_replace = (
    df_enriched["avg_rating"].round(2) == 4.11
) & (df_enriched["avg_rating_goodreads"].notna())

df_enriched.loc[mask_replace, "avg_rating"] = df_enriched.loc[
    mask_replace, "avg_rating_goodreads"
]

# Keep Goodreads ratings_count as a new column (optional feature)
df_enriched["ratings_count"] = df_enriched["ratings_count_goodreads"]

# Remove helper columns but keep your core structure intact
df_enriched = df_enriched.drop(columns=["avg_rating_goodreads", "ratings_count_goodreads"])

# Save the enriched dataset
output_path = Path("..") / "data" / "clean" / "books_final_enriched.csv"
save_dataset(df_enriched, output_path)

# --- Summary ---
print("‚úÖ Enriched dataset saved safely ‚Üí books_final_enriched.csv")
print(f"Ratings replaced (4.11 ‚Üí Goodreads): {mask_replace.sum()}")
print(df_enriched[["title", "author", "avg_rating", "ratings_count"]].head(10))


## Step 6 ‚Äî Summary & Quality Check

In this final step, we evaluate how much the dataset improved after enrichment.

We will:
- Count how many books had their imputed `avg_rating` (4.11) replaced with real Goodreads values.
- Compare the average rating before and after enrichment.
- Show basic statistics for the new `ratings_count` feature.


## Step 7 ‚Äî Clean Final Dataset for Re-Training (Overwrite Existing File)

Before re-running the Machine Learning pipeline (PCA, Elbow, K-Means),
we clean the enriched dataset to remove columns that are no longer needed.

This step:
- Removes outdated columns from the previous clustering (`cluster`, `pca_1`, `pca_2`).
- Drops helper columns created during the enrichment (`title_clean`, `author_clean`, `cover_url_goodreads`).
- Keeps relevant features for the next model training:
  - **avg_rating** (quality)
  - **ratings_count** (popularity)
  - **price**, **genre**, **published_year**, etc.
- Overwrites the file `books_final_enriched.csv` in the `data/clean` folder.


In [None]:
# ============================================================
# Step 7 ‚Äî Clean Final Dataset for Re-Training (Overwrite File)
# ============================================================

import pandas as pd
from pathlib import Path

# Load enriched dataset
path_enriched = Path("..") / "data" / "clean" / "books_final_enriched.csv"
df = pd.read_csv(path_enriched)

# Drop unnecessary columns
cols_to_drop = [
    "cluster",
    "pca_1",
    "pca_2",
    "title_clean",
    "author_clean",
    "cover_url_goodreads"
]
df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

# Ensure ratings_count is numeric
df["ratings_count"] = pd.to_numeric(df["ratings_count"], errors="coerce")

# Overwrite the same file
df.to_csv(path_enriched, index=False, encoding="utf-8-sig")

print("‚úÖ Cleaned and overwritten successfully ‚Üí books_final_enriched.csv")
print(f"Final shape: {df.shape}")
print("Columns ready for re-training:")
print(df.columns.tolist())


## Step 8 ‚Äî Data Health Check (Missing Values & Completeness)

Before deciding which features to include in the clustering model,
we examine the completeness of the key numeric and categorical columns.

This helps ensure that we don't include variables with too many missing values,
which could distort scaling, PCA, or clustering results.

We will check:
- `avg_rating`
- `price`
- `ratings_count`
- `genre`



In [None]:
# ============================================================
# Step 8 ‚Äî Data Health Check (Missing Values & Completeness)
# ============================================================

import pandas as pd
from pathlib import Path

# Load the cleaned enriched dataset
path_data = Path("..") / "data" / "clean" / "books_final_enriched.csv"
df = pd.read_csv(path_data)

# Select relevant columns to inspect
cols_to_check = [
    "avg_rating",
    "price",
    "ratings_count",
    "genre"    
]

# Calculate missing counts and percentages
missing_counts = df[cols_to_check].isna().sum()
missing_pct = (missing_counts / len(df)) * 100

# Combine into summary DataFrame
missing_summary = pd.DataFrame({
    "Missing Values": missing_counts,
    "Missing %": missing_pct.round(2)
}).sort_values("Missing %", ascending=False)

print("Missing Value Summary:")
display(missing_summary)


## Step 9 ‚Äî Feature Preparation (Final Set for Clustering)

Based on the data health check, we will only use columns that are fully complete.

Selected features for clustering:
- **avg_rating** ‚Üí reader-perceived quality
- **price** ‚Üí economic value
- **genre** ‚Üí categorical diversity

The column **published_year_goodreads** will be kept in the dataset for visualization in the Streamlit app, but it won‚Äôt be used in the clustering model because it has too many missing values (~49%).


In [None]:
# ============================================================
# Step 9 ‚Äî Feature Preparation (Price & Rating Only)
# ============================================================

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# --- Select relevant numeric features ---
features = ["avg_rating", "price"]
df_features = df[features].copy()

# --- Handle missing prices ---
median_price = df_features["price"].median()
df_features["price"] = df_features["price"].fillna(median_price)
df["price"] = df["price"].fillna(median_price)
print(f"Filled missing 'price' values with median: {median_price:.2f}")

# --- Standardize numeric features ---
scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df_features),
    columns=features
)
print("üìè Numeric columns standardized (avg_rating, price).")

# --- Check for missing values ---
missing_check = df_scaled.isna().sum().sum()
if missing_check == 0:
    print("‚úÖ No missing values remain in scaled feature matrix.")
else:
    print(f"‚ö†Ô∏è {missing_check} missing values still present ‚Äî check source data.")

# --- Create feature matrix for clustering ---
X = df_scaled.values
print(f"\n‚úÖ Feature matrix ready for clustering. Shape: {df_scaled.shape}")

# --- Quick preview ---
display(df_scaled.head(5))


## Step 10 ‚Äî K-Means Clustering (Elbow & Silhouette Analysis)

We now test multiple K-Means clustering configurations (k = 2 to 10)
to determine the optimal number of clusters.

Steps:
1. Run K-Means for different values of *k*.
2. Compute the **inertia** (Elbow Method) and **silhouette score** for each model.
3. Plot both metrics side by side.
4. Identify the best *k* value according to the silhouette score.
5. Save plots and metrics for later use in the Streamlit dashboard.

In [None]:
# ============================================================
# Step 10 ‚Äî K-Means Clustering (Elbow & Silhouette Method) ‚úÖ
# ============================================================

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import numpy as np

# --- Feature matrix (scaled numeric data) ---
X = df_scaled.copy()

# --- Initialize lists ---
inertias = []
silhouette_scores = []
K_range = range(2, 11)

print("Running K-Means for k = 2 to 10...\n")

# --- Run K-Means across different k values ---
for k in K_range:
    try:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
        score = silhouette_score(X, kmeans.labels_)
        silhouette_scores.append(score)
        print(f"k={k} ‚Äî Inertia={kmeans.inertia_:.2f}, Silhouette={score:.4f}")
    except Exception as e:
        print(f"Error for k={k}: {e}")
        inertias.append(np.nan)
        silhouette_scores.append(np.nan)

# --- Determine best k by silhouette score ---
valid_scores = [s for s in silhouette_scores if not np.isnan(s)]
best_k = list(K_range)[silhouette_scores.index(max(valid_scores))]
print(f"\n‚úÖ Best k by silhouette score: {best_k}\n")

# ============================================================
# üìà Visualization ‚Äî Elbow & Silhouette
# ============================================================

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

# --- Elbow Method ---
ax1.plot(K_range, inertias, marker='o', color='steelblue')
ax1.set_title("Elbow Method ‚Äî K-Means Inertia", fontsize=11)
ax1.set_xlabel("Number of Clusters (k)")
ax1.set_ylabel("Inertia")
ax1.grid(True, linestyle="--", alpha=0.5)

# --- Silhouette Score ---
ax2.plot(K_range, silhouette_scores, marker='o', color='orange')
ax2.set_title("Silhouette Scores by Number of Clusters", fontsize=11)
ax2.set_xlabel("Number of Clusters (k)")
ax2.set_ylabel("Silhouette Score")
ax2.set_ylim(-1, 1)  # ‚úÖ Correct axis range: Silhouette ‚àà [-1, 1]
ax2.grid(True, linestyle="--", alpha=0.5)

plt.tight_layout()
plt.show()

# ============================================================
# üíæ Save Results
# ============================================================

viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)
fig.savefig(viz_path / "kmeans_elbow_silhouette_combined.png", dpi=300, bbox_inches="tight")
print(f"Plot saved ‚Üí {viz_path / 'kmeans_elbow_silhouette_combined.png'}")

metrics_path = Path("..") / "data" / "clean"
metrics_df = pd.DataFrame({
    "k": list(K_range),
    "inertia": inertias,
    "silhouette": silhouette_scores
})
metrics_df.to_csv(metrics_path / "kmeans_metrics.csv", index=False, encoding="utf-8-sig")
print(f"Metrics saved ‚Üí {metrics_path / 'kmeans_metrics.csv'}")


## Step 10.1 ‚Äî Re-run K-Means with k=3 (Manual Selection for Interpretability)

Although k=2 gave the highest Silhouette Score (‚âà0.87), 
the clusters were highly imbalanced ‚Äî one cluster contained almost all books, 
reducing interpretability.

To gain richer insights, we manually re-run K-Means with k=3, 
balancing statistical validity and business relevance.


## Step 11 ‚Äî Train Final K-Means Model & Visualize Clusters (PCA 2D)

Using the optimal *k* value obtained from the Elbow & Silhouette analysis,  
we train the final K-Means model and project the results in 2D using PCA for visualization.

Steps:
1. Scale the feature matrix (`df_encoded`).  
2. Train K-Means with *k = best_k*.  
3. Apply PCA to obtain two principal components.  
4. Visualize the clusters in 2D.  
5. Save the updated dataset and visualization.

In [None]:
# ============================================================
# Step 11 ‚Äî Train K-Means (k=3) & Visualize Clusters (PCA 2D)
# ============================================================

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import numpy as np
import pandas as pd

# --- Set final k manually ---
k_final = 3
print(f"Applying final K-Means model with k = {k_final}...\n")

# --- Use scaled numeric features (avg_rating, price) ---
X_scaled = df_scaled.copy()

# --- Train K-Means ---
kmeans_final = KMeans(n_clusters=k_final, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_scaled)

# --- Assign clusters ---
df = df.copy()
df["cluster"] = cluster_labels

# --- PCA for visualization ---
pca = PCA(n_components=2, random_state=42)
pca_components = pca.fit_transform(X_scaled)
df["pca_1"] = pca_components[:, 0]
df["pca_2"] = pca_components[:, 1]

# ============================================================
# üìä PCA 2D Visualization
# ============================================================

fig_pca, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x="pca_1", y="pca_2",
    hue="cluster",
    palette="Set2",
    s=65,
    alpha=0.85,
    edgecolor="white",
    linewidth=0.7,
    ax=ax
)
ax.set_title(f"Book Clusters ‚Äî PCA 2D Projection (k = {k_final})", fontsize=13, pad=10)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.legend(title="Cluster", loc="best", fontsize=9)
ax.grid(alpha=0.25, linestyle="--")
plt.tight_layout()
plt.show()

# ============================================================
# üìã Cluster Summary ‚Äî Detailed Genre Composition
# ============================================================

cluster_summary = (
    df.groupby(["cluster", "genre"])
    .agg(
        Count=("title", "count"),
        Avg_Rating=("avg_rating", "mean"),
        Avg_Price_EUR=("price", "mean")
    )
    .reset_index()
)

# Proportion within each cluster
cluster_summary["cluster"] = cluster_summary["cluster"].astype(int)
cluster_totals = cluster_summary.groupby("cluster")["Count"].transform("sum")
cluster_summary["Proportion (%)"] = (cluster_summary["Count"] / cluster_totals * 100).round(2)

# Order and round values
cluster_summary = cluster_summary.sort_values(["cluster", "Count"], ascending=[True, False])
cluster_summary["Avg_Rating"] = cluster_summary["Avg_Rating"].round(2)
cluster_summary["Avg_Price_EUR"] = cluster_summary["Avg_Price_EUR"].round(2)

print(f"üìö Cluster Summary ‚Äî Detailed Genre Composition (k = {k_final})")
display(cluster_summary)

# ============================================================
# üíæ Save Outputs (clean version ‚Äî overwrite previous files)
# ============================================================

viz_path = Path("..") / "visualizations"
viz_path.mkdir(parents=True, exist_ok=True)
fig_pca.savefig(viz_path / "pca_clusters_final.png", dpi=300, bbox_inches="tight")

# Save dataset & cluster summary (overwrite existing)
output_path = Path("..") / "data" / "clean"

df.to_csv(output_path / "books_clustered_final_enriched.csv", index=False, encoding="utf-8-sig")
cluster_summary.to_csv(output_path / "cluster_summary.csv", index=False, encoding="utf-8-sig")

print(f"‚úÖ Clustering completed ‚Äî {df['cluster'].nunique()} clusters created.")
print(f"üíæ PCA plot saved ‚Üí {viz_path / 'pca_clusters_final.png'}")
print(f"üíæ Dataset saved ‚Üí {output_path / 'books_clustered_final_enriched.csv'}")
print(f"üíæ Summary saved ‚Üí {output_path / 'cluster_summary.csv'}")



## Step 12 ‚Äî Cluster Profiling and Centroid Analysis

To interpret the K-Means results, we summarize the characteristics of each cluster.
This step focuses on numeric features only ‚Äî average rating and average price ‚Äî which were used to train the clustering model.
Specifically, we:
- Calculate the **mean rating** and **mean price** for each cluster. 
- Adds the total **number of books** in each cluster.  

These summaries help us understand the general profile of each group ‚Äî  
for example, whether a cluster represents ‚Äúaffordable popular titles‚Äù or ‚Äúhigh-priced premium books.‚Äù

Genre information can still be referenced descriptively, but it was not used as part of the model‚Äôs features.


In [None]:
# ============================================================
# Step 12 ‚Äî Cluster Profiling and Genre Composition (Descriptive Only)
# ============================================================

"""
üéØ Step 12 ‚Äî Cluster Profiling and Genre Composition (Descriptive Only)
Although the clustering was trained only on numeric features (avg_rating, price),
we include genre information here to interpret and describe each group.

The final model uses k = 3 clusters for better interpretability.
"""

# --- Genre distribution within clusters ---
genre_summary = (
    df.groupby(["cluster", "genre"])
    .size()
    .reset_index(name="count")
)

# --- Add proportion (%) within each cluster ---
cluster_sizes = df["cluster"].value_counts().to_dict()
genre_summary["proportion_%"] = genre_summary.apply(
    lambda row: round((row["count"] / cluster_sizes[row["cluster"]]) * 100, 2),
    axis=1
)

# --- Add numeric averages per cluster ---
cluster_means = (
    df.groupby("cluster")[["avg_rating", "price"]]
    .mean()
    .round(2)
    .reset_index()
)

# --- Merge to get complete profile ---
cluster_profile = genre_summary.merge(cluster_means, on="cluster", how="left")
cluster_profile = cluster_profile.sort_values(["cluster", "count"], ascending=[True, False])

# --- Display top genres per cluster ---
print("üìä Cluster Composition Summary (Genres used for interpretation only):\n")
display(cluster_profile.groupby("cluster").head(6))

print("\nüß≠ Interpretation Guide:")
print("- avg_rating ‚Üí Average rating in the cluster")
print("- price ‚Üí Average price in the cluster")
print("- genre ‚Üí Genre distribution (not used in clustering)")
print("- count ‚Üí Number of books per genre")
print("- proportion_% ‚Üí Share of that genre within its cluster")


## Step 13 ‚Äî Export Final Clustered Dataset (Enriched Version)

We now export the final clustered dataset and summary table generated from the enriched data.
This version will be saved under a different name to allow comparison with the results from Notebook 03.

Outputs:
- `books_clustered_final_enriched.csv` ‚Üí detailed dataset with PCA & clusters  
- `cluster_summary_enriched.csv` ‚Üí summarized cluster characteristics


In [None]:
# ============================================================
# Step 13 ‚Äî Export Final Clustered Dataset + Detailed Genre Summary (k = 3)
# ============================================================

from pathlib import Path
import pandas as pd

# --- Define export paths ---
data_clean_path = Path("..") / "data" / "clean"
viz_path = Path("..") / "visualizations"
data_clean_path.mkdir(parents=True, exist_ok=True)
viz_path.mkdir(parents=True, exist_ok=True)

# --- Export enriched clustered dataset ---
final_cluster_path = data_clean_path / "books_clustered_final_enriched.csv"

export_cols = [
    "title", "author", "avg_rating", "genre", "price", "currency",
    "cover_url", "link", "cluster", "pca_1", "pca_2"
]

df[export_cols].to_csv(final_cluster_path, index=False, encoding="utf-8-sig")
print(f"üíæ Enriched clustered dataset saved successfully ‚Üí {final_cluster_path.resolve()}")

# ============================================================
# üìä Detailed Cluster Composition Summary (with genre proportions) ‚Äî k = 3
# ============================================================

# --- Count genres per cluster ---
genre_summary = (
    df.groupby(["cluster", "genre"])
    .size()
    .reset_index(name="count")
)

# --- Add proportion (%) within each cluster ---
cluster_sizes = df["cluster"].value_counts().to_dict()
genre_summary["proportion_%"] = genre_summary.apply(
    lambda row: round((row["count"] / cluster_sizes[row["cluster"]]) * 100, 2),
    axis=1
)

# --- Add numeric averages per cluster ---
cluster_means = (
    df.groupby("cluster")[["avg_rating", "price"]]
    .mean()
    .round(2)
    .reset_index()
)

# --- Merge into one summary table ---
cluster_profile = genre_summary.merge(cluster_means, on="cluster", how="left")
cluster_profile = cluster_profile.sort_values(["cluster", "count"], ascending=[True, False])

print("\nüìó Cluster Composition Summary (Top Genres per Cluster) ‚Äî k = 3")
display(cluster_profile.groupby("cluster").head(8))

# --- Save detailed summary ---
summary_path = data_clean_path / "cluster_summary_detailed.csv"
cluster_profile.to_csv(summary_path, index=False, encoding="utf-8-sig")

print(f"\nüíæ Detailed cluster summary saved ‚Üí {summary_path.resolve()}")


In [None]:
# ============================================================
# üìä Cluster Summary Table ‚Äî Detailed Genre Breakdown (k = 3)
# ============================================================

import pandas as pd
from IPython.display import display

# --- Usa la tabla real generada en el notebook (cluster_profile) ---
# Si ya la tienes en memoria, puedes usar directamente `cluster_profile`
# Si no, carga desde CSV:
# cluster_profile = pd.read_csv("../data/clean/cluster_summary_detailed.csv")

# --- Top 6 g√©neros por cluster ---
cluster_top_genres = (
    cluster_profile.groupby("cluster")
    .head(6)
    .reset_index(drop=True)
    .rename(columns={
        "cluster": "Cluster",
        "genre": "Genre",
        "count": "Count",
        "proportion_%": "Proportion (%)",
        "avg_rating": "Avg Rating",
        "price": "Avg Price (EUR)"
    })
)

# --- Estilo visual igual al de tu tabla de presentaci√≥n ---
styled_genres = (
    cluster_top_genres.style
    .set_caption("üìö Cluster Summary ‚Äî Detailed Genre Composition (k = 3)")
    .set_table_styles([
        {"selector": "caption", 
         "props": [("text-align", "left"), ("font-size", "16px"), 
                   ("font-weight", "bold"), ("color", "#00c3ff")]},
        {"selector": "table", 
         "props": [("border", "2px solid #00c3ff"), 
                   ("border-radius", "8px"),
                   ("border-collapse", "collapse")]},
        {"selector": "th", 
         "props": [("background-color", "#1c1c1c"), ("color", "white"), 
                   ("text-align", "center"), ("font-size", "14px")]},
        {"selector": "td", 
         "props": [("background-color", "#505050"), ("color", "#f2f2f2"), 
                   ("font-size", "13px"), ("text-align", "center")]}
    ])
    .hide(axis="index")
    .format({
        "Avg Rating": "{:.2f}",
        "Avg Price (EUR)": "{:.2f}",
        "Proportion (%)": "{:.2f}"
    })
)

# --- Mostrar tabla estilizada ---
display(styled_genres)
