# Sephora Customer Segmentation (Refined)

This notebook performs customer segmentation based on processed customer and product data. The primary goal is to group customers into distinct clusters using relevant features and output a CSV file mapping each `client_id` to a `cluster_id`.

**Key changes in this version:**
* Loads data from `data/processed/reviews.csv` and `data/processed/skincare_product_info.csv`.
* Uses the Silhouette method for determining the optimal number of clusters.
* Emphasizes careful feature engineering and selection for customer segmentation based on provided column names.

The process involves:
1.  Loading pre-processed data.
2.  Merging the processed data.
3.  Engineering features that describe customer behavior and preferences.
4.  Determining optimal cluster count using Silhouette analysis.
5.  Applying K-Means clustering.
6.  Analyzing cluster characteristics.
7.  Exporting the segmentation results.

In [None]:
import pandas as pd
import numpy as np
import glob
import os

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import silhouette_score # For Silhouette analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Define the base path for PROCESSED data
PROCESSED_DATA_PATH = "../data/processed/"

print("Libraries imported and PROCESSED_DATA_PATH set to:", PROCESSED_DATA_PATH)

## Cell 3: Load Processed Data & Initial Merge

In [None]:
import pandas as pd
import numpy as np
import glob
import os

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import silhouette_score # For Silhouette analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Define the base path for PROCESSED data
PROCESSED_DATA_PATH = "../data/processed/"

print("Libraries imported and PROCESSED_DATA_PATH set to:", PROCESSED_DATA_PATH)
# This cell loads processed reviews.csv and skincare_product_info.csv, then merges them.
# The output of this cell will be `merged_df`.

print(f"Looking for processed data in: {PROCESSED_DATA_PATH}")

# Define file names
processed_reviews_file = "reviews.csv"
processed_products_file = "skincare_product_info.csv"

reviews_csv_path = os.path.join(PROCESSED_DATA_PATH, processed_reviews_file)
products_csv_path = os.path.join(PROCESSED_DATA_PATH, processed_products_file)

# Load processed reviews data
try:
    reviews_df = pd.read_csv(reviews_csv_path)
    print(f"Successfully loaded {processed_reviews_file}. Shape: {reviews_df.shape}")
    # print("Columns in processed reviews_df:", reviews_df.columns.tolist())
    # print("\nProcessed Reviews DataFrame Sample:")
    # print(reviews_df.head())
except FileNotFoundError:
    print(f"Error: {reviews_csv_path} not found.")
    reviews_df = pd.DataFrame()
except Exception as e:
    print(f"An error occurred loading {processed_reviews_file}: {e}")
    reviews_df = pd.DataFrame()

# Load processed product information
try:
    products_df = pd.read_csv(products_csv_path)
    print(f"\nSuccessfully loaded {processed_products_file}. Shape: {products_df.shape}")
    # print("Columns in processed skincare_product_info_df (products_df):", products_df.columns.tolist())
    # print("\nProcessed Products DataFrame Sample:")
    # print(products_df.head())
except FileNotFoundError:
    print(f"Error: {products_csv_path} not found.")
    products_df = pd.DataFrame()
except Exception as e:
    print(f"An error occurred loading {processed_products_file}: {e}")
    products_df = pd.DataFrame()

# --- Proceed with merging if both DataFrames are loaded ---
if products_df.empty or reviews_df.empty:
    print("\nCannot proceed with merging as one or both processed dataframes are missing/empty.")
    merged_df = pd.DataFrame()
else:
    print("\n--- Starting Data Preprocessing & Merge of Processed Files ---")
    
    if 'product_id' not in reviews_df.columns:
        print("Error: 'product_id' column missing in processed reviews.csv.")
    if 'product_id' not in products_df.columns:
        print("Error: 'product_id' column missing in processed skincare_product_info.csv.")

    if 'product_id' in reviews_df.columns and 'product_id' in products_df.columns:
        products_df['product_id'] = products_df['product_id'].astype(str)
        reviews_df['product_id'] = reviews_df['product_id'].astype(str)

        client_id_col_in_reviews = 'author_id' 
        if client_id_col_in_reviews not in reviews_df.columns:
            print(f"Error: Client identifier column '{client_id_col_in_reviews}' not found in reviews.csv.")
            merged_df = pd.DataFrame() 
        else:
            reviews_df.dropna(subset=[client_id_col_in_reviews, 'product_id'], inplace=True)
            print(f"Reviews_df shape after dropping NA {client_id_col_in_reviews}/product_id: {reviews_df.shape}")
            
            merged_df = pd.merge(reviews_df, products_df, on="product_id", how="left", suffixes=('_review', '_product'))
            print(f"\nMerged data shape: {merged_df.shape}")
            # print("Columns in merged_df after merge:", merged_df.columns.tolist())
            
            discount_column_from_product_info = 'discount_usd_product' 
            if discount_column_from_product_info in merged_df.columns:
                merged_df[discount_column_from_product_info] = pd.to_numeric(merged_df[discount_column_from_product_info], errors='coerce')
                merged_df['is_on_sale'] = merged_df[discount_column_from_product_info] > 0
                print(f"Created 'is_on_sale' based on '{discount_column_from_product_info} > 0'.")
            else:
                print(f"Warning: Column '{discount_column_from_product_info}' not found. 'is_on_sale' defaulting to False.")
                merged_df['is_on_sale'] = False
            
            # *** NEW FEATURE ENGINEERING: Price relative to category average ***
            PRODUCT_PRICE_COL_MERGED = 'actual_price_usd_product' # From skincare_product_info, suffixed
            PRODUCT_CATEGORY_COL_MERGED = 'primary_category_product' # From skincare_product_info, suffixed
            
            if PRODUCT_PRICE_COL_MERGED in merged_df.columns and PRODUCT_CATEGORY_COL_MERGED in merged_df.columns:
                # Ensure price is numeric (already partially handled in your feature engineering cell, but good to ensure here too)
                merged_df[PRODUCT_PRICE_COL_MERGED] = pd.to_numeric(merged_df[PRODUCT_PRICE_COL_MERGED], errors='coerce')
                
                # Calculate average price per category
                print(f"\nCalculating average price for each '{PRODUCT_CATEGORY_COL_MERGED}' using '{PRODUCT_PRICE_COL_MERGED}'.")
                # Use transform to get a series of the same length as merged_df for easy division
                category_avg_prices = merged_df.groupby(PRODUCT_CATEGORY_COL_MERGED)[PRODUCT_PRICE_COL_MERGED].transform('mean')
                
                # Calculate the price ratio feature
                merged_df['price_ratio_to_category_avg'] = merged_df[PRODUCT_PRICE_COL_MERGED] / category_avg_prices
                
                # Handle potential inf values (if category_avg_price was 0) and NaNs by replacing inf with NaN,
                # then deciding on imputation (e.g., 1.0 if product price matches category average, or let imputer handle later)
                merged_df['price_ratio_to_category_avg'].replace([np.inf, -np.inf], np.nan, inplace=True)
                # Optional: fill NaNs with 1, assuming if data is missing, price is "average" for its category. 
                # Otherwise, the imputer in the pipeline will handle it.
                # merged_df['price_ratio_to_category_avg'].fillna(1.0, inplace=True) 
                print("Created 'price_ratio_to_category_avg' feature.")
                print("Sample of new price ratio feature (first 5 with actual price and category for context):")
                print(merged_df[[PRODUCT_CATEGORY_COL_MERGED, PRODUCT_PRICE_COL_MERGED, 'price_ratio_to_category_avg']].head())
            else:
                print(f"Warning: Columns '{PRODUCT_PRICE_COL_MERGED}' or '{PRODUCT_CATEGORY_COL_MERGED}' not found. Cannot create 'price_ratio_to_category_avg'.")
                merged_df['price_ratio_to_category_avg'] = np.nan # Create column as NaN

            print("\nMerged Data Sample (selected columns):")
            display_cols_sample = [client_id_col_in_reviews, 'product_id', 'is_on_sale']
            price_col_for_sample = PRODUCT_PRICE_COL_MERGED if PRODUCT_PRICE_COL_MERGED in merged_df.columns else ('actual_price_usd_review' if 'actual_price_usd_review' in merged_df.columns else None)
            if price_col_for_sample:
                display_cols_sample.insert(2, price_col_for_sample)
            if 'price_ratio_to_category_avg' in merged_df.columns:
                 display_cols_sample.append('price_ratio_to_category_avg')
            print(merged_df[display_cols_sample].head())
    else:
        print("Halting merge due to missing 'product_id' columns in one or both dataframes.")
        merged_df = pd.DataFrame()

## Cell 4: Feature Engineering & Selection

In [None]:
import pandas as pd
import numpy as np

if 'merged_df' not in locals() or not isinstance(merged_df, pd.DataFrame) or merged_df.empty:
    print("merged_df is not available or empty. Skipping Feature Engineering & Selection.")
    customer_df = pd.DataFrame() 
    customer_df_for_clustering = pd.DataFrame() 
else:
    print("\n--- Starting Feature Engineering (from merged_df) ---")
    
    CLIENT_ID_COLUMN_NAME = 'author_id'

    if CLIENT_ID_COLUMN_NAME not in merged_df.columns:
        print(f"CRITICAL ERROR: Client ID column '{CLIENT_ID_COLUMN_NAME}' not found in merged_df.")
        customer_df = pd.DataFrame()
        customer_df_for_clustering = pd.DataFrame()
    else:
        merged_df.dropna(subset=[CLIENT_ID_COLUMN_NAME], inplace=True)
        merged_df[CLIENT_ID_COLUMN_NAME] = pd.to_numeric(merged_df[CLIENT_ID_COLUMN_NAME], errors='coerce')
        merged_df.dropna(subset=[CLIENT_ID_COLUMN_NAME], inplace=True) 
        merged_df[CLIENT_ID_COLUMN_NAME] = merged_df[CLIENT_ID_COLUMN_NAME].astype(np.int64)

        REVIEW_RATING_COL = 'rating_review' # Assuming 'rating_review' from your sample data; was 'user_rating'
        REVIEW_RECOMMENDED_COL = 'is_recommended' # This seems to be from the original reviews_df directly, no suffix
        
        PRODUCT_PRICE_COL = 'actual_price_usd_product' 
        PRODUCT_RATING_COL = 'rating_product'  # Assuming 'rating_product' from your sample; was 'rating'
        PRODUCT_LOVES_COL = 'loves_count_product'      
        PRODUCT_BRAND_COL = 'brand_name_product'      
        PRODUCT_CATEGORY_COL = 'primary_category_product' 
        PRODUCT_LIMITED_EDITION_COL = 'limited_edition_product' 
        PRODUCT_NEW_COL = 'new_product'               
        PRODUCT_ONLINE_ONLY_COL = 'online_only_product' 
        PRODUCT_SEPHORA_EXCLUSIVE_COL = 'sephora_exclusive_product' 
        PRODUCT_IS_ON_SALE_COL = 'is_on_sale'
        
        # *** NEW: Define column name for the price ratio feature ***
        PRICE_RATIO_VS_CATEGORY_COL = 'price_ratio_to_category_avg' # Created in the previous cell

        print("\n--- Confirming availability and Pre-cleaning Numeric Source Columns ---")
        source_columns_map = {
            REVIEW_RATING_COL: "reviews.csv ('rating_review')",
            REVIEW_RECOMMENDED_COL: "reviews.csv ('is_recommended')",
            PRODUCT_PRICE_COL: "skincare_product_info.csv ('actual_price_usd_product')",
            PRODUCT_RATING_COL: "skincare_product_info.csv ('rating_product')",
            PRODUCT_LOVES_COL: "skincare_product_info.csv ('loves_count_product')",
            PRODUCT_BRAND_COL: "skincare_product_info.csv ('brand_name_product')",
            PRODUCT_CATEGORY_COL: "skincare_product_info.csv ('primary_category_product')",
            PRODUCT_LIMITED_EDITION_COL: "skincare_product_info.csv ('limited_edition_product')",
            PRODUCT_NEW_COL: "skincare_product_info.csv ('new_product')",
            PRODUCT_ONLINE_ONLY_COL: "skincare_product_info.csv ('online_only_product')",
            PRODUCT_SEPHORA_EXCLUSIVE_COL: "skincare_product_info.csv ('sephora_exclusive_product')",
            PRODUCT_IS_ON_SALE_COL: "Engineered ('is_on_sale')",
            PRICE_RATIO_VS_CATEGORY_COL: "Engineered ('price_ratio_to_category_avg')" # *** NEW ***
        }
        
        all_source_cols_found = True
        for col_name, col_source_desc in source_columns_map.items():
            if col_name not in merged_df.columns:
                print(f"Warning: Expected source column '{col_name}' (for {col_source_desc}) NOT FOUND in merged_df. Related features will be NaN or cause errors.")
                all_source_cols_found = False
            else:
                if col_name == PRODUCT_PRICE_COL:
                    print(f"Cleaning and converting '{col_name}' to numeric...")
                    merged_df[col_name] = merged_df[col_name].astype(str).str.replace('$', '', regex=False).str.replace(',', '', regex=False)
                    merged_df[col_name] = pd.to_numeric(merged_df[col_name], errors='coerce')
                elif col_name in [REVIEW_RATING_COL, PRODUCT_RATING_COL, PRODUCT_LOVES_COL, REVIEW_RECOMMENDED_COL, PRICE_RATIO_VS_CATEGORY_COL]: # *** Added PRICE_RATIO_VS_CATEGORY_COL ***
                    print(f"Converting '{col_name}' to numeric...")
                    merged_df[col_name] = pd.to_numeric(merged_df[col_name], errors='coerce')


        if all_source_cols_found:
            print("All key source columns for feature engineering appear to be available and have been pre-processed in merged_df.")
        print("--- End of source column confirmation and pre-cleaning ---\n")
        
        print("Calculating customer features using groupby().agg()...")
        
        agg_functions = {}
        if REVIEW_RATING_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[REVIEW_RATING_COL]):
            agg_functions['avg_rating_given'] = (REVIEW_RATING_COL, 'mean')
        
        if REVIEW_RECOMMENDED_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[REVIEW_RECOMMENDED_COL]):
            agg_functions['prop_recommended'] = (REVIEW_RECOMMENDED_COL, 'mean')
        
        agg_functions['num_reviews'] = (CLIENT_ID_COLUMN_NAME, 'count')

        if PRODUCT_PRICE_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[PRODUCT_PRICE_COL]):
            agg_functions['avg_price_reviewed'] = (PRODUCT_PRICE_COL, 'mean')
            agg_functions['total_value_reviewed'] = (PRODUCT_PRICE_COL, 'sum')
        
        if PRODUCT_RATING_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[PRODUCT_RATING_COL]):
            agg_functions['avg_product_rating_reviewed'] = (PRODUCT_RATING_COL, 'mean')
        
        if PRODUCT_LOVES_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[PRODUCT_LOVES_COL]):
            agg_functions['avg_loves_count_reviewed'] = (PRODUCT_LOVES_COL, 'mean')
        
        if PRODUCT_BRAND_COL in merged_df.columns: 
            agg_functions['num_unique_brands_reviewed'] = (PRODUCT_BRAND_COL, 'nunique')
        if PRODUCT_CATEGORY_COL in merged_df.columns: 
            agg_functions['num_unique_categories_reviewed'] = (PRODUCT_CATEGORY_COL, 'nunique')
        
        # *** NEW: Add aggregation for the price ratio feature ***
        if PRICE_RATIO_VS_CATEGORY_COL in merged_df.columns and pd.api.types.is_numeric_dtype(merged_df[PRICE_RATIO_VS_CATEGORY_COL]):
            agg_functions['avg_price_ratio_vs_category'] = (PRICE_RATIO_VS_CATEGORY_COL, 'mean')
        else:
            print(f"Warning: '{PRICE_RATIO_VS_CATEGORY_COL}' not available or not numeric in merged_df. Cannot create 'avg_price_ratio_vs_category' feature.")

        bool_like_cols = {
            PRODUCT_LIMITED_EDITION_COL: 'prop_limited_edition',
            PRODUCT_NEW_COL: 'prop_new_product',
            PRODUCT_ONLINE_ONLY_COL: 'prop_online_only',
            PRODUCT_SEPHORA_EXCLUSIVE_COL: 'prop_sephora_exclusive',
            PRODUCT_IS_ON_SALE_COL: 'prop_on_sale_reviewed'
        }
        for col, new_name in bool_like_cols.items():
            if col in merged_df.columns:
                 agg_functions[new_name] = (col, lambda x: pd.to_numeric(x, errors='coerce').mean())


        if not agg_functions: 
            print("Error: No valid aggregation functions could be defined. Cannot create customer_df robustly.")
            customer_df = pd.DataFrame()
            customer_df_for_clustering = pd.DataFrame()
        else:
            print(f"Attempting aggregation with functions: {list(agg_functions.keys())}") # Show all keys
            try:
                customer_df = merged_df.groupby(CLIENT_ID_COLUMN_NAME).agg(
                    **agg_functions
                ).reset_index()
                customer_df.rename(columns={CLIENT_ID_COLUMN_NAME: 'client_id'}, inplace=True)
                customer_df.set_index('client_id', inplace=True)
                print("Customer features calculated.")
            except Exception as e:
                print(f"Error during groupby().agg(): {e}")
                cols_in_agg = [val[0] for val in agg_functions.values() if isinstance(val, tuple)]
                if cols_in_agg:
                    # print(f"Dtypes in merged_df for columns used in aggregation: \n{merged_df[list(set(cols_in_agg))].info()}")
                    pass # Avoid overly verbose output
                customer_df = pd.DataFrame() 
                customer_df_for_clustering = pd.DataFrame() 

        if not customer_df.empty:
            print(f"\nCustomer features DataFrame created. Shape: {customer_df.shape}")
            print("Customer DataFrame sample:")
            print(customer_df.head())
            
            print("\nSelecting all available engineered numeric features for clustering as a starting point.")
            existing_numeric_cols = [col for col in customer_df.columns if pd.api.types.is_numeric_dtype(customer_df[col])]

            if not existing_numeric_cols:
                 print("Error: No numeric features available in customer_df for clustering.")
                 customer_df_for_clustering = pd.DataFrame()
            else:
                customer_df_for_clustering = customer_df[existing_numeric_cols].copy()
                print(f"Using these features for clustering: {customer_df_for_clustering.columns.tolist()}")
                print("\nSample of data selected for clustering (customer_df_for_clustering):")
                print(customer_df_for_clustering.head())
                print("\nNaNs per selected feature (before imputation for clustering):")
                print(customer_df_for_clustering.isnull().sum())
        else:
            print("Customer DataFrame is empty after feature engineering attempt.")
            customer_df_for_clustering = pd.DataFrame() 
            
if 'customer_df' not in locals():
    customer_df = pd.DataFrame()
if 'customer_df_for_clustering' not in locals():
    customer_df_for_clustering = pd.DataFrame()

print("\n--- Feature Engineering & Selection Cell Finished ---")
print(f"Shape of customer_df: {customer_df.shape if isinstance(customer_df, pd.DataFrame) else 'Not a DataFrame'}")
print(f"Shape of customer_df_for_clustering: {customer_df_for_clustering.shape if isinstance(customer_df_for_clustering, pd.DataFrame) else 'Not a DataFrame'}")

## Cell 5: Prepare for Clustering (Imputation & Scaling)

In [None]:
if 'customer_df_for_clustering' not in locals() or customer_df_for_clustering.empty:
    print("Data for clustering is not available. Skipping preparation.")
    processed_customer_df = pd.DataFrame()
else:
    print("\n--- Preparing Data for Clustering (Imputation & Scaling) ---")
    
    features_to_process = customer_df_for_clustering.columns.tolist()
    
    numerical_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')), # Impute NaNs
        ('scaler', StandardScaler())                   # Scale features
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_pipeline, features_to_process)
        ],
        remainder='passthrough'
    )
    
    try:
        processed_features_array = preprocessor.fit_transform(customer_df_for_clustering)
        processed_customer_df = pd.DataFrame(processed_features_array, 
                                             columns=features_to_process, 
                                             index=customer_df_for_clustering.index)
        
        print("\nProcessed (scaled and imputed) customer features for clustering (sample):")
        print(processed_customer_df.head())
        
        if processed_customer_df.isnull().sum().any():
            print("\nWARNING: NaNs found in processed_customer_df AFTER imputation. Check pipeline/data.")
            print(processed_customer_df.isnull().sum())
        else:
            print("\nNo NaNs in processed_customer_df. Ready for clustering.")

    except Exception as e:
        print(f"Error during preprocessing for clustering: {e}")
        processed_customer_df = pd.DataFrame()

## Cell 6: Determine Optimal Number of Clusters (Silhouette Method)

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go # Import Plotly
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Assume processed_customer_df is loaded and preprocessed
# For testing, let's create a dummy processed_customer_df if it doesn't exist
if 'processed_customer_df' not in locals() or not isinstance(processed_customer_df, pd.DataFrame):
    print("Dummy processed_customer_df created for testing.")
    num_samples_dummy = 100 
    num_features_dummy = 5
    if num_samples_dummy > 50:
         processed_customer_df = pd.DataFrame(np.random.rand(num_samples_dummy, num_features_dummy))
    else:
         processed_customer_df = pd.DataFrame()


if 'processed_customer_df' not in locals() or not isinstance(processed_customer_df, pd.DataFrame) or processed_customer_df.empty:
    print("Processed customer DataFrame is empty or not a DataFrame. Skipping Silhouette analysis.")
else:
    print("\n--- Determining Optimal K using Silhouette Method ---")
    print(f"Shape of processed_customer_df: {processed_customer_df.shape}")

    silhouette_scores = []
    k_range_start = 2
    k_range_end = 30 
    k_range_step = 2  
    k_range = range(k_range_start, k_range_end, k_range_step)

    current_max_k = max(k_range) if list(k_range) else 0

    if current_max_k > 0 and processed_customer_df.shape[0] < current_max_k:
        print(f"Warning: Number of samples ({processed_customer_df.shape[0]}) is less than the maximum K being tested ({current_max_k}).")
        new_upper_k_limit = processed_customer_df.shape[0] -1
        if new_upper_k_limit < k_range_start:
            k_range = [] 
        else:
            k_range = range(k_range_start, new_upper_k_limit + 1, k_range_step)
        current_max_k = max(k_range) if list(k_range) else 0
        print(f"Adjusted k_range due to small sample size: {list(k_range)}")

    optimal_k_silhouette = 3 # Default K, will be updated if analysis runs

    if not list(k_range):
        print("Not enough data points or k_range is empty/too small to perform Silhouette analysis. Setting K=3 as default for next step if applicable.")
        k_range_for_plot = []
        # optimal_k_silhouette is already 3
    else:
        k_range_for_plot = list(k_range)
        print(f"Testing K values in range: {k_range_for_plot}")
        
        silhouette_sample_size = None
        if processed_customer_df.shape[0] > 5000: 
            silhouette_sample_size = 5000 
            print(f"Using sample_size={silhouette_sample_size} for Silhouette score calculation due to large dataset.")

        for k_val in k_range_for_plot:
            if k_val >= processed_customer_df.shape[0]:
                print(f"Skipping K={k_val} as it's >= number of samples ({processed_customer_df.shape[0]}). Appending NaN.")
                silhouette_scores.append(np.nan)
                continue
            
            print(f"\nProcessing K={k_val}...")
            try:
                print(f"  Fitting KMeans for K={k_val} (n_init=3)...")
                kmeans_model = KMeans(n_clusters=k_val, random_state=42, n_init=3, algorithm='lloyd')
                cluster_labels_temp = kmeans_model.fit_predict(processed_customer_df)
                
                num_unique_labels = len(np.unique(cluster_labels_temp))
                if num_unique_labels > 1 and processed_customer_df.shape[0] > k_val:
                    print(f"  Calculating Silhouette Score for K={k_val} (unique labels: {num_unique_labels})...")
                    score = silhouette_score(processed_customer_df, cluster_labels_temp, sample_size=silhouette_sample_size, random_state=42)
                    silhouette_scores.append(score)
                    print(f"  Silhouette Score for K={k_val}: {score:.4f}")
                else:
                    print(f"  Could not calculate Silhouette Score for K={k_val}. Conditions not met:")
                    print(f"    Unique clusters found: {num_unique_labels} (need > 1)")
                    print(f"    Samples ({processed_customer_df.shape[0]}) vs K ({k_val}) (need samples > K)")
                    silhouette_scores.append(np.nan)
            except Exception as e:
                print(f"  Error calculating Silhouette for K={k_val}: {e}. Appending NaN.")
                silhouette_scores.append(np.nan)

    # --- Plotting with Plotly ---
    if k_range_for_plot and any(not np.isnan(s) for s in silhouette_scores if isinstance(s, float)):
        valid_k_for_plot = [k_range_for_plot[i] for i, s in enumerate(silhouette_scores) if not np.isnan(s)]
        valid_scores_for_plot = [s for s in silhouette_scores if not np.isnan(s)]

        if valid_k_for_plot:
            fig = go.Figure()
            fig.add_trace(go.Scatter(
                x=valid_k_for_plot,
                y=valid_scores_for_plot,
                mode='lines+markers',
                marker=dict(color='red', size=8), # Red markers
                line=dict(color='red', width=2),   # Red line
                name='Silhouette Score'
            ))

            fig.update_layout(
                title='Silhouette Scores for Optimal K',
                xaxis_title='Number of Clusters (K)',
                yaxis_title='Silhouette Score',
                xaxis=dict(tickmode='array', tickvals=valid_k_for_plot, showgrid=True, gridcolor='LightGrey'),
                yaxis=dict(showgrid=True, gridcolor='LightGrey'),
                plot_bgcolor='white',
                height=600,
                width=800,
                font=dict(family='Arial', size=12)
            )
            fig.show()
            
            if valid_scores_for_plot:
                optimal_k_silhouette = valid_k_for_plot[np.argmax(valid_scores_for_plot)]
                print(f"Recommended K based on highest Silhouette Score: {optimal_k_silhouette}")
            # else optimal_k_silhouette remains 3 (default)
        else:
            print("No valid data to plot Silhouette scores (all were NaN or k_range was empty).")
            # optimal_k_silhouette remains 3 (default)

    elif k_range_for_plot:
        print("Silhouette analysis completed, but all scores were NaN. Cannot plot or recommend K. Setting K=3 as default.")
        # optimal_k_silhouette remains 3 (default)
    else: # k_range_for_plot was empty from the start
        print("Silhouette analysis skipped due to insufficient data or k_range size. Optimal K not determined by Silhouette.")
        # optimal_k_silhouette remains 3 (default)

    # Ensure optimal_k_silhouette has a value for subsequent cells
    # This was already handled by initializing to 3 and updating if possible.
    print(f"Final optimal_k_silhouette to be used: {optimal_k_silhouette}")

## Cell 7: K-Means Clustering
Based on the Silhouette plot and scores from the previous step (and potentially other domain knowledge), choose a value for `K_CLUSTERS`. The K with the highest Silhouette score is often a good choice, but also consider the interpretability of the clusters.

In [None]:
if 'processed_customer_df' not in locals() or processed_customer_df.empty:
    print("Processed customer DataFrame is empty. Skipping K-Means clustering.")
    customer_df_final_with_clusters = pd.DataFrame() 
else:
    print("\n--- Performing K-Means Clustering ---")
    # !!! SET THIS VALUE BASED ON YOUR SILHOUETTE ANALYSIS (and Elbow if you run it) !!!
    K_CLUSTERS = 12 # Example value, ADJUST AS NEEDED (e.g., optimal_k_silhouette if defined)
    print(f"Using K = {K_CLUSTERS} clusters.")

    if K_CLUSTERS <= 1 or K_CLUSTERS > processed_customer_df.shape[0]:
        print(f"Error: Invalid K_CLUSTERS value ({K_CLUSTERS}) for dataset size {processed_customer_df.shape[0]}. Cannot proceed.")
        customer_df_final_with_clusters = pd.DataFrame()
    else:
        kmeans_final_model = KMeans(n_clusters=K_CLUSTERS, random_state=42, n_init='auto')
        cluster_labels = kmeans_final_model.fit_predict(processed_customer_df)
        
        # Use customer_df_for_clustering (unscaled, selected features) for easier interpretation of cluster means
        customer_df_interpretable = customer_df_for_clustering.copy()
        
        if customer_df_interpretable.isnull().values.any():
            print("Imputing NaNs in customer_df_interpretable for cluster analysis (using median)...")
            for col in customer_df_interpretable.columns:
                if customer_df_interpretable[col].isnull().any():
                    median_val = customer_df_interpretable[col].median()
                    customer_df_interpretable[col].fillna(median_val, inplace=True)
        
        customer_df_final_with_clusters = customer_df_interpretable.copy()
        customer_df_final_with_clusters['cluster_id'] = cluster_labels
        
        print(f"\nCustomers with assigned cluster IDs (sample):")
        display_cols = ['cluster_id'] + customer_df_for_clustering.columns.tolist()[:min(3, len(customer_df_for_clustering.columns))]
        print(customer_df_final_with_clusters[display_cols].head())

## Cell 8: Analyze Clusters

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots


if 'customer_df_final_with_clusters' not in locals() or not isinstance(customer_df_final_with_clusters, pd.DataFrame) or customer_df_final_with_clusters.empty:
    print("Final customer DataFrame with clusters is not available or not a DataFrame. Skipping cluster analysis.")
else:
    print("\n--- Analyzing Cluster Characteristics ---")
    
    if 'cluster_id' not in customer_df_final_with_clusters.columns:
        print("Error: 'cluster_id' column not found in customer_df_final_with_clusters. Skipping analysis.")
    else:
        numeric_cols = customer_df_final_with_clusters.select_dtypes(include=np.number).columns.tolist()
        if 'cluster_id' in numeric_cols:
            numeric_cols.remove('cluster_id')
        
        if not numeric_cols:
            print("Error: No numeric columns found for cluster summary (excluding cluster_id).")
            cluster_summary = pd.DataFrame()
        else:
            cluster_summary = customer_df_final_with_clusters.groupby('cluster_id')[numeric_cols].mean()

        cluster_sizes = customer_df_final_with_clusters['cluster_id'].value_counts().sort_index()
        cluster_summary['cluster_size'] = cluster_sizes
        
        print("\nCluster Summary (Mean feature values and size per cluster):")
        with pd.option_context('display.float_format', '{:,.2f}'.format, 'display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
            print(cluster_summary)
        
        # --- REWRITTEN BAR PLOTS USING PLOTLY GO (Red and Black) ---
        features_for_bar_plot = ['avg_price_reviewed', 'avg_rating_given'] 
        available_features_for_barplot = [col for col in features_for_bar_plot if col in cluster_summary.columns]

        if len(available_features_for_barplot) > 0:
            print(f"\n--- Generating Plotly Bar Plots for: {available_features_for_barplot} ---")
            num_bar_plots = len(available_features_for_barplot)
            
            fig_bar_plotly = make_subplots(
                rows=1, cols=num_bar_plots,
                subplot_titles=[f'Mean {feature} per Cluster' for feature in available_features_for_barplot]
            )

            bar_plot_colors = ['red', 'black'] # Define color cycle for bar plots

            for i, feature_to_plot in enumerate(available_features_for_barplot):
                current_bar_color = bar_plot_colors[i % len(bar_plot_colors)] # Cycle through red and black
                fig_bar_plotly.add_trace(
                    go.Bar(
                        x=cluster_summary.index, 
                        y=cluster_summary[feature_to_plot],
                        text=cluster_summary[feature_to_plot].apply(lambda x: f'{x:.2f}'),
                        textposition='auto',
                        name=feature_to_plot,
                        marker_color=current_bar_color, # Apply red or black
                        hovertemplate=(
                            f"<b>Cluster ID: %{{x}}</b><br>"
                            f"Mean {feature_to_plot}: %{{y:.2f}}"
                            "<extra></extra>"
                        )
                    ),
                    row=1, col=i+1
                )
                fig_bar_plotly.update_xaxes(title_text="Cluster ID", row=1, col=i+1, type='category')
                fig_bar_plotly.update_yaxes(title_text=f'Mean {feature_to_plot}', row=1, col=i+1)

            fig_bar_plotly.update_layout(
                height=500,
                showlegend=False, 
                title_text="Cluster Characteristics (Bar Plots)",
                plot_bgcolor='white',
                font=dict(family='Arial', size=12),
                margin=dict(t=80, b=60, l=60, r=60)
            )
            fig_bar_plotly.update_xaxes(showgrid=False)
            fig_bar_plotly.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')
            
            fig_bar_plotly.show()
        else:
            print(f"\nNote: No features from '{features_for_bar_plot}' available in cluster_summary for Plotly bar plotting.")

        # --- PLOTLY GO SCATTER PLOT (Red and Black based on condition) ---
        print("\n--- Generating Plotly Scatter Plot for Cluster Analysis ---")
        
        x_feature_scatter = 'avg_price_reviewed'
        y_feature_scatter = 'avg_price_ratio_vs_category'
        size_feature_scatter = 'cluster_size'

        if x_feature_scatter in cluster_summary.columns and \
           y_feature_scatter in cluster_summary.columns and \
           size_feature_scatter in cluster_summary.columns:

            # Determine colors for scatter plot based on y_feature_scatter's relation to 1.0
            scatter_marker_colors = np.where(cluster_summary[y_feature_scatter] > 1.0, 'red', 'black')

            fig_scatter = go.Figure()
            min_marker_size = 5
            scaled_sizes = min_marker_size + np.sqrt(cluster_summary[size_feature_scatter]) * 2 
            scaled_sizes = np.clip(scaled_sizes, min_marker_size, 40)

            fig_scatter.add_trace(go.Scatter(
                x=cluster_summary[x_feature_scatter],
                y=cluster_summary[y_feature_scatter],
                mode='markers+text',
                marker=dict(
                    size=scaled_sizes,
                    color=scatter_marker_colors, # Apply conditional red/black colors
                    # colorscale='Viridis', # Removed
                    # showscale=True, # Removed
                    line=dict(width=1, color='DarkSlateGrey') # Marker outline
                ),
                text=cluster_summary.index.astype(str), 
                textposition="top center",
                hovertemplate=(
                    f"<b>Cluster ID: %{{text}}</b><br>"
                    f"{x_feature_scatter}: %{{x:.2f}}<br>"
                    f"{y_feature_scatter}: %{{y:.2f}}<br>"
                    f"{size_feature_scatter}: %{{customdata[0]}}"
                    "<extra></extra>"
                ),
                customdata=cluster_summary[[size_feature_scatter]] 
            ))

            median_x = cluster_summary[x_feature_scatter].median()
            median_y = cluster_summary[y_feature_scatter].median()
            
            fig_scatter.add_hline(y=1.0, line_dash="dash", line_color="grey",
                                  annotation_text="Ratio = 1 (Category Avg Price)", annotation_position="bottom right")
            fig_scatter.add_hline(y=median_y, line_dash="dot", line_color="lightgrey",
                                  annotation_text=f"Median Ratio: {median_y:.2f}", annotation_position="top right")
            fig_scatter.add_vline(x=median_x, line_dash="dot", line_color="lightgrey",
                                  annotation_text=f"Median Avg Price: {median_x:.2f}", annotation_position="bottom right")
            
            y_range_scatter = [cluster_summary[y_feature_scatter].min() * 0.9, cluster_summary[y_feature_scatter].max() * 1.1]
            if len(cluster_summary[y_feature_scatter].unique()) == 1: 
                y_range_scatter = [y_range_scatter[0] - 0.5, y_range_scatter[1] + 0.5 if y_range_scatter[1] > y_range_scatter[0] else y_range_scatter[0] + 0.5]
            if y_range_scatter[0] >= y_range_scatter[1] : y_range_scatter = [y_range_scatter[0]-0.1, y_range_scatter[0]+0.1]


            band_count = 10
            band_shapes = []
            if y_range_scatter[1] > y_range_scatter[0]: # Ensure valid range for bands
                band_height = (y_range_scatter[1] - y_range_scatter[0]) / band_count
                if band_height > 0: 
                    for i in range(0, band_count, 2):
                        band_shapes.append(dict(
                            type='rect', xref='paper', yref='y',
                            x0=0, x1=1,
                            y0=y_range_scatter[0] + i * band_height,
                            y1=y_range_scatter[0] + (i + 1) * band_height,
                            fillcolor='rgba(0, 0, 0, 0.05)',
                            layer='below', line_width=0
                        ))

            fig_scatter.update_layout(
                title=f'Cluster Analysis: {y_feature_scatter} vs. {x_feature_scatter}',
                xaxis_title=f'Mean {x_feature_scatter} per Cluster',
                yaxis_title=f'Mean {y_feature_scatter} per Cluster',
                yaxis=dict(range=y_range_scatter if y_range_scatter[0] < y_range_scatter[1] else None, showgrid=False),
                xaxis=dict(showgrid=False),
                plot_bgcolor='white',
                height=700,
                font=dict(family='Arial', size=12),
                margin=dict(t=80, b=60, l=80, r=80),
                showlegend=False,
                shapes=band_shapes
            )
            fig_scatter.show()
        else:
            print(f"\nNote: One or more features ('{x_feature_scatter}', '{y_feature_scatter}', '{size_feature_scatter}') "
                  "not available in cluster_summary for Plotly scatter plot.")


if 'optimal_k_silhouette' not in locals():
    print("\nWarning: 'optimal_k_silhouette' not defined. Setting a default if needed by other cells.")

## Cell 9: Export Results

In [None]:
if 'customer_df_final_with_clusters' not in locals() or customer_df_final_with_clusters.empty or 'cluster_id' not in customer_df_final_with_clusters.columns:
    print("Clustering was not performed or 'cluster_id' is missing. Skipping export of segments.")
else:
    print("\n--- Exporting Customer Segments ---")
    output_df = customer_df_final_with_clusters[['cluster_id']].reset_index()
    # 'client_id' should be the name of the index after reset_index() if it was named 'client_id'
    if 'index' in output_df.columns and customer_df_final_with_clusters.index.name == 'client_id':
         output_df.rename(columns={'index': 'client_id'}, inplace=True)
    elif customer_df_final_with_clusters.index.name is not None and customer_df_final_with_clusters.index.name != 'client_id':
        output_df.rename(columns={customer_df_final_with_clusters.index.name : 'client_id'}, inplace=True)
    
    output_filename = "customer_segments_refined.csv"
    try:
        output_df.to_csv(output_filename, index=False)
        print(f"Successfully exported refined customer segments to {output_filename}")
        print("Output CSV sample:")
        print(output_df.head())
    except Exception as e:
        print(f"Error exporting CSV: {e}")

## Customer Cluster Summaries

---

### **Cluster 0: Contented Loyalists**
* **Size:** 82,734 customers
* **Summary:** Satisfied customers who stick to familiar, moderately priced, popular products and aren't swayed by trends or sales.

---

### **Cluster 1: Disappointed Critics of Popular Goods**
* **Size:** 58,380 customers
* **Summary:** Very critical reviewers of popular, relatively low-priced items, despite the products' general popularity.

---

### **Cluster 2: Engaged Brand Explorers**
* **Size:** 41,327 customers
* **Summary:** Active and generally satisfied reviewers who try a good number of brands and show moderate interest in trends and sales.

---

### **Cluster 3: Online Channel Loyalists**
* **Size:** 43,529 customers
* **Summary:** Highly satisfied customers who predominantly purchase online-only products, often less "viral" items.

---

### **Cluster 4: Limited Edition & Sephora Exclusive Hunters (Sale Savvy)**
* **Size:** 9,678 customers
* **Summary:** Seek out limited editions and Sephora exclusives, and are the most interested in finding these items on sale.

---

### **Cluster 5: Budget Fans of Viral Hits (Sephora Exclusives)**
* **Size:** 18,507 customers
* **Summary:** Purchase very low-priced but extremely popular (viral) items, with a high preference for Sephora exclusives.

---

### **Cluster 6: Devoted Sephora Exclusive Fans**
* **Size:** 63,823 customers
* **Summary:** Highly satisfied, budget-conscious customers who overwhelmingly prefer Sephora exclusive products.

---

### **Cluster 7: New Product Aficionados (High Quality Focus)**
* **Size:** 8,112 customers
* **Summary:** Early adopters who focus on new products that also have high community ratings, often Sephora exclusives or online.

---

### **Cluster 8: High-Spending, Prolific Reviewing Brand Connoisseurs**
* **Size:** 4,263 customers
* **Summary:** Highly engaged customers who spend and review a lot, explore many brands, and show interest in new products.

---

### **Cluster 9: Premium Sale Seekers (Viral Products)**
* **Size:** 10,492 customers
* **Summary:** Purchase high-priced, popular items almost exclusively when they are on sale, often Sephora exclusives.

---

### **Cluster 10: Quality-Focused Budget Buyers**
* **Size:** 123,569 customers
* **Summary:** The largest group; very satisfied, they buy low-priced items that have excellent community ratings and are the least interested in trends or sales.

---

### **Cluster 11: Luxury Niche Loyalists**
* **Size:** 10,212 customers
* **Summary:** Spend the most per item by far, focusing on expensive, niche (less viral) products and are brand loyal.

---

### **Cluster 12: Sephora Exclusive Fans (Less Critical of Product Ratings)**
* **Size:** 27,802 customers
* **Summary:** Strongly prefer Sephora exclusives, similar to Cluster 6, but are satisfied even if those exclusives have lower overall community ratings.

---

### **Cluster 13: Super Reviewers / High-Spending Trendsetters**
* **Size:** 767 customers
* **Summary:** A very small group of power users who review, spend, and explore brands the most, focusing on high-quality, new, and often niche items.

---

## Cell 10: Conclusion & Next Steps (Refined Segmentation)

This notebook has performed customer segmentation using data from `data/processed/`, with an emphasis on careful feature selection and Silhouette analysis for determining the number of clusters. The output is `customer_segments_refined.csv`.

**Interpreting Your Clusters:**
Carefully analyze the `Cluster Summary` table (Cell 8). The key is to identify how the average feature values differ across clusters, defining distinct customer personas. Consider not just the means, but also the relative importance of features that differentiate the groups.

**Potential Next Steps & Improvements:**
* **Iterative Feature Engineering & Selection:** The selection in Cell 4 is crucial. You might iterate on this: try different sets of features, use techniques like PCA for dimensionality reduction, or look at feature importance from predictive models (if applicable to a related task) to guide selection.
* **Hyperparameter Tuning for K-Means:** While `random_state` ensures reproducibility, other parameters of KMeans could be explored.
* **Alternative Clustering Algorithms:** If K-Means results are not satisfactory or its assumptions don't hold, explore DBSCAN, Agglomerative Clustering, etc.
* **Business Validation:** Ultimately, the "best" segmentation is the one that is most actionable and provides meaningful insights for your business goals.