In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Shapley Value Analysis for Customer Satisfaction\n",
    "\n",
    "This notebook implements the Shapley Value analysis to identify key drivers of customer satisfaction (or dissatisfaction) based on the methodology described in the paper \"Customer Satisfaction Analysis: Identification of Key Drivers\" by Conklin, Powaga, and Lipovetsky.\n",
    "\n",
    "**Steps:**\n",
    "1. **Setup & Configuration**: Define file paths, column names, and thresholds.\n",
    "2. **Data Preprocessing**: Load and binarize data.\n",
    "3. **Shapley Value Calculation**: Compute Shapley values for each feature.\n",
    "4. **Key Driver Identification**: Determine the optimal set of key dissatisfiers.\n",
    "5. **Results**: View the Shapley values and the identified key drivers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import itertools\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Define Helper Functions\n",
    "\n",
    "These functions are the core logic for preprocessing, calculating coalition values, Shapley values, and determining key drivers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess_data(filepath, overall_satisfaction_col, feature_cols, \n",
    "                    dissatisfaction_threshold, failure_threshold_map,\n",
    "                    overall_score_higher_is_better=True, \n",
    "                    feature_score_higher_is_better=True):\n",
    "    \"\"\"\n",
    "    Loads and preprocesses the data.\n",
    "    \"\"\"\n",
    "    df = pd.read_csv(filepath)\n",
    "\n",
    "    # Binarize overall satisfaction\n",
    "    binarized_overall_col = f\"{overall_satisfaction_col}_Dissatisfied\"\n",
    "    if overall_score_higher_is_better:\n",
    "        df[binarized_overall_col] = (df[overall_satisfaction_col] <= dissatisfaction_threshold).astype(int)\n",
    "    else:\n",
    "        df[binarized_overall_col] = (df[overall_satisfaction_col] >= dissatisfaction_threshold).astype(int)\n",
    "\n",
    "    # Binarize feature columns\n",
    "    binarized_feature_cols = []\n",
    "    for col in feature_cols:\n",
    "        bin_col_name = f\"{col}_Failed\"\n",
    "        threshold_info = failure_threshold_map.get(col)\n",
    "        \n",
    "        current_feature_higher_is_better = feature_score_higher_is_better # Default\n",
    "        current_failure_threshold = None\n",
    "\n",
    "        if isinstance(threshold_info, tuple): # (threshold, specific_higher_is_better)\n",
    "            current_failure_threshold = threshold_info[0]\n",
    "            current_feature_higher_is_better = threshold_info[1]\n",
    "        elif threshold_info is not None: # Just threshold, use global feature_score_higher_is_better\n",
    "            current_failure_threshold = threshold_info\n",
    "        else:\n",
    "            raise ValueError(f\"Failure threshold not defined for feature: {col}\")\n",
    "\n",
    "        if current_feature_higher_is_better:\n",
    "            df[bin_col_name] = (df[col] <= current_failure_threshold).astype(int)\n",
    "        else:\n",
    "            df[bin_col_name] = (df[col] >= current_failure_threshold).astype(int)\n",
    "        binarized_feature_cols.append(bin_col_name)\n",
    "        \n",
    "    return df, binarized_overall_col, binarized_feature_cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_value_of_coalition(df, coalition_bin_feature_cols, binarized_overall_col):\n",
    "    \"\"\"\n",
    "    Calculates the value v(M) = Reach_M - Noise_M for a given coalition M.\n",
    "    M is represented by a list of binarized feature column names.\n",
    "    \"\"\"\n",
    "    if not coalition_bin_feature_cols: # Empty coalition\n",
    "        return 0.0\n",
    "\n",
    "    # Mask for rows where at least one feature in the coalition has \"Failed\" (is 1)\n",
    "    failed_on_any_in_coalition_mask = df[coalition_bin_feature_cols].any(axis=1)\n",
    "    \n",
    "    df_dissatisfied = df[df[binarized_overall_col] == 1]\n",
    "    df_not_dissatisfied = df[df[binarized_overall_col] == 0]\n",
    "\n",
    "    num_total_dissatisfied = len(df_dissatisfied)\n",
    "    num_total_not_dissatisfied = len(df_not_dissatisfied)\n",
    "\n",
    "    if num_total_dissatisfied == 0 and num_total_not_dissatisfied == 0:\n",
    "        print(\"Warning: No data points found for calculating coalition value.\")\n",
    "        return 0.0\n",
    "    \n",
    "    if num_total_dissatisfied > 0:\n",
    "        num_failed_and_dissatisfied = df_dissatisfied[failed_on_any_in_coalition_mask[df_dissatisfied.index]].shape[0]\n",
    "        reach_M = num_failed_and_dissatisfied / num_total_dissatisfied\n",
    "    else:\n",
    "        reach_M = 0.0 \n",
    "\n",
    "    if num_total_not_dissatisfied > 0:\n",
    "        num_failed_and_not_dissatisfied = df_not_dissatisfied[failed_on_any_in_coalition_mask[df_not_dissatisfied.index]].shape[0]\n",
    "        noise_M = num_failed_and_not_dissatisfied / num_total_not_dissatisfied\n",
    "    else:\n",
    "        noise_M = 0.0\n",
    "\n",
    "    value_M = reach_M - noise_M\n",
    "    return value_M"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_shapley_values(df, binarized_feature_cols, binarized_overall_col):\n",
    "    \"\"\"\n",
    "    Calculates Shapley values for each feature.\n",
    "    \"\"\"\n",
    "    num_features = len(binarized_feature_cols)\n",
    "    feature_indices = list(range(num_features))\n",
    "    shapley_values = np.zeros(num_features)\n",
    "\n",
    "    for i in feature_indices: \n",
    "        feature_k_col_name = binarized_feature_cols[i]\n",
    "        remaining_feature_indices = [idx for idx in feature_indices if idx != i]\n",
    "        \n",
    "        for m_size in range(num_features): \n",
    "            if num_features - m_size - 1 < 0:\n",
    "                 pass \n",
    "\n",
    "            gamma_weight = (math.factorial(m_size) * math.factorial(num_features - m_size - 1)) / math.factorial(num_features)\n",
    "\n",
    "            for M_indices_tuple in itertools.combinations(remaining_feature_indices, m_size):\n",
    "                M_cols = [binarized_feature_cols[idx] for idx in M_indices_tuple]\n",
    "                M_union_k_cols = M_cols + [feature_k_col_name]\n",
    "\n",
    "                v_M_union_k = get_value_of_coalition(df, M_union_k_cols, binarized_overall_col)\n",
    "                v_M = get_value_of_coalition(df, M_cols, binarized_overall_col)\n",
    "                \n",
    "                shapley_values[i] += gamma_weight * (v_M_union_k - v_M)\n",
    "                \n",
    "    return {binarized_feature_cols[i]: shapley_values[i] for i in range(num_features)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reach_noise_success_for_set(df, set_bin_feature_cols, binarized_overall_col):\n",
    "    \"\"\"Calculates Reach, Noise, and Success for a given set of (binarized) features.\"\"\"\n",
    "    if not set_bin_feature_cols:\n",
    "        return 0.0, 0.0, 0.0\n",
    "\n",
    "    failed_on_any_in_set_mask = df[set_bin_feature_cols].any(axis=1)\n",
    "    \n",
    "    df_dissatisfied = df[df[binarized_overall_col] == 1]\n",
    "    df_not_dissatisfied = df[df[binarized_overall_col] == 0]\n",
    "\n",
    "    num_total_dissatisfied = len(df_dissatisfied)\n",
    "    num_total_not_dissatisfied = len(df_not_dissatisfied)\n",
    "\n",
    "    if num_total_dissatisfied == 0 and num_total_not_dissatisfied == 0:\n",
    "        return 0.0, 0.0, 0.0\n",
    "\n",
    "    if num_total_dissatisfied > 0:\n",
    "        num_failed_and_dissatisfied = df_dissatisfied[failed_on_any_in_set_mask[df_dissatisfied.index]].shape[0]\n",
    "        reach = num_failed_and_dissatisfied / num_total_dissatisfied\n",
    "    else:\n",
    "        reach = 0.0\n",
    "\n",
    "    if num_total_not_dissatisfied > 0:\n",
    "        num_failed_and_not_dissatisfied = df_not_dissatisfied[failed_on_any_in_set_mask[df_not_dissatisfied.index]].shape[0]\n",
    "        noise = num_failed_and_not_dissatisfied / num_total_not_dissatisfied\n",
    "    else:\n",
    "        noise = 0.0\n",
    "        \n",
    "    success = reach - noise\n",
    "    return reach, noise, success"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def determine_key_drivers(df, binarized_feature_cols, binarized_overall_col, shapley_values_dict):\n",
    "    \"\"\"\n",
    "    Determines the key dissatisfiers based on Shapley values and the Success metric.\n",
    "    \"\"\"\n",
    "    sorted_features = sorted(shapley_values_dict.items(), key=lambda item: item[1], reverse=True)\n",
    "    \n",
    "    print(\"\\n--- Determining Key Dissatisfiers (Cumulative Analysis) ---\")\n",
    "    print(f\"{'Step':<5} {'Added Feature':<30} {'Cumulative Set Size':<20} {'Reach':<10} {'Noise':<10} {'Success':<10}\")\n",
    "    \n",
    "    cumulative_set_cols = []\n",
    "    optimal_set_cols = []\n",
    "    max_success_achieved = -float('inf')\n",
    "    results_log = []\n",
    "\n",
    "    for i, (feature_col, sv) in enumerate(sorted_features):\n",
    "        current_cumulative_cols_for_step = cumulative_set_cols + [feature_col]\n",
    "        reach, noise, success = get_reach_noise_success_for_set(df, current_cumulative_cols_for_step, binarized_overall_col)\n",
    "        \n",
    "        results_log.append({\n",
    "            'step': i + 1,\n",
    "            'added_feature': feature_col.replace('_Failed', ''),\n",
    "            'set_size': len(current_cumulative_cols_for_step),\n",
    "            'reach': reach,\n",
    "            'noise': noise,\n",
    "            'success': success\n",
    "        })\n",
    "        print(f\"{i+1:<5} {feature_col.replace('_Failed', ''):<30} {len(current_cumulative_cols_for_step):<20} {reach:<10.3f} {noise:<10.3f} {success:<10.3f}\")\n",
    "\n",
    "        if success >= max_success_achieved:\n",
    "            max_success_achieved = success\n",
    "            optimal_set_cols = list(current_cumulative_cols_for_step)\n",
    "            cumulative_set_cols.append(feature_col)\n",
    "        else:\n",
    "            print(f\"Success decreased. Optimal set identified before adding '{feature_col.replace('_Failed', '')}'.\")\n",
    "            break \n",
    "            \n",
    "    return [col.replace('_Failed', '') for col in optimal_set_cols], results_log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configuration and Execution\n",
    "\n",
    "Modify the parameters in the next cell to match your dataset and analysis requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Configuration ---\n",
    "filepath = 'sample_customer_data.csv' # Replace with your actual file path\n",
    "\n",
    "overall_satisfaction_col = 'OverallSatisfaction' \n",
    "dissatisfaction_threshold = 5 \n",
    "overall_score_higher_is_better = True\n",
    "\n",
    "feature_cols = ['FeatureA', 'FeatureB', 'FeatureC', 'FeatureD']\n",
    "failure_threshold_map = {\n",
    "    'FeatureA': 3,\n",
    "    'FeatureB': 3,\n",
    "    'FeatureC': (7, False), \n",
    "    'FeatureD': 2\n",
    "}\n",
    "feature_score_higher_is_better = True \n",
    "\n",
    "# --- Create a dummy sample_customer_data.csv for testing (Optional) ---\n",
    "# You can comment this out if you have your own CSV file.\n",
    "data = {\n",
    "    'OverallSatisfaction': [2, 8, 5, 10, 3, 6, 1, 9, 4, 7, 2, 5, 8, 3, 6, 10, 1, 4, 9, 7],\n",
    "    'FeatureA':            [1, 5, 3,  4, 2, 5, 1, 4, 2, 3, 1, 3, 5, 2, 4, 5, 1, 2, 5, 3],\n",
    "    'FeatureB':            [2, 4, 2,  5, 1, 3, 2, 5, 1, 4, 2, 2, 4, 1, 3, 5, 2, 1, 5, 4],\n",
    "    'FeatureC':            [8, 3, 6,  2, 9, 4, 10,1, 7, 5, 8, 6, 2, 9, 4, 1, 10,7, 3, 5],\n",
    "    'FeatureD':            [1, 3, 1,  3, 1, 2, 1, 3, 1, 2, 1, 1, 3, 1, 2, 3, 1, 1, 3, 2] \n",
    "}\n",
    "df_sample = pd.DataFrame(data)\n",
    "df_sample.to_csv(filepath, index=False)\n",
    "print(f\"Created/Replaced dummy data at {filepath}\")\n",
    "# --- End of dummy data creation ---\n",
    "\n",
    "# 1. Preprocess data\n",
    "df_processed, bin_overall_col, bin_feature_cols = preprocess_data(\n",
    "    filepath,\n",
    "    overall_satisfaction_col,\n",
    "    feature_cols,\n",
    "    dissatisfaction_threshold,\n",
    "    failure_threshold_map,\n",
    "    overall_score_higher_is_better,\n",
    "    feature_score_higher_is_better\n",
    ")\n",
    "print(\"\\n--- Processed Data Head ---\")\n",
    "print(df_processed[[bin_overall_col] + bin_feature_cols].head())\n",
    "\n",
    "if df_processed[bin_overall_col].nunique() < 2:\n",
    "    print(f\"\\nWarning: The binarized overall satisfaction column '{bin_overall_col}' has only one unique value.\")\n",
    "    print(\"This will lead to Reach or Noise (or both) being undefined or zero for all coalitions.\")\n",
    "    print(\"Please check your dissatisfaction_threshold and data distribution.\")\n",
    "    if df_processed[df_processed[bin_overall_col] == 1].empty:\n",
    "        print(\"No customers are marked as 'Dissatisfied'.\")\n",
    "    if df_processed[df_processed[bin_overall_col] == 0].empty:\n",
    "        print(\"No customers are marked as 'Not Dissatisfied'.\")\n",
    "else:\n",
    "    # 2. Calculate Shapley values\n",
    "    print(\"\\nCalculating Shapley values... (this may take time for many features)\")\n",
    "    shapley_values = calculate_shapley_values(df_processed, bin_feature_cols, bin_overall_col)\n",
    "    \n",
    "    print(\"\\n--- Shapley Values ---\")\n",
    "    for feature, sv in sorted(shapley_values.items(), key=lambda item: item[1], reverse=True):\n",
    "        print(f\"{feature.replace('_Failed', ''):<30}: {sv:.4f}\")\n",
    "\n",
    "    # 3. Determine Key Dissatisfiers\n",
    "    key_dissatisfiers, full_log = determine_key_drivers(df_processed, bin_feature_cols, bin_overall_col, shapley_values)\n",
    "    \n",
    "    print(\"\\n--- Final Set of Key Dissatisfiers ---\")\n",
    "    if key_dissatisfiers:\n",
    "        for kd in key_dissatisfiers:\n",
    "            print(f\"- {kd}\")\n",
    "    else:\n",
    "        print(\"No key dissatisfiers identified based on the criteria (or no features provided).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. How to Use and Interpret\n",
    "\n",
    "1.  **Modify Configuration**: \n",
    "    * Update `filepath` to point to your CSV data file.\n",
    "    * Set `overall_satisfaction_col` to the name of your overall satisfaction score column.\n",
    "    * Adjust `dissatisfaction_threshold` and `overall_score_higher_is_better` based on your overall satisfaction scale.\n",
    "    * List your raw feature column names in `feature_cols`.\n",
    "    * Carefully define `failure_threshold_map`. For each feature, specify the threshold that denotes a \"failure\". You can also specify if a lower score is better for a particular feature, e.g., `{'PriceComplaints': (1, False)}` might mean 1 or more complaints is a failure, and fewer complaints are better.\n",
    "    * Set `feature_score_higher_is_better` as the default for your feature scales.\n",
    "2.  **Run All Cells**: Execute the cells in order (e.g., by clicking \"Run All\" in your Jupyter environment).\n",
    "3.  **Interpret Results**:\n",
    "    * **Processed Data Head**: Shows the first few rows of your data after binarization. Check if `_Dissatisfied` and `_Failed` columns look correct.\n",
    "    * **Shapley Values**: Lists each feature and its calculated Shapley value. Higher values indicate a greater contribution to the overall \"Success\" metric (Reach - Noise), meaning the feature is more important in distinguishing dissatisfied customers.\n",
    "    * **Determining Key Dissatisfiers (Cumulative Analysis)**: This table shows the step-by-step process of adding features (ordered by Shapley value) to a potential set of key dissatisfiers. It tracks the cumulative Reach, Noise, and Success of the set at each step.\n",
    "    * **Final Set of Key Dissatisfiers**: This is the primary output. It's the set of features that, when considered together, maximized the `Success = Reach - Noise` metric. These are the features your analysis suggests are the most critical drivers of dissatisfaction.\n",
    "\n",
    "**Important Considerations:**\n",
    "* **Computational Cost**: Calculating exact Shapley values is computationally intensive (factorial complexity with the number of features). For more than ~10-12 features, this script might become very slow. The original paper suggests sampling for larger datasets.\n",
    "* **Data Quality**: The quality of your input data and the appropriateness of your thresholds significantly impact the results.\n",
    "* **Definition of \"Failure\" and \"Dissatisfaction\"**: Carefully consider how you define these for your specific context. The thresholds directly influence the binarization and subsequent calculations."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
