## **Contextual Bandits vs. Full RL:**

| Aspect                    | **Contextual Bandits**                           | **Full RL (e.g., Q-learning, DQN)**             |
|---------------------------|--------------------------------------------------|--------------------------------------------------|
| **Decision Focus**        | Single-step decisions (one interaction at a time) | Multi-step decisions (long-term sequences)      |
| **State Transitions**     | No explicit state transitions                     | Explicit state transitions and environment states|
| **Complexity**            | Simpler & quicker to train                        | More complex; takes longer to train             |
| **Use-case suitability**  | Optimal when each decision is independent         | Optimal for long-term, sequential decisions     |

---

## **Is Contextual Bandit suitable for churn prediction explicitly?**

In our scenario (deciding explicitly which customers to target with promotions):

- Each decision (promotion or no promotion) is mostly independent.
- You explicitly consider the customer's current state (context: churn probability, CLV, etc.).
- You seek immediate reward optimization (send promotion → customer stays → immediate reward).

---

## **Our explicit Contextual Bandit setup (clearly stated):**

- **Context (State)**: Customer features (CLV, purchase frequency, return ratio, etc.)
- **Actions**: 
  - `0`: **No action**
  - `1`: **Send promotion**
- **Reward**:
  - **Positive Reward** explicitly when customer remains active after intervention (economic gain = CLV − intervention cost).
  - **Negative Reward** if intervention fails (loss = intervention cost).

---

## **Explicit Problem Restatement Clearly:**

- **Context** (features of each customer):
  - Churn Probability (from predictive model)
  - Customer Lifetime Value (**CLV**)
  - Promotion cost (fixed per customer)

- **Actions**:
  - `0`: **No action** (no cost, but potential churn)
  - `1`: **Send promotion** (explicit intervention cost, but potential retention)

- **Reward** explicitly defined as follows:
  - If **customer retained** after promotion: reward = **CLV − promotion cost**
  - If **customer churns** after promotion: reward = **− promotion cost**
  - If **no action taken** and customer retained: reward = **CLV**
  - If **no action taken** and customer churns: reward = **0**

---

## **Explicit Thompson Sampling Implementation (Contextual Bandit)**:

- **Initialization**: 
  - Explicitly initializes success/failure counts for each action (0 and 1).
- **Action selection** (`select_action`):
  - Explicitly samples from Beta distribution for each action.
  - Explicitly chooses the action with the highest sampled probability (Thompson Sampling).
- **Reward Update** (`update` method):
  - Updates counts explicitly based on received reward (success if positive, failure if zero or negative).
- **Simulation (`run_bandit` function)**:
  - Iterates explicitly through each customer and selects actions.
  - Explicitly calculates rewards based on the defined economics (CLV, churn probability, promotion cost).

In [2]:
import numpy as np
import pandas as pd

class ThompsonSamplingBandit:
    def __init__(self):
        # success and failure counts explicitly initialized
        self.success_counts = np.ones(2)  # successes for each action
        self.failure_counts = np.ones(2)  # failures for each action

    def select_action(self):
        # Explicitly select action using Thompson Sampling
        sampled_probs = np.random.beta(self.success_counts, self.failure_counts)
        return np.argmax(sampled_probs)

    def update(self, action, reward, threshold=0):
        # Explicitly update success/failure based on received reward
        if reward > threshold:
            self.success_counts[action] += 1
        else:
            self.failure_counts[action] += 1

# Explicit function to run Thompson Sampling for churn intervention
def run_bandit(data, cost_promotion):
    bandit = ThompsonSamplingBandit()
    history = []

    for _, row in data.iterrows():
        churn_prob = row['Churn_Prob']
        clv = row['CLV']

        action = bandit.select_action()

        # Explicitly simulate customer's response
        customer_stays = np.random.rand() > churn_prob

        if action == 1:  # Send promotion
            reward = (clv - cost_promotion) if customer_stays else -cost_promotion
        else:  # No action
            reward = clv if customer_stays else 0

        # Explicitly update bandit
        bandit.update(action, reward)

        history.append({
            'Churn_Prob': churn_prob,
            'CLV': clv,
            'Action': 'Promotion' if action == 1 else 'No Action',
            'Customer_Stays': customer_stays,
            'Reward': reward
        })

    return pd.DataFrame(history)

In [3]:
data = pd.read_csv('processed_customer_churn_data.csv')

 # For demonstration, assume churn probability is already predicted
 # Otherwise, use your trained model explicitly to add this column
data['Churn_Prob'] = np.random.uniform(0.2, 0.9, size=len(data))

cost_promotion = 50  # explicitly defined promotion cost

results = run_bandit(data, cost_promotion)
print(results.head())

print("\nSummary explicitly stated:")
print(results.groupby('Action')['Reward'].mean())

   Churn_Prob         CLV     Action  Customer_Stays      Reward
0    0.694373  386.961240  Promotion            True  336.961240
1    0.868660  100.981938  Promotion           False  -50.000000
2    0.417101  164.802792  Promotion            True  114.802792
3    0.829524  180.403407  Promotion           False  -50.000000
4    0.466314  269.700620  Promotion            True  219.700620

Summary explicitly stated:
Action
No Action    688.762266
Promotion    867.086631
Name: Reward, dtype: float64


## **Explicit Explanation of the code clearly:**

- **Initialization**: 
  - Explicitly initializes success/failure counts for each action (0 and 1).
- **Action selection** (`select_action`):
  - Explicitly samples from Beta distribution for each action.
  - Explicitly chooses the action with the highest sampled probability (Thompson Sampling).
- **Reward Update** (`update` method):
  - Updates counts explicitly based on received reward (success if positive, failure if zero or negative).
- **Simulation (`run_bandit` function)**:
  - Iterates explicitly through each customer and selects actions.
  - Explicitly calculates rewards based on the defined economics (CLV, churn probability, promotion cost).

## **Explicit Interpretation of Your Results:**

### **First rows clearly explained:**

- **Row 0:** Customer had **69% churn probability**. A promotion was sent, the customer stayed, and you got a positive reward (**CLV - promotion cost**).
- **Row 1:** Customer had a **high churn probability (86%)**, received promotion, churned anyway, resulting in negative reward (**promotion cost lost**).
- Similar logic explicitly applies to other rows.

---

### **Summary (explicitly clear):**

```text
Action
No Action    688.76 (average reward)
Promotion    867.09 (average reward)
```

- **Promotion explicitly** provides significantly higher average economic rewards (**867.09**) compared to doing nothing (**688.76**).

**Conclusion:**  
The Contextual Bandit explicitly identified that actively intervening (promotions) is more economically beneficial on average for your specific customer dataset.

## **Recommended Next Steps explicitly clear:**

- **Refine the simulation explicitly** by adjusting:
  - Promotion costs.
  - Including real predicted churn probabilities explicitly from your Random Forest or SVM models (instead of randomly generated).
  
- **Run the bandit again explicitly** on refined data for robust insights.

- **Implement exploration analysis explicitly** to visualize:
  - How the bandit's strategy evolves explicitly over time.
  - Economic performance explicitly over customer segments (high CLV vs. low CLV).