In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Explanation of the dataset:

* num_summaries: The number of summaries created by the user.
* slack_integration: Whether the user integrated Slack (1 = Yes, 0 = No).
* days_to_first_summary: Days taken by the user to create their first summary.
* email_clicks: Number of email clicks the user interacted with.
* retained_week4: Whether the user was retained by week 4 (1 = Retained, 0 = Churned).

In [2]:
# Simplified sample user engagement mock dataset for 20 users

num_summaries = [15, 8, 12, 17, 2, 20, 2, 24, 26, 3, 1, 32, 34, 1, 38, 40, 42, 7, 46, 4]
slack_integration = [0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1]
days_to_first_summary = [2, 1, 1, 1, 9, 2, 8, 1, 2, 8, 6, 1, 2, 5, 1, 1, 2, 18, 1, 8]
email_clicks = [3, 5, 7, 3, 0, 13, 1, 16, 3, 1, 0, 12, 3, 2, 5, 3, 32, 1, 9, 2]
retained_week4 = [1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0]

data = {
    'num_summaries': num_summaries,
    'slack_integration': slack_integration,
    'days_to_first_summary': days_to_first_summary,
    'email_clicks': email_clicks,
    'retained_week4': retained_week4
}

df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,num_summaries,slack_integration,days_to_first_summary,email_clicks,retained_week4
0,15,0,2,3,1
1,8,1,1,5,1
2,12,1,1,7,1
3,17,1,1,3,1
4,2,0,9,0,0


In [4]:
# Features to evaluate (X) and target variable of retained after week 4 (y)

X = df[['num_summaries', 'slack_integration', 'days_to_first_summary', 'email_clicks']]
y = df['retained_week4']

In [5]:
# Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
# Train logistic regression model

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [7]:
# Get feature importance (coefficients)

coefficients = model.coef_[0]
features = X.columns
for feature, coef in zip(features, coefficients):
    print(f"Feature: {feature}, Coefficient: {coef:.4f}")

Feature: num_summaries, Coefficient: 0.4940
Feature: slack_integration, Coefficient: 0.0304
Feature: days_to_first_summary, Coefficient: -0.3009
Feature: email_clicks, Coefficient: 0.2333


<h3>Feature, Coefficient, and Interpretation of coefficient meaning</h3>

| **Feature**               | **Coefficient** | **Interpretation**                                                                 |
|---------------------------|-----------------|-----------------------------------------------------------------------------------|
| `num_summaries`            | **+0.4940**       | More summaries strongly correlate with higher retention.                                  
| `slack_integration`        | **+0.0304**        | Users integrating Slack have a very slightly higher association of retention.      |
| `days_to_first_summary`    | **-0.3009**        | Users taking longer to create their first summary are less correlated with retention.|
| `email_clicks`             | **+0.2333**       | Users clicking more emails have a small-medium but positive correlation with retention. |

In [8]:
# Predict retention probabilities

y_pred_prob = model.predict_proba(X_test)[:, 1]  # Probability of retention
y_pred = model.predict(X_test)
y_pred

array([1, 0, 1, 1, 1, 1])

In [9]:
# Evaluate model performance - we're 100% accurate!

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0


<h2>Interpretation of Logistic Regression Coefficients:</h2>

* Magnitude: The magnitude of the coefficient indicates how strongly the feature is associated the outcome, but it's not constrained to a specific range like -1 to 1.
* Sign: The sign of the coefficient (+ or −) tells you the direction of the relationship:
  * Positive coefficient: The feature is correlated with an increase in the likelihood of the outcome (retention in this case).
  * Negative coefficient: The feature is correlated with a decrease in the likelihood of the outcome.

It is important to note that a logistic regression shows correlation, not causality.

<b> Why?</b>

1. Correlation vs. Causation – Logistic regression can identify associations between independent variables (features) and the dependent variable (outcome), but it does not prove that changes in one variable cause changes in the outcome.

2. Omitted Variable Bias – If important confounding variables are missing from the model, the relationships detected may be spurious or misleading.

3. Reverse Causality – Logistic regression does not establish the direction of influence. For example, if you find a significant relationship between stress and heart disease, it's unclear whether stress causes heart disease or if heart disease leads to stress.

4. Experimental vs. Observational Data – Logistic regression is often used with observational data, where confounding factors and biases make it difficult to determine causality. Causal inference usually requires experimental designs (like randomized controlled trials) or advanced statistical techniques such as instrumental variables or causal inference frameworks (e.g., Directed Acyclic Graphs, propensity score matching).


<b>Example</b>

Feature: num_summaries, Coefficient: 0.4940

This means that for every one-unit increase in num_summaries, the log-odds of the user being retained (increased probability of retention) increases by 0.4940, holding all other variables constant.

<h3>Converting Coefficients to Odds Ratio:</h3>

To better interpret the impact, you can convert the log-odds to an odds ratio by exponentiating the coefficient:

$Odds Ratio=e^{coefficient}$

For the coefficient of 0.4940, the odds ratio would be:

$e^{0.4940} ≈ 1.639$

<h3>Interpretation of Odds Ratio:</h3>

* An odds ratio of 1.639 means that for every 1 additional summary a user has, the odds of retention increase by a factor of 1.639.
* If the odds ratio were less than 1 (e.g., 0.5), it would mean the feature decreases the likelihood of retention.
* An odds ratio of 1 means there is no effect on retention.


<b>Takeaway</b>: Once we have discovered which features have the most impact on our target (retention at week 4 in this case - note that there are many other ways to define customer retention, and it largely depends on long-term goals), we can take actions to influence customer behavior (e.g., ensure that onboarding is as frictionless as possible to get those first few summaries generated). If a customer has not met those key metrics by a certain point, trigger some action like an outreach email, trial extension, etc. These can be validated with A/B testing over time to see which actions are most effective at influencing churn.

---
<h2>Quick Recap of A/B Testing</h2>

<h4>1. Define the experiment goals</h4>

Example goal: "Test if personalized onboarding emails or extended trials improve retention."

Hypothesis: Either intervention increases retention after 30 days.

<b>Note that this is actually an A/B/C test</b>

* Null Hypothesis ($H_{0}$): No difference in retention between groups.
* Alternative Hypothesis ($H_{1}$): At least one treatment group improves retention over the control.

<h4>2. Clarify what you define as success</h4>

* Statistical significance (e.g., p-value < 0.05)
   * This will tell us if the results are real (i.e., statistically significant) or due to random chance.
   * If p < 0.05, we reject the null hypothesis, meaning at least one intervention significantly affects retention.
   * p-value is good for a simple yes/no answer for "is there a difference between the groups?"; it is good for making one-time, discrete decisions like launch the new onboarding process or not.



* Confidence Intervals (CIs)
   * A 95% CI gives a range in which the true effect likely lies
   * If Group B had a retention rate of 40%, then a 95% CI tells us that we are 95% confident that the true retention rate is between 35% and 45%.
   * If two groups have overlapping confidence intervals, there may not be a significant difference.
   * Use CI if you care about the magnitude of the effect, and not just statistical significance.
   * Better for long-term business decisions and want a range estimate instead of a strict cutoff.

* <b>It is best practice to use both p-values and confidence intervals, since p-values tell us if there is a difference that exists, and confidence intervals tell us how large the difference is.</b>

* Will the retention be measured at a fixed time point (30 days) or as a survival curve (Kaplan-Meier analysis)?
   * This can be used to track how long users stay engaged over time (not just a strict cutoff like 30 days).
   * Kaplan-Meier curves can show differences in retention trends.
   * <b>For example, maybe Group C shows higher retention early on, but drops off later.</b>

<h4>3. Choose the key metrics:</h4>

*  Retention Rate (e.g., % of users still active after 30 days)
*  Churn Rate (e.g., % of users who stopped using the product)
*  Engagement Metrics (e.g., number of summaries generated, Slack integration usage)

<h4>4. Select the  target audience, ensuring randomization</h4>

* New Users Only: Focus on users who sign up within a certain period (e.g., this month).
* Randomized Assignment: Use a random split to ensure fair comparisons and avoid bias.
* Example random split:
  * <b>Group A (Control Group)</b>: Standard onboarding
  * <b>Group B (Personal Email Group)</b>: Personalized onboarding email
  * <b>Group C (Trial Extension Group)</b>: Extended trial if inactive by day 5

<h4>5. Analyze Results</h4>

* Use statistical tests (chi-square, t-test, ANOVA) to compare retention.


---
<h2>Better Alternatives for Retention Analysis</h2>

* Logistic regression is a good starting point for ease of interpretation, but it has limitations.

<b>Why Logistic Regression Works for Retention Analysis</b>

1. Binary Outcome: Customer retention is often framed as a binary classification problem (e.g., "retained" = 1, "churned" = 0), making logistic regression a natural choice.
2. Interpretability: Unlike complex machine learning models, logistic regression provides coefficients that indicate the impact of each factor on retention, making it easier to explain insights to stakeholders.
3. Baseline Model: It serves as a good starting point before testing more advanced models like decision trees, random forests, or neural networks.


<b>Potential Limitations</b>

1. Assumes Linearity in Log-Odds: Logistic regression assumes a linear relationship between independent variables and log-odds of retention, which may not always be realistic.
2. Ignores Interactions & Nonlinearity: Important interactions between factors (e.g., usage frequency × customer support engagement) may be missed unless explicitly modeled.
3. Feature Engineering is Key: Retention is driven by complex behaviors, and raw variables alone may not be enough. You may need to create meaningful features (e.g., last login time, feature usage trends).


<b>For deeper insights, consider using:</b>

1. Random Forests/XGBoost – Capture nonlinear relationships and interactions.
2. Survival Analysis – Predicts "time until churn" rather than just whether a customer churns.
3. Causal Inference Methods – If you need to identify causal drivers rather than just correlations, methods like propensity score matching or instrumental variables might be useful.

<b>A good first-pass approach:</b>

1. Start with logistic regression for interpretability.
2. Explore more advanced models (e.g., XGBoost, survival analysis) for improved accuracy.
3. Test causal inference techniques if the goal is to find actionable retention levers (e.g., does sending an email boost retention?).