# ML + Product Exercise: Churn Prediction as a PM

What this exercise is about

You are a Product Manager at a subscription SaaS company (XYZ).
Your job is to decide which users are likely to churn in the next 30 days, so Marketing can intervene early.

This is not about building a perfect ML system.
It’s about using models to make better product decisions under constraints.

We’re using past user behavior to estimate who is likely to cancel soon, so the company can reach out early and keep them.

We test a few simple models, compare them using metrics that reflect real business costs, and choose the option that balances recall, trust, and explainability rather than raw accuracy.

Business Goal

Predict churn early enough to:
	•	proactively retain users
	•	reduce wasted marketing spend
	•	understand behavioral drivers of churn

Key Business Constraints
	•	False Negatives are expensive
→ missing a churned user means lost revenue
	•	False Positives are cheap
→ contacting a non-churn user is low risk
	•	Explainability matters
→ PMs must justify decisions to non-technical stakeholders

This means:
➡️ Recall > accuracy
➡️ Simple, interpretable models are preferred


Dataset (Business Meaning, not math)

Each row = one user.
Feature
What it represents in product terms
tenure_months
How embedded the user is
sessions_per_week
How habitual the product is
feature_usage
How broadly the product is used
support_tickets
Friction or problems experienced
is_premium
Level of commitment / payment


Target: churn
	•	1 → user left
	•	0 → user stayed



What models we try (and why)
	•	Logistic Regression
	•	Strong baseline
	•	Very explainable
	•	Easy to justify in meetings
	•	KNN
	•	Captures similarity between users
	•	Can boost recall
	•	Harder to explain
	•	Decision Tree
	•	Human-readable logic
	•	Risk of overfitting if too deep

The goal is not accuracy alone, but business-aligned performance.

How models are evaluated

We focus on:
	•	Recall → don’t miss churners
	•	Precision → acceptable noise level
	•	F1 → tradeoff summary

We also:
	•	tune hyperparameters
	•	adjust decision thresholds (because contacting users is cheap)

Why this dataset matters

This exercise teaches PMs to:
	•	connect features to real product behavior
	•	reason about trade-offs (risk vs explainability)
	•	translate ML outputs into policy decisions
	•	understand bias–variance as product risk

⸻

What the reflection questions test

They’re not ML theory questions.
They test whether you can:
	•	justify a model choice in business terms
	•	reason about costs of mistakes
	•	explain model behavior to stakeholders
	•	decide when a model is “good enough to ship”

In [23]:
import os
os.listdir()

['requirements.txt',
 'exercise.ipynb',
 'README.md',
 '.gitignore',
 '.venv',
 '.python-version',
 'task.md',
 '.git',
 'data']

In [24]:
import os
os.listdir("data")

['synthetic_customer_churn.csv']

In [25]:
import pandas as pd

df = pd.read_csv("data/synthetic_customer_churn.csv")

In [26]:
df.head()

Unnamed: 0,tenure_months,sessions_per_week,feature_usage,support_tickets,is_premium,churn
0,29,14,46,2,0,0
1,15,0,14,2,0,0
2,8,7,36,0,0,0
3,21,11,12,0,0,0
4,19,7,17,0,0,0


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   tenure_months      1500 non-null   int64
 1   sessions_per_week  1500 non-null   int64
 2   feature_usage      1500 non-null   int64
 3   support_tickets    1500 non-null   int64
 4   is_premium         1500 non-null   int64
 5   churn              1500 non-null   int64
dtypes: int64(6)
memory usage: 70.4 KB


In [28]:
df["churn"].value_counts(normalize=True)

churn
0    0.758
1    0.242
Name: proportion, dtype: float64

What this tells us
	•	1,500 users → small, illustrative dataset (as advertised)
	•	All features are numeric → no encoding needed
	•	No missing values → no imputation risk
	•	Binary target (churn) → standard classification

Why this is good
	•	No data cleaning distractions
	•	Focus stays on model trade-offs and decisions
	•	Exactly what this exercise is meant to test

Business implication
	•	Any model differences are due to behavior patterns, not messy data
	•	Results are easier to explain to stakeholders


    Churn rate
	•	24.2% churn
	•	75.8% retained

What this means for decision-making
	•	Churn is a clear minority class
	•	A dumb model predicting “no churn” for everyone gets 75.8% accuracy
	•	Therefore:
	•	Accuracy is misleading
	•	Recall for churn (class = 1) matters most

This directly validates your stated business constraint:

Missing a churned user is very costly.

So the evaluation strategy is now locked in: Optimize for recall, not accuracy.

With a 24% churn rate:
	•	Every false negative = real revenue loss
	•	False positives are acceptable noise (cheap outreach)
	•	Threshold tuning will matter more than model choice

In [29]:
y = df["churn"]
X = df.drop(columns=["churn"])

from sklearn.model_selection import train_test_split

RSEED = 42

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RSEED,
    stratify=y
)

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logreg = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000, random_state=RSEED))
])

logreg.fit(X_train, y_train)

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('scaler', ...), ('model', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",42
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [31]:
#Confusion Matrix 
#We use a confusion matrix because accuracy hides the mistakes that actually matter.


from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, f1_score

y_pred = logreg.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))

print("Recall:", round(recall_score(y_test, y_pred), 3))
print("Precision:", round(precision_score(y_test, y_pred), 3))
print("F1:", round(f1_score(y_test, y_pred), 3))

[[215  12]
 [ 63  10]]
              precision    recall  f1-score   support

           0      0.773     0.947     0.851       227
           1      0.455     0.137     0.211        73

    accuracy                          0.750       300
   macro avg      0.614     0.542     0.531       300
weighted avg      0.696     0.750     0.696       300

Recall: 0.137
Precision: 0.455
F1: 0.211


This tells us:
	•	You caught 10 churners
	•	You missed 63 churners ← this is the real problem
	•	You annoyed 12 non-churn users (cheap)

Accuracy alone doesn’t show this imbalance.

We use a confusion matrix because it shows the business cost of each type of mistake, not just how often the model is right

In [32]:
#lower the threshold 

y_proba = logreg.predict_proba(X_test)[:, 1]

for t in [0.5, 0.4, 0.3, 0.2]:
    y_pred_t = (y_proba >= t).astype(int)

    print(f"\nThreshold: {t}")
    print(confusion_matrix(y_test, y_pred_t))
    print("Recall:", round(recall_score(y_test, y_pred_t), 3))
    print("Precision:", round(precision_score(y_test, y_pred_t), 3))
    print("F1:", round(f1_score(y_test, y_pred_t), 3))


Threshold: 0.5
[[215  12]
 [ 63  10]]
Recall: 0.137
Precision: 0.455
F1: 0.211

Threshold: 0.4
[[199  28]
 [ 53  20]]
Recall: 0.274
Precision: 0.417
F1: 0.331

Threshold: 0.3
[[172  55]
 [ 35  38]]
Recall: 0.521
Precision: 0.409
F1: 0.458

Threshold: 0.2
[[120 107]
 [ 18  55]]
Recall: 0.753
Precision: 0.34
F1: 0.468


The model is too conservative. Change the decision rule, not the model - this aligns the model with the business rule: missing churn is expensive

Is Logistic Regression good enough? Yes — with threshold tuning

By lowering the decision threshold, recall improved substantially while precision declined, which is acceptable given the business constraints.

Threshold 0.3
	•	Recall: 0.52 → catches ~half of churners
	•	Precision: 0.41 → ~6 in 10 contacted users won’t churn
	•	Trade-off: Balanced, moderate outreach

Threshold 0.2
	•	Recall: 0.75 → catches 3 out of 4 churners
	•	Precision: 0.34 → more false alarms
	•	Trade-off: Aggressive retention, higher outreach volume

Logistic Regression becomes usable once the decision threshold is adjusted to reflect the real business cost of churn. At a threshold of 0.2–0.3, the model captures a majority of churners while remaining simple, stable, and explainable to stakeholders

The baseline model was initially unusable, but threshold tuning aligned it with business priorities and made it deployment-ready without increasing model complexity

I would start with a 0.3 threshold to balance recall and operational cost, then monitor retention lift before expanding outreach.


In [33]:
# Encode business cost into the model (class weights)
#This usually boosts recall without extreme thresholding

logreg_bal = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        max_iter=2000,
        random_state=RSEED,
        class_weight="balanced"
    ))
])

logreg_bal.fit(X_train, y_train)

y_pred = logreg_bal.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))

[[146  81]
 [ 23  50]]
Recall: 0.684931506849315
Precision: 0.3816793893129771


In [34]:
#Decision Tree (Explainability comparison)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score

tree = DecisionTreeClassifier(
    max_depth=3,
    min_samples_leaf=20,
    random_state=RSEED
)

tree.fit(X_train, y_train)

y_pred_tree = tree.predict(X_test)

print(confusion_matrix(y_test, y_pred_tree))
print("Recall:", round(recall_score(y_test, y_pred_tree), 3))
print("Precision:", round(precision_score(y_test, y_pred_tree), 3))
print("F1:", round(f1_score(y_test, y_pred_tree), 3))

[[220   7]
 [ 56  17]]
Recall: 0.233
Precision: 0.708
F1: 0.351


	•	Recall: 0.23 → catches only ~1 in 4 churners
	•	Precision: 0.71 → very “clean”, but too conservative
	•	F1: 0.35

	I'm choosing Logistic Regression over the Decision Tree because 
	The Decision Tree is highly precise but misses most churners, which violates the primary business constraint where false negatives are very costly.

	•	The tree behaves conservatively due to hard decision rules
	•	It optimizes for correctness, not coverage
	•	High precision is not valuable when missing churn is expensive
	•	Recall is far below acceptable levels

	#The tree is explainable.

But:
	•	Explainable and wrong is worse than
	•	Explainable and aligned with business cost

Logistic Regression remains:
	•	explainable
	•	tunable via threshold
	•	more stable
	•	far better at catching churners

Questions to answer

Model Choice & Business Fit

1. Which model would you recommend deploying and why?
Cost-sensitive Logistic Regression (threshold tuning and/or class_weight="balanced"). It hits the business goal (high recall), stays simple, and is explainable.

2. Which constraint mattered most: recall, explainability, or consistency?
Recall first. Missing churners is the expensive failure mode. Explainability is second (we still need buy-in).

3. Would your recommendation change for early-stage vs mature? Why?
	•	Early-stage: still Logistic Regression, but I’d move faster, accept rougher monitoring, iterate weekly.
	•	Mature: still Logistic Regression, but with stricter governance: calibration, monitoring, segmentation, A/B testing, audit trails. More emphasis on risk controls than speed.



Metrics & Trade-offs

4. Which metric did you prioritize and why?
Recall (for churn = 1) because false negatives cost revenue.

5. Business consequences of FN vs FP?
	•	False negative: we don’t contact a real churner → lost subscription revenue + higher reacquisition cost later.
	•	False positive: we contact someone who wouldn’t churn → minor annoyance + small marketing cost.

6. If leadership demanded higher accuracy but lower recall, how do you respond?
I’d show that with 24% churn, accuracy can be inflated by predicting “no churn.” I’d propose a compromise: keep accuracy reporting, but set a minimum recall target and measure business outcomes (retention lift, cost per save). If they insist, it’s a business decision—but they should explicitly accept the revenue loss from missed churners.

Bias–Variance as Product Risk

7. Any signs of over/underfitting? Product risks?
	•	The Decision Tree behaved too conservatively at the chosen settings (low recall) and trees in general can be high variance if deep → unstable targeting, inconsistent campaigns.
	•	Baseline Logistic at 0.5 threshold was effectively under-sensitive → misses churners → wasted opportunity.

8. Prefer slightly underfit or overfit here? Why?
Slightly underfit / stable. Overfit models can swing outreach decisions unpredictably and hurt trust. For retention campaigns, consistency matters.

Decision Trees & Explainability

9. How does tree depth / leaf count affect UX and trust?
	•	Deeper tree = more complex rules → more erratic targeting → users get inconsistent messaging.
	•	Stakeholder trust drops if the logic becomes too long to explain or changes a lot between retrains.

10. Max complexity you’d explain to non-technical stakeholders?
A shallow tree (depth ~2–4) or Logistic Regression with a small set of clear drivers. If I can’t explain it in 60 seconds, it’s too complex for this use case.

Feature Interpretation & Product Insight

11. Which features seemed most useful?
Practically, the likely strongest drivers are engagement and friction signals: sessions_per_week, feature_usage, support_tickets, and tenure_months. (To be precise, you’d confirm via LR coefficients / feature importance.)

12. Any feature surprise + hypothesis?
If support_tickets strongly increases churn risk, the hypothesis is: unresolved issues or poor support experience drives cancellations. If is_premium lowers churn, it reflects higher commitment and switching costs.

13. Product/UX changes to reduce churn for flagged users?
	•	Trigger in-app guidance/onboarding when usage drops.
	•	Improve support flows: faster resolution, proactive help, better self-serve.
	•	“Save” offers: targeted discounts or plan downgrades for high-risk users.
	•	Increase feature adoption: push 1–2 “sticky” features tied to retention.

Feature Engineering 

14. Did scaling/normalization change performance?
Yes for models that care about scale (Logistic, KNN). Scaling stabilizes training and improves comparability across features.

15. When is changing feature treatment acceptable vs risky (PM view)?
	•	Acceptable: when it improves stability/recall without harming interpretability (e.g., scaling).
	•	Risky: when it changes meaning or introduces leakage (using future info, post-churn behavior, or fitting transforms on full data).

Threshold & Policy Decisions

16. How choose a decision threshold for contacting users?
Pick a threshold based on:
	•	required recall target (FN cost)
	•	outreach capacity / budget
	•	acceptable false positive rate
Then validate with a small rollout and measure retention lift.

17. Different thresholds for premium vs free?
Yes, potentially. Premium users have higher LTV, so it can justify a lower threshold (more aggressive outreach). Free users may get a higher threshold or cheaper interventions.

Shipping & Ethics

18. Would you ship today? What’s missing if not?
I’d ship as a controlled pilot. Missing for “full ship”: monitoring, segmentation checks, calibration, and a clear operational playbook (who contacts whom, when, and how).

19. Safeguards before production use
	•	Monitor recall/precision drift + outreach volume
	•	A/B test retention campaigns
	•	Regular retrain schedule + rollback plan
	•	Logging and auditability of decisions
	•	Guardrails to avoid spammy outreach and protect user trust

20. Who is accountable if it makes a bad decision?
Ultimately Product + Marketing leadership jointly. Product owns model policy and risk, Marketing owns execution. Data/ML supports with monitoring and model quality.

I built a churn model designed around the real business cost: missing churners is expensive, while outreach is cheap. The baseline looked “accurate” but missed most churners, so I adjusted the decision threshold (and validated cost-sensitive weighting) to significantly improve recall without increasing model complexity

This keeps the model explainable and operationally controllable: we can tune aggressiveness via threshold based on campaign capacity and desired recall

We’ll start with a moderate threshold, run a pilot, measure retention lift and cost per save, then scale or adjust based on results, with monitoring and safeguards in place.

We choose Logistic Regression over a Decision Tree because it catches far more churners while staying stable and explainable, whereas the tree misses too many churners and adds variance without business upside.