# Waze User Churn - 02. Hypothesis Testing

**Project Goal:** To analyze user churn behavior and build a model to predict it.

**Notebook 2 Goal:** In this notebook, I move from descriptive analysis (from Notebook 01) to formal statistical hypothesis testing. Focusing on two key questions:

1.  Do iPhone and Android users differ meaningfully in their driving behaviour?
2.  Is there a statistically significant association between device type and churn?

Answering these questions will help understand whether `device_type` should be treated as a major driver of churn or low-importance feature in our model.

To do this, I will load the **cleaned data** from Notebook 01 and conduct two separate tests:
1. **Two-Sample T-Test:** To determine if there is a significant difference in the mean number of `drives` based on `device` type.
2. **Chi-Square Test:** To determine if there is a significant association between `device` type and user `label` (churned vs. retained).

## 1. Import Libraries and Load Data

I will import `pandas` for data manipulation, `numpy` for calculations (like Cohen's d), and `scipy.stats` for the statistical tests and load the `waze_clean.csv` file that I processed and saved in Notebook 01.

In [2]:
# import libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# import library for statistical testing
from scipy import stats

In [None]:
# load the cleaned data 
df = pd.read_csv('../data/processed/waze_clean.csv')

# display the first few rows to verify
df.head();

Unnamed: 0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,retained,243,200,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,retained,133,107,326.896596,1225,19,64,8898.716275,3160.472914,13,11,iPhone
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


## 1. Hypothesis 1 – Difference in Average Drives by Device Type

**Research question**

> Do iPhone users and Android users complete a different average number of drives?

If one platform showed consistently higher or lower driving activity, device-specific UX or technical issues might be contributing to churn.

### 1.1. Statistical formulation

Let:

- $\mu_{\text{iPhone}}$: mean number of drives for iPhone users
- $\mu_{\text{Android}}$: mean number of drives for Android users

We test:

- **$H_0$ (Null Hypothesis)**: there is no difference in average drives between iPhone and Android users.
  $$\mu_{\text{iPhone}} = \mu_{\text{Android}}$$

- **$H_A$ (Alternative Hypothesis)**: there is a difference in average drives between the two groups.
$$\mu_{\text{iPhone}} \neq \mu_{\text{Android}}$$

I use **Welch’s two-sample t-test** (unequal variances) at a significance level of **$\alpha = 0.05$**.

### 1.2. Method

-   Split the cleaned dataset into two samples based on `device_type`:
    -   `iphone_drives`: `drives` for users on iPhone
    -   `android_drives`: `drives` for users on Android
-   Because the sample size is large, the Central Limit Theorem makes the t-test reasonably robust to non-normality.
-   Run Welch’s t-test (`equal_var=False`) to compare the two means.
-   Compute **Cohen’s d** as an effect-size measure to quantify how large any difference is in practical terms.

In [None]:
# ensure 'device_type' column is present, if not, create it
if 'device_type' not in df.columns:
    print("Creating 'device_type' column...")
    map_dictionary = {'iPhone': 1, 'Android': 0}
    df['device_type'] = df['device'].map(map_dictionary)

# isolate the 'drives' data for each group
iphone_drives = df[df['device_type'] == 1]['drives']
android_drives = df[df['device_type'] == 0]['drives']

In [10]:
# calculate and print the means
mean_iphone = iphone_drives.mean()
mean_android = android_drives.mean()
print(f"Mean drives for iPhone: {mean_iphone:.2f}")
print(f"Mean drives for Android: {mean_android:.2f}")

Mean drives for iPhone: 64.44
Mean drives for Android: 63.10


In [11]:
# perform Welch's t-test
t_stat_1, p_value_1 = stats.ttest_ind(a=iphone_drives, b=android_drives, equal_var=False)
print(f"\nWelch's T-statistic: {t_stat_1:.4f}")
print(f"P-value: {p_value_1:.4f}")


Welch's T-statistic: 1.4057
P-value: 0.1598


In [12]:
# calculate Cohen's d (effect size)
def cohens_d(x, y):
    """Calculates Cohen's d for independent samples."""
    nx = len(x)
    ny = len(y)
    # calculate pooled standard deviation
    s_pooled = np.sqrt(((nx - 1) * np.std(x, ddof=1)**2 + (ny - 1) * np.std(y, ddof=1)**2) / (nx + ny - 2))
    # calculate Cohen's d
    return (np.mean(x) - np.mean(y)) / s_pooled

d_value = cohens_d(iphone_drives, android_drives)
print(f"Cohen's d: {d_value:.4f}")

Cohen's d: 0.0244


### 1.3. Results

-   Sample means (based on the cleaned data):

    -   $\hat{\mu}_{\text{iPhone}} = 64.44$ drives
    -   $\hat{\mu}_{\text{Android}} = 63.10$ drives

-   Welch’s t-test:

    -   Test statistic: $t = 1.4057$
    -   p-value: $p = 0.1598$

-   Effect size (Cohen’s d):

    -   $d = 0.0244$

The p-value is **greater than 0.05**, and Cohen’s $d$ is close to zero.

### 1.4. Interpretation

From a statistical perspective, can conclude that **fail to reject the null hypothesis $H_0$**.
The average number of drives is very similar for iPhone and Android users, and the effect size is practically negligible.

**From a business perspective, this means:**

-   Device type does **not** appear to drive materially different driving behaviour in this dataset.
-   There is **no evidence** that one platform systematically encourages more or fewer drives than the other.

As a result, I **do not treat device type as a primary driver of driving intensity**; instead, I will rely on direct usage metrics (e.g., drives, sessions,duration,..) as the main behavioural features.

## 2. Hypothesis 2 – Association Between Device Type and Churn

**Research question**

> Are churn rates different between iPhone and Android users?

If one platform had a significantly higher churn rate, that would suggest device-specific friction, bugs, or UX problems that require targeted interventions.

### 2.1. Statistical formulation

Consider the relationship between:

-   `device_type` (two categories: iPhone vs Android), and
-   `label` (binary churn label: retained vs churned).

Test:

-   **$H_0$ (Null Hypothesis)**:
    `device_type` and `label` are **independent**.
    (device type has no association with churn status.)

-   **$H_A$ (Alternative Hypothesis)**:
    `device_type` and `label` are **not independent**.
    (there is an association between device type and churn.)

Use a **Chi-square test of independence** at **$\alpha = 0.05$**.

### 2.2. Method

-   Construct a contingency table (crosstab) of observed frequencies:

$$\text{contingency} =
\begin{bmatrix}
\text{count of iPhone active} & \text{count of iPhone churn} \\
\text{count of Android active} & \text{count of Android churn}
\end{bmatrix}$$

-   Apply `scipy.stats.chi2_contingency` to obtain:
    -   Chi-square statistic
    -   p-value
    -   Degrees of freedom
    -   Expected frequencies under $H_0$

-   Verify that **all expected cell counts are larger than 5**, so the chi-square test assumptions are satisfied.

In [14]:
# create the contingency table
contingency_table = pd.crosstab(df['device'], df['label'])

print("===== Contingency Table (Observed Frequencies) =====")
display(contingency_table)

===== Contingency Table (Observed Frequencies) =====


label,churned,retained
device,Unnamed: 1_level_1,Unnamed: 2_level_1
Android,891,4183
iPhone,1645,7580


In [13]:
# perform the Chi-Square test
chi2_stat, p_value_chi2, dof, expected_freq = stats.chi2_contingency(contingency_table)

print(f"\n===== Chi-Square Test Results =====")
print(f"Chi2 Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value_chi2:.4f}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies (under H0):")
display(pd.DataFrame(expected_freq, 
                      index=contingency_table.index, 
                      columns=contingency_table.columns))


===== Chi-Square Test Results =====
Chi2 Statistic: 0.1477
P-value: 0.7007
Degrees of Freedom: 1

Expected Frequencies (under H0):


label,churned,retained
device,Unnamed: 1_level_1,Unnamed: 2_level_1
Android,899.899573,4174.100427
iPhone,1636.100427,7588.899573


### 2.3. Results

The p-value is **greater than 0.05**, and all expected cell counts are well above 5, validating the test.

## 2.4. Interpretation

Again **fail to reject the null hypothesis $H_0$**.
Within this dataset, there is **no statistically significant evidence** that churn depends on device type.

**From a business standpoint:**

* Do **not** observe a strong platform-specific churn pattern (iOS vs Android).
* Churn appears to be driven more by engagement and usage patterns than by the underlying device ecosystem.

As a result, I **will not include** `device_type` in the feature set for the churn model. I do not expect it to carry any significant predictive power compared with core usage features (drives, sessions, activity days, etc.), and removing it helps simplify the model.

## 3. Overall Conclusions from Hypothesis Testing

Across both tests, these findings consistently point to the same conclusion:

1.  **Driving behaviour (average number of drives)** is essentially the same between iPhone and Android users.
2.  **Churn rates** do not differ significantly by device type.

These results confirm my visual analysis from Notebook 01 and suggest that **device type is not a primary driver of churn** in this sample.

In the subsequent modeling step (`03_modeling.ipynb`), I will therefore **exclude** `device_type` from the feature set and focus exclusively on behavioural and recency-related features as the main predictors of churn risk.