In [1]:
import pandas as pd
from scipy.stats import ttest_ind, chi2_contingency

# Add the src folder to the Python path
import sys
sys.path.append("../src")

# Import your test functions
from utils.tests import chi_square_test, t_test_analysis

## Load the preprocessed data

In [2]:

data = pd.read_csv(r"..\data\preprocessed_data.csv")
data.head()


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0,1,0,1,29.85,29.85,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,34,1,0,56.95,1889.5,0,...,0,0,0,0,0,1,0,0,0,1
2,1,0,0,0,2,1,1,53.85,108.15,1,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,45,0,0,42.3,1840.75,0,...,1,0,0,0,0,1,0,0,0,0
4,0,0,0,0,2,1,1,70.7,151.65,1,...,0,0,0,0,0,0,0,0,1,0


In [3]:

target = 'Churn'

# Continuous numeric features
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Binary features = all other columns except numeric and target
binary_features = [col for col in data.columns if col not in numeric_features + [target]]

# Run tests
t_test_table = t_test_analysis(data, target, numeric_features)
chi_square_table = chi_square_test(data, target, binary_features)

# Display both tables separately
print("===== T-Test Results =====")
display(t_test_table.sort_values('p-value'))

print("===== Chi-Square Test Results =====")
display(chi_square_table.sort_values('p-value'))


===== T-Test Results =====


Unnamed: 0,Feature,Test,p-value,Reject_H0,Relationship_Strength
0,tenure,T-Test,0.0,Yes,Significant
1,MonthlyCharges,T-Test,0.0,Yes,Significant
2,TotalCharges,T-Test,0.0,Yes,Significant


===== Chi-Square Test Results =====


Unnamed: 0,Feature,Test,p-value,Reject_H0,Cramer_V,Relationship_Strength
13,OnlineBackup_Yes,Chi-Square,0.0,Yes,0.0819,Very Weak
24,PaymentMethod_Credit card (automatic),Chi-Square,0.0,Yes,0.1339,Weak
23,Contract_Two year,Chi-Square,0.0,Yes,0.3019,Moderate
22,Contract_One year,Chi-Square,0.0,Yes,0.1774,Weak
21,StreamingMovies_Yes,Chi-Square,0.0,Yes,0.0611,Very Weak
20,StreamingMovies_No internet service,Chi-Square,0.0,Yes,0.2275,Weak
19,StreamingTV_Yes,Chi-Square,0.0,Yes,0.0629,Very Weak
18,StreamingTV_No internet service,Chi-Square,0.0,Yes,0.2275,Weak
17,TechSupport_Yes,Chi-Square,0.0,Yes,0.1643,Weak
16,TechSupport_No internet service,Chi-Square,0.0,Yes,0.2275,Weak


# 🧠 Advanced Data Analysis: Statistical Testing Summary

This section presents the results of **t-tests** and **chi-squared tests** conducted to examine which features are significantly related to customer churn.  

---

## 📊 1. T-Test Results (Continuous Features)

| Feature | p-value | Significance | Interpretation |
|----------|----------|--------------|----------------|
| `tenure` | < 0.0001 | ✅ Significant | Customers with **shorter tenure** are much more likely to churn. |
| `MonthlyCharges` | < 0.0001 | ✅ Significant | Customers with **higher monthly charges** are more likely to churn. |
| `TotalCharges` | < 0.0001 | ✅ Significant | Customers who have **paid less in total** tend to churn (they are newer customers). |

**Interpretation:**  
All continuous features have statistically significant relationships with churn.  
Customers who are new (low tenure) and have higher monthly charges are the most likely to churn, while long-term, lower-cost customers tend to stay.

---

## 🧩 2. Chi-Squared Test Results (Categorical/Binary Features)

### ✅ Significant Predictors (p < 0.05)

| Strength | Features | Insight |
|-----------|-----------|----------|
| **Moderate** | `Contract_Two year`, `PaymentMethod_Electronic check`, `InternetService_Fiber optic` | These features are the strongest churn indicators. Fiber-optic users and electronic-check payers have the highest churn rates, while two-year contracts drastically reduce churn. |
| **Weak** | `Partner`, `Dependents`, `SeniorCitizen`, `PaperlessBilling`, `TechSupport_Yes`, `OnlineSecurity_Yes`, `Contract_One year` | Customers with partners, dependents, or technical support/security services are **less likely** to churn. |
| **Very Weak** | `DeviceProtection_Yes`, `OnlineBackup_Yes`, `StreamingTV_Yes`, `StreamingMovies_Yes` | These features show minimal influence on churn. |

### ❌ Not Significant (p > 0.05)

| Features | Interpretation |
|-----------|----------------|
| `gender`, `PhoneService`, `MultipleLines_No phone service` | These have **no significant relationship** with churn and can be deprioritized in feature selection. |

---

## 🧠 3. Key Insights

| Category | Main Finding | Impact on Churn |
|-----------|---------------|----------------|
| **Contract Type** | Longer contracts, especially 2-year plans, significantly reduce churn. | 🔽 Decreases churn |
| **Payment Method** | Customers using electronic checks churn at much higher rates. | 🔼 Increases churn |
| **Internet Service** | Fiber optic users are more likely to leave than DSL users. | 🔼 Increases churn |
| **Tenure & Charges** | New customers with high monthly charges are the most at-risk group. | 🔼 Increases churn |
| **Customer Profile** | Having a partner or dependents correlates with higher loyalty. | 🔽 Decreases churn |
| **Demographics** | Gender and phone service type have no measurable effect. | ➖ Negligible impact |

---

## 💼 4. Business Implications

- **Retention Focus:** Target **fiber-optic** and **electronic-check** users with retention offers or loyalty incentives.  
- **Early Engagement:** Provide **onboarding support** or discounts for **new, high-paying** customers.  
- **Contract Strategy:** Encourage **long-term contracts (1–2 years)** to stabilize customer retention.  
- **Service Bundles:** Promote **Tech Support** and **Online Security** services that correlate with lower churn.

---

## 🏁 5. Conclusion

> Statistical analysis reveals that **tenure**, **charges**, **contract type**, **payment method**, and **internet service** are the most influential predictors of churn.  
> Conversely, demographic variables like **gender** and **phone service** show no significant impact.  
> These findings will guide both **feature selection** for model building and **business decisions** for churn prevention.


In [1]:
print('hi')

hi
