# Analysis and Prediction of Customer Churn

### 1. Project Overview
This project analyzes a telecom customer dataset to identify key drivers of churn. The primary goals are to understand the characteristics of customers who leave the service and to build a predictive model that can identify at-risk customers. This is a classic classification problem with significant business value.

First, let's set up our environment and load the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample Customer Churn Dataset
# churn: 1 = Customer left, 0 = Customer stayed
# tenure: How many months the customer has been with the company
# monthly_charges: The customer's monthly bill
data = {
    'tenure': [2, 48, 5, 12, 55, 24, 1, 70, 3, 65],
    'monthly_charges': [50, 100, 60, 75, 110, 80, 45, 115, 55, 112],
    'age': [25, 50, 22, 30, 60, 28, 21, 55, 24, 48],
    'churn': [1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
}
churn_df = pd.DataFrame(data)

# Display the first 5 rows to understand the data
churn_df.head()

### 2. Exploratory Data Analysis (EDA)
My initial hypothesis was that customers with higher monthly charges would be more likely to churn. Let's test this by analyzing the average monthly charges for customers who churned versus those who did not.

In [None]:
# Calculate and print the average monthly charges by churn status
avg_charges_by_churn = churn_df.groupby('churn')['monthly_charges'].mean()

print("--- Average Monthly Charges by Churn Status ---")
print(avg_charges_by_churn)

# Visualize the result
plt.figure(figsize=(8, 5))
sns.barplot(x=avg_charges_by_churn.index, y=avg_charges_by_churn.values)
plt.title('Average Monthly Charges for Churn vs. Non-Churn Customers')
plt.xlabel('Churn Status (1 = Churned, 0 = Stayed)')
plt.ylabel('Average Monthly Charges')
plt.show()

### 3. Predictive Modeling
The EDA shows a surprising result: customers who churned, on average, had lower monthly charges. This refutes my initial hypothesis and suggests other factors are more important.

Now, let's build machine learning models to predict churn based on all available features. We will prepare the data and then compare two different classification models: **Decision Tree** and **Logistic Regression**.

In [None]:
# Define features (X) and target (y)
X = churn_df.drop(columns=['churn'])
y = churn_df['churn']

# Split data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# --- Model 1: Decision Tree ---
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_score = accuracy_score(y_test, dt_predictions)

# --- Model 2: Logistic Regression ---
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_score = accuracy_score(y_test, lr_predictions)

print("--- MODEL PERFORMANCE COMPARISON ---")
print(f"Decision Tree Accuracy: {dt_score}")
print(f"Logistic Regression Accuracy: {lr_score}")

### 4. Model Interpretation: What Drives Churn?
An accuracy score tells us *how* well the model performs, but not *why*. Let's inspect the Decision Tree model to understand which features it found most important when making a prediction.

In [None]:
# Get feature importances from the trained Decision Tree model
feature_importances = dt_model.feature_importances_
feature_names = X.columns

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
importance_df = importance_df.sort_values(by='importance', ascending=False)

print("--- FEATURE IMPORTANCE ANALYSIS (from Decision Tree) ---")
print(importance_df)

### 5. Prediction on a New Customer
Finally, let's use our trained Decision Tree model to predict whether a new, hypothetical customer is likely to churn.

In [None]:
# New customer features: tenure=6, monthly_charges=90, age=40
new_customer = [[6, 90, 40]]
prediction = dt_model.predict(new_customer)

print("\n--- NEW CUSTOMER CHURN PREDICTION ---")
if prediction[0] == 1:
    print("Prediction: This new customer is LIKELY TO CHURN.")
else:
    print("Prediction: This new customer is LIKELY TO STAY.")