# Analysis and Prediction of Telco Customer Churn

### 1. Project Overview
This project analyzes a telecom customer dataset to identify key drivers of churn. The goal is to build a classification model that can predict which customers are most likely to leave, enabling the business to take proactive retention measures.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the full dataset from the .csv file
file_path = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'
churn_df = pd.read_csv(file_path)

# Display basic information and the first 5 rows
print(f"Dataset loaded successfully. Shape of the data: {churn_df.shape}")
print("\nData Info:")
churn_df.info()

print("\nSample of the data:")
display(churn_df.head())

### 2. Data Cleaning and Preprocessing

Real-world data often needs cleaning. We need to handle non-numeric values before we can use the data in a machine learning model.

* `TotalCharges` column contains spaces for some rows, which should be numeric. We'll convert it.
* Categorical columns like `gender` and `Partner` need to be converted into numbers.
* We will convert the target variable `Churn` from 'Yes'/'No' to 1/0.

In [None]:
# Handle non-numeric values in 'TotalCharges'
churn_df['TotalCharges'] = pd.to_numeric(churn_df['TotalCharges'], errors='coerce')
# Fill any resulting missing values (NaNs) with the median
churn_df.dropna(inplace=True)

# Convert 'Churn' column from Yes/No to 1/0
churn_df['Churn'] = churn_df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Select features and target
# For simplicity, we'll use only the numeric features for this model
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
target = 'Churn'

X = churn_df[numeric_features]
y = churn_df[target]

print("Data preprocessing complete. Features used for modeling:")
print(X.columns.tolist())



In [None]:
# --- EDA: Visualize Churn vs. Key Features ---
# Let's visualize our earlier finding about MonthlyCharges to tell the story.

# Calculate average monthly charges by churn status
avg_charges_by_churn = churn_df.groupby('Churn')['MonthlyCharges'].mean()

# Create the bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x=avg_charges_by_churn.index, y=avg_charges_by_churn.values, palette='viridis')

# Add titles and labels for clarity
plt.title('Average Monthly Charges by Churn Status', fontsize=16)
plt.xlabel('Churn Status', fontsize=12)
plt.ylabel('Average Monthly Charges ($)', fontsize=12)
# Make x-axis labels more readable
plt.xticks([0, 1], ['Stayed', 'Churned']) 
plt.show()

### 3. Predictive Modeling

Now that our data is clean, we can split it and train our classification models to compare their performance.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Model 1: Decision Tree ---
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_score = accuracy_score(y_test, dt_predictions)

# --- Model 2: Logistic Regression ---
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_score = accuracy_score(y_test, lr_predictions)

print("--- MODEL PERFORMANCE COMPARISON ---")
print(f"Decision Tree Accuracy: {dt_score:.4f}")
print(f"Logistic Regression Accuracy: {lr_score:.4f}")

### 4. Model Interpretation: What Drives Churn?

Let's inspect the Decision Tree model to understand which of our numeric features it found most important.

In [None]:
# Get feature importances from the trained Decision Tree model
feature_importances = dt_model.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'feature': numeric_features, 'importance': feature_importances})
importance_df = importance_df.sort_values(by='importance', ascending=False)

print("--- FEATURE IMPORTANCE ANALYSIS (from Decision Tree) ---")
display(importance_df)

### 5. Prediction on a New Customer

Finally, let's use our trained Decision Tree model to predict whether a new, hypothetical customer with specific features is likely to churn. This demonstrates the practical application of our model.

In [None]:
# Create a dictionary for a new customer's data
# The keys must match the feature names used for training
new_customer_data = {
    'tenure': [12],           # 12 months with the company
    'MonthlyCharges': [75.5], # 75.5 monthly charge
    'TotalCharges': [900.5]   # 900.5 total charges
}

# Convert the dictionary to a DataFrame to ensure feature names are correct
new_customer_df = pd.DataFrame(new_customer_data)

# Use the trained Decision Tree model to make a prediction
prediction = dt_model.predict(new_customer_df)

print("\n--- PREDICTION FOR A NEW CUSTOMER ---")
if prediction[0] == 1:
    print("Prediction Result: This new customer is LIKELY TO CHURN.")
else:
    print("Prediction Result: This new customer is LIKELY TO STAY.")