<a href="https://colab.research.google.com/github/nguyenhson03/Telco-Customer-churn/blob/main/Telco_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
blastchar_telco_customer_churn_path = kagglehub.dataset_download('blastchar/telco-customer-churn')

print('Data source import complete.')


# 📊 Telco Customer Churn  

## 📌 About Dataset  
_"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs."_  
📍 **Source:** [IBM Sample Data Sets](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113)  

This dataset provides insights into **customer churn**—helping businesses understand why customers leave and how to improve retention strategies.  

---

## 📄 Content  
Each row represents a **customer**, and each column describes customer **attributes**.  

### 🔹 The dataset includes:  
- **Churn Status** – Whether the customer left within the last month 🏃‍♂️  
- **Services Signed Up** – Phone, multiple lines, internet, online security, backups, device protection, tech support, and streaming services 📡  
- **Account Details** – Tenure, contract type, payment method, paperless billing, monthly charges, and total charges 💳  
- **Demographics** – Gender, age range, presence of partners & dependents 👨‍👩‍👧‍👦  

---

## 💡 Inspiration  
- 🔍 What insights can be gained from different **customer behaviors**?  
- 📈 Can we predict churn based on features such as **location, pricing, and reviews**?  
- 🏆 Which hosts/customers are the busiest and why?  
- 🏙️ Are there noticeable differences in **customer retention** across regions?  

---

## 🏷️ Key Features  
| Feature | Description |
|---------|------------|
| **🆔 customerID** | Unique Customer ID |
| **👤 gender** | Customer gender (Male/Female) |
| **👴 SeniorCitizen** | Whether the customer is a senior (1 = Yes, 0 = No) |
| **💑 Partner** | Whether the customer has a partner (Yes/No) |
| **👶 Dependents** | Whether the customer has dependents (Yes/No) |
| **📅 tenure** | Number of months the customer has stayed with the company |
| **📞 PhoneService** | Whether the customer has phone service (Yes/No) |
| **📡 MultipleLines** | Whether they have multiple lines (Yes/No) |
| **🌐 InternetService** | Type of internet provider (DSL/Fiber optic/None) |
| **🔐 OnlineSecurity** | Whether they have online security (Yes/No) |
| **💰 MonthlyCharges** | Customer's monthly bill amount |
| **💳 TotalCharges** | Total amount billed |
| **📆 Contract** | Type of contract (Month-to-month, One-year, Two-year) |
| **💵 PaymentMethod** | How the customer pays (Electronic check, Bank transfer, etc.) |
| **📜 Churn** | Whether the customer has churned (Yes/No) |

---



# 📊 Data Analysis: EDA & Visualization

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
file_path = "/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(file_path)

# Display basic information
print("\nDataset Overview:")
print(df.info())

# Check missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Convert 'TotalCharges' to numeric (handling missing values)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

# Customer Churn distribution
plt.figure(figsize=(7, 4))
sns.countplot(x=df['Churn'], palette='coolwarm')
plt.title("Customer Churn Distribution")
plt.xlabel("Churn")
plt.ylabel("Count")
plt.show()

# Correlation heatmap (excluding categorical variables)
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
corr_matrix = df[numeric_features].corr()

plt.figure(figsize=(7, 5))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Between Numeric Features")
plt.show()

# Monthly Charges distribution
plt.figure(figsize=(10, 5))
sns.histplot(df['MonthlyCharges'], bins=40, kde=True)
plt.title("Monthly Charges Distribution")
plt.xlabel("Monthly Charges ($)")
plt.ylabel("Frequency")
plt.show()


# 🤖 Machine Learning: Predicting Customer Churn

In [None]:
# Import ML libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report

# Select relevant features
features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Contract', 'PaymentMethod', 'InternetService']
target = 'Churn'

df = df.dropna(subset=[target])  # Remove missing values in target
X = df[features]
y = df[target].map({'Yes': 1, 'No': 0})  # Convert Churn to binary (1 for Yes, 0 for No)

# Handling categorical data
categorical_features = ['Contract', 'PaymentMethod', 'InternetService']
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Preprocessing pipeline
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combining transformers
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))
