# Customer Churn Prediction System

## Project Overview
This project builds an end-to-end machine learning system to predict customer churn for a telecommunications service using the IBM Telco Customer Churn dataset.

### Business Problem
Customer churn is a critical business metric that directly impacts revenue. By predicting which customers are likely to churn, businesses can:
- Implement targeted retention strategies
- Reduce customer acquisition costs
- Improve customer lifetime value
- Optimize resource allocation for retention efforts

### Dataset
The Telco Customer Churn dataset contains information about:
- Customer demographics (gender, age, family status)
- Account information (tenure, contract, payment method)
- Services usage (phone, internet, streaming, etc.)
- Charges (monthly and total charges)
- Churn status (target variable)


## Part 1: Data Analysis & Preprocessing


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings

# Machine Learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import xgboost as xgb

# Model interpretation
import shap

# Configuration
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


### 1.1 Data Loading and Initial Exploration


In [None]:
# Load the dataset
df = pd.read_csv('data/Telco-Customer-Churn.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")
print(df.info())


In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()


In [None]:
# Check for missing values and data quality issues
print("Missing values:")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Values'] > 0])

# Check TotalCharges column for data quality issues
print(f"\nTotalCharges data type: {df['TotalCharges'].dtype}")
print(f"Sample TotalCharges values: {df['TotalCharges'].head(10).tolist()}")

# Check for non-numeric values in TotalCharges
try:
    pd.to_numeric(df['TotalCharges'])
    print("All TotalCharges values are numeric")
except:
    non_numeric_mask = pd.to_numeric(df['TotalCharges'], errors='coerce').isna()
    non_numeric_count = non_numeric_mask.sum()
    print(f"Non-numeric values in TotalCharges: {non_numeric_count}")
    if non_numeric_count > 0:
        print("Non-numeric values:", df[non_numeric_mask]['TotalCharges'].unique())
