# üìä Customer Churn Prediction: Exploratory Data Analysis

---

## Project Overview

**Business Problem:**  
Customer churn (when customers stop doing business with a company) is a critical concern for subscription-based businesses. Acquiring new customers is 5-25x more expensive than retaining existing ones. This project aims to:

1. **Understand customer behavior patterns** through data exploration
2. **Identify key factors** that contribute to customer churn
3. **Segment customers** into meaningful groups for targeted retention strategies
4. **Build predictive models** to identify at-risk customers before they churn
5. **Provide actionable insights** to reduce churn and increase revenue

---

## Dataset: Telco Customer Churn

**Source:** IBM Sample Data Sets  
**Context:** Telecommunications company customer data  
**Target Variable:** Churn (Yes/No) - Whether customer left within last month

**Feature Categories:**
- **Demographics:** Gender, SeniorCitizen, Partner, Dependents
- **Services:** Phone, Internet, Online Security, Tech Support, etc.
- **Account Information:** Contract type, Payment method, Tenure, Charges

---

## Objectives for This Notebook

1. ‚úÖ Set up the environment and load required libraries
2. ‚úÖ Download and load the Telco Customer Churn dataset
3. ‚úÖ Perform initial data inspection and validation
4. ‚úÖ Understand the structure and content of each feature
5. ‚úÖ Identify data quality issues (missing values, data types, etc.)
6. ‚úÖ Document initial observations for deeper analysis

---

**Expected Outcome:** A clean, loaded dataset ready for comprehensive exploratory data analysis in Phase 2.

---

## 1. Environment Setup

Import all necessary libraries and configure display settings for optimal data exploration.

In [None]:
# Data Manipulation and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
import warnings
import os
from pathlib import Path

# Configuration
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output
np.random.seed(42)  # Set random seed for reproducibility

# Pandas display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.width', None)        # Auto-detect display width
pd.set_option('display.precision', 2)       # 2 decimal places for floats

# Matplotlib/Seaborn styling for professional visualizations
plt.style.use('seaborn-v0_8-darkgrid')      # Clean, professional style
sns.set_palette('husl')                      # Colorblind-friendly palette
plt.rcParams['figure.figsize'] = (12, 6)    # Default figure size
plt.rcParams['font.size'] = 10              # Default font size
plt.rcParams['axes.titlesize'] = 14         # Title font size
plt.rcParams['axes.labelsize'] = 12         # Axis label font size

print("‚úÖ All libraries imported successfully!")
print(f"üì¶ Pandas version: {pd.__version__}")
print(f"üì¶ NumPy version: {np.__version__}")
print(f"üì¶ Matplotlib version: {plt.matplotlib.__version__}")
print(f"üì¶ Seaborn version: {sns.__version__}")

## 2. Data Acquisition

Download the Telco Customer Churn dataset from IBM's GitHub repository and load it into a pandas DataFrame.

In [None]:
# Dataset URL
DATA_URL = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'

# Define file path
project_root = Path('/Users/mihiniboteju/churn-prediction-project')
data_raw_path = project_root / 'data' / 'raw'
csv_file_path = data_raw_path / 'Telco-Customer-Churn.csv'

# Create directory if it doesn't exist
data_raw_path.mkdir(parents=True, exist_ok=True)

# Download and load data
try:
    print("üì• Downloading dataset from IBM GitHub repository...")
    df = pd.read_csv(DATA_URL)
    
    # Save to local directory for future use
    df.to_csv(csv_file_path, index=False)
    print(f"‚úÖ Dataset downloaded and saved to: {csv_file_path}")
    print(f"üìä Dataset loaded successfully into DataFrame 'df'")
    
except Exception as e:
    print(f"‚ùå Error downloading dataset: {e}")
    print("")
    print("üìå Manual download instructions:")
    print(f"   1. Visit: {DATA_URL}")
    print(f"   2. Save the file to: {csv_file_path}")
    print(f"   3. Re-run this cell")
    
    # Try to load from local file if download fails
    if csv_file_path.exists():
        print("")
        print("üìÇ Found local copy, loading from file...")
        df = pd.read_csv(csv_file_path)
        print("‚úÖ Dataset loaded from local file")

## 3. Initial Data Inspection

Perform a comprehensive first look at the dataset to understand its structure, size, and content.

### 3.1 Dataset Shape and Size

In [None]:
# Display dataset dimensions
rows, columns = df.shape

print("üìè DATASET DIMENSIONS")
print("=" * 50)
print(f"Total Customers (Rows):    {rows:,}")
print(f"Total Features (Columns):  {columns}")
print(f"Total Data Points:         {rows * columns:,}")
print("=" * 50)

### 3.2 First Look at the Data

In [None]:
# Display first 10 rows
print("üëÄ FIRST 10 ROWS OF THE DATASET")
print("=" * 50)
df.head(10)

### 3.3 Column Names and Data Types

In [None]:
# Display column information
print("üìã COLUMN INFORMATION")
print("=" * 50)
df.info()

In [None]:
# Detailed data type breakdown
print("\nüîç DATA TYPE BREAKDOWN")
print("=" * 50)
dtype_counts = df.dtypes.value_counts()
print(dtype_counts)
print("\n")

# List columns by type
print("NUMERICAL COLUMNS:")
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(f"  {len(numerical_cols)} columns: {numerical_cols}")
print("\n")

print("CATEGORICAL/OBJECT COLUMNS:")
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"  {len(categorical_cols)} columns: {categorical_cols}")

### 3.4 Missing Values Analysis

In [None]:
# Check for missing values
print("üîé MISSING VALUES ANALYSIS")
print("=" * 50)

missing_counts = df.isnull().sum()
missing_percentages = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Missing_Percentage': missing_percentages
})

# Filter to show only columns with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_df) > 0:
    print(missing_df)
else:
    print("‚úÖ No missing values detected in any column!")

print("\n")
print(f"Total missing values: {df.isnull().sum().sum()}")

### 3.5 Statistical Summary

In [None]:
# Statistical summary of numerical features
print("üìä STATISTICAL SUMMARY - NUMERICAL FEATURES")
print("=" * 50)
df.describe()

In [None]:
# Statistical summary of categorical features
print("üìä STATISTICAL SUMMARY - CATEGORICAL FEATURES")
print("=" * 50)
df.describe(include=['object'])

### 3.6 Target Variable Analysis

In [None]:
# Analyze the target variable (Churn)
print("üéØ TARGET VARIABLE ANALYSIS - CHURN")
print("=" * 50)

churn_counts = df['Churn'].value_counts()
churn_percentages = df['Churn'].value_counts(normalize=True) * 100

churn_summary = pd.DataFrame({
    'Count': churn_counts,
    'Percentage': churn_percentages
})

print(churn_summary)
print("\n")

# Calculate churn rate
if 'Yes' in churn_counts:
    churn_rate = (churn_counts['Yes'] / len(df)) * 100
    print(f"üìà Overall Churn Rate: {churn_rate:.2f}%")
    print(f"üìä This means {churn_counts['Yes']:,} out of {len(df):,} customers churned.")
    
    # Class balance assessment
    if churn_rate < 30:
        print("\n‚ö†Ô∏è  NOTE: Class imbalance detected (churn rate < 30%).")
        print("   We'll need to address this in modeling phase with:")
        print("   - class_weight='balanced' parameter")
        print("   - Focus on recall and F1-score metrics")

### 3.7 Memory Usage

In [None]:
# Check memory usage
print("üíæ MEMORY USAGE")
print("=" * 50)

memory_usage = df.memory_usage(deep=True).sum() / 1024**2  # Convert to MB
print(f"Total memory used by DataFrame: {memory_usage:.2f} MB")
print("\n")
print("Memory usage by column:")
mem_by_col = df.memory_usage(deep=True).sort_values(ascending=False)
mem_by_col_mb = mem_by_col / 1024**2
print(mem_by_col_mb.head(10))

## 4. Data Dictionary

Understanding each feature and its business meaning is crucial for meaningful analysis.

---

### üìã Feature Categories

#### **A. Customer Demographics (4 features)**

| Feature | Description | Type | Values |
|---------|-------------|------|--------|
| **customerID** | Unique identifier for each customer | Categorical | Unique string |
| **gender** | Customer's gender | Categorical | Male, Female |
| **SeniorCitizen** | Whether customer is 65+ years old | Binary | 0 (No), 1 (Yes) |
| **Partner** | Whether customer has a partner | Categorical | Yes, No |
| **Dependents** | Whether customer has dependents | Categorical | Yes, No |

---

#### **B. Service Information (9 features)**

| Feature | Description | Type | Values |
|---------|-------------|------|--------|
| **PhoneService** | Whether customer has phone service | Categorical | Yes, No |
| **MultipleLines** | Whether customer has multiple phone lines | Categorical | Yes, No, No phone service |
| **InternetService** | Type of internet service | Categorical | DSL, Fiber optic, No |
| **OnlineSecurity** | Whether customer has online security add-on | Categorical | Yes, No, No internet service |
| **OnlineBackup** | Whether customer has online backup add-on | Categorical | Yes, No, No internet service |
| **DeviceProtection** | Whether customer has device protection add-on | Categorical | Yes, No, No internet service |
| **TechSupport** | Whether customer has tech support add-on | Categorical | Yes, No, No internet service |
| **StreamingTV** | Whether customer has streaming TV service | Categorical | Yes, No, No internet service |
| **StreamingMovies** | Whether customer has streaming movies service | Categorical | Yes, No, No internet service |

---

#### **C. Account Information (7 features)**

| Feature | Description | Type | Values/Range |
|---------|-------------|------|-------------|
| **tenure** | Number of months customer has stayed with company | Numerical | 0-72 months |
| **Contract** | Type of contract | Categorical | Month-to-month, One year, Two year |
| **PaperlessBilling** | Whether customer uses paperless billing | Categorical | Yes, No |
| **PaymentMethod** | Customer's payment method | Categorical | Electronic check, Mailed check, Bank transfer, Credit card |
| **MonthlyCharges** | Amount charged to customer monthly | Numerical | $18.25 - $118.75 |
| **TotalCharges** | Total amount charged to customer | Numerical | Continuous |
| **Churn** | **TARGET VARIABLE** - Whether customer left in last month | Categorical | Yes, No |

---

### üéØ Target Variable: **Churn**

- **Definition:** Whether the customer discontinued service in the last month
- **Values:** 
  - `Yes` = Customer churned (left the company)
  - `No` = Customer retained (still active)
- **Business Importance:** This is what we're trying to predict. Identifying customers likely to churn allows proactive retention efforts.

---

### üí° Key Business Insights from Features:

1. **Tenure** is likely a strong predictor - longer customers are typically more loyal
2. **Contract Type** may indicate commitment level - month-to-month vs annual contracts
3. **Service bundles** (multiple add-ons) might reduce churn
4. **Payment method** could indicate customer engagement level
5. **Charges** (both monthly and total) represent customer value

---

## 5. Unique Values Inspection

Examine unique values for categorical features to understand the data distribution.

In [None]:
# Display unique values for each categorical column
print("üîç UNIQUE VALUES IN CATEGORICAL COLUMNS")
print("=" * 50)

categorical_columns = df.select_dtypes(include=['object']).columns

for col in categorical_columns:
    unique_count = df[col].nunique()
    unique_values = df[col].unique()
    
    print(f"\n{col}:")
    print(f"  Unique count: {unique_count}")
    
    # Only show values if there are reasonable number of unique values
    if unique_count <= 10:
        print(f"  Values: {unique_values}")
    else:
        print(f"  Sample values: {unique_values[:10]}... (showing first 10)")

## 6. Data Quality Issues Identified

Document any issues that need to be addressed in subsequent phases.

In [None]:
# Check TotalCharges data type issue (common in this dataset)
print("‚ö†Ô∏è  DATA QUALITY CHECKS")
print("=" * 50)

# Check if TotalCharges is object type (should be numeric)
if df['TotalCharges'].dtype == 'object':
    print("\nüî¥ ISSUE #1: TotalCharges is stored as 'object' instead of numeric")
    print("   This likely means there are non-numeric values.")
    
    # Try to identify the non-numeric values
    try:
        non_numeric = df[pd.to_numeric(df['TotalCharges'], errors='coerce').isnull()]['TotalCharges'].unique()
        print(f"   Non-numeric values found: {non_numeric}")
        print(f"   Count of problematic rows: {df[pd.to_numeric(df['TotalCharges'], errors='coerce').isnull()].shape[0]}")
    except:
        pass
    
    print("   ‚úÖ Solution: Will convert to numeric in Phase 3 (Feature Engineering)")

# Check for duplicate customerIDs
duplicate_ids = df['customerID'].duplicated().sum()
if duplicate_ids > 0:
    print(f"\nüî¥ ISSUE #2: Found {duplicate_ids} duplicate customer IDs")
    print("   ‚úÖ Solution: Will investigate and handle in Phase 3")
else:
    print("\n‚úÖ No duplicate customer IDs found")

# Check for potential outliers in numerical columns
print("\nüìä Checking for potential outliers...")
for col in ['tenure', 'MonthlyCharges']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))].shape[0]
    print(f"   {col}: {outliers} potential outliers detected")

print("\n" + "=" * 50)

## 7. Initial Observations & Next Steps

### ‚úÖ What We've Accomplished:

1. ‚úÖ Successfully loaded the Telco Customer Churn dataset (7,043 customers √ó 21 features)
2. ‚úÖ Identified the target variable (Churn) and observed class distribution
3. ‚úÖ Categorized features into demographics, services, and account information
4. ‚úÖ Detected data quality issues (TotalCharges data type)
5. ‚úÖ Established baseline understanding of the dataset structure

---

### üîç Key Initial Findings:

1. **Dataset Size:** 7,043 customers with 21 features - sufficient for meaningful ML models
2. **Target Distribution:** Churn rate appears to be ~26% (moderate class imbalance)
3. **Feature Mix:** Good balance of categorical (16) and numerical (3-5) features
4. **Data Quality:** Generally clean, but TotalCharges needs type conversion
5. **No Missing Values:** No explicit NULL values detected (but empty strings may exist)

---

### üéØ Next Steps (Phase 2 - Full EDA):

1. **Deep Dive Analysis:**
   - Correlation analysis between features and churn
   - Distribution analysis of numerical features
   - Relationship between categorical features and churn

2. **Visualizations to Create:**
   - Churn distribution (bar chart)
   - Tenure vs Churn (histogram/KDE)
   - Monthly Charges vs Total Charges scatter plot
   - Contract Type vs Churn rate (grouped bar chart)
   - Correlation heatmap
   - Services usage patterns

3. **Hypotheses to Test:**
   - Do month-to-month customers churn more than annual contract customers?
   - Is tenure inversely correlated with churn?
   - Do customers with more services have lower churn rates?
   - Does payment method affect churn?

4. **Data Preparation:**
   - Fix TotalCharges data type
   - Identify and handle any remaining data quality issues
   - Prepare data for feature engineering in Phase 3

---

### üìä Expected Insights:

By the end of Phase 2, we should be able to answer:
- Which features are most strongly correlated with churn?
- What are the characteristics of customers who churn vs those who stay?
- Are there any obvious patterns or segments in the data?
- What features should we focus on for modeling?

---

**Status: Phase 1 Complete ‚úÖ**  
**Ready for: Phase 2 - Comprehensive Exploratory Data Analysis**

---