In [1]:
from pathlib import Path
import pandas as pd

NOTEBOOK_DIR = Path.cwd()
DATA_RAW = (NOTEBOOK_DIR / "../data/raw").resolve()

## Step 1: Load CSV

Load CSV

In [2]:
# Load all CSV files
demographics = pd.read_csv(DATA_RAW / "Telco_customer_churn_demographics.csv")
location     = pd.read_csv(DATA_RAW / "Telco_customer_churn_location.csv")
population   = pd.read_csv(DATA_RAW / "Telco_customer_churn_population.csv")
services     = pd.read_csv(DATA_RAW / "Telco_customer_churn_services.csv")
status       = pd.read_csv(DATA_RAW / "Telco_customer_churn_status.csv")

# Quick shapes to confirm they loaded
for name, df in {
    "Demographics": demographics,
    "Location": location,
    "Population": population,
    "Services": services,
    "Status": status,
}.items():
    print(f"{name:12s} -> {df.shape}")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\miga\\Documents\\GitHub\\Project_EDSB\\Sandbox\\data\\raw\\Telco_customer_churn_demographics.csv'

## Step 2: Initial Data Exploration

Before merging the datasets, it's important to understand what each table represents
and how they relate to one another.  
We'll start by exploring them individually to inspect their structure, size, and key variables.

In [None]:
datasets = {
    "Demographics": demographics,
    "Location": location,
    "Population": population,
    "Services": services,
    "Status": status,
}

# Print shape and preview each dataset
for name, df in datasets.items():
    print(f"===== {name} =====")
    print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
    display(df.head(3))
    print("\nColumn names:\n", list(df.columns))
    print("-" * 60)

### Step 2.1: Data Overview and Descriptive Statistics

Now that we have inspected each dataset‚Äôs structure, we‚Äôll examine their **data types**, 
**numeric distributions**, and **categorical summaries**.  
This step helps identify potential data-quality issues, redundant columns, and 
features that might need cleaning or transformation later on.

In [None]:
# Combined data overview (includes .info, describe, missing, uniques)

for name, df in datasets.items():
    print(f"\n{'=' * 25} {name} {'=' * 25}")
    print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns\n")

    # --- 1Ô∏è‚É£ Data types and non-null counts
    print("üìò Data Types & Non-Null Values:")
    df.info()

    # Identify numeric & categorical columns
    num_cols = df.select_dtypes(include="number").columns
    cat_cols = df.select_dtypes(include=["object", "category", "bool"]).columns

    # --- 2Ô∏è‚É£ Numeric summary
    if len(num_cols) > 0:
        print("\nüìä Numeric Summary:")
        display(df[num_cols].describe().T)
    else:
        print("\nüìä Numeric Summary: (none)")

    # --- 3Ô∏è‚É£ Categorical summary
    if len(cat_cols) > 0:
        print("\nüî† Categorical Summary:")
        display(df[cat_cols].describe().T)
    else:
        print("\nüî† Categorical Summary: (none)")

    # --- 4Ô∏è‚É£ Missing & unique values
    print("\nüßπ Missing Values (Top 10):")
    display(df.isna().sum().sort_values(ascending=False).head(10))

    print("üî¢ Unique Values (Top 10):")
    display(df.nunique().sort_values(ascending=False).head(10).to_frame("nunique"))

    print("-" * 80)

## **Observations**

### **Demographics:**  

The **Demographics** dataset contains information describing each customer‚Äôs personal and family profile.  
It includes **7,043 customers** and **9 variables** ‚Äî 3 numeric (`Count`, `Age`, `Number of Dependents`) and 6 categorical.

**Key takeaways:**

- **Data quality:**  
  - No missing values across any column.  
  - Data types are correctly assigned (`int64` for numeric, `object` for categorical).  

- **Numeric overview:**  
  - `Count` is constant (=1) ‚Üí non-informative and can be dropped later.  
  - `Age` ranges from **19 to 80** (mean ‚âà 46.5 years).  
  - `Number of Dependents` ranges from **0 to 9**, with an average of 0.47 - which means that most customers have few or no dependents.  

- **Categorical overview:**  
  - Gender distribution is balanced (Male ‚âà 3.6k, Female ‚âà 3.5k).  
  - Most customers are **not married** (‚âà 52%).  
  - About **84% are not senior citizens** and **80% are not under 30**, suggesting the typical customer is middle-aged.  
  - Dependents are mostly ‚ÄúNo‚Äù (‚âà 77%).  

**Interpretation:**  
This table provides socio-demographic context for each customer, which may influence churn behaviour.  
Variables such as **Age**, **Senior Citizen**, and **Dependents** could serve as useful predictors, while `Count` is non-informative.  
`Under 30` may be redundant (as it is derived from `Age`), but it will be **kept for interpretability** and to facilitate descriptive comparisons between age groups.



### **Location:**  

The **Location** dataset provides geographic and positional information for each customer.  
It includes **7,043 customers** and **9 variables**, with 3 numeric columns (`Count`, `Zip Code`, `Latitude`, `Longitude`) and 5 categorical columns.

**Key takeaways:**
- **Data quality:**  
  - No missing values.  
  - Data types are appropriate (`object` for text, `int64` and `float64` for numeric).  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Zip Code` ranges from **90001 to 96150**, covering southern and northern California regions.  
  - Latitude and longitude values confirm all customers are located within **California, United States**.

- **Categorical overview:**  
  - `Country` = ‚ÄúUnited States‚Äù and `State` = ‚ÄúCalifornia‚Äù for all records.  
  - `City` has **1,106 unique values**, with Los Angeles being the most frequent (293 customers).  
  - `Lat Long` is a textual combination of latitude and longitude, redundant given the numeric columns.  

**Interpretation:**  
This table adds **geospatial context** to the dataset.  
It allows customer-level geographic segmentation (e.g., by city or ZIP code) and later enables merging with **Population** data using `Zip Code`.  
Columns like `Lat Long` and `Count` are redundant, while `Zip Code` serves as a key linking variable to external demographic data.



### **Population:**  

The **Population** dataset contains ZIP-code‚Äìlevel demographic information.  
It includes **1,671 rows** and **3 variables**, all of which are numeric (`int64`).

**Key takeaways:**
- **Data quality:**  
  - No missing values in any column.  
  - Data types are correctly assigned as integers.  

- **Structure and uniqueness:**  
  - Each `Zip Code` is unique (1,671 distinct ZIP codes).  
  - The `ID` column is also unique and functions only as an internal index ‚Äî it does not link to customers directly.  
  - `Population` has 1,607 unique values, indicating some ZIP codes may have similar population sizes.  

- **Numeric overview:**  
  - `Zip Code` ranges from **90001 to 96161**, consistent with California ZIP codes.  
  - `Population` ranges from **11** to **105,285**, with an average of about **20,276** people per ZIP code.  

**Interpretation:**  
This table provides **contextual demographic data** that can be linked to customers through their `Zip Code` from the **Location** table.  
Since it operates at the **ZIP-code level**, it will be joined later via `Zip Code`, not `Customer ID`.  
The `ID` column is only an index field and can be dropped before merging.



### **Services:**  

The **Services** dataset captures customer service usage, subscription details, and billing information.  
It includes **7,043 customers** and **30 variables**, combining both service attributes and financial metrics.

**Key takeaways:**
- **Data quality:**  
  - No missing values in most columns.  
  - The columns `Offer` and `Internet Type` contain missing data (‚âà55% and 22% respectively), suggesting that not all customers were offered promotions or subscribed to Internet services.  
  - Data types are consistent: numeric for billing and tenure, categorical for service indicators.  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Tenure in Months` ranges from **1 to 72**, indicating customer relationships lasting up to six years.  
  - `Monthly Charge` varies from **$18.25 to $118.75** (mean ‚âà $64.8).  
  - `Total Charges` and `Total Revenue` are highly variable, reflecting differences in service plans and tenure.  
  - Financial columns such as `Total Refunds`, `Total Extra Data Charges`, and `Total Long Distance Charges` are mostly small relative to overall revenue.  

- **Categorical overview:**  
  - `Quarter` = ‚ÄúQ3‚Äù for all entries - not informative.  
  - Service adoption patterns:  
    - **Phone Service:** 90% ‚ÄúYes‚Äù  
    - **Internet Service:** 78% ‚ÄúYes‚Äù  
    - **Contract:** Dominated by ‚ÄúMonth-to-Month‚Äù (~51%)  
    - **Payment Method:** Most common is ‚ÄúBank Withdrawal‚Äù (~55%)  
  - Value-added services (`Online Security`, `Streaming TV`, etc.) are mostly ‚ÄúNo,‚Äù suggesting many customers subscribe to basic plans.

**Interpretation:**  
This table provides a detailed view of **customer engagement and spending behaviour**.  
It combines tenure, billing, and service usage information ‚Äî all of which are likely **strong predictors of churn**.  
Columns like `Count` and `Quarter` can be dropped, while `Offer` and `Internet Type` require cleaning or imputation.  
The mix of continuous (e.g., `Tenure in Months`, `Monthly Charge`) and binary categorical features will be useful for both descriptive and predictive analyses.



### **Status:**  

The **Status** dataset captures customer satisfaction, churn outcomes, and value metrics.  
It contains **7,043 customers** and **11 variables**, mixing satisfaction scores, churn labels, and lifetime value indicators.

**Key takeaways:**
- **Data quality:**  
  - No missing values for most columns.  
  - The fields `Churn Category` and `Churn Reason` have missing data in **‚âà73% of rows**, which aligns with the fact that these fields are only populated for customers who have churned.  
  - Data types are correctly assigned (`int64` for numerical measures, `object` for categorical variables).  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Satisfaction Score` ranges from **1 to 5** (mean ‚âà 3.24).  
  - `Churn Score` ranges from **5 to 96** (mean ‚âà 58.5), showing a wide variation in churn risk.  
  - `CLTV` (Customer Lifetime Value) ranges from **2003 to 6500**, indicating differing customer profitability levels.  

- **Categorical overview:**  
  - `Quarter` = ‚ÄúQ3‚Äù for all entries - not informative.  
  - `Customer Status`:  
    - **Stayed** ‚Äì 4,720 customers  
    - **Churned** ‚Äì 1,869 customers  
    - **Joined** ‚Äì 454 customers  
  - `Churn Label`: Binary ‚ÄúYes‚Äù/‚ÄúNo‚Äù indicator of churn (Yes = 1,869; No = 5,174).  
  - `Churn Category`: 5 categories for churned customers (most common: *Competitor*).  
  - `Churn Reason`: 20 reasons reported (most frequent: *Competitor had better devices*).  

**Interpretation:**  
This table provides the **core churn information** and customer satisfaction measures ‚Äî the foundation for our prediction target.  
`Churn Label` will serve as the **dependent variable (target)** in the churn prediction model.  
Columns such as `Count` and `Quarter` are not useful analytically and can be removed.  
Although `Churn Category` and `Churn Reason` have many missing values, they still offer valuable insight for **post-model interpretation** and business recommendations.