In [None]:
from pathlib import Path
import pandas as pd

NOTEBOOK_DIR = Path.cwd()
DATA_RAW = (NOTEBOOK_DIR / "../data/raw").resolve()

Notebook dir: /Users/pedro.cabeco/Desktop/Project_EDSB/notebooks
DATA_RAW: /Users/pedro.cabeco/Desktop/Project_EDSB/data/raw
Exists? True
CSV files: ['Telco_customer_churn_status.csv', 'Telco_customer_churn_services.csv', 'Telco_customer_churn_demographics.csv', 'Telco_customer_churn_location.csv', 'Telco_customer_churn_population.csv']


## Step 1: Load CSV

Load CSV

In [5]:
# Load all CSV files
demographics = pd.read_csv(DATA_RAW / "Telco_customer_churn_demographics.csv")
location     = pd.read_csv(DATA_RAW / "Telco_customer_churn_location.csv")
population   = pd.read_csv(DATA_RAW / "Telco_customer_churn_population.csv")
services     = pd.read_csv(DATA_RAW / "Telco_customer_churn_services.csv")
status       = pd.read_csv(DATA_RAW / "Telco_customer_churn_status.csv")

# Quick shapes to confirm they loaded
for name, df in {
    "Demographics": demographics,
    "Location": location,
    "Population": population,
    "Services": services,
    "Status": status,
}.items():
    print(f"{name:12s} -> {df.shape}")

Demographics -> (7043, 9)
Location     -> (7043, 9)
Population   -> (1671, 3)
Services     -> (7043, 30)
Status       -> (7043, 11)


## Step 2: Initial Data Exploration

Before merging the datasets, it's important to understand what each table represents
and how they relate to one another.  
We'll start by exploring them individually to inspect their structure, size, and key variables.


In [7]:
datasets = {
    "Demographics": demographics,
    "Location": location,
    "Population": population,
    "Services": services,
    "Status": status,
}

# Print shape and preview each dataset
for name, df in datasets.items():
    print(f"===== {name} =====")
    print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
    display(df.head(3))
    print("\nColumn names:\n", list(df.columns))
    print("-" * 60)

===== Demographics =====
Shape: 7043 rows √ó 9 columns


Unnamed: 0,Customer ID,Count,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents
0,8779-QRDMV,1,Male,78,No,Yes,No,No,0
1,7495-OOKFY,1,Female,74,No,Yes,Yes,Yes,1
2,1658-BYGOY,1,Male,71,No,Yes,No,Yes,3



Column names:
 ['Customer ID', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents']
------------------------------------------------------------
===== Location =====
Shape: 7043 rows √ó 9 columns


Unnamed: 0,Customer ID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude
0,8779-QRDMV,1,United States,California,Los Angeles,90022,"34.02381, -118.156582",34.02381,-118.156582
1,7495-OOKFY,1,United States,California,Los Angeles,90063,"34.044271, -118.185237",34.044271,-118.185237
2,1658-BYGOY,1,United States,California,Los Angeles,90065,"34.108833, -118.229715",34.108833,-118.229715



Column names:
 ['Customer ID', 'Count', 'Country', 'State', 'City', 'Zip Code', 'Lat Long', 'Latitude', 'Longitude']
------------------------------------------------------------
===== Population =====
Shape: 1671 rows √ó 3 columns


Unnamed: 0,ID,Zip Code,Population
0,1,90001,54492
1,2,90002,44586
2,3,90003,58198



Column names:
 ['ID', 'Zip Code', 'Population']
------------------------------------------------------------
===== Services =====
Shape: 7043 rows √ó 30 columns


Unnamed: 0,Customer ID,Count,Quarter,Referred a Friend,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,...,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue
0,8779-QRDMV,1,Q3,No,0,1,,No,0.0,No,...,No,Month-to-Month,Yes,Bank Withdrawal,39.65,39.65,0.0,20,0.0,59.65
1,7495-OOKFY,1,Q3,Yes,1,8,Offer E,Yes,48.85,Yes,...,Yes,Month-to-Month,Yes,Credit Card,80.65,633.3,0.0,0,390.8,1024.1
2,1658-BYGOY,1,Q3,No,0,18,Offer D,Yes,11.33,Yes,...,Yes,Month-to-Month,Yes,Bank Withdrawal,95.45,1752.55,45.61,0,203.94,1910.88



Column names:
 ['Customer ID', 'Count', 'Quarter', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Offer', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Internet Type', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing', 'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue']
------------------------------------------------------------
===== Status =====
Shape: 7043 rows √ó 11 columns


Unnamed: 0,Customer ID,Count,Quarter,Satisfaction Score,Customer Status,Churn Label,Churn Value,Churn Score,CLTV,Churn Category,Churn Reason
0,8779-QRDMV,1,Q3,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,1,Q3,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,1,Q3,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer



Column names:
 ['Customer ID', 'Count', 'Quarter', 'Satisfaction Score', 'Customer Status', 'Churn Label', 'Churn Value', 'Churn Score', 'CLTV', 'Churn Category', 'Churn Reason']
------------------------------------------------------------


### Step 2.1: Data Overview and Descriptive Statistics

Now that we have inspected each dataset‚Äôs structure, we‚Äôll examine their **data types**, 
**numeric distributions**, and **categorical summaries**.  
This step helps identify potential data-quality issues, redundant columns, and 
features that might need cleaning or transformation later on.

In [12]:
# Combined data overview (includes .info, describe, missing, uniques)

for name, df in datasets.items():
    print(f"\n{'=' * 25} {name} {'=' * 25}")
    print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns\n")

    # --- 1Ô∏è‚É£ Data types and non-null counts
    print("üìò Data Types & Non-Null Values:")
    df.info()

    # Identify numeric & categorical columns
    num_cols = df.select_dtypes(include="number").columns
    cat_cols = df.select_dtypes(include=["object", "category", "bool"]).columns

    # --- 2Ô∏è‚É£ Numeric summary
    if len(num_cols) > 0:
        print("\nüìä Numeric Summary:")
        display(df[num_cols].describe().T)
    else:
        print("\nüìä Numeric Summary: (none)")

    # --- 3Ô∏è‚É£ Categorical summary
    if len(cat_cols) > 0:
        print("\nüî† Categorical Summary:")
        display(df[cat_cols].describe().T)
    else:
        print("\nüî† Categorical Summary: (none)")

    # --- 4Ô∏è‚É£ Missing & unique values
    print("\nüßπ Missing Values (Top 10):")
    display(df.isna().sum().sort_values(ascending=False).head(10))

    print("üî¢ Unique Values (Top 10):")
    display(df.nunique().sort_values(ascending=False).head(10).to_frame("nunique"))

    print("-" * 80)


Shape: 7043 rows √ó 9 columns

üìò Data Types & Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Customer ID           7043 non-null   object
 1   Count                 7043 non-null   int64 
 2   Gender                7043 non-null   object
 3   Age                   7043 non-null   int64 
 4   Under 30              7043 non-null   object
 5   Senior Citizen        7043 non-null   object
 6   Married               7043 non-null   object
 7   Dependents            7043 non-null   object
 8   Number of Dependents  7043 non-null   int64 
dtypes: int64(3), object(6)
memory usage: 495.3+ KB

üìä Numeric Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Count,7043.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Age,7043.0,46.509726,16.750352,19.0,32.0,46.0,60.0,80.0
Number of Dependents,7043.0,0.468692,0.962802,0.0,0.0,0.0,0.0,9.0



üî† Categorical Summary:


Unnamed: 0,count,unique,top,freq
Customer ID,7043,7043,8779-QRDMV,1
Gender,7043,2,Male,3555
Under 30,7043,2,No,5642
Senior Citizen,7043,2,No,5901
Married,7043,2,No,3641
Dependents,7043,2,No,5416



üßπ Missing Values (Top 10):


Customer ID             0
Count                   0
Gender                  0
Age                     0
Under 30                0
Senior Citizen          0
Married                 0
Dependents              0
Number of Dependents    0
dtype: int64

üî¢ Unique Values (Top 10):


Unnamed: 0,nunique
Customer ID,7043
Age,62
Number of Dependents,10
Gender,2
Under 30,2
Senior Citizen,2
Married,2
Dependents,2
Count,1


--------------------------------------------------------------------------------

Shape: 7043 rows √ó 9 columns

üìò Data Types & Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Customer ID  7043 non-null   object 
 1   Count        7043 non-null   int64  
 2   Country      7043 non-null   object 
 3   State        7043 non-null   object 
 4   City         7043 non-null   object 
 5   Zip Code     7043 non-null   int64  
 6   Lat Long     7043 non-null   object 
 7   Latitude     7043 non-null   float64
 8   Longitude    7043 non-null   float64
dtypes: float64(2), int64(2), object(5)
memory usage: 495.3+ KB

üìä Numeric Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Count,7043.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Zip Code,7043.0,93486.070567,1856.767505,90001.0,92101.0,93518.0,95329.0,96150.0
Latitude,7043.0,36.197455,2.468929,32.555828,33.990646,36.205465,38.161321,41.962127
Longitude,7043.0,-119.756684,2.154425,-124.301372,-121.78809,-119.595293,-117.969795,-114.192901



üî† Categorical Summary:


Unnamed: 0,count,unique,top,freq
Customer ID,7043,7043,8779-QRDMV,1
Country,7043,1,United States,7043
State,7043,1,California,7043
City,7043,1106,Los Angeles,293
Lat Long,7043,1679,"33.362575, -117.299644",43



üßπ Missing Values (Top 10):


Customer ID    0
Count          0
Country        0
State          0
City           0
Zip Code       0
Lat Long       0
Latitude       0
Longitude      0
dtype: int64

üî¢ Unique Values (Top 10):


Unnamed: 0,nunique
Customer ID,7043
Lat Long,1679
Zip Code,1626
Latitude,1626
Longitude,1625
City,1106
Count,1
Country,1
State,1


--------------------------------------------------------------------------------

Shape: 1671 rows √ó 3 columns

üìò Data Types & Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1671 entries, 0 to 1670
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   ID          1671 non-null   int64
 1   Zip Code    1671 non-null   int64
 2   Population  1671 non-null   int64
dtypes: int64(3)
memory usage: 39.3 KB

üìä Numeric Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,1671.0,836.0,482.520466,1.0,418.5,836.0,1253.5,1671.0
Zip Code,1671.0,93678.99222,1817.763591,90001.0,92269.0,93664.0,95408.0,96161.0
Population,1671.0,20276.384201,20689.1173,11.0,1789.0,14239.0,32942.5,105285.0



üî† Categorical Summary: (none)

üßπ Missing Values (Top 10):


ID            0
Zip Code      0
Population    0
dtype: int64

üî¢ Unique Values (Top 10):


Unnamed: 0,nunique
ID,1671
Zip Code,1671
Population,1607


--------------------------------------------------------------------------------

Shape: 7043 rows √ó 30 columns

üìò Data Types & Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 30 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Count                              7043 non-null   int64  
 2   Quarter                            7043 non-null   object 
 3   Referred a Friend                  7043 non-null   object 
 4   Number of Referrals                7043 non-null   int64  
 5   Tenure in Months                   7043 non-null   int64  
 6   Offer                              3166 non-null   object 
 7   Phone Service                      7043 non-null   object 
 8   Avg Monthly Long Distance Charges  7043 non-null   float64
 9   Multiple Lines                     

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Count,7043.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Number of Referrals,7043.0,1.951867,3.001199,0.0,0.0,0.0,3.0,11.0
Tenure in Months,7043.0,32.386767,24.542061,1.0,9.0,29.0,55.0,72.0
Avg Monthly Long Distance Charges,7043.0,22.958954,15.448113,0.0,9.21,22.89,36.395,49.99
Avg Monthly GB Download,7043.0,20.515405,20.41894,0.0,3.0,17.0,27.0,85.0
Monthly Charge,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75
Total Charges,7043.0,2280.381264,2266.220462,18.8,400.15,1394.55,3786.6,8684.8
Total Refunds,7043.0,1.962182,7.902614,0.0,0.0,0.0,0.0,49.79
Total Extra Data Charges,7043.0,6.860713,25.104978,0.0,0.0,0.0,0.0,150.0
Total Long Distance Charges,7043.0,749.099262,846.660055,0.0,70.545,401.44,1191.1,3564.72



üî† Categorical Summary:


Unnamed: 0,count,unique,top,freq
Customer ID,7043,7043,8779-QRDMV,1
Quarter,7043,1,Q3,7043
Referred a Friend,7043,2,No,3821
Offer,3166,5,Offer B,824
Phone Service,7043,2,Yes,6361
Multiple Lines,7043,2,No,4072
Internet Service,7043,2,Yes,5517
Internet Type,5517,3,Fiber Optic,3035
Online Security,7043,2,No,5024
Online Backup,7043,2,No,4614



üßπ Missing Values (Top 10):


Offer                          3877
Internet Type                  1526
Customer ID                       0
Premium Tech Support              0
Total Long Distance Charges       0
Total Extra Data Charges          0
Total Refunds                     0
Total Charges                     0
Monthly Charge                    0
Payment Method                    0
dtype: int64

üî¢ Unique Values (Top 10):


Unnamed: 0,nunique
Customer ID,7043
Total Revenue,6985
Total Charges,6540
Total Long Distance Charges,6091
Avg Monthly Long Distance Charges,3584
Monthly Charge,1585
Total Refunds,500
Tenure in Months,72
Avg Monthly GB Download,50
Total Extra Data Charges,16


--------------------------------------------------------------------------------

Shape: 7043 rows √ó 11 columns

üìò Data Types & Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Customer ID         7043 non-null   object
 1   Count               7043 non-null   int64 
 2   Quarter             7043 non-null   object
 3   Satisfaction Score  7043 non-null   int64 
 4   Customer Status     7043 non-null   object
 5   Churn Label         7043 non-null   object
 6   Churn Value         7043 non-null   int64 
 7   Churn Score         7043 non-null   int64 
 8   CLTV                7043 non-null   int64 
 9   Churn Category      1869 non-null   object
 10  Churn Reason        1869 non-null   object
dtypes: int64(5), object(6)
memory usage: 605.4+ KB

üìä Numeric Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Count,7043.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Satisfaction Score,7043.0,3.244924,1.201657,1.0,3.0,3.0,4.0,5.0
Churn Value,7043.0,0.26537,0.441561,0.0,0.0,0.0,1.0,1.0
Churn Score,7043.0,58.50504,21.170031,5.0,40.0,61.0,75.5,96.0
CLTV,7043.0,4400.295755,1183.057152,2003.0,3469.0,4527.0,5380.5,6500.0



üî† Categorical Summary:


Unnamed: 0,count,unique,top,freq
Customer ID,7043,7043,8779-QRDMV,1
Quarter,7043,1,Q3,7043
Customer Status,7043,3,Stayed,4720
Churn Label,7043,2,No,5174
Churn Category,1869,5,Competitor,841
Churn Reason,1869,20,Competitor had better devices,313



üßπ Missing Values (Top 10):


Churn Category        5174
Churn Reason          5174
Customer ID              0
Count                    0
Quarter                  0
Satisfaction Score       0
Customer Status          0
Churn Label              0
Churn Value              0
Churn Score              0
dtype: int64

üî¢ Unique Values (Top 10):


Unnamed: 0,nunique
Customer ID,7043
CLTV,3438
Churn Score,81
Churn Reason,20
Satisfaction Score,5
Churn Category,5
Customer Status,3
Churn Label,2
Churn Value,2
Count,1


--------------------------------------------------------------------------------


## **Observations**

### **Demographics:**  

The **Demographics** dataset contains information describing each customer‚Äôs personal and family profile.  
It includes **7,043 customers** and **9 variables** ‚Äî 3 numeric (`Count`, `Age`, `Number of Dependents`) and 6 categorical.

**Key takeaways:**

- **Data quality:**  
  - No missing values across any column.  
  - Data types are correctly assigned (`int64` for numeric, `object` for categorical).  

- **Numeric overview:**  
  - `Count` is constant (=1) ‚Üí non-informative and can be dropped later.  
  - `Age` ranges from **19 to 80** (mean ‚âà 46.5 years).  
  - `Number of Dependents` ranges from **0 to 9**, with an average of 0.47 - which means that most customers have few or no dependents.  

- **Categorical overview:**  
  - Gender distribution is balanced (Male ‚âà 3.6k, Female ‚âà 3.5k).  
  - Most customers are **not married** (‚âà 52%).  
  - About **84% are not senior citizens** and **80% are not under 30**, suggesting the typical customer is middle-aged.  
  - Dependents are mostly ‚ÄúNo‚Äù (‚âà 77%).  

**Interpretation:**  
This table provides socio-demographic context for each customer, which may influence churn behaviour.  
Variables such as **Age**, **Senior Citizen**, and **Dependents** could serve as useful predictors, while `Count` is non-informative.  
`Under 30` may be redundant (as it is derived from `Age`), but it will be **kept for interpretability** and to facilitate descriptive comparisons between age groups.



### **Location:**  

The **Location** dataset provides geographic and positional information for each customer.  
It includes **7,043 customers** and **9 variables**, with 3 numeric columns (`Count`, `Zip Code`, `Latitude`, `Longitude`) and 5 categorical columns.

**Key takeaways:**
- **Data quality:**  
  - No missing values.  
  - Data types are appropriate (`object` for text, `int64` and `float64` for numeric).  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Zip Code` ranges from **90001 to 96150**, covering southern and northern California regions.  
  - Latitude and longitude values confirm all customers are located within **California, United States**.

- **Categorical overview:**  
  - `Country` = ‚ÄúUnited States‚Äù and `State` = ‚ÄúCalifornia‚Äù for all records.  
  - `City` has **1,106 unique values**, with Los Angeles being the most frequent (293 customers).  
  - `Lat Long` is a textual combination of latitude and longitude, redundant given the numeric columns.  

**Interpretation:**  
This table adds **geospatial context** to the dataset.  
It allows customer-level geographic segmentation (e.g., by city or ZIP code) and later enables merging with **Population** data using `Zip Code`.  
Columns like `Lat Long` and `Count` are redundant, while `Zip Code` serves as a key linking variable to external demographic data.



### **Population:**  

The **Population** dataset contains ZIP-code‚Äìlevel demographic information.  
It includes **1,671 rows** and **3 variables**, all of which are numeric (`int64`).

**Key takeaways:**
- **Data quality:**  
  - No missing values in any column.  
  - Data types are correctly assigned as integers.  

- **Structure and uniqueness:**  
  - Each `Zip Code` is unique (1,671 distinct ZIP codes).  
  - The `ID` column is also unique and functions only as an internal index ‚Äî it does not link to customers directly.  
  - `Population` has 1,607 unique values, indicating some ZIP codes may have similar population sizes.  

- **Numeric overview:**  
  - `Zip Code` ranges from **90001 to 96161**, consistent with California ZIP codes.  
  - `Population` ranges from **11** to **105,285**, with an average of about **20,276** people per ZIP code.  

**Interpretation:**  
This table provides **contextual demographic data** that can be linked to customers through their `Zip Code` from the **Location** table.  
Since it operates at the **ZIP-code level**, it will be joined later via `Zip Code`, not `Customer ID`.  
The `ID` column is only an index field and can be dropped before merging.



### **Services:**  

The **Services** dataset captures customer service usage, subscription details, and billing information.  
It includes **7,043 customers** and **30 variables**, combining both service attributes and financial metrics.

**Key takeaways:**
- **Data quality:**  
  - No missing values in most columns.  
  - The columns `Offer` and `Internet Type` contain missing data (‚âà55% and 22% respectively), suggesting that not all customers were offered promotions or subscribed to Internet services.  
  - Data types are consistent: numeric for billing and tenure, categorical for service indicators.  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Tenure in Months` ranges from **1 to 72**, indicating customer relationships lasting up to six years.  
  - `Monthly Charge` varies from **$18.25 to $118.75** (mean ‚âà $64.8).  
  - `Total Charges` and `Total Revenue` are highly variable, reflecting differences in service plans and tenure.  
  - Financial columns such as `Total Refunds`, `Total Extra Data Charges`, and `Total Long Distance Charges` are mostly small relative to overall revenue.  

- **Categorical overview:**  
  - `Quarter` = ‚ÄúQ3‚Äù for all entries - not informative.  
  - Service adoption patterns:  
    - **Phone Service:** 90% ‚ÄúYes‚Äù  
    - **Internet Service:** 78% ‚ÄúYes‚Äù  
    - **Contract:** Dominated by ‚ÄúMonth-to-Month‚Äù (~51%)  
    - **Payment Method:** Most common is ‚ÄúBank Withdrawal‚Äù (~55%)  
  - Value-added services (`Online Security`, `Streaming TV`, etc.) are mostly ‚ÄúNo,‚Äù suggesting many customers subscribe to basic plans.

**Interpretation:**  
This table provides a detailed view of **customer engagement and spending behaviour**.  
It combines tenure, billing, and service usage information ‚Äî all of which are likely **strong predictors of churn**.  
Columns like `Count` and `Quarter` can be dropped, while `Offer` and `Internet Type` require cleaning or imputation.  
The mix of continuous (e.g., `Tenure in Months`, `Monthly Charge`) and binary categorical features will be useful for both descriptive and predictive analyses.



### **Status:**  

The **Status** dataset captures customer satisfaction, churn outcomes, and value metrics.  
It contains **7,043 customers** and **11 variables**, mixing satisfaction scores, churn labels, and lifetime value indicators.

**Key takeaways:**
- **Data quality:**  
  - No missing values for most columns.  
  - The fields `Churn Category` and `Churn Reason` have missing data in **‚âà73% of rows**, which aligns with the fact that these fields are only populated for customers who have churned.  
  - Data types are correctly assigned (`int64` for numerical measures, `object` for categorical variables).  

- **Numeric overview:**  
  - `Count` is constant (=1) - can be dropped.  
  - `Satisfaction Score` ranges from **1 to 5** (mean ‚âà 3.24).  
  - `Churn Score` ranges from **5 to 96** (mean ‚âà 58.5), showing a wide variation in churn risk.  
  - `CLTV` (Customer Lifetime Value) ranges from **2003 to 6500**, indicating differing customer profitability levels.  

- **Categorical overview:**  
  - `Quarter` = ‚ÄúQ3‚Äù for all entries - not informative.  
  - `Customer Status`:  
    - **Stayed** ‚Äì 4,720 customers  
    - **Churned** ‚Äì 1,869 customers  
    - **Joined** ‚Äì 454 customers  
  - `Churn Label`: Binary ‚ÄúYes‚Äù/‚ÄúNo‚Äù indicator of churn (Yes = 1,869; No = 5,174).  
  - `Churn Category`: 5 categories for churned customers (most common: *Competitor*).  
  - `Churn Reason`: 20 reasons reported (most frequent: *Competitor had better devices*).  

**Interpretation:**  
This table provides the **core churn information** and customer satisfaction measures ‚Äî the foundation for our prediction target.  
`Churn Label` will serve as the **dependent variable (target)** in the churn prediction model.  
Columns such as `Count` and `Quarter` are not useful analytically and can be removed.  
Although `Churn Category` and `Churn Reason` have many missing values, they still offer valuable insight for **post-model interpretation** and business recommendations.

In [None]:
print("Data loading and initial exploration complete.")
