# Data Overview and Quality
EDA Entry & Data Contract Validation
## Objective

This notebook provides a systematic overview of the dataset to:

- Validate schema and data types

- Assess data completeness and consistency

- Detect structural data quality issues

- Establish a data contract for downstream analysis

This notebook answers:

    “Is this dataset fit to be analyzed at all?”

## Why Data Overview Comes First

Without a data overview:

- Silent schema errors propagate

- Downstream metrics become meaningless

- Models fail due to preventable issues

This notebook defines what the data is, before asking what it means.

# Imports and Configuration

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


# Load Dataset

Dataset name (canonical):


    synthetic_customer_churn_eda_benchmark
    
For this benchmark notebook, we generate the dataset inline.

In [4]:
np.random.seed(2010)

N = 5000

df = pd.DataFrame({
    "customer_id": range(1, N + 1),
    "age": np.random.randint(18, 75, size=N),
    "income": np.random.lognormal(mean=10.8, sigma=0.6, size=N),
    "tenure_years": np.random.exponential(scale=6, size=N),
    "transactions_last_30d": np.random.poisson(lam=4, size=N),
    "region": np.random.choice(
        ["North", "South", "East", "West"],
        size=N,
        p=[0.35, 0.25, 0.25, 0.15]
    ),
    "churn": np.random.binomial(1, 0.28, size=N)
})

df.head()


Unnamed: 0,customer_id,age,income,tenure_years,transactions_last_30d,region,churn
0,1,18,45868.374647,5.749047,4,North,0
1,2,18,74287.388492,1.537824,5,South,0
2,3,67,78586.352313,24.502748,3,North,0
3,4,64,56102.92543,1.502888,3,South,0
4,5,37,25639.985952,3.9506,6,North,0


## Save the dataset 
- Into datasets folder

In [8]:
# df.to_csv("../datasets/synthetic_customer_churn_eda_benchmark.csv", index=False)

# Step 2 – Dataset Shape and Structure

In [11]:
df.shape

(5000, 7)

In [12]:
df.columns

Index(['customer_id', 'age', 'income', 'tenure_years', 'transactions_last_30d',
       'region', 'churn'],
      dtype='object')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   customer_id            5000 non-null   int64  
 1   age                    5000 non-null   int32  
 2   income                 5000 non-null   float64
 3   tenure_years           5000 non-null   float64
 4   transactions_last_30d  5000 non-null   int32  
 5   region                 5000 non-null   object 
 6   churn                  5000 non-null   int32  
dtypes: float64(2), int32(3), int64(1), object(1)
memory usage: 215.0+ KB


### Validation Checklist

- Unique row identifier present (customer_id)

- No unexpected object dtypes

- Target variable identified (churn)

# Step 3 – Schema Validation (Conceptual)


| Column                | Expected Type | Role        |
| --------------------- | ------------- | ----------- |
| customer_id           | int           | Primary key |
| age                   | int           | Numeric     |
| income                | float         | Numeric     |
| tenure_years          | float         | Numeric     |
| transactions_last_30d | int           | Count       |
| region                | category      | Categorical |
| churn                 | binary        | Target      |


# Step 4 – Missing Value Assessment

In [14]:
missing_summary = (
    df.isna()
    .mean()
    .sort_values(ascending=False))

missing_summary


customer_id              0.0
age                      0.0
income                   0.0
tenure_years             0.0
transactions_last_30d    0.0
region                   0.0
churn                    0.0
dtype: float64

## Interpretation

- No missing values in synthetic data

- Real-world datasets rarely behave this well

# Step 5 – Duplicate Records

In [18]:
# Check for duplicated observation 
df.duplicated().sum()

np.int64(0)

In [19]:
# Checking for dupliacted user ID
df["customer_id"].duplicated().sum()

np.int64(0)

## Rule

- Duplicate primary keys are unacceptable

- Duplicated rows require business investigation

# Step 6 – Basic Validity Rules
## Range Checks

- `age` - people ages range from [0, 120] years old
- `income, tenure_years, transactions_last_30d` - are always >= 0
- by using __`asssert`__ if any observation result in __`False`__ during condition verification, then it return - __`AssertionError`__.

In [20]:
assert df["age"].between(0, 120).all()
assert (df["income"] > 0).all()
assert (df["tenure_years"] >= 0).all()
assert (df["transactions_last_30d"] >= 0).all()

In [23]:
# Here it fail the condition because age range from [0 - 80], not from [0, 20]

#assert df["age"].between(0, 20).all()

## Cardinality Checks

In [26]:
df["region"].value_counts()

region
North    1761
East     1246
South    1219
West      774
Name: count, dtype: int64

In [27]:
df["churn"].value_counts()

churn
0    3618
1    1382
Name: count, dtype: int64

### Interpretation

- Region cardinality acceptable

- Target moderately imbalanced


# Step 7 – Consistency & Business Rules

Example rules:

- High tenure should not coexist with zero age

- Transactions should not be negative

- Income must be positive

These rules pass for this dataset.


# Step 8 – Early Data Quality Risks



| Risk             | Status |
| ---------------- | ------ |
| Missing data     | None   |
| Duplicates       | None   |
| Invalid ranges   | None   |
| Schema mismatch  | None   |
| Target ambiguity | None   |


# Step 9 – Data Quality Output Artifacts

This notebook establishes:

- Dataset schema contract

- Feature roles and expectations

- Baseline data quality guarantees

Feeds directly into:

    01_Exploratory_Data_Analysis/
    └── 02_univariate_analysis.ipynb

# Summary

This notebook demonstrated:

- Structural data validation

- Completeness and consistency checks

- Business rule enforcement

- EDA entry gating

If this notebook fails, EDA must stop.