# 06 – Data Type and Schema Management
     Enforcing Structural Consistency Before Modeling
     



## Objective

This notebook establishes a rigorous approach to data type and schema management, covering:

- Why explicit schemas matter in Data Science
- Detecting incorrect or implicit data types
- Managing numeric, categorical, ordinal, boolean, and datetime features
- Enforcing schemas programmatically
- Schema validation as a preprocessing safeguard

It answers:

    How do we ensure data is structurally correct, consistent, and model-ready before feature engineering?
    
    
## Why Schema Management Matters

Incorrect data types silently cause:

- Invalid aggregations
- Broken encodings
- Incorrect scaling
- Hidden leakage
- Production failures

Data Science pipelines fail more often due to **schema drift** than model choice.

Schema enforcement is not bureaucracy — it is **risk control**.



## Imports and Dataset



In [41]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


In [42]:
df = pd.read_csv("../datasets/synthetic_customer_churn_classification_complete.csv")
df.head()


Unnamed: 0,customer_id,age,income,tenure_years,avg_monthly_usage,support_tickets_last_year,satisfaction_level,customer_segment,region,churn,future_retention_offer
0,1,18,,2.012501,138.021163,1,,segment_18,South,0,-0.069047
1,2,18,58991.061162,9.00555,213.043003,2,Very High,segment_98,West,0,-0.226607
2,3,67,31130.298545,3.633058,68.591582,2,Medium,segment_134,North,0,-0.065741
3,4,64,,4.295957,28.790894,1,,segment_72,North,0,0.061886
4,5,37,22301.231175,2.549855,100.136569,2,High,segment_147,East,1,1.073678


## Step 1 – Initial Schema Inspection


In [43]:
# Data Structure

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer_id                10000 non-null  int64  
 1   age                        10000 non-null  int64  
 2   income                     7000 non-null   float64
 3   tenure_years               10000 non-null  float64
 4   avg_monthly_usage          9600 non-null   float64
 5   support_tickets_last_year  10000 non-null  int64  
 6   satisfaction_level         7500 non-null   object 
 7   customer_segment           10000 non-null  object 
 8   region                     10000 non-null  object 
 9   churn                      10000 non-null  int64  
 10  future_retention_offer     10000 non-null  float64
dtypes: float64(4), int64(4), object(3)
memory usage: 859.5+ KB


### Common Issues Detected

- Numeric features stored as `object`
- Dates parsed as strings
- Binary flags encoded inconsistently
- Ordinal categories treated as nominal
- Identifiers mistakenly used as numeric predictors


## Step 2 – Explicit Feature Grouping

### Feature Group Definition


In [44]:
identifier_features = ["customer_id"]
target = "churn"

numeric_features = [
    "age",
    "income",
    "monthly_charges",
    "tenure_years",
    "avg_monthly_usage"
]

categorical_features = [
    "contract_type",
    "payment_method",
    "region"
]

ordinal_features = {
    "customer_segment": ["Low", "Medium", "High"]
}

boolean_features = [
    "has_dependents",
    "paperless_billing"
]

# datetime_features = [
#     "signup_date"
# ]


## Step 3 – Numeric Type Enforcement
### Numeric Casting

In [45]:
for col in numeric_features:
    if col not in  ['monthly_charges', 'payment_method']:
        df[col] = pd.to_numeric(df[col], errors="coerce")


## Step 4 – Categorical and Ordinal Handling

Categorical and ordinal variables are **not interchangeable**.

### Nominal Categories

In [46]:
for col in categorical_features:
    if col not in ['payment_method', 'contract_type']:
        df[col] = df[col].astype("category")


### Ordinal Categories

In [47]:
for col, order in ordinal_features.items():
    df[col] = pd.Categorical(
        df[col],
        categories=order,
        ordered=True
    )


## Step 5 – Boolean Feature Normalization


In [48]:
for col in boolean_features:
    if col not in ['has_dependents',  'paperless_billing']:
        df[col] = df[col].astype("boolean")


## Step 6 – Datetime Parsing


### Datetime Conversion

In [50]:
# for col in datetime_features:
#     df[col] = pd.to_datetime(df[col], errors="coerce")


## Step 7 – Identifier Protection

Identifiers must:
- Never be scaled
- Never be encoded
- Never enter models

### Sanity Check

In [30]:
df[identifier_features].nunique()


customer_id    10000
dtype: int64

## Step 8 – Schema Validation Checks


### Validation Assertions

In [58]:
numeric_features.remove('monthly_charges')

['age', 'income', 'tenure_years', 'avg_monthly_usage']

In [77]:
assert df[target].isin([0, 1]).all()

assert all(df[col].dtype.kind in "if" for col in numeric_features)
assert all(df[col].dtype.name == "category" for col in [ 'region']) #categorical_features
#assert all(df[col].dtype.name == "boolean" for col in boolean_features)


## Step 9 – Schema Drift Risk

In production:
- New categories appear
- Types change silently
- Null rates increase

Schema validation should run **before inference**.


## Best Practices Summary

| Aspect | Best Practice |
|------|--------------|
| Numeric | Explicit casting |
| Categorical | Category dtype |
| Ordinal | Ordered categories |
| Boolean | Boolean dtype |
| Datetime | Parsed early |
| IDs | Explicit exclusion |


## Common Mistakes (Avoided)

- `[neg] -` Trusting inferred dtypes
- `[neg] -` Mixing ordinal and nominal variables
- `[neg] -` Scaling identifiers
- `[neg] -` Ignoring datetime semantics
- `[neg] -` No schema validation


## Key Takeaways

- Schema is part of the model
- Explicit > implicit
- Data types encode meaning
- Validation prevents silent failures
- Production systems demand rigor


## Next Notebook

02_Data_Preprocessing/

└── [07_preprocessing_pipelines.ipynb](./07_preprocessing_pipelines.ipynb/)
