# Data Type Validation
Always validate data types before analysis or modeling, as incorrect types can silently produce wrong results.

## Dataset Summary
- **Mixed-Type Numeric (age)**: Contains integers, floats, strings like `42 years`, and non-numeric labels like `Unknown`.
- **Object-Typed Numbers (salary)**: Numbers are stored as strings because of formatting (commas, dollar signs, and decimals).
- **Inconsistent Dates (date of joining)**: Uses multiple formats like `DD-MM-YYYY`, `YYYY/MM/DD`, and `Month DD, YYYY`.
- **The Boolean Mess (is_full_time)**: A common real-world problem where "True" is represented by `1`, `Yes`, `True`, `No`, `0`, etc.

In [1]:
import pandas as pd, numpy as np

In [2]:
df = pd.read_csv("10_datatype_validation.csv")
df

Unnamed: 0,name,gender,age,salary,city,date of joining,is_full_time
0,Amit,Male,28,40769,Kolkata,30-10-2021,TRUE
1,Riya,Female,41,99735,Pune,09-02-2018,Yes
2,John,Male,36,96101,Pune,02-06-2019,1
3,Neha,Female,32,42433.5,Kolkata,04-03-2020,TRUE
4,Siddharth,Male,29,45311,Pune,15-10-2022,TRUE
5,Zoe,Female,42 years,77819,Bangalore,28-11-2023,TRUE
6,Ken,Male,28,79188,Chennai,12-05-2018,TRUE
7,Anjali,Female,47,57568,Kolkata,25-10-2020,TRUE
8,Vijay,Male,40,$93707,Chennai,06-09-2022,TRUE
9,Priya,Female,44,59769,Delhi,03-09-2022,TRUE


## Before memory usage:

In [3]:
df.memory_usage(deep=True)

Index               132
name               1073
gender             1080
age                1031
salary             1084
city               1114
date of joining    1181
is_full_time       1052
dtype: int64

## Clean and Convert Numeric Columns

### Handle `age` Column
-  `Int64(nullable)` for Missing Values (NaN)	
- `errors='coerce'` turns `Unknown` into NaN
- Use `Int8` for `Memory-critical`

In [4]:
df['age'] = df['age'].str.replace(' years', '') # Remove the ' years' suffix from the age column
df['age'] = pd.to_numeric(df['age'].str.replace(' years', ''), errors='coerce').astype('Int64') 

**Note**: `errors='coerce'` converts invalid values to NaN, which should be handled explicitly afterward.

### Handle `salary` Column
Strip symbols and commas before converting

In [5]:
df['salary'] = df['salary'].str.replace(r'[$,]', '', regex=True)
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

### .apply() â†’ Apply any function to rows or columns
 if Juniors salary less than and equel to 75000 and Seniors age more than 75000

In [6]:
df["Junior or Senions"] = df["salary"].apply(lambda age: "Senions" if age >= 75000 else "Junior")

## Standardize Date Formats

### Handle `date of joining` Column

In [7]:
df['date of joining'] = pd.to_datetime(df['date of joining'], format='mixed', dayfirst=True, errors='coerce')

## Map Booleans

### Handle `is_full_time` Column

In [8]:
bool_map = {
    'true': True, 'yes': True, 1: True, '1': True,
    'false': False, 'no': False, 0: False, '0': False
}
df['is_full_time'] = df['is_full_time'].astype(str).str.lower()
df['is_full_time'] = df['is_full_time'].map(bool_map).astype('boolean')

In [9]:
df[['gender', 'city']].memory_usage(deep=True)

Index      132
gender    1080
city      1114
dtype: int64

## Conversion to 'category' type

### Handle `gender` Column

In [10]:
# Before converting to categories, we must ensure 'Male' and 'male' are treated as the same
df['gender'] = df['gender'].str.strip().str.capitalize()
df['gender'] = df['gender'].astype('category')

### Handle `city` Column

In [11]:
df['city'] = df['city'].str.strip().str.title()
df['city'] = df['city'].astype('category')
df

Unnamed: 0,name,gender,age,salary,city,date of joining,is_full_time,Junior or Senions
0,Amit,Male,28.0,40769.0,Kolkata,2021-10-30,True,Junior
1,Riya,Female,41.0,99735.0,Pune,2018-02-09,True,Senions
2,John,Male,36.0,96101.0,Pune,2019-06-02,True,Senions
3,Neha,Female,32.0,42433.5,Kolkata,2020-03-04,True,Junior
4,Siddharth,Male,29.0,45311.0,Pune,2022-10-15,True,Junior
5,Zoe,Female,42.0,77819.0,Bangalore,2023-11-28,True,Senions
6,Ken,Male,28.0,79188.0,Chennai,2018-05-12,True,Senions
7,Anjali,Female,47.0,57568.0,Kolkata,2020-10-25,True,Junior
8,Vijay,Male,40.0,93707.0,Chennai,2022-09-06,True,Senions
9,Priya,Female,44.0,59769.0,Delhi,2022-09-03,True,Junior


## Accessing Category Codes

In [12]:
df['city'].cat.codes.head()

0    4
1    6
2    6
3    4
4    6
dtype: int8

## After memory usage

In [13]:
df.memory_usage(deep=True)

Index                 132
name                 1073
gender                236
age                   180
salary                160
city                  710
date of joining       160
is_full_time           40
Junior or Senions    1108
dtype: int64

## Summary
- Data type validation is a critical cleaning step
- Blind casting is dangerous
- Correct types improve analysis and modeling reliability