# **Data Cleaning**

# **9. Type Conversions & Consistency Checks**

In [69]:
import numpy as np
import pandas as pd 

## ✅ What Are Type Conversions & Consistency Checks?

In real-world data:

* You might have numbers stored as strings.
* Dates could be read as objects.
* Booleans might be `'Yes'/'No'` or `1/0`.
* Categorical data may not be explicitly typed.

To **make computations accurate, memory usage efficient, and validations easier**, you need to convert columns to appropriate data types and ensure consistency across them.

### 🔹 1. `astype()` – Convert to a Specific Data Type

#### ✅ Syntax:

```python
df['col'] = df['col'].astype('desired_dtype')
```

#### ✅ Common Conversions:

* String: `'str'`
* Integer: `'int'`
* Float: `'float'`
* Boolean: `'bool'`
* Categorical: `'category'`

In [70]:
df = pd.DataFrame({
    'id': ['001', '002', '003'],
    'age': ['25', '30', '45'],
    'salary': ['55000', '₹63000.5', 'Not Provided'],
    'join_date': ['2023-07-01', '2023/10/15', 'invalid_date'],
    'duration': ['1 days 12:00:00', '2 days 00:00:00', '5:30:00'],
    'department': ['HR', 'IT', 'Finance'],
    'active': ['Yes', 'No', 'Yes'],
    'is_member': [1, 0, 1]
})

df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000,2023-07-01,1 days 12:00:00,HR,Yes,1
1,2,30,₹63000.5,2023/10/15,2 days 00:00:00,IT,No,0
2,3,45,Not Provided,invalid_date,5:30:00,Finance,Yes,1


In [71]:
df.dtypes

id            object
age           object
salary        object
join_date     object
duration      object
department    object
active        object
is_member      int64
dtype: object

In [72]:
df['age'] = df['age'].astype(int)
df['is_member'] = df['is_member'].astype(bool)

df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,₹63000.5,2023/10/15,2 days 00:00:00,IT,No,False
2,3,45,Not Provided,invalid_date,5:30:00,Finance,Yes,True


In [73]:
df.dtypes

id            object
age            int32
salary        object
join_date     object
duration      object
department    object
active        object
is_member       bool
dtype: object

#### 📌 **Real-Time Use Case**:

You receive a survey dataset where numeric IDs are loaded as strings due to Excel formatting. Use `astype(int)` for correct aggregation.

#### ✅ Why `astype()`?

It’s the **simplest and fastest method** when you’re confident of clean and convertible values.

### 🔹 2. `pd.to_numeric()` – Safely Convert to Numbers

#### ✅ Syntax:

```python
df['col'] = pd.to_numeric(df['col'], errors='coerce')
```

* `errors='coerce'`: Non-convertible values are turned into NaN.
* `errors='ignore'`: Leaves invalid values as-is.


In [74]:
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,₹63000.5,2023/10/15,2 days 00:00:00,IT,No,False
2,3,45,Not Provided,invalid_date,5:30:00,Finance,Yes,True


In [75]:
df.dtypes

id            object
age            int32
salary        object
join_date     object
duration      object
department    object
active        object
is_member       bool
dtype: object

In [76]:
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000.0,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,,2023/10/15,2 days 00:00:00,IT,No,False
2,3,45,,invalid_date,5:30:00,Finance,Yes,True


In [77]:
df.dtypes

id             object
age             int32
salary        float64
join_date      object
duration       object
department     object
active         object
is_member        bool
dtype: object

#### 📌 **Real-Time Use Case**:

CSV import with currency values like `'₹100,000'` or corrupted data like `'N/A'`.

#### ✅ Why `to_numeric()`?

Best for **robust parsing of mixed-type columns** (strings, numbers, etc.) and safe error handling.

### 🔹 3. `pd.to_datetime()` – Convert to DateTime

#### ✅ Syntax:

```python
df['date'] = pd.to_datetime(df['date'], errors='coerce')
```


In [78]:
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000.0,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,,2023/10/15,2 days 00:00:00,IT,No,False
2,3,45,,invalid_date,5:30:00,Finance,Yes,True


In [79]:
df['join_date'] = pd.to_datetime(df['join_date'], format='ISO8601', errors='coerce')
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000.0,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,,2023-10-15,2 days 00:00:00,IT,No,False
2,3,45,,NaT,5:30:00,Finance,Yes,True


#### 📌 **Real-Time Use Case**:

Log data where date comes as `'2023/10/15'`, `'15-10-2023'`, etc. Use this method to **parse and normalize** to pandas datetime format.

#### ✅ Why?

It’s essential for **time-based indexing, filtering, and plotting**. Handles inconsistent formats.

### 🔹 4. `pd.to_timedelta()` – Convert Duration Strings to Timedelta

In [80]:
df['duration'] = pd.to_timedelta(df['duration'])
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000.0,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,,2023-10-15,2 days 00:00:00,IT,No,False
2,3,45,,NaT,0 days 05:30:00,Finance,Yes,True


In [81]:
df.dtypes

id                     object
age                     int32
salary                float64
join_date      datetime64[ns]
duration      timedelta64[ns]
department             object
active                 object
is_member                bool
dtype: object

#### 📌 **Real-Time Use Case**:

When parsing logs or telemetry data with durations like `'2 days 5:00:00'`.

### 🔹 5. `astype('category')` – Convert to Categorical Type

In [82]:
df['department']

0         HR
1         IT
2    Finance
Name: department, dtype: object

In [83]:
df['department'] = df['department'].astype('category')
df['department']

0         HR
1         IT
2    Finance
Name: department, dtype: category
Categories (3, object): ['Finance', 'HR', 'IT']

#### 📌 **Real-Time Use Case**:

For columns like `gender`, `department`, `region`, which have **limited unique values**, reducing memory usage significantly and improving groupby performance.

#### ✅ Why?

Efficient for **classification models and analytics**.

### 🔹 6. Converting Boolean-like Strings to `bool`

In [84]:
df

Unnamed: 0,id,age,salary,join_date,duration,department,active,is_member
0,1,25,55000.0,2023-07-01,1 days 12:00:00,HR,Yes,True
1,2,30,,2023-10-15,2 days 00:00:00,IT,No,False
2,3,45,,NaT,0 days 05:30:00,Finance,Yes,True


In [85]:
df['active'].map({'Yes': True, 'No': False})

0     True
1    False
2     True
Name: active, dtype: bool

In [86]:
df['active'].astype(bool) # Only if already 0/1

0    True
1    True
2    True
Name: active, dtype: bool

#### 📌 **Real-Time Use Case**:

Forms or survey tools export checkboxes as `'Yes'/'No'`.

## 📋 Consistency Checks After Conversion

### ✅ Check Data Types

In [87]:
df.dtypes

id                     object
age                     int32
salary                float64
join_date      datetime64[ns]
duration      timedelta64[ns]
department           category
active                 object
is_member                bool
dtype: object

### ✅ Check Unique Values

Useful for spotting inconsistent casing or typos.

In [88]:
df['department'].unique()

['HR', 'IT', 'Finance']
Categories (3, object): ['Finance', 'HR', 'IT']

### ✅ Normalize Text Before Type Conversion

For string-cleaned columns before mapping:

```python
df['col'] = df['col'].str.strip().str.lower()
```



## 🧠 Real-Time Use Cases Summary

| Scenario                                 | Method                        | Why It Works                              |
| ---------------------------------------- | ----------------------------- | ----------------------------------------- |
| Salary loaded as strings                 | `pd.to_numeric()`             | Safely handles errors and corrupt strings |
| Joining on Date                          | `pd.to_datetime()`            | Ensures proper date comparison            |
| User responses in ‘Yes/No’               | `.map()` or `.astype(bool)`   | Converts into binary logical form         |
| Survey column like “region”              | `astype('category')`          | Saves memory, speeds up groupby           |
| Mixed numeric + text (e.g. '500', 'abc') | `to_numeric(errors='coerce')` | Keeps good data, flags bad                |


## 🔚 Summary of Methods

| Method                  | Use                                    |
| ----------------------- | -------------------------------------- |
| `astype()`              | Direct conversion for clean data       |
| `pd.to_numeric()`       | Convert mixed string/number to numeric |
| `pd.to_datetime()`      | Handle and normalize dates             |
| `pd.to_timedelta()`     | Parse durations                        |
| `map()` or `.replace()` | String to bool or categorical          |
| `astype('category')`    | Save memory, speed up processing       |


<center><b>Thanks</b></center>