# **Data Cleaning**

## **3. Fixing Incorrect Data Types**

In [1]:
import numpy as np
import pandas as pd 

## 🔍 Why Fix Data Types?

Incorrect or inconsistent data types can cause:

* Errors in computation (e.g., strings instead of numbers)
* Problems during merging, grouping, or modeling
* Inaccurate memory usage and performance issues
* Incompatibility with machine learning algorithms


### ✅ Real-Life Examples of Type Issues

| Field         | Incorrect Type          | Desired Type      |
| ------------- | ----------------------- | ----------------- |
| Age           | String (`"25"`)         | Integer (`25`)    |
| Date of Birth | Object (`'2023-07-01'`) | Datetime          |
| Salary        | Object (`'60K'`)        | Float (`60000.0`) |
| Category      | Object                  | Categorical       |


## 🛠️ Key Techniques for Fixing Data Types

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': ['25', '30', '35'],            # Strings instead of int
    'Salary': ['50000', '60000', '70000'],# Strings instead of float
    'JoinDate': ['2022-01-15', '2021-11-10', '2023-03-05'],  # Strings, not datetime
    'Department': ['Sales', 'HR', 'Sales'] # Candidate for category
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000,2022-01-15,Sales
1,Bob,30,60000,2021-11-10,HR
2,Charlie,35,70000,2023-03-05,Sales


In [4]:
df.dtypes

Name          object
Age           object
Salary        object
JoinDate      object
Department    object
dtype: object

## 🔹 **1. Converting Column to Numeric**

#### ▶️ Method: `pd.to_numeric()`

In [5]:
df['Age'] = pd.to_numeric(df['Age'])
df['Salary'] = pd.to_numeric(df['Salary'])

In [6]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000,2022-01-15,Sales
1,Bob,30,60000,2021-11-10,HR
2,Charlie,35,70000,2023-03-05,Sales


In [7]:
df.dtypes

Name          object
Age            int64
Salary         int64
JoinDate      object
Department    object
dtype: object

#### ✅ Real-World Use Case:

CSV export from Excel stores all columns as strings. To perform analytics, numeric columns like `"Age"` and `"Salary"` must be converted.

🔹 *Why this method?*
Handles type conversion with options to handle errors gracefully (`errors='coerce'`).

## 🔹 **2. Converting to Datetime**

#### ▶️ Method: `pd.to_datetime()`

In [8]:
df['JoinDate'] = pd.to_datetime(df['JoinDate'])

In [9]:
df.dtypes

Name                  object
Age                    int64
Salary                 int64
JoinDate      datetime64[ns]
Department            object
dtype: object

#### ✅ Real-World Use Case:

You’re doing **employee tenure analysis** using JoinDate — it must be datetime for operations like filtering by year or sorting chronologically.

🔹 *Why this method?*
Robust conversion with format inference and error handling (e.g., `errors='coerce'`).


## 🔹 **3. Using `astype()` for Type Casting**

In [10]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000,2022-01-15,Sales
1,Bob,30,60000,2021-11-10,HR
2,Charlie,35,70000,2023-03-05,Sales


In [11]:
df.dtypes

Name                  object
Age                    int64
Salary                 int64
JoinDate      datetime64[ns]
Department            object
dtype: object

In [12]:
df['Age'] = df['Age'].astype(np.int16)
df['Salary'] = df['Salary'].astype(np.float32)
df['Department'] = df['Department'].astype('category')

In [13]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000.0,2022-01-15,Sales
1,Bob,30,60000.0,2021-11-10,HR
2,Charlie,35,70000.0,2023-03-05,Sales


In [14]:
df.dtypes

Name                  object
Age                    int16
Salary               float32
JoinDate      datetime64[ns]
Department          category
dtype: object

#### ✅ Real-World Use Cases:

* Reduce memory by converting `"Department"` from object to `"category"` in large datasets (e.g., millions of rows).
* Ensure `"Age"` and `"Salary"` are numeric before modeling.

🔹 *Why this method?*
Fast and explicit type conversion when data is already clean.


## 🔹 **4. Inferring Object Types Automatically**

#### ▶️ Method: `infer_objects()`

In [15]:
df.infer_objects()

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000.0,2022-01-15,Sales
1,Bob,30,60000.0,2021-11-10,HR
2,Charlie,35,70000.0,2023-03-05,Sales


In [16]:
df.infer_objects().dtypes

Name                  object
Age                    int16
Salary               float32
JoinDate      datetime64[ns]
Department          category
dtype: object

#### ✅ Use Case:

You loaded data from JSON/CSV and want pandas to **auto-convert object types** to the most likely dtype (int, float, etc.).

🔹 *Why this method?*
Useful when doing an initial cleanup of mixed object types.


## 🔹 **5. Parsing Numbers with Non-Numeric Characters**

In [17]:
df

Unnamed: 0,Name,Age,Salary,JoinDate,Department
0,Alice,25,50000.0,2022-01-15,Sales
1,Bob,30,60000.0,2021-11-10,HR
2,Charlie,35,70000.0,2023-03-05,Sales


```python
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').astype(float)
```

Or using regex with `str.extract()`:

```python
df['Salary'] = df['Salary'].str.extract('(\d+)').astype(float)
```

#### ✅ Real-World Use Case:

Financial systems store `"Salary"` as `$60,000` — these characters must be stripped before numeric conversion.

🔹 *Why this method?*
Cleaning before casting ensures no conversion errors.


## 🔹 **6. Converting to Boolean**

```python
df['IsActive'] = df['IsActive'].map({'yes': True, 'no': False})
```

#### ✅ Use Case:

In survey data, answers like “Yes”/“No” need to be converted into **boolean** to make them usable in models or filters.

🔹 *Why this method?*
Mapping string flags to `True`/`False` ensures consistency.


## 🔹 **7. Handling Errors During Type Conversion**

```python
pd.to_numeric(df['Age'], errors='coerce')
pd.to_datetime(df['JoinDate'], errors='coerce')
```

#### ✅ Use Case:

Your `"Age"` column has `"unknown"` or `"N/A"` — coerce them to NaN instead of failing.

🔹 *Why this method?*
Graceful fallback for cleaning large dirty datasets.


## 📌 Summary Table

| Technique                   | Purpose                               | Real-World Use Case                          |
| --------------------------- | ------------------------------------- | -------------------------------------------- |
| `pd.to_numeric()`           | Convert string to int/float           | Cleaned salary field from string             |
| `pd.to_datetime()`          | Convert string to datetime            | Working with timestamps or event logs        |
| `astype()`                  | Cast dtype explicitly                 | Ensure numeric types before modeling         |
| `infer_objects()`           | Guess types from object columns       | General cleanup after import                 |
| `.str.replace() + astype()` | Strip characters and convert          | Handle values like "\$60,000" or "1,200"     |
| `str.extract()`             | Regex-based number extraction         | Pull numeric portion from noisy strings      |
| `map()`                     | Map to custom types (e.g. boolean)    | Convert "yes"/"no" to True/False             |
| `errors='coerce'`           | Handle invalid conversions gracefully | Set faulty values to NaN instead of crashing |


### 🧠 Best Practices

* Always inspect with `df.dtypes` or `df.info()` before cleaning.
* Be cautious when using `astype()`— it fails if data isn't clean.
* Use `errors='coerce'` to avoid unexpected crashes.
* Convert objects to `category` for large low-cardinality columns to save memory.

<center><b>Thanks</b></center>