# **Data Cleaning**

## **11. Fixing Structural Errors in Data**

In [12]:
import numpy as np
import pandas as pd 

## 🔍 What Are Structural Errors?

Structural errors are **inconsistencies in the structure or representation** of data values. These typically arise from:

* Typos
* Irregular capitalization
* Extra whitespace
* Mixed formats (e.g., `500`, `'500'`)
* Inconsistent labeling (`"Male"`, `"male"`, `"M"`)
* Improper column names
* Incorrect delimiters or malformed strings

These errors **lead to duplicate categories or distorted statistics** and can corrupt data analysis or machine learning models.

---

## 🧠 Common Structural Error Types & Fixing Techniques

| Structural Error Type       | Fixing Method              | Example Technique              |
| --------------------------- | -------------------------- | ------------------------------ |
| Inconsistent Capitalization | Convert case               | `.str.lower()`, `.str.title()` |
| Leading/Trailing Spaces     | Strip whitespace           | `.str.strip()`                 |
| Typos or Misspelled Entries | Standardize or fuzzy match | `replace()`, `fuzzywuzzy`      |
| Mixed Format Values         | Convert to standard type   | `.astype()`                    |
| Inconsistent Labels         | Replace or map             | `.replace()`, `.map()`         |
| Improper Column Names       | Rename                     | `df.rename()`, `.str.strip()`  |


In [13]:
df = pd.DataFrame({
    'Gender': ['Male', 'male', 'FEMALE', 'female', 'F', 'M'],
    'Product': ['Mobile', 'mobile ', 'MOBILE', 'Laptop', 'laptop', 'Lap Top'],
    'Price': ['500', 600, ' 700', 800, 'Eight Hundred', 900],
    'Customer Type': ['Regular', ' Regular', 'VIP', 'vip ', 'V.I.P', 'regular']
})

df

Unnamed: 0,Gender,Product,Price,Customer Type
0,Male,Mobile,500,Regular
1,male,mobile,600,Regular
2,FEMALE,MOBILE,700,VIP
3,female,Laptop,800,vip
4,F,laptop,Eight Hundred,V.I.P
5,M,Lap Top,900,regular


## 🔧 1. Fixing Capitalization

In [14]:
df['Gender'] = df['Gender'].str.lower()
df['Product'] = df['Product'].str.title()

df

Unnamed: 0,Gender,Product,Price,Customer Type
0,male,Mobile,500,Regular
1,male,Mobile,600,Regular
2,female,Mobile,700,VIP
3,female,Laptop,800,vip
4,f,Laptop,Eight Hundred,V.I.P
5,m,Lap Top,900,regular


🧠 **Why?**
Helps convert values like `'mobile'`, `'Mobile'`, `'MOBILE'` into a single representation (`'Mobile'`).

## 🔧 2. Removing Extra Whitespace

In [15]:
df.columns

Index(['Gender', 'Product', 'Price', 'Customer Type'], dtype='object')

In [16]:
df['Product'] = df['Product'].str.strip()
df['Customer Type'] = df['Customer Type'].str.strip()

df

Unnamed: 0,Gender,Product,Price,Customer Type
0,male,Mobile,500,Regular
1,male,Mobile,600,Regular
2,female,Mobile,700,VIP
3,female,Laptop,800,vip
4,f,Laptop,Eight Hundred,V.I.P
5,m,Lap Top,900,regular


In [17]:
df.columns

Index(['Gender', 'Product', 'Price', 'Customer Type'], dtype='object')

🧠 **Why?**
Fixes `'mobile '` or `' Regular'` which would otherwise be treated as separate categories.

## 🔧 3. Correcting Typos / Inconsistent Labels

In [18]:
df['Product'] = df['Product'].replace({
    'Lap Top': 'Laptop'
})

df['Gender'] = df['Gender'].replace({
    'm': 'male', 'f': 'female'
})

df

Unnamed: 0,Gender,Product,Price,Customer Type
0,male,Mobile,500,Regular
1,male,Mobile,600,Regular
2,female,Mobile,700,VIP
3,female,Laptop,800,vip
4,female,Laptop,Eight Hundred,V.I.P
5,male,Laptop,900,regular


🧠 **Why?**
Prevents misclassification or misleading counts. E.g., `Lap Top` and `Laptop` should be same.

## 🔧 4. Converting Mixed Formats (e.g., numeric values stored as strings)

In [20]:
# Replace string-based numbers like 'Eight Hundred' manually or using mapping
df['Price'] = df['Price'].replace({'Eight Hundred': 800})
# Remove spaces and convert to integer
df['Price'] = df['Price'].astype(str).str.strip().astype(int)


df

Unnamed: 0,Gender,Product,Price,Customer Type
0,male,Mobile,500,Regular
1,male,Mobile,600,Regular
2,female,Mobile,700,VIP
3,female,Laptop,800,vip
4,female,Laptop,800,V.I.P
5,male,Laptop,900,regular


🧠 **Why?**
Ensures consistent numeric data type for analysis or modeling.

## 🔧 5. Standardizing Categories

In [22]:
df['Customer Type'] = df['Customer Type'].str.lower().replace({
    'v.i.p': 'vip',
    'regular': 'regular'
})

df

Unnamed: 0,Gender,Product,Price,Customer Type
0,male,Mobile,500,regular
1,male,Mobile,600,regular
2,female,Mobile,700,vip
3,female,Laptop,800,vip
4,female,Laptop,800,vip
5,male,Laptop,900,regular


🧠 **Why?**
Combines `vip`, `V.I.P`, `VIP` into one standardized form.

## 🔧 6. Renaming Columns (Structural Fix)

In [24]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

df

Unnamed: 0,gender,product,price,customer_type
0,male,Mobile,500,regular
1,male,Mobile,600,regular
2,female,Mobile,700,vip
3,female,Laptop,800,vip
4,female,Laptop,800,vip
5,male,Laptop,900,regular


🧠 **Why?**
Good for clean column naming conventions. Helpful in ML pipelines & consistent reference.

## 🔧 7. Fuzzy Matching (Advanced)

For large datasets with many misspellings (e.g., `"Laptoop"`, `"Lap Top"`, `"Lap-top"`), use `fuzzywuzzy`:


In [26]:
from fuzzywuzzy import process

choices = df['product'].unique()
print(f"Choices: {choices}")

process.extractOne('Laptoop', choices)

Choices: ['Mobile' 'Laptop']


('Laptop', 92)

🧠 **Why?**
Helpful when **manual mapping is impractical**, e.g., thousands of unique values.

In [28]:
# Before fixing
print(df['product'].value_counts())

# After fixing
df['product'] = df['product'].str.strip().str.title().replace({'Lap Top': 'Laptop'})
print(df['product'].value_counts())

product
Mobile    3
Laptop    3
Name: count, dtype: int64
product
Mobile    3
Laptop    3
Name: count, dtype: int64


## ✅ Summary Table of Techniques

| Task                    | Method                         | Use Case                           |
| ----------------------- | ------------------------------ | ---------------------------------- |
| Lowercase or title case | `.str.lower()`, `.str.title()` | Fix capitalization issues          |
| Remove spaces           | `.str.strip()`                 | Remove trailing/leading whitespace |
| Fix typos               | `.replace()`, `fuzzywuzzy`     | Inconsistent spelling              |
| Convert mixed types     | `.astype()`                    | Strings vs numbers                 |
| Combine categories      | `.map()`, `.replace()`         | Normalize labels                   |
| Rename columns          | `.rename()`, `.columns = ...`  | Improve structure                  |



## 📦 Real-Life Use Cases

| Domain                    | Problem                           | Structural Fix                |
| ------------------------- | --------------------------------- | ----------------------------- |
| E-commerce                | `Mobile`, `MOBILE`, `mobile`      | `.str.title()`                |
| HR                        | `M`, `Male`, `male`               | `.str.lower()` + `.replace()` |
| Healthcare                | `V.I.P`, `vip`, `VIP `            | `.str.lower()` + `.strip()`   |
| Banking                   | `' 500'`, `'Five Hundred'`, `500` | `.replace()` + `.astype()`    |
| Data ingestion from forms | Columns like `'Customer Type '`   | `.columns.str.strip()`        |


<center><b>Thanks</b></center>