# **Data Cleaning**

# **12. Removing or Treating Noise in Pandas**

In [1]:
import numpy as np
import pandas as pd 

Data noise refers to meaningless or inconsistent data that can obscure patterns or lead to incorrect analysis. Common types of noise include:

* **Spelling inconsistencies** in categorical labels
* **Rare categories** with very low frequencies
* **Data entry errors, typos, or format mismatches**

We'll now cover:

1. **Spelling Correction using `fuzzywuzzy` or `difflib`**
2. **Frequency Filtering of Rare Categories**

## 🔹 1. Spelling Correction Using `fuzzywuzzy` or `difflib`

### 🔧 Techniques

* `fuzzywuzzy.process.extractOne()` – Finds the best match from a list using fuzzy matching.
* `difflib.get_close_matches()` – Finds close string matches using sequence similarity.

### ✅ Use Case:

Suppose a column like `"department"` contains spelling variations:

In [2]:
df = pd.DataFrame({
    'department': ['Finance', 'finanace', 'Fiance', 'HR', 'H.R.', 'Human Resources', 'IT', 'I.T']
})

df

Unnamed: 0,department
0,Finance
1,finanace
2,Fiance
3,HR
4,H.R.
5,Human Resources
6,IT
7,I.T


### 🎯 Goal:

Normalize all department names to consistent values like `['Finance', 'HR', 'IT']`.


### 📌 Using `fuzzywuzzy`

In [5]:
from fuzzywuzzy import process

# Define standard set
standard_departments = ['Finance', 'HR', 'IT']

# Function to correct spelling
def correct_dept(dept):
    best_match = process.extractOne(dept, standard_departments)
    print(f"Best match: {best_match}")
    return best_match[0] if best_match[1] >= 80 else dept

df['department'].apply(correct_dept)

Best match: ('Finance', 100)
Best match: ('Finance', 93)
Best match: ('Finance', 92)
Best match: ('HR', 100)
Best match: ('HR', 80)
Best match: ('HR', 45)
Best match: ('IT', 100)
Best match: ('IT', 80)


0            Finance
1            Finance
2            Finance
3                 HR
4                 HR
5    Human Resources
6                 IT
7                 IT
Name: department, dtype: object

### ✅ Why `fuzzywuzzy`?

* Best for messy string data like human-entered categories.
* Offers fuzzy ratio-based matching, works well on typos or keyboard errors.


### 📌 Using `difflib`

In [6]:
import difflib

In [7]:
def get_best_match(word, valid_list):
    matches = difflib.get_close_matches(word, valid_list, n=1, cutoff=0.7)
    print(f"Matches: {matches}")
    return matches[0] if matches else word

df['cleaned_dept'] = df['department'].apply(lambda x: get_best_match(x, standard_departments))
df

Matches: ['Finance']
Matches: ['Finance']
Matches: ['Finance']
Matches: ['HR']
Matches: []
Matches: []
Matches: ['IT']
Matches: ['IT']


Unnamed: 0,department,cleaned_dept
0,Finance,Finance
1,finanace,Finance
2,Fiance,Finance
3,HR,HR
4,H.R.,H.R.
5,Human Resources,Human Resources
6,IT,IT
7,I.T,IT


### ✅ Why `difflib`?

* Built-in (no external dependencies).
* Good for simpler fuzzy matching scenarios.

## 🔹 2. Frequency Filtering of Rare Categories

### 🔧 Technique

* Count category frequency and filter out or group rare categories.


### ✅ Use Case:

For a product classification model, suppose we have:

In [8]:
df = pd.DataFrame({
    'product_category': ['Book', 'Electronics', 'Pen', 'Book', 'Laptop', 'Shoes', 'Book', 'Shoes', 'Laptop', 'Toy']
})

df

Unnamed: 0,product_category
0,Book
1,Electronics
2,Pen
3,Book
4,Laptop
5,Shoes
6,Book
7,Shoes
8,Laptop
9,Toy


Some categories like `'Pen'`, `'Toy'` occur rarely and might not have enough data to train well.

### 📌 Method: Replace infrequent labels

In [11]:
threshold = 2
value_counts = df['product_category'].value_counts()
value_counts

product_category
Book           3
Laptop         2
Shoes          2
Electronics    1
Pen            1
Toy            1
Name: count, dtype: int64

In [14]:
df['filtered_category'] = df['product_category'].apply(lambda x: x if value_counts[x] >= threshold else 'Other')
df

Unnamed: 0,product_category,filtered_category
0,Book,Book
1,Electronics,Other
2,Pen,Other
3,Book,Book
4,Laptop,Laptop
5,Shoes,Shoes
6,Book,Book
7,Shoes,Shoes
8,Laptop,Laptop
9,Toy,Other


### ✅ Why this method?

* Reduces overfitting in models by preventing them from memorizing rare classes.
* Helps simplify visualizations (e.g., pie charts) by grouping small slices.

## 🧠 Summary: Choosing the Right Technique

| **Technique**                 | **Use Case**                                               | **Why Use It**                     |
| ----------------------------- | ---------------------------------------------------------- | ---------------------------------- |
| `fuzzywuzzy` or `difflib`     | Normalize inconsistent spellings in categorical columns    | Handles typos, variations          |
| Frequency filtering           | Reduce noise in modeling or visuals due to rare categories | Improves generalization & clarity  |
| Combined (Normalize + Filter) | Clean and group data for better modeling                   | Best practice in many ML pipelines |


<center><b>Thanks</b></center>