## Identifying Data Quality Issues

Data cleaning is a core part of data science — often taking 60–80% of project time. Before fixing data, you must first identify what’s wrong using a structured approach.

### Why Data Becomes “Dirty”

Data issues arise from:

    Human error

    System glitches

    Data integration problems

    Real-world inconsistencies

In [11]:
# imports 
import pandas as pd 

In [12]:
# load data 

In [13]:
# check shape

In [14]:
# check info

### The Main Categories of Data Quality Issues

#### 1. Missing Values

Empty cells, `NaN`, `NULL`, or placeholder values such as `"N/A"` or `"999"`.

**Problem:**  
Missing values can break calculations, distort averages, and introduce bias into analysis if not handled properly.



In [15]:
# how to check for missing values 



#### 2. Duplicate Records

Exact duplicates or near-duplicate records with slight variations.

**Problem:**  
Duplicates inflate counts, distort summary statistics, and can lead to double-counting in business metrics.


In [16]:
# example check for duplicates 

#### 3. Inconsistent Formatting

Different formats used for the same type of data, such as:
- Multiple date formats
- Inconsistent name casing
- Different phone number formats

**Problem:**  
Prevents accurate grouping, sorting, filtering, and matching of records.


In [17]:
# example  check for inconsistent Formats 

#### 4. Invalid Data Types

- Numbers stored as text  
- Dates stored as strings  
- Mixed units (e.g., `"25 years"`)

**Problem:**  
Prevents mathematical operations, proper sorting, and accurate analysis.



In [18]:
# example 

#### 5. Structural Issues

- Poor column names (e.g., `"Unnamed: 3"`, `"Col_A"`)  
- Spaces or special characters in column names  
- Inconsistent naming conventions  

**Problem:**  
Makes code harder to write, maintain, and debug.

In [19]:
# example

#### 6. Outliers and Impossible Values

Examples:
- Negative ages  
- Unrealistic salaries  
- Future birth dates  

**Problem:**  
Skews statistical analysis and leads to misleading conclusions.


In [20]:
# example 

### The 5-Step Data Quality Assessment Framework

| Step | Check        | Method                           | Red Flag                 |
| ---- | ------------ | -------------------------------- | ------------------------ |
| 1    | Completeness | `df.info()`, `df.isnull().sum()` | >10% missing             |
| 2    | Uniqueness   | `df.duplicated().sum()`          | Duplicates in key fields |
| 3    | Validity     | `df.describe()`                  | Impossible values        |
| 4    | Consistency  | `df[col].value_counts()`         | Multiple formats         |
| 5    | Data Types   | `df.dtypes`                      | Numeric stored as object |


## Handling Missing Values

Missing values are unavoidable in real-world datasets. The goal is not to eliminate them blindly, but to handle them strategically based on context.

There is no universal solution. Always ask: **Why is this data missing?**


### Types of Missing Data

#### 1. Missing Completely at Random (MCAR)

No identifiable pattern in the missingness.

**Example:**  
Survey responses skipped accidentally.

**Strategy:**  
- Safe to drop rows if missing data is small (e.g., <5%)  
- Or fill using averages



#### 2. Missing at Random (MAR)

Missing values are related to other observed variables.

**Example:**  
Older users less likely to provide email addresses.

**Strategy:**  
- Fill using group averages  
- Use related columns to inform imputation



#### 3. Missing Not at Random (MNAR)

Missingness itself carries meaning.

**Example:**  
High earners refuse to report salary.

**Strategy:**  
- Requires domain knowledge  
- May require modeling the missingness



#### Method 1: Dropping Missing Values with `.dropna()`

Removes rows or columns containing missing values.

```python
# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows missing specific columns
df_clean = df.dropna(subset=['Email', 'Phone'])

# Drop columns with missing values
df_clean = df.dropna(axis=1)

# Keep rows with at least 3 non-null values
df_clean = df.dropna(thresh=3)


In [21]:
# example

#### Method 2: Filling Missing Values with .fillna()

Replaces missing values while preserving rows.

```python
# Fill with a specific value
df['Status'] = df['Status'].fillna('Unknown')

# Fill with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill with median
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Fill with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Forward fill
df['Price'] = df['Price'].fillna(method='ffill')

# Backward fill
df['Price'] = df['Price'].fillna(method='bfill')

# Different strategies per column
df = df.fillna({
    'Age': 0,
    'City': 'Unknown',
    'Score': df['Score'].mean()
})
```


In [22]:
# examples

#### Choosing the Right Fill Strategy
**Numeric Data**

    Mean: Normally distributed data without outliers

    Median: Skewed data or presence of outliers

    0 or -1: When missing indicates absence

**Categorical Data**

    Mode: Most frequent category

    "Unknown": When missing has meaning

    Forward/Backward fill: Time-series data only

| Situation              | Recommended Action | Reason         |
| ---------------------- | ------------------ | -------------- |
| <5% missing            | Drop rows          | Minimal impact |
| 5–15% missing          | Fill strategically | Preserve data  |
| >30% missing column    | Drop column        | Too unreliable |
| Critical field missing | Drop rows          | Cannot proceed |
| Optional field missing | Fill placeholder   | Keep record    |
| Time series data       | Forward/back fill  | Maintain order |


In [23]:
# example

### Column Renaming and Standardization

Clean and consistent column names improve readability, reduce errors, and make code easier to maintain. Poor column names create confusion and require extra handling in code.


#### Why Column Names Matter

Messy column names often include:
- Spaces
- Special characters
- Inconsistent casing
- Unclear wording

Example of problematic names:
- "First Name!"
- "Total Sales ($)"
- "E-mail Address"
- "Unnamed: 7"
- "Customer's Age (Years)"

These cause:
- Syntax issues
- Hard-to-read code
- Inconsistent references

Clean versions:
- `first_name`
- `total_sales`
- `email`
- `purchase_count`
- `customer_age`

Benefits:
- Easy to type
- No special characters
- Consistent structure
- Cleaner code (`df.first_name` instead of `df['First Name']`)


### Golden Rules of Column Naming

1. Use lowercase  
2. Replace spaces with underscores  
3. Remove special characters  
4. Keep names descriptive but concise  
5. Use consistent naming patterns (e.g., snake_case)


### Renaming Columns with `.rename()`

#### Rename Specific Columns

```python
df = df.rename(columns={
    'First Name': 'first_name',
    'E-mail': 'email',
    'Total Sales ($)': 'total_sales'
})


In [24]:
# example
