## Identifying Data Quality Issues

Data cleaning is a core part of data science — often taking 60–80% of project time. Before fixing data, you must first identify what’s wrong using a structured approach.

### Why Data Becomes “Dirty”

Data issues arise from:

    Human error

    System glitches

    Data integration problems

    Real-world inconsistencies

In [34]:
# imports 
import pandas as pd 

In [35]:
# load data 
df_titanic = pd.read_csv("titanic.csv")


In [36]:
# show head 
df_titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [37]:
# check shape
df_titanic.shape 

(893, 13)

In [38]:
# check info
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Name         893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [39]:
df_titanic["Age"]

0      220.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
888      NaN
889     26.0
890     32.0
891     32.0
892     27.0
Name: Age, Length: 893, dtype: float64

### The Main Categories of Data Quality Issues

#### 1. Missing Values

Empty cells, `NaN`, `NULL`, or placeholder values such as `"N/A"` or `"999"`.

**Problem:**  
Missing values can break calculations, distort averages, and introduce bias into analysis if not handled properly.



In [40]:
# how to check for missing values 
df_titanic.isna().sum()


Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

In [41]:
df_titanic.isnull().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

#### 2. Duplicate Records

Exact duplicates or near-duplicate records with slight variations.

**Problem:**  
Duplicates inflate counts, distort summary statistics, and can lead to double-counting in business metrics.


In [42]:
# example check for duplicates 
df_titanic.duplicated().sum()

np.int64(1)

#### 3. Inconsistent Formatting

Different formats used for the same type of data, such as:
- Multiple date formats
- Inconsistent name casing
- Different phone number formats

**Problem:**  
Prevents accurate grouping, sorting, filtering, and matching of records.


In [43]:
# example  check for inconsistent Formats 
df_titanic["Pclass"].value_counts()

Pclass
3    470
1    201
2    173
?     49
Name: count, dtype: int64

In [44]:
df_titanic["Sex"].value_counts()

Sex
male      579
female    314
Name: count, dtype: int64

In [45]:
df_titanic["Cabin"].value_counts()

Cabin
G6             4
C23 C25 C27    4
B96 B98        4
F33            3
E101           3
              ..
E17            1
A24            1
C50            1
B42            1
C148           1
Name: count, Length: 147, dtype: int64

#### 4. Invalid Data Types

- Numbers stored as text  
- Dates stored as strings  
- Mixed units (e.g., `"25 years"`)

**Problem:**  
Prevents mathematical operations, proper sorting, and accurate analysis.



In [46]:
# example 
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Name         893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [47]:
df_titanic["Pclass"].value_counts()

Pclass
3    470
1    201
2    173
?     49
Name: count, dtype: int64

#### 5. Structural Issues

- Poor column names (e.g., `"Unnamed: 3"`, `"Col_A"`)  
- Spaces or special characters in column names  
- Inconsistent naming conventions  

**Problem:**  
Makes code harder to write, maintain, and debug.

In [48]:
# example
df_titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
df_titanic.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

#### 6. Outliers and Impossible Values

Examples:
- Negative ages  
- Unrealistic salaries  
- Future birth dates  

**Problem:**  
Skews statistical analysis and leads to misleading conclusions.


In [50]:
# example 
df_titanic.describe()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare
count,893.0,893.0,893.0,716.0,893.0,893.0,893.0
mean,445.989922,446.992161,0.382979,29.975098,0.521837,0.380739,32.155318
std,257.913891,257.917707,0.486386,16.153539,1.101784,0.805355,49.64858
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.0,224.0,0.0,20.375,0.0,0.0,7.8958
50%,446.0,447.0,0.0,28.0,0.0,0.0,14.4542
75%,669.0,670.0,1.0,38.0,1.0,0.0,31.0
max,890.0,891.0,1.0,220.0,8.0,6.0,512.3292


In [51]:
df_titanic["Fare"].value_counts()

Fare
8.0500     43
13.0000    43
7.8958     38
7.7500     35
26.0000    31
           ..
13.8583     1
50.4958     1
5.0000      1
9.8458      1
10.5167     1
Name: count, Length: 248, dtype: int64

### The 5-Step Data Quality Assessment Framework

| Step | Check        | Method                           | Red Flag                 |
| ---- | ------------ | -------------------------------- | ------------------------ |
| 1    | Completeness | `df.info()`, `df.isnull().sum()` | >10% missing             |
| 2    | Uniqueness   | `df.duplicated().sum()`          | Duplicates in key fields |
| 3    | Validity     | `df.describe()`                  | Impossible values        |
| 4    | Consistency  | `df[col].value_counts()`         | Multiple formats         |
| 5    | Data Types   | `df.dtypes`                      | Numeric stored as object |


## Handling Missing Values

Missing values are unavoidable in real-world datasets. The goal is not to eliminate them blindly, but to handle them strategically based on context.

There is no universal solution. Always ask: **Why is this data missing?**


### Types of Missing Data

#### 1. Missing Completely at Random (MCAR)

No identifiable pattern in the missingness.

**Example:**  
Survey responses skipped accidentally.

**Strategy:**  
- Safe to drop rows if missing data is small (e.g., <5%)  
- Or fill using averages



#### 2. Missing at Random (MAR)

Missing values are related to other observed variables.

**Example:**  
Older users less likely to provide email addresses.

**Strategy:**  
- Fill using group averages  
- Use related columns to inform imputation



#### 3. Missing Not at Random (MNAR)

Missingness itself carries meaning.

**Example:**  
High earners refuse to report salary.

**Strategy:**  
- Requires domain knowledge  
- May require modeling the missingness



#### Method 1: Dropping Missing Values with `.dropna()`

Removes rows or columns containing missing values.

```python
# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows missing specific columns
df_clean = df.dropna(subset=['Email', 'Phone'])

# Drop columns with missing values
df_clean = df.dropna(axis=1)

# Keep rows with at least 3 non-null values
df_clean = df.dropna(thresh=3)


In [52]:
df_titanic.isna().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

In [53]:
# example
df_titanic_clean = df_titanic.dropna()
df_titanic_clean.info()

<class 'pandas.DataFrame'>
Index: 183 entries, 1 to 889
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   183 non-null    int64  
 1   PassengerId  183 non-null    int64  
 2   Survived     183 non-null    int64  
 3   Pclass       183 non-null    str    
 4   Name         183 non-null    str    
 5   Sex          183 non-null    str    
 6   Age          183 non-null    float64
 7   SibSp        183 non-null    int64  
 8   Parch        183 non-null    int64  
 9   Ticket       183 non-null    str    
 10  Fare         183 non-null    float64
 11  Cabin        183 non-null    str    
 12  Embarked     183 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 20.0 KB


In [54]:
df_titanic_no_cabin = df_titanic.drop("Cabin",axis=1)
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Name         893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [55]:
df_titanic_no_cabin_clean = df_titanic_no_cabin.dropna()
df_titanic_no_cabin_clean.info()

<class 'pandas.DataFrame'>
Index: 714 entries, 0 to 892
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   714 non-null    int64  
 1   PassengerId  714 non-null    int64  
 2   Survived     714 non-null    int64  
 3   Pclass       714 non-null    str    
 4   Name         714 non-null    str    
 5   Sex          714 non-null    str    
 6   Age          714 non-null    float64
 7   SibSp        714 non-null    int64  
 8   Parch        714 non-null    int64  
 9   Ticket       714 non-null    str    
 10  Fare         714 non-null    float64
 11  Embarked     714 non-null    str    
dtypes: float64(2), int64(5), str(5)
memory usage: 72.5 KB


#### Method 2: Filling Missing Values with .fillna()

Replaces missing values while preserving rows.

```python
# Fill with a specific value
df['Status'] = df['Status'].fillna('Unknown')

# Fill with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill with median
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Fill with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Forward fill
df['Price'] = df['Price'].fillna(method='ffill')

# Backward fill
df['Price'] = df['Price'].fillna(method='bfill')

# Different strategies per column
df = df.fillna({
    'Age': 0,
    'City': 'Unknown',
    'Score': df['Score'].mean()
})
```


In [60]:
# examples
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Name         893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [61]:
df_titanic["Age"] = df_titanic["Age"].fillna(df_titanic["Age"].median())

In [68]:
mid = df_titanic["Age"].median()
mid

np.float64(28.0)

In [None]:
df_titanic["Age"] = df_titanic["Age"].fillna(df_titanic["Age"].median)

In [None]:

df_titanic

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_clean
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.2500,,S,220.0
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,26.0
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,35.0
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
888,888,889,0,?,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S,28.0
889,889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,26.0
890,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q,32.0
891,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q,32.0


In [62]:
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Name         893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          893 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


#### Choosing the Right Fill Strategy
**Numeric Data**

    Mean: Normally distributed data without outliers

    Median: Skewed data or presence of outliers

    0 or -1: When missing indicates absence

**Categorical Data**

    Mode: Most frequent category

    "Unknown": When missing has meaning

    Forward/Backward fill: Time-series data only

| Situation              | Recommended Action | Reason         |
| ---------------------- | ------------------ | -------------- |
| <5% missing            | Drop rows          | Minimal impact |
| 5–15% missing          | Fill strategically | Preserve data  |
| >30% missing column    | Drop column        | Too unreliable |
| Critical field missing | Drop rows          | Cannot proceed |
| Optional field missing | Fill placeholder   | Keep record    |
| Time series data       | Forward/back fill  | Maintain order |


In [80]:
# example
mode = df_titanic["Embarked"].mode()

type(mode)

pandas.Series

In [81]:
print(mode)

0    S
Name: Embarked, dtype: str


### Column Renaming and Standardization

Clean and consistent column names improve readability, reduce errors, and make code easier to maintain. Poor column names create confusion and require extra handling in code.


#### Why Column Names Matter

Messy column names often include:
- Spaces
- Special characters
- Inconsistent casing
- Unclear wording

Example of problematic names:
- "First Name!"
- "Total Sales ($)"
- "E-mail Address"
- "Unnamed: 7"
- "Customer's Age (Years)"

These cause:
- Syntax issues
- Hard-to-read code
- Inconsistent references

Clean versions:
- `first_name`
- `total_sales`
- `email`
- `purchase_count`
- `customer_age`

Benefits:
- Easy to type
- No special characters
- Consistent structure
- Cleaner code (`df.first_name` instead of `df['First Name']`)


### Golden Rules of Column Naming

1. Use lowercase  
2. Replace spaces with underscores  
3. Remove special characters  
4. Keep names descriptive but concise  
5. Use consistent naming patterns (e.g., snake_case)


### Renaming Columns with `.rename()`

#### Rename Specific Columns

```python
df = df.rename(columns={
    'First Name': 'first_name',
    'E-mail': 'email',
    'Total Sales ($)': 'total_sales'
})


In [58]:
# example
