<a href="https://colab.research.google.com/github/mohamedbahaaeldien92-max/Elezaby-project/blob/main/cleaned_product_dimension.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform a comprehensive data cleaning, quality assessment, and flagging process on the "products_dimension.csv" file. This includes loading the data, profiling, identifying data quality issues (missing values, duplicates, invalid formats, outliers, inconsistent text), standardizing data formats, handling missing values by replacement and flagging original missingness, validating IDs and other data formats, applying business logic validations and flagging issues, detecting and flagging duplicate rows, and finally, generating a cleaned DataFrame along with a detailed data quality report and a 'Before vs After' comparison summary.

## Load Data and Initial Profiling

### Subtask:
Load the 'products_dimension.csv' file into a pandas DataFrame named 'df'. Then, perform initial data profiling, which includes displaying the DataFrame's shape, data types of all columns, and summary statistics (e.g., .describe()). This step will also identify numerical versus categorical columns for subsequent cleaning steps.


**Reasoning**:
First, I'll import the pandas library to enable data manipulation, then load the 'products_dimension.csv' file into a DataFrame, and display its shape, data types, and descriptive statistics. Afterwards, I will identify and print the numerical and categorical columns.



In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/products_dimension.csv')

# Display the DataFrame's shape
print(f"DataFrame shape: {df.shape}")

# Display data types of all columns
print("\nDataFrame Info:")
df.info()

# Display descriptive statistics for numerical columns
print("\nDescriptive Statistics:")
print(df.describe())

# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(exclude=np.number).columns.tolist()

print(f"\nNumerical Columns: {numerical_cols}")
print(f"Categorical Columns: {categorical_cols}")

DataFrame shape: (62700, 8)

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62700 entries, 0 to 62699
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   product_id    62700 non-null  object 
 1   product_name  62700 non-null  object 
 2   Brand         59595 non-null  object 
 3   Division      61877 non-null  object 
 4   Category      61877 non-null  object 
 5   Subcategory   61877 non-null  object 
 6   Segment       61877 non-null  object 
 7   unit_price    62700 non-null  float64
dtypes: float64(1), object(7)
memory usage: 3.8+ MB

Descriptive Statistics:
         unit_price
count  62700.000000
mean      52.468453
std       27.391753
min        5.000000
25%       28.800000
50%       52.500000
75%       76.160000
max      100.000000

Numerical Columns: ['unit_price']
Categorical Columns: ['product_id', 'product_name', 'Brand', 'Division', 'Category', 'Subcategory', 'Segment']


## Data Quality Assessment (Identification)

### Subtask:
Identify and document data quality issues, including missing values, duplicate rows, invalid formats, outliers using the IQR method, and inconsistent text formatting in string columns. No data will be modified in this step, only assessed.


**Reasoning**:
First, I'll calculate and display the missing values count and percentage for each column in the DataFrame, storing this information in a `missing_data_report` DataFrame.



In [2]:
missing_count = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_data_report = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

print("Missing Data Report:")
print(missing_data_report[missing_data_report['Missing Count'] > 0])

Missing Data Report:
             Missing Count  Missing Percentage
Brand                 3105            4.952153
Division               823            1.312600
Category               823            1.312600
Subcategory            823            1.312600
Segment                823            1.312600


**Reasoning**:
Next, I will detect and report on duplicate rows in the DataFrame by counting them and displaying all identified duplicates for review.



In [3]:
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Total number of duplicate rows (keeping all occurrences): {len(duplicate_rows)}")

if not duplicate_rows.empty:
    print("\nDuplicate Rows (first 50 occurrences):")
    print(duplicate_rows.head(50))
else:
    print("\nNo duplicate rows found.")

Total number of duplicate rows (keeping all occurrences): 0

No duplicate rows found.


**Reasoning**:
I will attempt to convert the 'product_id' column to a numeric type, identifying and displaying any values that cause conversion errors, as these indicate invalid data formats.



In [4]:
invalid_product_ids = df[pd.to_numeric(df['product_id'], errors='coerce').isna() & df['product_id'].notna()]

print(f"Number of rows with invalid 'product_id' format: {len(invalid_product_ids)}")

if not invalid_product_ids.empty:
    print("\nRows with invalid 'product_id' format (first 20 occurrences):")
    print(invalid_product_ids.head(20))
else:
    print("\nNo invalid 'product_id' formats found.")

Number of rows with invalid 'product_id' format: 2

Rows with invalid 'product_id' format (first 20 occurrences):
      product_id     product_name Brand       Division       Category  \
42809          V     VOUCHER DEMO   NaN  PINK ELEPHANT  PINK ELEPHANT   
44062         V1  VOUCHER 0.5 EGP   NaN            AUX            AUX   

         Subcategory        Segment  unit_price  
42809  PINK ELEPHANT  PINK ELEPHANT       39.92  
44062            AUX            AUX       67.18  


**Reasoning**:
I will inspect the unique values of categorical columns (excluding 'product_id' which was already checked for format) to identify any potential inconsistent entries or unexpected formats, displaying a sample if there are many unique values.



In [5]:
print("\nInspecting unique values in categorical columns for inconsistencies:")
for col in categorical_cols:
    if col == 'product_id': # Already handled
        continue
    print(f"\nColumn: {col}")
    unique_count = df[col].nunique(dropna=False)
    print(f"Number of unique values (including NaN): {unique_count}")

    # Display unique values or a sample if there are too many
    if unique_count < 50:
        print(df[col].unique())
    else:
        print(f"Top 50 unique values (out of {unique_count}):\n{df[col].value_counts(dropna=False).head(50).index.tolist()}")


Inspecting unique values in categorical columns for inconsistencies:

Column: product_name
Number of unique values (including NaN): 62556
Top 50 unique values (out of 62556):
['XXX', 'X', '++', 'تى شيرت دليفرى رمادى شتوى7XL', 'SOFTMINA5001PANTYHOSE CCL2AT(XXL)BLACK▓', 'توريد وتركيب عود داكت 2*2', 'سدادة ليجراند', 'NEEDLE BURNER ELECTRIC THERMINATOR▓', 'EAR WASHING SYRING(METAL STAINLESS TY.)▓', 'NEW PHARMA INQUIRY', 'تى شيرت دليفرى رمادى شتوى6XL', 'RESUSCITATOR APPARATUS▓', '0', 'DIAGNOSTIC SET (EAR-NOSE-THROAT) E.N.T▓', '###', 'ISOPTO CARPINE 1% EYE DROP', 'مسدس تسخين', 'حزام', 'DEBORAH 24ORE CARE PERF.(SPF20)FOUND./0#', 'شريط لحام', 'TS JAGUAR STANCE EDT F/M 100ML', 'TS JAGUAR CLASSIC GOLD EDT F/M 100ML', 'TS JAGUAR CLASSIC RED EDT F/M 100ML', '$$', 'XXXX', 'تابنتانيتانتابنت', 'CHICCO SOFT SILICONE SPOON RED 6M+™', 'BIODERMA ATODERM ULTRA NOURISH.CR 200M#', 'سخان كهرباء 20  لتر فريش', 'xxxx', 'ACCU-CHEK PERFORMA 50/STRIP▓', 'حبارة طابعة LASER UP107', 'BACK SUPPORT CUSHION F/BED LE15

**Reasoning**:
Next, I will detect outliers in the 'unit_price' column using the Interquartile Range (IQR) method by calculating Q1, Q3, and IQR, and then determining the lower and upper bounds to identify and count outliers.



In [6]:
Q1 = df['unit_price'].quantile(0.25)
Q3 = df['unit_price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['unit_price'] < lower_bound) | (df['unit_price'] > upper_bound)]

print(f"\nNumber of outliers in 'unit_price' using IQR method: {len(outliers)}")
if not outliers.empty:
    print("Outliers in 'unit_price' (first 20 occurrences):")
    print(outliers.head(20))
else:
    print("No outliers found in 'unit_price'.")


Number of outliers in 'unit_price' using IQR method: 0
No outliers found in 'unit_price'.


**Reasoning**:
I will examine the categorical columns for inconsistent text formatting by comparing the number of unique values before and after applying `.str.strip().str.lower()`, which will highlight variations due to casing or leading/trailing spaces.



In [7]:
print("\nInspecting categorical columns for inconsistent text formatting (casing, leading/trailing spaces):")
for col in categorical_cols:
    if col in ['product_id', 'product_name']: # product_id already checked, product_name has too many unique values to be practical for this check
        continue

    original_unique_count = df[col].nunique(dropna=False)

    # Apply strip and lower, then count unique values (excluding NaN for this comparison)
    processed_unique_count = df[col].astype(str).str.strip().str.lower().nunique(dropna=False)

    if original_unique_count != processed_unique_count:
        print(f"Column '{col}': Inconsistent formatting detected (Original unique: {original_unique_count}, Processed unique: {processed_unique_count})")
        # Optionally print some examples if needed, but for now, just count is sufficient
    else:
        print(f"Column '{col}': No significant text formatting inconsistencies found (Unique values count: {original_unique_count})")


Inspecting categorical columns for inconsistent text formatting (casing, leading/trailing spaces):
Column 'Brand': Inconsistent formatting detected (Original unique: 11852, Processed unique: 11848)
Column 'Division': No significant text formatting inconsistencies found (Unique values count: 16)
Column 'Category': No significant text formatting inconsistencies found (Unique values count: 115)
Column 'Subcategory': No significant text formatting inconsistencies found (Unique values count: 316)
Column 'Segment': No significant text formatting inconsistencies found (Unique values count: 1211)


## Standardize Data Formats

### Subtask:
Clean and standardize various data formats. This includes converting all column names to snake_case, trimming leading and trailing spaces from string columns, and standardizing text casing (e.g., to lowercase or title case) where appropriate. Convert columns to their correct inferred data types based on their content, handling errors by coercing to NaN where necessary.


**Reasoning**:
First, I'll convert all column names to snake_case to ensure consistency and ease of use in subsequent operations.



In [8]:
import re

def to_snake_case(name):
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    name = re.sub('__([A-Z])', r'_\1', name)
    name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
    return name

df.columns = [to_snake_case(col) for col in df.columns]

print("Columns after converting to snake_case:")
print(df.columns.tolist())

Columns after converting to snake_case:
['product_id', 'product_name', 'brand', 'division', 'category', 'subcategory', 'segment', 'unit_price']


**Reasoning**:
Next, I will iterate through all object (string) columns to remove leading/trailing whitespace and then apply appropriate casing (title case for descriptive columns, lowercase for 'product_name').



In [9]:
for col in df.select_dtypes(include='object').columns:
    # Strip whitespace from all object columns
    df[col] = df[col].str.strip()

    # Apply specific casing
    if col in ['brand', 'division', 'category', 'subcategory', 'segment']:
        df[col] = df[col].str.title()
    elif col == 'product_name':
        df[col] = df[col].str.lower()

print("String columns after stripping whitespace and applying casing (first 5 rows):")
print(df.head())

String columns after stripping whitespace and applying casing (first 5 rows):
    product_id                              product_name         brand  \
0  5.28102E+12          beesline elder rose deo r.on 50m  Beesline Deo   
1  5.28102E+12  beesline green forst whit.r.on deod.50m#  Beesline Deo   
2  5.28102E+12    beesline pacific islands deo r.on 50m#  Beesline Deo   
3  5.28102E+12         beesline cool breeze deo r.on 50m  Beesline Deo   
4  5.28102E+12        beesline hr delaying deo r.on 50m#  Beesline Deo   

              division         category      subcategory         segment  \
0  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
1  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
2  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
3  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
4  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   

   unit_price  
0   

**Reasoning**:
I will infer and convert data types for all columns using `df.convert_dtypes()` to optimize memory and type representation. Then, I'll explicitly convert the 'product_id' column to a numeric type, coercing any non-numeric values to NaN, and create a new boolean flag column, 'is_invalid_product_id_format', to mark rows where this conversion failed.



In [10]:
df_original = df.copy()
# Infer and convert data types for all columns
df = df.convert_dtypes()

# Convert 'product_id' to numeric, coercing errors to NaN
df['product_id_numeric'] = pd.to_numeric(df['product_id'], errors='coerce')

# Create a flag for invalid 'product_id' formats
df['is_invalid_product_id_format'] = df['product_id_numeric'].isna() & df['product_id'].notna()

# Display the DataFrame info after type conversion and the first few rows with the new columns
print("\nDataFrame Info after type conversion and flagging:")
df.info()
print("\nDataFrame head with new 'product_id_numeric' and 'is_invalid_product_id_format' columns:")
print(df.head())

print(f"\nNumber of rows flagged with invalid product_id format: {df['is_invalid_product_id_format'].sum()}")


DataFrame Info after type conversion and flagging:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62700 entries, 0 to 62699
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   product_id                    62700 non-null  string 
 1   product_name                  62700 non-null  string 
 2   brand                         59595 non-null  string 
 3   division                      61877 non-null  string 
 4   category                      61877 non-null  string 
 5   subcategory                   61877 non-null  string 
 6   segment                       61877 non-null  string 
 7   unit_price                    62700 non-null  Float64
 8   product_id_numeric            62698 non-null  Float64
 9   is_invalid_product_id_format  62700 non-null  bool   
dtypes: Float64(2), bool(1), string(7)
memory usage: 4.5 MB

DataFrame head with new 'product_id_numeric' and 'is_invalid_product_i

## Handle Missing Values and Flag `is_missing_original`

### Subtask:
Address missing values without deleting rows. For numerical columns, replace missing values with the column's median. For categorical columns, replace missing values with the string 'unknown'. For each column that had original missing values, create a new boolean flag column named `is_missing_original_{column_name}` to indicate where values were originally missing.


**Reasoning**:
First, I'll identify columns with missing values and create a new boolean flag column for each to indicate where values were originally missing. Then, I will fill the missing values in these categorical columns with the string 'Unknown'.



In [11]:
missing_cols = df.columns[df.isnull().any()].tolist()

for col in missing_cols:
    # Create a flag column for original missingness
    df[f'is_missing_original_{col}'] = df[col].isnull()

    # Handle categorical columns by filling with 'Unknown'
    if df[col].dtype == 'string': # Check for pandas StringDtype
        df[col] = df[col].fillna('Unknown')
    # Assuming numerical columns (if any) would be handled here, but for this dataset, only categorical have missing values.
    # If there were numerical columns with missing values, the logic would be:
    # elif pd.api.types.is_numeric_dtype(df[col]):
    #     median_val = df[col].median()
    #     df[col] = df[col].fillna(median_val)

print("Missing values handled and flag columns created.")
print("New flag columns and their sum of True values:")
for col in missing_cols:
    print(f"is_missing_original_{col}: {df[f'is_missing_original_{col}'].sum()} True values")

print("\nVerifying no remaining missing values:")
print(df.isnull().sum())

Missing values handled and flag columns created.
New flag columns and their sum of True values:
is_missing_original_brand: 3105 True values
is_missing_original_division: 823 True values
is_missing_original_category: 823 True values
is_missing_original_subcategory: 823 True values
is_missing_original_segment: 823 True values
is_missing_original_product_id_numeric: 2 True values

Verifying no remaining missing values:
product_id                                0
product_name                              0
brand                                     0
division                                  0
category                                  0
subcategory                               0
segment                                   0
unit_price                                0
product_id_numeric                        2
is_invalid_product_id_format              0
is_missing_original_brand                 0
is_missing_original_division              0
is_missing_original_category              0
is_missi

**Reasoning**:
The previous step identified that 'product_id_numeric' still has missing values, which is a numerical column. Following the subtask's instruction, I will fill these missing values with the column's median.



In [12]:
if 'product_id_numeric' in df.columns and df['product_id_numeric'].isnull().any():
    median_product_id_numeric = df['product_id_numeric'].median()
    df['product_id_numeric'] = df['product_id_numeric'].fillna(median_product_id_numeric)
    print(f"Missing values in 'product_id_numeric' filled with median: {median_product_id_numeric}")

print("\nVerifying no remaining missing values after numerical imputation:")
print(df.isnull().sum())

Missing values in 'product_id_numeric' filled with median: 3012287777.0

Verifying no remaining missing values after numerical imputation:
product_id                                0
product_name                              0
brand                                     0
division                                  0
category                                  0
subcategory                               0
segment                                   0
unit_price                                0
product_id_numeric                        0
is_invalid_product_id_format              0
is_missing_original_brand                 0
is_missing_original_division              0
is_missing_original_category              0
is_missing_original_subcategory           0
is_missing_original_segment               0
is_missing_original_product_id_numeric    0
dtype: int64


## Validate IDs and Flag `is_corrupted_id` and `is_invalid_format`

### Subtask:
Apply specific ID validation rules for 'product_id'. Create a new boolean column `is_corrupted_id` to flag rows where 'product_id' fails these validations (non-numeric, not 13-digit integer). Additionally, create a general `is_invalid_format` flag column to mark any other invalid formats, using the previously created `is_invalid_product_id_format` for now.


**Reasoning**:
First, I will initialize the `is_corrupted_id` column to `False` and populate it based on existing `is_invalid_product_id_format` flags. Then, I'll convert valid `product_id_numeric` values to integer strings to check if their length is exactly 13 digits, updating `is_corrupted_id` accordingly. Finally, I will create the `is_invalid_format` column and print the counts for the new flag columns.



In [13]:
import numpy as np

# 1. Create a new boolean column named is_corrupted_id initialized to False.
df['is_corrupted_id'] = False

# 2. Update is_corrupted_id to True for any rows where the is_invalid_product_id_format
#    flag is already True (indicating original non-numeric product_id values).
df.loc[df['is_invalid_product_id_format'], 'is_corrupted_id'] = True

# 3. For rows where is_corrupted_id is still False, check if the product_id_numeric
#    (converted to a string representation of an integer) does not have a length of 13.
#    If it does not, set is_corrupted_id to True for those rows.

# For rows not yet flagged as corrupted, check the length of the numeric ID
not_corrupted_yet_mask = ~df['is_corrupted_id']

# Convert product_id_numeric to integer string for length check
# Handle potential NaN values by converting to empty string or a placeholder first, though it should be filled already.
# Since `product_id_numeric` is Float64, we need to convert it to an integer first before casting to string
# to avoid decimal points (e.g., '123.0' vs '123').
# Using .astype('Int64') to handle nullable integers if any before converting to string
df.loc[not_corrupted_yet_mask, 'product_id_str_for_len_check'] = df.loc[not_corrupted_yet_mask, 'product_id_numeric'].astype('Int64').astype(str)

# Apply the length check
df.loc[not_corrupted_yet_mask & (df['product_id_str_for_len_check'].str.len() != 13), 'is_corrupted_id'] = True

# Drop the temporary column
df = df.drop(columns=['product_id_str_for_len_check'], errors='ignore')

# 4. Create a new boolean column named is_invalid_format and assign the values
#    from the existing is_invalid_product_id_format column to it.
df['is_invalid_format'] = df['is_invalid_product_id_format']

# 5. Print the sum of True values for both is_corrupted_id and is_invalid_format.
print(f"Number of rows flagged as 'is_corrupted_id': {df['is_corrupted_id'].sum()}")
print(f"Number of rows flagged as 'is_invalid_format': {df['is_invalid_format'].sum()}")

print("\nDataFrame head with new flag columns:")
print(df.head())

Number of rows flagged as 'is_corrupted_id': 39547
Number of rows flagged as 'is_invalid_format': 2

DataFrame head with new flag columns:
    product_id                              product_name         brand  \
0  5.28102E+12          beesline elder rose deo r.on 50m  Beesline Deo   
1  5.28102E+12  beesline green forst whit.r.on deod.50m#  Beesline Deo   
2  5.28102E+12    beesline pacific islands deo r.on 50m#  Beesline Deo   
3  5.28102E+12         beesline cool breeze deo r.on 50m  Beesline Deo   
4  5.28102E+12        beesline hr delaying deo r.on 50m#  Beesline Deo   

              division         category      subcategory         segment  \
0  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
1  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
2  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
3  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
4  Everyday Essentials  Body Freshen

## Business Logic Validation and Flag `is_outlier`

### Subtask:
Perform business logic validations on the numerical columns and create an `is_outlier` flag based on IQR. Finally, combine all critical flags into a `validation_flag`.


**Reasoning**:
First, I will create a new boolean column `is_negative_unit_price` to flag any rows where `unit_price` is negative. Then, I will calculate the IQR, lower, and upper bounds for the `unit_price` column to identify outliers and create an `is_outlier` flag.



In [14]:
import numpy as np

# 1. Create a new boolean column named is_negative_unit_price
df['is_negative_unit_price'] = df['unit_price'] < 0

# 2. Calculate Q1, Q3, and IQR for unit_price
Q1 = df['unit_price'].quantile(0.25)
Q3 = df['unit_price'].quantile(0.75)
IQR = Q3 - Q1

# 3. Determine lower and upper bounds for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# 4. Create a new boolean column named is_outlier
df['is_outlier'] = (df['unit_price'] < lower_bound) | (df['unit_price'] > upper_bound)

print(f"Number of rows with negative unit_price: {df['is_negative_unit_price'].sum()}")
print(f"Number of rows identified as outliers in unit_price: {df['is_outlier'].sum()}")

print("\nDataFrame head with new flags:")
print(df[['unit_price', 'is_negative_unit_price', 'is_outlier']].head())

Number of rows with negative unit_price: 0
Number of rows identified as outliers in unit_price: 0

DataFrame head with new flags:
   unit_price  is_negative_unit_price  is_outlier
0       40.58                   False       False
1       95.32                   False       False
2       74.54                   False       False
3       61.87                   False       False
4       19.82                   False       False


**Reasoning**:
Next, I will initialize the `validation_flag` column to `False` and then update it to `True` for any rows where `is_corrupted_id`, `is_invalid_format`, `is_negative_unit_price`, or `is_outlier` are `True`, finally printing the sum of `True` values for the new flag and the previously created flags.



In [15]:
df['validation_flag'] = False
df.loc[df['is_corrupted_id'] | \
       df['is_invalid_format'] | \
       df['is_negative_unit_price'] | \
       df['is_outlier'], 'validation_flag'] = True

print(f"\nNumber of rows with negative unit_price: {df['is_negative_unit_price'].sum()}")
print(f"Number of rows identified as outliers in unit_price: {df['is_outlier'].sum()}")
print(f"Total number of rows flagged by 'validation_flag': {df['validation_flag'].sum()}")

print("\nDataFrame head with all validation flags:")
print(df[['product_id', 'unit_price', 'is_corrupted_id', 'is_invalid_format', 'is_negative_unit_price', 'is_outlier', 'validation_flag']].head())


Number of rows with negative unit_price: 0
Number of rows identified as outliers in unit_price: 0
Total number of rows flagged by 'validation_flag': 39547

DataFrame head with all validation flags:
    product_id  unit_price  is_corrupted_id  is_invalid_format  \
0  5.28102E+12       40.58            False              False   
1  5.28102E+12       95.32            False              False   
2  5.28102E+12       74.54            False              False   
3  5.28102E+12       61.87            False              False   
4  5.28102E+12       19.82            False              False   

   is_negative_unit_price  is_outlier  validation_flag  
0                   False       False            False  
1                   False       False            False  
2                   False       False            False  
3                   False       False            False  
4                   False       False            False  


## Detect and Flag Duplicates (is_duplicate)

### Subtask:
Identify duplicate rows within the DataFrame and add a new boolean column, `is_duplicate`, to flag all occurrences of these duplicate rows. Duplicate rows will be flagged but not removed.


**Reasoning**:
I will create a new boolean column 'is_duplicate' by using `df.duplicated(keep=False)` to flag all occurrences of duplicate rows as True, then print the total number of flagged duplicates and display the DataFrame's head with the new column.



In [16]:
df['is_duplicate'] = df.duplicated(keep=False)

print(f"Total number of rows flagged as 'is_duplicate': {df['is_duplicate'].sum()}")

print("\nDataFrame head with new 'is_duplicate' column:")
print(df.head())

Total number of rows flagged as 'is_duplicate': 0

DataFrame head with new 'is_duplicate' column:
    product_id                              product_name         brand  \
0  5.28102E+12          beesline elder rose deo r.on 50m  Beesline Deo   
1  5.28102E+12  beesline green forst whit.r.on deod.50m#  Beesline Deo   
2  5.28102E+12    beesline pacific islands deo r.on 50m#  Beesline Deo   
3  5.28102E+12         beesline cool breeze deo r.on 50m  Beesline Deo   
4  5.28102E+12        beesline hr delaying deo r.on 50m#  Beesline Deo   

              division         category      subcategory         segment  \
0  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
1  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
2  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
3  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   
4  Everyday Essentials  Body Fresheners  Lady Fresheners  Lady Deodorant   



## Generate Final Output and Report

### Subtask:
Construct the final cleaned DataFrame (`cleaned_df`), generate a comprehensive data quality report (`data_quality_report`), and create a 'Before vs After' comparison summary.


**Reasoning**:
I will create a copy of the current DataFrame `df` and name it `cleaned_df` to represent the final cleaned dataset. Then, I will initialize an empty dictionary `data_quality_report` to store the data quality metrics.



In [17]:
cleaned_df = df.copy()

data_quality_report = {}

print("Created cleaned_df and initialized data_quality_report.")

Created cleaned_df and initialized data_quality_report.


**Reasoning**:
I will populate the `data_quality_report` dictionary with the 'Overall Data Quality Summary' by calculating the total number of rows, the total number of rows flagged by `validation_flag`, and the percentage of flagged rows. Then, I will iterate through all columns with `is_missing_original_` prefix and populate the 'Missing Values (Original)' section with their respective counts and percentages.



In [18]:
total_rows = cleaned_df.shape[0]

# Overall Data Quality Summary
data_quality_report['Overall Data Quality Summary'] = {
    'Total Rows': total_rows,
    'Total Flagged Rows (validation_flag)': cleaned_df['validation_flag'].sum(),
    'Percentage Flagged Rows': (cleaned_df['validation_flag'].sum() / total_rows) * 100
}

# Missing Values (Original)
data_quality_report['Missing Values (Original)'] = {}
original_missing_cols = [col for col in cleaned_df.columns if col.startswith('is_missing_original_')]
for col in original_missing_cols:
    count = cleaned_df[col].sum()
    percentage = (count / total_rows) * 100
    data_quality_report['Missing Values (Original)'][col] = {
        'Count': count,
        'Percentage': percentage
    }

# Invalid Formats
invalid_format_count = cleaned_df['is_invalid_format'].sum()
invalid_format_percentage = (invalid_format_count / total_rows) * 100
data_quality_report['Invalid Formats'] = {
    'Count': invalid_format_count,
    'Percentage': invalid_format_percentage
}

# Corrupted IDs
corrupted_id_count = cleaned_df['is_corrupted_id'].sum()
corrupted_id_percentage = (corrupted_id_count / total_rows) * 100
data_quality_report['Corrupted IDs'] = {
    'Count': corrupted_id_count,
    'Percentage': corrupted_id_percentage
}

# Outliers
outlier_count = cleaned_df['is_outlier'].sum()
outlier_percentage = (outlier_count / total_rows) * 100
data_quality_report['Outliers'] = {
    'Count': outlier_count,
    'Percentage': outlier_percentage
}

# Negative Unit Price
negative_unit_price_count = cleaned_df['is_negative_unit_price'].sum()
negative_unit_price_percentage = (negative_unit_price_count / total_rows) * 100
data_quality_report['Negative Unit Price'] = {
    'Count': negative_unit_price_count,
    'Percentage': negative_unit_price_percentage
}

# Duplicates
duplicate_count = cleaned_df['is_duplicate'].sum()
duplicate_percentage = (duplicate_count / total_rows) * 100
data_quality_report['Duplicates'] = {
    'Count': duplicate_count,
    'Percentage': duplicate_percentage
}

print("Data quality report populated with summaries.")

Data quality report populated with summaries.


**Reasoning**:
Now that the data quality report is populated, I will create the 'Before vs After' comparison summary, drawing on the initial data profiling and the results of the cleaning and flagging steps. Finally, I will print both the `data_quality_report` and the `before_after_summary` to fulfill the subtask requirements.



In [19]:
before_after_summary = {}

total_rows = cleaned_df.shape[0]

# Missing Values Comparison
before_missing_brand = missing_data_report.loc['Brand', 'Missing Count']
before_missing_division = missing_data_report.loc['Division', 'Missing Count']
before_missing_category = missing_data_report.loc['Category', 'Missing Count']
before_missing_subcategory = missing_data_report.loc['Subcategory', 'Missing Count']
before_missing_segment = missing_data_report.loc['Segment', 'Missing Count']

before_after_summary['Missing Values (Categorical Columns)'] = {
    'Before Cleaning (Count)': before_missing_brand + before_missing_division + before_missing_category + before_missing_subcategory + before_missing_segment,
    'After Cleaning (Count)': 0 # All categorical missing values were filled
}

before_after_summary['Missing Values (product_id_numeric)'] = {
    'Before Cleaning (Count)': cleaned_df['is_missing_original_product_id_numeric'].sum(),
    'After Cleaning (Count)': 0 # Numerical missing values were filled with median
}

# Invalid product_id format comparison
before_after_summary['Invalid product_id Format'] = {
    'Before Cleaning (Count)': len(invalid_product_ids),
    'After Cleaning (Flagged Count)': cleaned_df['is_invalid_product_id_format'].sum() # Flagged, not removed
}

# Duplicate Rows Comparison
before_after_summary['Duplicate Rows'] = {
    'Before Cleaning (Count)': len(duplicate_rows),
    'After Cleaning (Flagged Count)': cleaned_df['is_duplicate'].sum() # Flagged, not removed
}

# Overall Validation Flag
before_after_summary['Overall Flagged Rows (Validation Flag)'] = {
    'Total Rows': total_rows,
    'Flagged Rows Count': cleaned_df['validation_flag'].sum(),
    'Flagged Rows Percentage': (cleaned_df['validation_flag'].sum() / total_rows) * 100
}

print("\nData Quality Report:")
for section, content in data_quality_report.items():
    print(f"\n{section}:")
    if isinstance(content, dict):
        for key, value in content.items():
            if isinstance(value, dict):
                print(f"  {key}:")
                for sub_key, sub_value in value.items():
                    print(f"    {sub_key}: {sub_value}")
            else:
                print(f"  {key}: {value}")
    else:
        print(f"  {content}")

print("\nBefore vs After Comparison Summary:")
for section, content in before_after_summary.items():
    print(f"\n{section}:")
    for key, value in content.items():
        print(f"  {key}: {value}")


Data Quality Report:

Overall Data Quality Summary:
  Total Rows: 62700
  Total Flagged Rows (validation_flag): 39547
  Percentage Flagged Rows: 63.07336523125997

Missing Values (Original):
  is_missing_original_brand:
    Count: 3105
    Percentage: 4.952153110047846
  is_missing_original_division:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_category:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_subcategory:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_segment:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_product_id_numeric:
    Count: 2
    Percentage: 0.0031897926634768745

Invalid Formats:
  Count: 2
  Percentage: 0.0031897926634768745

Corrupted IDs:
  Count: 39547
  Percentage: 63.07336523125997

Outliers:
  Count: 0
  Percentage: 0.0

Negative Unit Price:
  Count: 0
  Percentage: 0.0

Duplicates:
  Count: 0
  Percentage: 0.0

Before vs After Comparison Summary:

Missing

## Final Task

### Subtask:
Review the generated `cleaned_df`, `data_quality_report`, and the Before vs After comparison summary. This step concludes the data cleaning and preparation process.


## Summary:

### Data Analysis Key Findings

*   The initial dataset contained 62,700 rows and 8 columns. Key columns included `product_id`, `product_name`, `Brand`, `Division`, `Category`, `Subcategory`, `Segment`, and `unit_price`.
*   **Missing Values (Initial State)**:
    *   `Brand`: 3,105 values (4.95%) were missing.
    *   `Division`, `Category`, `Subcategory`, `Segment`: 823 values each (1.31% each) were missing.
    *   `unit_price`: No missing values.
*   **Duplicate Rows**: No perfectly duplicate rows were found in the dataset.
*   **Invalid Formats**:
    *   Two rows had `product_id` values ('V', 'V1') that were non-numeric, indicating an invalid format.
    *   `Brand` column showed inconsistent text formatting (e.g., casing, leading/trailing spaces), with 11,852 original unique values reducing to 11,848 after standardization.
*   **Outliers**: No outliers were detected in the `unit_price` column using the IQR method.
*   **Standardization**:
    *   All column names were converted to snake\_case (e.g., `product_id`, `unit_price`).
    *   String columns were stripped of leading/trailing whitespace.
    *   Categorical columns (`brand`, `division`, `category`, `subcategory`, `segment`) were standardized to Title Case, while `product_name` was converted to lowercase.
*   **Missing Value Handling**:
    *   Missing values in categorical columns were replaced with 'Unknown'.
    *   The 2 missing `product_id_numeric` values were imputed with the column's median (3,012,287,777.0).
    *   Boolean flag columns (e.g., `is_missing_original_brand`, `is_missing_original_product_id_numeric`) were created to indicate original missingness, with a total of 6,397 original missing categorical values and 2 original missing numeric product IDs.
*   **ID Validation and Business Logic**:
    *   **`is_invalid_format`**: Two rows were flagged due to non-numeric `product_id` values.
    *   **`is_corrupted_id`**: A significant 39,547 rows (63.07%) were flagged because their `product_id` was either non-numeric or not a 13-digit integer.
    *   **`is_negative_unit_price`**: No rows had negative unit prices.
    *   **`is_outlier`**: No outliers were detected in `unit_price`.
*   **Combined Validation Flag**: A `validation_flag` identified 39,547 rows (63.07% of the dataset) with at least one critical data quality issue (primarily corrupted IDs or invalid ID format).

### Insights or Next Steps

*   **Address Corrupted IDs**: Given that 63.07% of products have corrupted IDs, immediate action is required to investigate the source of this widespread issue. This could involve data source verification or a more robust ID generation/capture process.
*   **Utilize Flag Columns**: The `cleaned_df` with its comprehensive set of flag columns can be used for downstream analysis, allowing analysts to filter out or specifically analyze problematic data points without losing the original context or values.


In [23]:
# Save the cleaned_df to a CSV file
cleaned_df.to_csv('cleaned_products_dimension.csv', index=False)
print("cleaned_products_dimension.csv saved successfully.")

cleaned_products_dimension.csv saved successfully.


In [22]:
display(cleaned_df.head())

Unnamed: 0,product_id,product_name,brand,division,category,subcategory,segment,unit_price,product_id_numeric,is_invalid_product_id_format,...,is_missing_original_category,is_missing_original_subcategory,is_missing_original_segment,is_missing_original_product_id_numeric,is_corrupted_id,is_invalid_format,is_negative_unit_price,is_outlier,validation_flag,is_duplicate
0,5281020000000.0,beesline elder rose deo r.on 50m,beesline deo,everyday essentials,body fresheners,lady fresheners,lady deodorant,40.58,5281020000000.0,False,...,False,False,False,False,False,False,False,False,False,False
1,5281020000000.0,beesline green forst whit.r.on deod.50m#,beesline deo,everyday essentials,body fresheners,lady fresheners,lady deodorant,95.32,5281020000000.0,False,...,False,False,False,False,False,False,False,False,False,False
2,5281020000000.0,beesline pacific islands deo r.on 50m#,beesline deo,everyday essentials,body fresheners,lady fresheners,lady deodorant,74.54,5281020000000.0,False,...,False,False,False,False,False,False,False,False,False,False
3,5281020000000.0,beesline cool breeze deo r.on 50m,beesline deo,everyday essentials,body fresheners,lady fresheners,lady deodorant,61.87,5281020000000.0,False,...,False,False,False,False,False,False,False,False,False,False
4,5281020000000.0,beesline hr delaying deo r.on 50m#,beesline deo,everyday essentials,body fresheners,lady fresheners,lady deodorant,19.82,5281020000000.0,False,...,False,False,False,False,False,False,False,False,False,False


### Additional Text Column Refinement

**Reasoning:**
This step, provided by the user, explicitly converts specified text columns to string type, trims whitespace, replaces various empty string representations with `NaN`, and converts all text to lowercase. This is a crucial step to ensure consistency and proper identification of missing values within text fields before final imputation. While some of these operations were previously covered, this snippet adds explicit `astype('string')` and comprehensive `replace` logic for empty/placeholder strings.

In [20]:
import pandas as pd
import numpy as np

# الأعمدة المطلوبة
text_columns = [
    "brand",
    "division",
    "category",
    "subcategory",
    "segment",
    "product_name"
]

# نتأكد إن الأعمدة موجودة
existing_cols = [col for col in text_columns if col in df.columns]

for col in existing_cols:

    # تحويل إلى string لتجنب مشاكل الـ dtype
    df[col] = df[col].astype("string")

    # إزالة المسافات الزائدة
    df[col] = df[col].str.strip()

    # استبدال القيم الفارغة النصية بـ NaN
    df[col] = df[col].replace(["", " ", "nan", "None"], np.nan)

    # توحيد شكل النص (اختياري)
    df[col] = df[col].str.lower()

    # استبدال القيم المفقودة بـ unknown
    df[col] = df[col].fillna("unknown")

print("Cleaning completed successfully ✅")
print("\nDataFrame head after additional text cleaning:")
print(df.head())

Cleaning completed successfully ✅

DataFrame head after additional text cleaning:
    product_id                              product_name         brand  \
0  5.28102E+12          beesline elder rose deo r.on 50m  beesline deo   
1  5.28102E+12  beesline green forst whit.r.on deod.50m#  beesline deo   
2  5.28102E+12    beesline pacific islands deo r.on 50m#  beesline deo   
3  5.28102E+12         beesline cool breeze deo r.on 50m  beesline deo   
4  5.28102E+12        beesline hr delaying deo r.on 50m#  beesline deo   

              division         category      subcategory         segment  \
0  everyday essentials  body fresheners  lady fresheners  lady deodorant   
1  everyday essentials  body fresheners  lady fresheners  lady deodorant   
2  everyday essentials  body fresheners  lady fresheners  lady deodorant   
3  everyday essentials  body fresheners  lady fresheners  lady deodorant   
4  everyday essentials  body fresheners  lady fresheners  lady deodorant   

   unit_price  p

# Task
Perform a comprehensive data cleaning, quality assessment, and flagging process on the "products_dimension.csv" file. This includes loading the data, profiling, identifying data quality issues (missing values, duplicates, invalid formats, outliers, inconsistent text), standardizing data formats, handling missing values by replacement and flagging original missingness, validating IDs and other data formats, applying business logic validations and flagging issues, detecting and flagging duplicate rows, and finally, generating a cleaned DataFrame along with a detailed data quality report and a 'Before vs After' comparison summary.

## Generate Final Output and Report (Re-run)

### Subtask:
Re-construct the final `cleaned_df` from the latest state of `df`, then re-generate the comprehensive `data_quality_report` dictionary and the 'Before vs After' comparison summary to include the effects of the recent text cleaning step. This ensures all reports are up-to-date.


**Reasoning**:
I will re-create `cleaned_df` from the current state of `df` to capture all the latest cleaning changes. Then, I will re-initialize the `data_quality_report` and `before_after_summary` dictionaries to ensure they reflect the most up-to-date information, including the effects of the recent text cleaning step. Finally, I will populate these reports with the requested metrics and comparisons, and print them.



In [21]:
cleaned_df = df.copy()

data_quality_report = {}
before_after_summary = {}

total_rows = cleaned_df.shape[0]

# Overall Data Quality Summary
data_quality_report['Overall Data Quality Summary'] = {
    'Total Rows': total_rows,
    'Total Flagged Rows (validation_flag)': cleaned_df['validation_flag'].sum(),
    'Percentage Flagged Rows': (cleaned_df['validation_flag'].sum() / total_rows) * 100
}

# Missing Values (Original)
data_quality_report['Missing Values (Original)'] = {}
original_missing_cols = [col for col in cleaned_df.columns if col.startswith('is_missing_original_')]
for col in original_missing_cols:
    count = cleaned_df[col].sum()
    percentage = (count / total_rows) * 100
    data_quality_report['Missing Values (Original)'][col] = {
        'Count': count,
        'Percentage': percentage
    }

# Invalid Formats
invalid_format_count = cleaned_df['is_invalid_format'].sum()
invalid_format_percentage = (invalid_format_count / total_rows) * 100
data_quality_report['Invalid Formats'] = {
    'Count': invalid_format_count,
    'Percentage': invalid_format_percentage
}

# Corrupted IDs
corrupted_id_count = cleaned_df['is_corrupted_id'].sum()
corrupted_id_percentage = (corrupted_id_count / total_rows) * 100
data_quality_report['Corrupted IDs'] = {
    'Count': corrupted_id_count,
    'Percentage': corrupted_id_percentage
}

# Outliers
outlier_count = cleaned_df['is_outlier'].sum()
outlier_percentage = (outlier_count / total_rows) * 100
data_quality_report['Outliers'] = {
    'Count': outlier_count,
    'Percentage': outlier_percentage
}

# Negative Unit Price
negative_unit_price_count = cleaned_df['is_negative_unit_price'].sum()
negative_unit_price_percentage = (negative_unit_price_count / total_rows) * 100
data_quality_report['Negative Unit Price'] = {
    'Count': negative_unit_price_count,
    'Percentage': negative_unit_price_percentage
}

# Duplicates
duplicate_count = cleaned_df['is_duplicate'].sum()
duplicate_percentage = (duplicate_count / total_rows) * 100
data_quality_report['Duplicates'] = {
    'Count': duplicate_count,
    'Percentage': duplicate_percentage
}

# Before vs After Comparison Summary

# Missing Values (Categorical Columns)
before_missing_brand = missing_data_report.loc['Brand', 'Missing Count']
before_missing_division = missing_data_report.loc['Division', 'Missing Count']
before_missing_category = missing_data_report.loc['Category', 'Missing Count']
before_missing_subcategory = missing_data_report.loc['Subcategory', 'Missing Count']
before_missing_segment = missing_data_report.loc['Segment', 'Missing Count']

before_after_summary['Missing Values (Categorical Columns)'] = {
    'Before Cleaning (Count)': before_missing_brand + before_missing_division + before_missing_category + before_missing_subcategory + before_missing_segment,
    'After Cleaning (Count)': 0 # All categorical missing values were filled
}

before_after_summary['Missing Values (product_id_numeric)'] = {
    'Before Cleaning (Count)': cleaned_df['is_missing_original_product_id_numeric'].sum(),
    'After Cleaning (Count)': 0 # Numerical missing values were filled with median
}

# Invalid product_id format comparison
before_after_summary['Invalid product_id Format'] = {
    'Before Cleaning (Count)': len(invalid_product_ids),
    'After Cleaning (Flagged Count)': cleaned_df['is_invalid_product_id_format'].sum() # Flagged, not removed
}

# Duplicate Rows Comparison
before_after_summary['Duplicate Rows'] = {
    'Before Cleaning (Count)': len(duplicate_rows),
    'After Cleaning (Flagged Count)': cleaned_df['is_duplicate'].sum() # Flagged, not removed
}

# Overall Validation Flag
before_after_summary['Overall Flagged Rows (Validation Flag)'] = {
    'Total Rows': total_rows,
    'Flagged Rows Count': cleaned_df['validation_flag'].sum(),
    'Flagged Rows Percentage': (cleaned_df['validation_flag'].sum() / total_rows) * 100
}

print("\nData Quality Report:")
for section, content in data_quality_report.items():
    print(f"\n{section}:")
    if isinstance(content, dict):
        for key, value in content.items():
            if isinstance(value, dict):
                print(f"  {key}:")
                for sub_key, sub_value in value.items():
                    print(f"    {sub_key}: {sub_value}")
            else:
                print(f"  {key}: {value}")
    else:
        print(f"  {content}")

print("\nBefore vs After Comparison Summary:")
for section, content in before_after_summary.items():
    print(f"\n{section}:")
    for key, value in content.items():
        print(f"  {key}: {value}")


Data Quality Report:

Overall Data Quality Summary:
  Total Rows: 62700
  Total Flagged Rows (validation_flag): 39547
  Percentage Flagged Rows: 63.07336523125997

Missing Values (Original):
  is_missing_original_brand:
    Count: 3105
    Percentage: 4.952153110047846
  is_missing_original_division:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_category:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_subcategory:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_segment:
    Count: 823
    Percentage: 1.3125996810207337
  is_missing_original_product_id_numeric:
    Count: 2
    Percentage: 0.0031897926634768745

Invalid Formats:
  Count: 2
  Percentage: 0.0031897926634768745

Corrupted IDs:
  Count: 39547
  Percentage: 63.07336523125997

Outliers:
  Count: 0
  Percentage: 0.0

Negative Unit Price:
  Count: 0
  Percentage: 0.0

Duplicates:
  Count: 0
  Percentage: 0.0

Before vs After Comparison Summary:

Missing

## Final Task

### Subtask:
Review the generated `cleaned_df`, `data_quality_report`, and the Before vs After comparison summary. This step concludes the data cleaning and preparation process.


## Summary:

### Data Analysis Key Findings

*   **Overall Data Quality:** Out of 62,700 total rows, 39,547 rows (63.07%) were flagged by the `validation_flag`, indicating a significant number of data quality issues or business rule violations.
*   **Corrupted IDs are Dominant:** The primary contributor to flagged rows is corrupted product IDs, affecting 39,547 rows (63.07% of the dataset). This aligns directly with the overall flagged row count.
*   **Missing Values (Original):**
    *   'Brand' had 3,105 original missing values (4.95%).
    *   'Division', 'Category', 'Subcategory', and 'Segment' each had 823 original missing values (1.31%).
    *   'product\_id\_numeric' had 2 original missing values (0.003%).
*   **Missing Value Handling Effectiveness:** All 6,397 original missing values in categorical columns and 2 missing values in 'product\_id\_numeric' were successfully handled (filled) during the cleaning process, resulting in 0 missing values for these categories after cleaning.
*   **Invalid Formats:** 2 rows (0.003%) were identified and flagged for invalid formats, specifically concerning `product_id`.
*   **No Outliers or Negative Unit Prices:** The analysis found 0 outliers and 0 instances of negative unit prices.
*   **No Duplicate Rows:** The dataset did not contain any duplicate rows, either before or after the cleaning process.

### Insights or Next Steps

*   **Address Corrupted IDs:** The high percentage of corrupted IDs (63.07%) is a critical issue. Future steps should involve a deep dive into the source of these corrupted IDs and implementing measures to prevent them. It's crucial to determine if these rows are salvageable or if they should be excluded from analysis.
*   **Review Validation Criteria:** Given the high number of flagged rows, it would be beneficial to review the business logic and validation criteria that led to such a high flagging rate. This ensures that the flags accurately reflect actionable data quality problems rather than overly strict or misapplied rules.
