<a href="https://colab.research.google.com/github/mohamedbahaaeldien92-max/Elezaby-project/blob/main/recoverd_failed_tabels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Analyze and clean product IDs in "products_dimension.csv" by matching them against "order_products_fact.csv", recovering 'product_id' values based on exact or prefix-with-price-tolerance matches, and then categorize and summarize the recovery process. Save the recovered products to "products_dimension_recovered.csv" and failed recoveries to "products_dimension_failed.csv".

## Load and Prepare DataFrames

### Subtask:
Load 'order_products_fact.csv' and 'products_dimension.csv' into pandas DataFrames, ensuring correct data types for 'product_id' and 'unit_price'. Initialize a 'recovered_product_id' column and a recovery status flag in the 'products_dimension' DataFrame.


**Reasoning**:
First, I'll import the pandas library to enable data manipulation. Then, I'll load both CSV files into pandas DataFrames, `fact_df` and `products_df`, specifying the correct data types for 'product_id' and 'unit_price' as requested. After loading, I'll initialize the 'recovered_product_id' and 'is_recovered' columns in the `products_df`.



In [1]:
import pandas as pd

# Load 'order_products_fact.csv' into fact_df
fact_df = pd.read_csv(
    '/content/order_products_fact.csv',
    dtype={'product_id': str, 'unit_price': float}
)

# Load 'products_dimension.csv' into products_df
products_df = pd.read_csv(
    '/content/products_dimension.csv',
    dtype={'product_id': str, 'unit_price': float}
)

# Create 'recovered_product_id' column in products_df
products_df['recovered_product_id'] = products_df['product_id']

# Create 'is_recovered' boolean column in products_df and initialize to False
products_df['is_recovered'] = False

print("DataFrames 'fact_df' and 'products_df' loaded and prepared.")
print("fact_df head:")
print(fact_df.head())
print("\nproducts_df head:")
print(products_df.head())

DataFrames 'fact_df' and 'products_df' loaded and prepared.
fact_df head:
   order_id     product_id  quantity  unit_price   sales
0  O1000000  6251001217308         1       85.79   85.79
1  O1000000        3500958         4       24.94   99.76
2  O1000000        6000400         2       12.55   25.10
3  O1000000         121018         2       77.30  154.60
4  O1000000        2705575         1       58.66   58.66

products_df head:
    product_id                              product_name         Brand  \
0  5.28102E+12          BEESLINE ELDER ROSE DEO R.ON 50M  BEESLINE DEO   
1  5.28102E+12  BEESLINE GREEN FORST WHIT.R.ON DEOD.50M#  BEESLINE DEO   
2  5.28102E+12    BEESLINE PACIFIC ISLANDS DEO R.ON 50M#  BEESLINE DEO   
3  5.28102E+12         BEESLINE COOL BREEZE DEO R.ON 50M  BEESLINE DEO   
4  5.28102E+12        BEESLINE HR DELAYING DEO R.ON 50M#  BEESLINE DEO   

              Division         Category      Subcategory         Segment  \
0  EVERYDAY ESSENTIALS  BODY FRESHENERS  LAD

## Clean Product IDs in Products Dimension

### Subtask:
Clean the 'product_id' column in the `products_df` DataFrame by removing trailing '.0', converting scientific notation to standard numeric strings, and removing all non-numeric characters.


**Reasoning**:
The current subtask requires cleaning the 'product_id' column in the `products_df` DataFrame. This involves defining a function to handle scientific notation, remove trailing '.0', and remove non-numeric characters, then applying it to the column. I will also display the head of the DataFrame and its dtypes to verify the cleaning.



In [2]:
import re

def clean_product_id(product_id_str):
    # Explicitly convert to string
    product_id_str = str(product_id_str)

    # Handle scientific notation: Convert to float, then to int, then to string
    try:
        # Check if it looks like scientific notation, but avoid converting
        # legitimate string IDs that contain 'e' or 'E' but are not scientific numbers
        if re.match(r'^-?\d+(\.\d+)?[eE][+-]?\d+$', product_id_str):
            product_id_str = str(int(float(product_id_str)))
    except ValueError:
        pass # If it's not a valid scientific notation number, keep as is for now

    # Remove trailing '.0'
    if product_id_str.endswith('.0'):
        product_id_str = product_id_str[:-2]

    # Remove all non-digit characters
    product_id_str = re.sub(r'\D', '', product_id_str)

    return product_id_str

# Apply the cleaning function to the 'product_id' column
products_df['product_id'] = products_df['product_id'].apply(clean_product_id)

print("product_id column in products_df has been cleaned.")
print("\nproducts_df head after cleaning:")
print(products_df.head())

print("\nData types of products_df after cleaning:")
print(products_df.dtypes)

product_id column in products_df has been cleaned.

products_df head after cleaning:
      product_id                              product_name         Brand  \
0  5281020000000          BEESLINE ELDER ROSE DEO R.ON 50M  BEESLINE DEO   
1  5281020000000  BEESLINE GREEN FORST WHIT.R.ON DEOD.50M#  BEESLINE DEO   
2  5281020000000    BEESLINE PACIFIC ISLANDS DEO R.ON 50M#  BEESLINE DEO   
3  5281020000000         BEESLINE COOL BREEZE DEO R.ON 50M  BEESLINE DEO   
4  5281020000000        BEESLINE HR DELAYING DEO R.ON 50M#  BEESLINE DEO   

              Division         Category      Subcategory         Segment  \
0  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
1  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
2  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
3  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
4  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   

 

## Create Canonical Reference from Fact Table

### Subtask:
From 'order_products_fact', create a reference DataFrame containing unique combinations of 'product_id' and 'unit_price'. This will be used as the authoritative source for lookup and validation. Also, identify potential initial 'damaged' product_ids in products_dimension for summary.


**Reasoning**:
First, I'll create the `canonical_products_df` from `fact_df` by selecting the 'product_id' and 'unit_price' columns and dropping duplicates. Then, I'll apply the previously defined `clean_product_id` function to its 'product_id' column. Finally, I will identify and count 'damaged' product IDs by comparing the original and cleaned 'product_id' values in `products_df`.



In [3]:
import numpy as np

# Create canonical_products_df from fact_df
canonical_products_df = fact_df[['product_id', 'unit_price']].drop_duplicates().copy()

# Apply the cleaning function to the 'product_id' column of canonical_products_df
canonical_products_df['product_id'] = canonical_products_df['product_id'].apply(clean_product_id)

# Identify 'damaged' product_ids in products_df based on the initial cleaning step
# We compare the 'recovered_product_id' (original value) with the 'product_id' (cleaned value)
initial_damaged_product_ids = products_df[products_df['recovered_product_id'] != products_df['product_id']].shape[0]

print("Canonical Products DataFrame created and cleaned.")
print("canonical_products_df head:")
print(canonical_products_df.head())

print(f"\nNumber of initial 'damaged' product_ids in products_df: {initial_damaged_product_ids}")


Canonical Products DataFrame created and cleaned.
canonical_products_df head:
      product_id  unit_price
0  6251001217308       85.79
1        3500958       24.94
2        6000400       12.55
3         121018       77.30
4        2705575       58.66

Number of initial 'damaged' product_ids in products_df: 26084


## Attempt Exact Match Recovery

### Subtask:
Iterate through 'products_dimension'. For each row, attempt to find an exact match for its *cleaned* 'product_id' and 'unit_price' in the canonical reference created from 'order_products_fact'. If an exact match is found, update 'recovered_product_id' and mark the row as successfully recovered.


**Reasoning**:
I will perform an inner merge between `products_df` and `canonical_products_df` on cleaned 'product_id' and 'unit_price' to find exact matches. Then, I'll update the `is_recovered` and `recovered_product_id` columns in `products_df` for the matched rows and print the results.



In [4]:
print(f"Initial number of unrecovered products: {products_df[~products_df['is_recovered']].shape[0]}")

# Create a temporary DataFrame of exact matches
merged_exact_matches = pd.merge(
    products_df[~products_df['is_recovered']], # Only consider unrecovered products for matching
    canonical_products_df,
    on=['product_id', 'unit_price'],
    how='inner'
)

# Get the indices of the products_df rows that were successfully matched
exact_match_indices = merged_exact_matches.index

# Update 'is_recovered' for these rows in products_df
products_df.loc[exact_match_indices, 'is_recovered'] = True

# Update 'recovered_product_id' for these rows in products_df
# The product_id from products_df is already cleaned and is the one that matched
products_df.loc[exact_match_indices, 'recovered_product_id'] = products_df.loc[exact_match_indices, 'product_id']

num_exact_matches = len(exact_match_indices)
print(f"Number of products recovered by exact match: {num_exact_matches}")
print(f"Total recovered products after exact match: {products_df['is_recovered'].sum()}")
print("\nproducts_df head after exact match recovery:")
print(products_df.head())

Initial number of unrecovered products: 62700
Number of products recovered by exact match: 36618
Total recovered products after exact match: 36618

products_df head after exact match recovery:
      product_id                              product_name         Brand  \
0  5281020000000          BEESLINE ELDER ROSE DEO R.ON 50M  BEESLINE DEO   
1  5281020000000  BEESLINE GREEN FORST WHIT.R.ON DEOD.50M#  BEESLINE DEO   
2  5281020000000    BEESLINE PACIFIC ISLANDS DEO R.ON 50M#  BEESLINE DEO   
3  5281020000000         BEESLINE COOL BREEZE DEO R.ON 50M  BEESLINE DEO   
4  5281020000000        BEESLINE HR DELAYING DEO R.ON 50M#  BEESLINE DEO   

              Division         Category      Subcategory         Segment  \
0  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
1  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
2  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
3  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY 

## Attempt Prefix Match with Price Tolerance Recovery

### Subtask:
For rows in 'products_dimension' that were not successfully recovered by exact match: Extract the first 6 digits (prefix) from their *cleaned* 'product_id'. Search the canonical reference for product_ids that share this prefix AND have a 'unit_price' within a \u00b10.01 tolerance of the 'unit_price' in the current 'products_dimension' row. If exactly one such match is found, update 'recovered_product_id' and mark the row as recovered. If zero or more than one match, mark the row as unrecoverable.


**Reasoning**:
I will iterate through unrecovered products, extract a product ID prefix, and then search the canonical reference for products within a price tolerance. If a unique match is found, I will update the product information in `products_df` and track the recovery count.



In [5]:
print(f"Initial number of unrecovered products before prefix matching: {products_df[~products_df['is_recovered']].shape[0]}")

# Get indices of currently unrecovered products
unrecovered_indices = products_df[~products_df['is_recovered']].index

num_prefix_matches_recovered = 0

# Iterate through unrecovered products
for idx in unrecovered_indices:
    current_product_id = products_df.loc[idx, 'product_id']
    current_unit_price = products_df.loc[idx, 'unit_price']

    # Ensure product_id is long enough for a 6-digit prefix
    if len(current_product_id) < 6:
        continue

    prefix = current_product_id[:6]

    # Define price tolerance bounds
    lower_bound = current_unit_price - 0.01
    upper_bound = current_unit_price + 0.01

    # Filter canonical_products_df for matches based on prefix and price tolerance
    potential_matches = canonical_products_df[
        canonical_products_df['product_id'].str.startswith(prefix) &
        (canonical_products_df['unit_price'] >= lower_bound) &
        (canonical_products_df['unit_price'] <= upper_bound)
    ]

    # If exactly one match is found, update products_df
    if len(potential_matches) == 1:
        matched_product_id = potential_matches['product_id'].iloc[0]
        products_df.loc[idx, 'recovered_product_id'] = matched_product_id
        products_df.loc[idx, 'is_recovered'] = True
        num_prefix_matches_recovered += 1

print(f"Number of products recovered by prefix match with price tolerance: {num_prefix_matches_recovered}")
print(f"Total recovered products after prefix match: {products_df['is_recovered'].sum()}")
print("\nproducts_df head after prefix match recovery:")
print(products_df.head())

Initial number of unrecovered products before prefix matching: 26082
Number of products recovered by prefix match with price tolerance: 19308
Total recovered products after prefix match: 55926

products_df head after prefix match recovery:
      product_id                              product_name         Brand  \
0  5281020000000          BEESLINE ELDER ROSE DEO R.ON 50M  BEESLINE DEO   
1  5281020000000  BEESLINE GREEN FORST WHIT.R.ON DEOD.50M#  BEESLINE DEO   
2  5281020000000    BEESLINE PACIFIC ISLANDS DEO R.ON 50M#  BEESLINE DEO   
3  5281020000000         BEESLINE COOL BREEZE DEO R.ON 50M  BEESLINE DEO   
4  5281020000000        BEESLINE HR DELAYING DEO R.ON 50M#  BEESLINE DEO   

              Division         Category      Subcategory         Segment  \
0  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
1  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
2  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   

## Categorize Products and Generate Summary

### Subtask:
Separate the 'products_dimension' DataFrame into two new DataFrames: one containing products with successfully recovered 'product_id's ('products_dimension_recovered') and another for those that failed recovery ('products_dimension_failed'). Calculate and print summary statistics including: total products in dimension, count of values that needed recovery (initially 'damaged'), number of successfully recovered values, number of values that failed recovery, and the overall recovery success rate percentage.


**Reasoning**:
I will create two new DataFrames, `products_dimension_recovered` and `products_dimension_failed`, by filtering the `products_df` based on the `is_recovered` column. Then, I will calculate and print the requested summary statistics to provide an overview of the recovery process.



In [6]:
products_dimension_recovered = products_df[products_df['is_recovered']].copy()
products_dimension_failed = products_df[~products_df['is_recovered']].copy()

total_products_in_dimension = len(products_df)
num_successfully_recovered = products_df['is_recovered'].sum()
num_failed_recovery = total_products_in_dimension - num_successfully_recovered

# Ensure initial_damaged_product_ids is correctly interpreted as needing recovery
# It was calculated in a previous step as rows where original product_id != cleaned product_id
# which signifies products that needed some form of cleaning/recovery.
# The variable `initial_damaged_product_ids` holds this count.

recovery_success_rate = (num_successfully_recovered / total_products_in_dimension) * 100 if total_products_in_dimension > 0 else 0

print("--- Product ID Recovery Summary ---")
print(f"Total products in products_dimension: {total_products_in_dimension}")
print(f"Number of values that initially needed recovery (damaged): {initial_damaged_product_ids}")
print(f"Number of successfully recovered product IDs: {num_successfully_recovered}")
print(f"Number of product IDs that failed recovery: {num_failed_recovery}")
print(f"Overall recovery success rate: {recovery_success_rate:.2f}%")

print("\nproducts_dimension_recovered head:")
print(products_dimension_recovered.head())
print("\nproducts_dimension_failed head:")
print(products_dimension_failed.head())

--- Product ID Recovery Summary ---
Total products in products_dimension: 62700
Number of values that initially needed recovery (damaged): 26084
Number of successfully recovered product IDs: 55926
Number of product IDs that failed recovery: 6774
Overall recovery success rate: 89.20%

products_dimension_recovered head:
      product_id                              product_name         Brand  \
0  5281020000000          BEESLINE ELDER ROSE DEO R.ON 50M  BEESLINE DEO   
1  5281020000000  BEESLINE GREEN FORST WHIT.R.ON DEOD.50M#  BEESLINE DEO   
2  5281020000000    BEESLINE PACIFIC ISLANDS DEO R.ON 50M#  BEESLINE DEO   
3  5281020000000         BEESLINE COOL BREEZE DEO R.ON 50M  BEESLINE DEO   
4  5281020000000        BEESLINE HR DELAYING DEO R.ON 50M#  BEESLINE DEO   

              Division         Category      Subcategory         Segment  \
0  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORANT   
1  EVERYDAY ESSENTIALS  BODY FRESHENERS  LADY FRESHENERS  LADY DEODORAN

**Reasoning**:
I will save the `products_dimension_recovered` DataFrame to a CSV file named 'products_dimension_recovered.csv' and the `products_dimension_failed` DataFrame to 'products_dimension_failed.csv' as required by the overall task.



In [7]:
products_dimension_recovered.to_csv('products_dimension_recovered.csv', index=False)
products_dimension_failed.to_csv('products_dimension_failed.csv', index=False)

print("'products_dimension_recovered.csv' and 'products_dimension_failed.csv' have been saved.")

'products_dimension_recovered.csv' and 'products_dimension_failed.csv' have been saved.


## Final Task

### Subtask:
Provide a summary of the data quality process, the number of product IDs successfully recovered, and highlight any remaining unrecovered entries, along with insights into potential further investigation if needed.


## Summary:

### Q&A
*   **Summary of the data quality process:** The data quality process involved several key steps:
    1.  **Data Loading and Preparation:** `products_dimension.csv` and `order_products_fact.csv` were loaded, with `product_id` and `unit_price` columns cast to appropriate types. A `recovered_product_id` column and an `is_recovered` flag were initialized in the `products_df`.
    2.  **Product ID Cleaning:** The `product_id` column in `products_df` was cleaned by removing trailing '.0', converting scientific notation to standard numeric strings, and eliminating all non-numeric characters. This process identified 26,084 product IDs that initially needed cleaning or recovery.
    3.  **Canonical Reference Creation:** A canonical reference DataFrame was created from `order_products_fact.csv` containing unique, cleaned `product_id` and `unit_price` combinations to serve as the authoritative source for lookups.
    4.  **Exact Match Recovery:** An initial recovery attempt was made by finding exact matches for cleaned `product_id` and `unit_price` in the canonical reference.
    5.  **Prefix Match with Price Tolerance Recovery:** For products not recovered by an exact match, a second recovery attempt was made. This involved using the first 6 digits of the cleaned `product_id` as a prefix and searching the canonical reference for product IDs with this prefix, provided their `unit_price` was within a \$0.01 tolerance.
    6.  **Categorization and Summary:** Finally, products were categorized into successfully recovered and failed recovery groups, and comprehensive statistics were generated.
*   **Number of product IDs successfully recovered:** A total of 55,926 product IDs were successfully recovered.
*   **Remaining unrecovered entries:** 6,774 product IDs remained unrecovered after the process.

### Data Analysis Key Findings
*   The `products_dimension` dataset initially contained 62,700 entries.
*   Out of these, 26,084 entries (approximately 41.6% of the total) had `product_id` values that required initial cleaning due to formatting inconsistencies like scientific notation, trailing decimal zeros, or non-numeric characters.
*   The exact matching recovery strategy successfully identified and recovered 36,618 product IDs.
*   An additional 19,308 product IDs were recovered using the prefix matching with price tolerance strategy.
*   Overall, 55,926 product IDs were successfully recovered, leading to a high recovery success rate of 89.20%.
*   Despite the extensive recovery efforts, 6,774 product IDs remained unrecovered, indicating persistent data quality issues for these specific entries.

### Insights or Next Steps
*   Investigate the 6,774 unrecovered product IDs. This could involve manual review, analyzing common patterns or errors in their original `product_id` and `unit_price` values, or exploring alternative fuzzy matching techniques (e.g., Levenshtein distance) if initial prefixes or prices are too dissimilar.
*   Review the initial data entry or generation processes for `products_dimension.csv` to identify root causes of the `product_id` inconsistencies and prevent similar issues in the future.
