# Data Cleaning and Peperation for PostgreSQL Import

In order to prepare our datasets for import to our PostgreSQL database, we need to ensure the data is clean, structured correctly, and compatible with PostgreSQL's requirements. This comprehensive data cleaning process aims to achieve data integrity, consistency, and compatibility, which are crucial for efficient data handling and analysis in PostgreSQL.

## Project Overview:
To achieve our goals, we will follow a structured data-cleaning process detailed in the steps below, and work from there to address any issues that arise:

1. **Load Data:** Import the dataset into the notebook to begin cleaning.
2. **Initial Inspection:** Perform an initial inspection to understand the structure and content of the data.
3. **Check Data Types:** Ensure each column has the appropriate data type matching PostgreSQL's data types.
4. **Identify Problematic Entries:** Identify any entries that may cause issues in analysis or importing.
5. **Replace Invalid Entries:** Correct or remove any invalid entries identified in the previous step.
6. **Detect and Handle Duplicates:** Identify and handle duplicate records in the dataset.
7. **Detect and Handle Missing Values:** Address missing values appropriately based on the context.
8. **Detect and Handle Outliers:** Identify and manage outliers that could skew analysis.
9. **Data Consistency:** Ensure consistency in data entries, such as standardizing units or formats.
10. **Final Cleaning:** Perform any final cleaning steps to ensure the dataset is ready.
11. **Log Issues:** Log any issues encountered during the cleaning process for future reference.
12. **Save Cleaned Data:** Save the cleaned dataset to a file.
13. **Processing and Saving Cleaned CSV Files:** Process and save the cleaned data into CSV files ready for import.


## Key Requirements for Data Cleaning:
### Data Integrity and Consistency:
- Ensure all data is accurate, complete, and consistent. This includes checking for missing values, outliers, and correcting any erroneous or inconsistent data entries.

### Data Types:
- Ensure that each column in your dataset has an appropriate data type that matches PostgreSQL's data types. Common types include integer, numeric, text, date, timestamp, etc. Make sure dates and timestamps are formatted correctly.

### Null Values:
- Handle null values appropriately. Decide whether to fill in missing values, discard rows with nulls (if applicable), or set default values. PostgreSQL handles nulls differently from some other databases, so ensure your approach aligns with PostgreSQL's handling of nulls.

### Data Cleaning Tools and Techniques:
- Use appropriate tools and techniques to clean and transform the data as per PostgreSQL's requirements.

### File Format and Size:
- Ensure your dataset is in a compatible file format for import into PostgreSQL, such as CSV, JSON, or SQL dump files. Pay attention to file size limitations and consider breaking large datasets into manageable chunks if needed.

### Import Method:
- Choose the appropriate method for importing data into PostgreSQL, such as using the COPY command, pg_restore for SQL dumps, or tools like psql or pgAdmin for interactive imports. Each method may have specific requirements and optimizations.

## Step 1: Load Data

In [10]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Define the base path to the CSV files
base_path = 'C:/Users/matth/ecommerce_mba_project/data/raw'
cleaned_path = 'C:/Users/matth/ecommerce_mba_project/data/cleaned'
log_path = 'C:/Users/matth/ecommerce_mba_project/data/logs'

# Ensure the cleaned path and log path exist
os.makedirs(cleaned_path, exist_ok=True)
os.makedirs(log_path, exist_ok=True)

# Define the paths to the CSV files
csv_files = {
    'sale_report': os.path.join(base_path, 'Sale Report.csv'),
    'p_and_l_march_2021': os.path.join(base_path, 'P  L March 2021.csv'),
    'amazon_sale_report': os.path.join(base_path, 'Amazon Sale Report.csv'),
    'may_2022': os.path.join(base_path, 'May-2022.csv')
}

# Load CSV files into DataFrames
sale_report_df = pd.read_csv(csv_files['sale_report'])
p_and_l_march_2021_df = pd.read_csv(csv_files['p_and_l_march_2021'])
amazon_sale_report_df = pd.read_csv(csv_files['amazon_sale_report'], low_memory=False)
may_2022_df = pd.read_csv(csv_files['may_2022'])

# Display initial rows
print("Sale Report DataFrame:")
print(sale_report_df.head())

print("\nP & L March 2021 DataFrame:")
print(p_and_l_march_2021_df.head())

print("\nAmazon Sale Report DataFrame:")
print(amazon_sale_report_df.head())

print("\nMay 2022 DataFrame:")
print(may_2022_df.head())

Sale Report DataFrame:
   index       SKU Code Design No.  Stock       Category Size Color
0      0    AN201-RED-L      AN201    5.0  AN : LEGGINGS    L   Red
1      1    AN201-RED-M      AN201    5.0  AN : LEGGINGS    M   Red
2      2    AN201-RED-S      AN201    3.0  AN : LEGGINGS    S   Red
3      3   AN201-RED-XL      AN201    6.0  AN : LEGGINGS   XL   Red
4      4  AN201-RED-XXL      AN201    3.0  AN : LEGGINGS  XXL   Red

P & L March 2021 DataFrame:
   index             Sku    Style Id  Catalog Category Weight TP 1    TP 2  \
0      0    Os206_3141_S  Os206_3141  Moments    Kurta    0.3  538  435.78   
1      1    Os206_3141_M  Os206_3141  Moments    Kurta    0.3  538  435.78   
2      2    Os206_3141_L  Os206_3141  Moments    Kurta    0.3  538  435.78   
3      3   Os206_3141_XL  Os206_3141  Moments    Kurta    0.3  538  435.78   
4      4  Os206_3141_2XL  Os206_3141  Moments    Kurta    0.3  538  435.78   

  MRP Old Final MRP Old Ajio MRP Amazon MRP Amazon FBA MRP Flipkart MRP

## Step 2: Initial Inspection

In [11]:
# Display summary statistics
print("Sale Report Summary:")
print(sale_report_df.describe(include='all'))

print("\nP & L March 2021 Summary:")
print(p_and_l_march_2021_df.describe(include='all'))

print("\nAmazon Sale Report Summary:")
print(amazon_sale_report_df.describe(include='all'))

print("\nMay 2022 Summary:")
print(may_2022_df.describe(include='all'))

Sale Report Summary:
              index SKU Code Design No.        Stock Category  Size Color
count   9271.000000     9188       9235  9235.000000     9226  9235  9226
unique          NaN     9170       1594          NaN       21    11    62
top             NaN    #REF!      J0096          NaN    KURTA     S  Blue
freq            NaN       15         10          NaN     3726  1353   782
mean    4635.000000      NaN        NaN    26.246454      NaN   NaN   NaN
std     2676.451507      NaN        NaN    58.462891      NaN   NaN   NaN
min        0.000000      NaN        NaN     0.000000      NaN   NaN   NaN
25%     2317.500000      NaN        NaN     3.000000      NaN   NaN   NaN
50%     4635.000000      NaN        NaN     8.000000      NaN   NaN   NaN
75%     6952.500000      NaN        NaN    31.000000      NaN   NaN   NaN
max     9270.000000      NaN        NaN  1234.000000      NaN   NaN   NaN

P & L March 2021 Summary:
              index           Sku Style Id Catalog Category Weig

## Step 3: Identify Problematic Entries

To ensure data quality, we'll inspect the unique values in each column to identify any problematic entries. This step is crucial for understanding potential anomalies or outliers in the dataset, which could affect our analysis.

In [12]:
# Inspect unique values to identify problematic entries
def inspect_problematic_entries(df):
    for col in df.columns:
        unique_values = df[col].unique()
        print(f"\nUnique values in column '{col}':")
        print(unique_values[:10])  # Display first 10 unique values for inspection

inspect_problematic_entries(sale_report_df)
inspect_problematic_entries(p_and_l_march_2021_df)
inspect_problematic_entries(amazon_sale_report_df)
inspect_problematic_entries(may_2022_df)


Unique values in column 'index':
[0 1 2 3 4 5 6 7 8 9]

Unique values in column 'SKU Code':
['AN201-RED-L' 'AN201-RED-M' 'AN201-RED-S' 'AN201-RED-XL' 'AN201-RED-XXL'
 'AN202-ORANGE-L' 'AN202-ORANGE-M' 'AN202-ORANGE-S' 'AN202-ORANGE-XL'
 'AN202-ORANGE-XXL']

Unique values in column 'Design No.':
['AN201' 'AN202' 'AN203' 'AN204' 'AN205' 'AN206' 'AN207' 'AN209' 'AN210'
 'AN211']

Unique values in column 'Stock':
[ 5.  3.  6. 11. 16.  8. 14.  1.  2. 10.]

Unique values in column 'Category':
['AN : LEGGINGS' 'BLOUSE' 'PANT' 'BOTTOM' 'PALAZZO' 'SHARARA' 'SKIRT'
 'DRESS' 'KURTA SET' 'LEHENGA CHOLI']

Unique values in column 'Size':
['L' 'M' 'S' 'XL' 'XXL' 'FREE' 'XS' 'XXXL' '4XL' '5XL']

Unique values in column 'Color':
['Red' 'Orange' 'Maroon' 'Purple' 'Yellow' 'Green' 'Pink' 'Beige'
 'Navy Blue' 'Black']

Unique values in column 'index':
[0 1 2 3 4 5 6 7 8 9]

Unique values in column 'Sku':
['Os206_3141_S' 'Os206_3141_M' 'Os206_3141_L' 'Os206_3141_XL'
 'Os206_3141_2XL' 'Os206_3141_3XL' 'Os

## Step 4: Replace Invalid Entries

From the inspection of unique values, we identified several problematic entries that need to be addressed including invalid entries such as 'Nill', 'nil', 'None', '', '#VALUE!', and 'NaN'. These entries can cause inaccuracies in data analysis if not handled properly.

To standardize the handling of missing or invalid entries, we replace them with NaN.

In [13]:
# Replace identified invalid entries with NaN
def replace_invalid_entries(df):
    df.replace(['Nill', 'nil', 'None', '', '#VALUE!', 'NaN'], np.nan, inplace=True)
    return df

sale_report_df = replace_invalid_entries(sale_report_df)
p_and_l_march_2021_df = replace_invalid_entries(p_and_l_march_2021_df)
amazon_sale_report_df = replace_invalid_entries(amazon_sale_report_df)
may_2022_df = replace_invalid_entries(may_2022_df)

# Check missing values
print("Sale Report Missing Values:")
print(sale_report_df.isnull().sum())

print("\nP & L March 2021 Missing Values:")
print(p_and_l_march_2021_df.isnull().sum())

print("\nAmazon Sale Report Missing Values:")
print(amazon_sale_report_df.isnull().sum())

print("\nMay 2022 Missing Values:")
print(may_2022_df.isnull().sum())

Sale Report Missing Values:
index          0
SKU Code      83
Design No.    36
Stock         36
Category      45
Size          36
Color         45
dtype: int64

P & L March 2021 Missing Values:
index              0
Sku                0
Style Id           0
Catalog           73
Category          73
Weight            73
TP 1               6
TP 2               6
MRP Old           37
Final MRP Old     37
Ajio MRP          37
Amazon MRP        37
Amazon FBA MRP    37
Flipkart MRP      37
Limeroad MRP      37
Myntra MRP        31
Paytm MRP         37
Snapdeal MRP      37
dtype: int64

Amazon Sale Report Missing Values:
index                     0
Order ID                  0
Date                      0
Status                    0
Fulfilment                0
Sales Channel             0
ship-service-level        0
Style                     0
SKU                       0
Category                  0
Size                      0
ASIN                      0
Courier Status         6872
Qty            

## Step 5: Check for Duplicates

In [14]:
# Check for duplicates
def check_duplicates(df, df_name):
    duplicate_count = df.duplicated().sum()
    print(f"\nNumber of duplicate rows in {df_name}: {duplicate_count}")
    df.drop_duplicates(inplace=True)
    return df

sale_report_df = check_duplicates(sale_report_df, 'sale_report')
p_and_l_march_2021_df = check_duplicates(p_and_l_march_2021_df, 'p_and_l_march_2021')
amazon_sale_report_df = check_duplicates(amazon_sale_report_df, 'amazon_sale_report')
may_2022_df = check_duplicates(may_2022_df, 'may_2022')


Number of duplicate rows in sale_report: 0

Number of duplicate rows in p_and_l_march_2021: 0

Number of duplicate rows in amazon_sale_report: 0

Number of duplicate rows in may_2022: 0


## Step 6: Check Data Types

Here we'll ensure that column data types are correctly applied.

In [15]:
# Check current data types of the columns
def check_data_types(df, columns):
    for col in columns:
        print(f"Data type of column '{col}': {df[col].dtype}")

print("Sale Report Data Types:")
check_data_types(sale_report_df, ['Stock'])

print("\nP & L March 2021 Data Types:")
check_data_types(p_and_l_march_2021_df, ['Weight', 'TP 1', 'TP 2', 'MRP Old', 'Final MRP Old'])

print("\nAmazon Sale Report Data Types:")
check_data_types(amazon_sale_report_df, ['Amount'])

print("\nMay 2022 Data Types:")
check_data_types(may_2022_df, ['Weight', 'TP', 'MRP Old', 'Final MRP Old'])


Sale Report Data Types:
Data type of column 'Stock': float64

P & L March 2021 Data Types:
Data type of column 'Weight': object
Data type of column 'TP 1': object
Data type of column 'TP 2': object
Data type of column 'MRP Old': object
Data type of column 'Final MRP Old': object

Amazon Sale Report Data Types:
Data type of column 'Amount': float64

May 2022 Data Types:
Data type of column 'Weight': object
Data type of column 'TP': object
Data type of column 'MRP Old': object
Data type of column 'Final MRP Old': object


Several data type corrections are needed.

## Step 6: Convert Data Types

In [16]:
# Convert columns to appropriate data types
def convert_data_types(df, columns):
    for col in columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

sale_report_df = convert_data_types(sale_report_df, ['Stock'])
p_and_l_march_2021_df = convert_data_types(p_and_l_march_2021_df, ['Weight', 'TP 1', 'TP 2', 'MRP Old', 'Final MRP Old'])
amazon_sale_report_df = convert_data_types(amazon_sale_report_df, ['Amount'])
may_2022_df = convert_data_types(may_2022_df, ['Weight', 'TP', 'MRP Old', 'Final MRP Old'])

## Step 7: Handle Missing Values

Missing values in specified columns will be replaced with 0. This approach ensures that these columns do not contain NaN values, which could disrupt numerical analysis.

In [17]:
# Handle missing values
def handle_missing_values(df, columns):
    df.fillna({col: 0 for col in columns}, inplace=True)
    return df

sale_report_df = handle_missing_values(sale_report_df, ['Stock'])
p_and_l_march_2021_df = handle_missing_values(p_and_l_march_2021_df, ['Weight', 'TP 1', 'TP 2', 'MRP Old', 'Final MRP Old'])
amazon_sale_report_df = handle_missing_values(amazon_sale_report_df, ['Amount'])
may_2022_df = handle_missing_values(may_2022_df, ['Weight', 'TP', 'MRP Old', 'Final MRP Old'])

# Check missing values after handling
print("Sale Report Missing Values After Handling:")
print(sale_report_df.isnull().sum())

print("\nP & L March 2021 Missing Values After Handling:")
print(p_and_l_march_2021_df.isnull().sum())

print("\nAmazon Sale Report Missing Values After Handling:")
print(amazon_sale_report_df.isnull().sum())

print("\nMay 2022 Missing Values After Handling:")
print(may_2022_df.isnull().sum())

Sale Report Missing Values After Handling:
index          0
SKU Code      83
Design No.    36
Stock          0
Category      45
Size          36
Color         45
dtype: int64

P & L March 2021 Missing Values After Handling:
index              0
Sku                0
Style Id           0
Catalog           73
Category          73
Weight             0
TP 1               0
TP 2               0
MRP Old            0
Final MRP Old      0
Ajio MRP          37
Amazon MRP        37
Amazon FBA MRP    37
Flipkart MRP      37
Limeroad MRP      37
Myntra MRP        31
Paytm MRP         37
Snapdeal MRP      37
dtype: int64

Amazon Sale Report Missing Values After Handling:
index                     0
Order ID                  0
Date                      0
Status                    0
Fulfilment                0
Sales Channel             0
ship-service-level        0
Style                     0
SKU                       0
Category                  0
Size                      0
ASIN                      

## Step 8: Detect and Handle Outliers

Outliers can skew the analysis and provide misleading results. By removing outliers, we ensure that the dataset represents typical values more accurately.

In [18]:
# Detect and handle outliers
def detect_handle_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        df = df[(np.abs(df[col] - df[col].mean()) <= (3 * df[col].std())) | df[col].isnull()]
    return df

sale_report_df = detect_handle_outliers(sale_report_df)
p_and_l_march_2021_df = detect_handle_outliers(p_and_l_march_2021_df)
amazon_sale_report_df = detect_handle_outliers(amazon_sale_report_df)
may_2022_df = detect_handle_outliers(may_2022_df)


## Step 9: Data Consistency

Here we'll ensure that the 'Date' column in the amazon_sale_report_df DataFrame is consistently formatted as a datetime.

In [19]:
# Ensure consistency in categorical values and date formats
def ensure_consistency(df):
    # Example: Convert date columns to datetime format
    if 'Date' in df.columns:
        try:
            # Try to parse with a known format
            df['Date'] = pd.to_datetime(df['Date'], format='%m-%d-%y', errors='coerce')
        except ValueError:
            # Fallback to general parsing
            df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    return df

amazon_sale_report_df = ensure_consistency(amazon_sale_report_df)

# Display the first few rows to verify date parsing
print(amazon_sale_report_df.head())

   index             Order ID       Date                        Status  \
0      0  405-8078784-5731545 2022-04-30                     Cancelled   
1      1  171-9198151-1101146 2022-04-30  Shipped - Delivered to Buyer   
2      2  404-0687676-7273146 2022-04-30                       Shipped   
3      3  403-9615377-8133951 2022-04-30                     Cancelled   
4      4  407-1069790-7240320 2022-04-30                       Shipped   

  Fulfilment Sales Channel  ship-service-level    Style              SKU  \
0   Merchant      Amazon.in           Standard   SET389   SET389-KR-NP-S   
1   Merchant      Amazon.in           Standard  JNE3781  JNE3781-KR-XXXL   
2     Amazon      Amazon.in          Expedited  JNE3371    JNE3371-KR-XL   
3   Merchant      Amazon.in           Standard    J0341       J0341-DR-L   
4     Amazon      Amazon.in          Expedited  JNE3671  JNE3671-TU-XXXL   

        Category  ... currency  Amount    ship-city   ship-state  \
0            Set  ...      INR

## Step 10: Implementing Consistent Column Naming

Due to formatting inconsistency, we'll change column names to match the column names used in the create_table.sql script, preventing problems upon data import:
);


In [20]:
# Column mappings
sale_report_columns = {
    'SKU Code': 'sku_code',
    'Design No.': 'design_no',
    'Stock': 'stock',
    'Category': 'category',
    'Size': 'size',
    'Color': 'color'
}

p_and_l_march_2021_columns = {
    'Sku': 'sku',
    'Style Id': 'style_id',
    'Catalog': 'catalog',
    'Category': 'category',
    'Weight': 'weight',
    'TP 1': 'tp1',
    'TP 2': 'tp2',
    'MRP Old': 'mrp_old',
    'Final MRP Old': 'final_mrp_old',
    'Ajio MRP': 'ajio_mrp',
    'Amazon MRP': 'amazon_mrp',
    'Amazon FBA MRP': 'amazon_fba_mrp',
    'Flipkart MRP': 'flipkart_mrp',
    'Limeroad MRP': 'limeroad_mrp',
    'Myntra MRP': 'myntra_mrp',
    'Paytm MRP': 'paytm_mrp',
    'Snapdeal MRP': 'snapdeal_mrp'
}

amazon_sale_report_columns = {
    'Order ID': 'order_id',
    'Date': 'date',
    'Status': 'status',
    'Fulfilment': 'fulfillment',
    'Sales Channel ': 'sales_channel',
    'ship-service-level': 'ship_service_level',
    'Style': 'style',
    'SKU': 'sku',
    'Category': 'category',
    'Size': 'size',
    'ASIN': 'asin',
    'Courier Status': 'courier_status',
    'Qty': 'qty',
    'currency': 'currency',
    'Amount': 'amount',
    'ship-city': 'ship_city',
    'ship-state': 'ship_state',
    'ship-postal-code': 'ship_postal_code',
    'ship-country': 'ship_country',
    'promotion-ids': 'promotion_ids',
    'B2B': 'b2b',
    'fulfilled-by': 'fulfilled_by'
}

may_2022_columns = {
    'Sku': 'sku',
    'Style Id': 'style_id',
    'Catalog': 'catalog',
    'Category': 'category',
    'Weight': 'weight',
    'TP': 'tp',
    'MRP Old': 'mrp_old',
    'Final MRP Old': 'final_mrp_old',
    'Ajio MRP': 'ajio_mrp',
    'Amazon MRP': 'amazon_mrp',
    'Amazon FBA MRP': 'amazon_fba_mrp',
    'Flipkart MRP': 'flipkart_mrp',
    'Limeroad MRP': 'limeroad_mrp',
    'Myntra MRP': 'myntra_mrp',
    'Paytm MRP': 'paytm_mrp',
    'Snapdeal MRP': 'snapdeal_mrp'
}

## Step 11: Define Helper Functions for Data Cleaning and Manipulation

This cell includes functions that clean the data frame by removing unnecessary columns, standardizing column names, handling duplicates, and verifying SKUs.

In [21]:
# Function to remove the 'index' column if it exists
def remove_index_column(df):
    if 'index' in df.columns:
        df.drop(columns=['index'], inplace=True)
    return df

# Function to remove 'Unnamed' columns if they exist
def remove_unnamed_columns(df):
    unnamed_cols = [col for col in df.columns if col.lower().startswith('unnamed')]
    if unnamed_cols:
        df.drop(columns=unnamed_cols, inplace=True)
    return df

# Function to standardize column names
def standardize_column_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_')
    return df

# Function to validate and add missing columns
def add_missing_columns(df, required_columns):
    for col in required_columns:
        if col not in df.columns:
            df[col] = np.nan
    return df

# Function to clean and rename columns
def clean_and_rename(df, column_mapping):
    df.rename(columns=column_mapping, inplace=True)
    return df

# Handle duplicates in primary key columns
def handle_duplicates(df, primary_key):
    if primary_key in df.columns:
        duplicates = df[df.duplicated(subset=[primary_key], keep=False)]
        if not duplicates.empty:
            print(f"Found duplicates in {primary_key} column:")
            print(duplicates)
            df = df.drop_duplicates(subset=[primary_key])
    else:
        print(f"{primary_key} column not found in DataFrame.")
    return df

# Ensure all SKUs in sale_report, p_and_l_march_2021, and may_2022 exist in amazon_products
def verify_and_add_skus(reference_df, target_df, key_column):
    missing_skus = target_df[~target_df[key_column].isin(reference_df['sku'])]
    if not missing_skus.empty:
        new_products = missing_skus[[key_column, 'style', 'category', 'size']].drop_duplicates()
        new_products['asin'] = ''  # Add ASIN if available or keep as empty
        reference_df = pd.concat([reference_df, new_products], ignore_index=True)
    return reference_df

## Step 12: Renaming, Standardizing, Column Removal, and Augmentation

In [22]:
# Apply cleaning and renaming to DataFrames
sale_report_df = clean_and_rename(sale_report_df, sale_report_columns)
p_and_l_march_2021_df = clean_and_rename(p_and_l_march_2021_df, p_and_l_march_2021_columns)
amazon_sale_report_df = clean_and_rename(amazon_sale_report_df, amazon_sale_report_columns)
may_2022_df = clean_and_rename(may_2022_df, may_2022_columns)

# Apply standardization and cleaning to DataFrames
sale_report_df = standardize_column_names(remove_unnamed_columns(remove_index_column(sale_report_df)))
p_and_l_march_2021_df = standardize_column_names(remove_unnamed_columns(remove_index_column(p_and_l_march_2021_df)))
amazon_sale_report_df = standardize_column_names(remove_unnamed_columns(remove_index_column(amazon_sale_report_df)))
may_2022_df = standardize_column_names(remove_unnamed_columns(remove_index_column(may_2022_df)))

# Required columns for amazon_products
required_product_columns = ['sku', 'style', 'category', 'size', 'asin']
sale_report_df = add_missing_columns(sale_report_df, required_product_columns)
p_and_l_march_2021_df = add_missing_columns(p_and_l_march_2021_df, required_product_columns)
may_2022_df = add_missing_columns(may_2022_df, required_product_columns)

## Step 13: Split Amazon Sale Report into Separate Dataframes

This is needed to properly organize information in the database while also ensuring data integrity. These are the following tables:

- amazon_orders_df: Contains order-specific information, ensuring that each order is unique by order_id.
- amazon_products_df: Contains product-specific information, ensuring each product is unique by sku.
- amazon_order_items_df: Contains detailed information about items in each order, linking orders and products.

In [23]:
# Function to split amazon_sale_report_df into orders, products, and order_items
def split_amazon_sale_report(amazon_sale_report_df):
    required_columns = ['order_id', 'date', 'status', 'fulfillment', 'sales_channel', 'ship_service_level', 'courier_status', 'currency', 'amount', 'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'promotion_ids', 'b2b', 'fulfilled_by', 'sku', 'style', 'category', 'size', 'asin', 'qty']
    amazon_sale_report_df = add_missing_columns(amazon_sale_report_df, required_columns)

    amazon_orders_df = amazon_sale_report_df.drop_duplicates(subset=['order_id']).copy()
    amazon_orders_df = amazon_orders_df[['order_id', 'date', 'status', 'fulfillment', 'sales_channel', 'ship_service_level', 'courier_status', 'currency', 'amount', 'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'promotion_ids', 'b2b', 'fulfilled_by']]
    
    amazon_products_df = amazon_sale_report_df.drop_duplicates(subset=['sku']).copy()
    amazon_products_df = amazon_products_df[['sku', 'style', 'category', 'size', 'asin']]
    
    amazon_order_items_df = amazon_sale_report_df[['order_id', 'sku', 'style', 'category', 'size', 'asin', 'qty', 'amount']]
    
    return amazon_orders_df, amazon_products_df, amazon_order_items_df

# Split amazon_sale_report_df into orders, products, and order_items
amazon_orders_df, amazon_products_df, amazon_order_items_df = split_amazon_sale_report(amazon_sale_report_df)

## Step 14: Handle Duplicates

In [31]:
# Apply the removal functions
sale_report_df = handle_duplicates(remove_unnamed_columns(remove_index_column(sale_report_df)), 'sku_code')
p_and_l_march_2021_df = handle_duplicates(remove_unnamed_columns(remove_index_column(p_and_l_march_2021_df)), 'sku')
amazon_sale_report_df = handle_duplicates(remove_unnamed_columns(remove_index_column(amazon_sale_report_df)), 'order_id')
may_2022_df = handle_duplicates(remove_unnamed_columns(remove_index_column(may_2022_df)), 'sku')


## Step 15: Verify and Add Missing SKUs

This cell verifies that all SKUs are present and adds any missing SKUs to the Amazon products DataFrame.

In [32]:
# Add missing SKUs to amazon_products_df
amazon_products_df = verify_and_add_skus(amazon_products_df, sale_report_df, 'sku_code')
amazon_products_df = verify_and_add_skus(amazon_products_df, p_and_l_march_2021_df, 'sku')
amazon_products_df = verify_and_add_skus(amazon_products_df, amazon_order_items_df, 'sku')
amazon_products_df = verify_and_add_skus(amazon_products_df, may_2022_df, 'sku')

## Step 16: Final Cleaning

Final clean-up by removing rows with null SKUs and ensuring that there are no duplicates in the order items data frame.

In [33]:
# Remove rows with null SKU in amazon_products_df
amazon_products_df = amazon_products_df.dropna(subset=['sku'])

# Ensure no duplicates in amazon_order_items
amazon_order_items_df = amazon_order_items_df.drop_duplicates(subset=['order_id', 'sku'])

print("\nUpdated amazon_products with missing SKUs.")


Updated amazon_products with missing SKUs.


## Step 17: Handling and Updating Missing SKUs

Ensures that all SKUs from sales reports and other data sources are included in the amazon_products list, which is crucial for accurate inventory and sales analysis.

In [34]:
# Identify missing SKUs from each table
missing_skus = pd.Series(dtype='object')

for df in [sale_report_df, p_and_l_march_2021_df, may_2022_df]:
    if 'sku_code' in df.columns:
        missing_skus = pd.concat([missing_skus, df[~df['sku_code'].isin(amazon_products_df['sku']) & df['sku_code'].notna()]['sku_code']])
    elif 'sku' in df.columns:
        missing_skus = pd.concat([missing_skus, df[~df['sku'].isin(amazon_products_df['sku']) & df['sku'].notna()]['sku']])

missing_skus = missing_skus.drop_duplicates()

# Add missing SKUs to amazon_products
if len(missing_skus) > 0:
    new_products = pd.DataFrame({
        'sku': missing_skus,
        'style': 'Unknown',
        'category': 'Unknown',
        'size': 'Unknown',
        'asin': 'Unknown'
    })
    amazon_products_df = pd.concat([amazon_products_df, new_products], ignore_index=True)
    amazon_products_df.to_csv('C:/Users/matth/ecommerce_mba_project/data/cleaned/amazon_products_cleaned.csv', index=False)
    print("Added missing SKUs to amazon_products.")
else:
    print("No missing SKUs found.")

No missing SKUs found.


## Step 18: Log Issues

This is to identify and log rows with missing values in each data frame for further inspection and potential correction.

In [35]:
# Log problematic rows
def log_problematic_rows(df, key):
    problematic_rows = df[df.isnull().any(axis=1)]
    if not problematic_rows.empty:
        log_file = os.path.join(log_path, f'{key}_log.csv')
        problematic_rows.to_csv(log_file, index=False)
        print(f"\nLogged problematic rows for {key} to {log_file}")

log_problematic_rows(sale_report_df, 'sale_report')
log_problematic_rows(p_and_l_march_2021_df, 'p_and_l_march_2021')
log_problematic_rows(amazon_sale_report_df, 'amazon_sale_report')
log_problematic_rows(may_2022_df, 'may_2022')



Logged problematic rows for sale_report to C:/Users/matth/ecommerce_mba_project/data/logs\sale_report_log.csv

Logged problematic rows for p_and_l_march_2021 to C:/Users/matth/ecommerce_mba_project/data/logs\p_and_l_march_2021_log.csv

Logged problematic rows for amazon_sale_report to C:/Users/matth/ecommerce_mba_project/data/logs\amazon_sale_report_log.csv

Logged problematic rows for may_2022 to C:/Users/matth/ecommerce_mba_project/data/logs\may_2022_log.csv


## Step 19: Save Cleaned Data

In [36]:

# Save cleaned data to new CSV files
def save_cleaned_data(df, key):
    new_path = os.path.join(cleaned_path, key + '_cleaned.csv')
    df.to_csv(new_path, index=False)
    print(f"\nSaved cleaned data to {new_path}")

save_cleaned_data(amazon_orders_df, 'amazon_orders')
save_cleaned_data(amazon_products_df, 'amazon_products')
save_cleaned_data(amazon_order_items_df, 'amazon_order_items')
save_cleaned_data(sale_report_df, 'sale_report')
save_cleaned_data(p_and_l_march_2021_df, 'p_and_l_march_2021')
save_cleaned_data(may_2022_df, 'may_2022')


Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\amazon_orders_cleaned.csv

Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\amazon_products_cleaned.csv

Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\amazon_order_items_cleaned.csv

Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\sale_report_cleaned.csv

Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\p_and_l_march_2021_cleaned.csv

Saved cleaned data to C:/Users/matth/ecommerce_mba_project/data/cleaned\may_2022_cleaned.csv


## Step 20: Processing and Saving Cleaned CSV Files

This will maintain consistent column names and formats across all CSV files, which is crucial for downstream data processing and analysis.

In [37]:
# List of CSV files and their expected columns
csv_files = {
    'amazon_orders_cleaned.csv': ['order_id', 'date', 'status', 'fulfillment', 'sales_channel', 'ship_service_level', 'courier_status', 'currency', 'amount', 'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'promotion_ids', 'b2b', 'fulfilled_by'],
    'amazon_products_cleaned.csv': ['sku', 'style', 'category', 'size', 'asin'],
    'amazon_order_items_cleaned.csv': ['order_id', 'sku', 'style', 'category', 'size', 'asin', 'qty', 'amount'],
    'sale_report_cleaned.csv': ['sku_code', 'design_no', 'stock', 'category', 'size', 'color'],
    'p_and_l_march_2021_cleaned.csv': ['sku', 'style_id', 'catalog', 'category', 'weight', 'tp1', 'tp2', 'mrp_old', 'final_mrp_old', 'ajio_mrp', 'amazon_mrp', 'amazon_fba_mrp', 'flipkart_mrp', 'limeroad_mrp', 'myntra_mrp', 'paytm_mrp', 'snapdeal_mrp'],
    'may_2022_cleaned.csv': ['sku', 'style_id', 'catalog', 'category', 'weight', 'tp', 'mrp_old', 'final_mrp_old', 'ajio_mrp', 'amazon_mrp', 'amazon_fba_mrp', 'flipkart_mrp', 'limeroad_mrp', 'myntra_mrp', 'paytm_mrp', 'snapdeal_mrp']
}

base_path = 'C:/Users/matth/ecommerce_mba_project/data/cleaned/'

for file_name, columns in csv_files.items():
    file_path = base_path + file_name
    try:
        df = pd.read_csv(file_path, usecols=columns)
        df.to_csv(file_path, index=False)
        print(f"Cleaned and saved {file_name}")
    except Exception as e:
        print(f"Error processing {file_name}: {e}")

Cleaned and saved amazon_orders_cleaned.csv
Cleaned and saved amazon_products_cleaned.csv
Cleaned and saved amazon_order_items_cleaned.csv
Cleaned and saved sale_report_cleaned.csv
Cleaned and saved p_and_l_march_2021_cleaned.csv
Cleaned and saved may_2022_cleaned.csv
