# Amazon Data Cleaning project


This Jupyter notebook is part of an automation project for data cleaning in Amazon Ads. Managing two distinct ad accounts, each with its unique set of brands, campaign types, and targetings, poses significant challenges in data processing and management. Initially, the data cleaning process was cumbersome and inefficient, primarily conducted in Excel, which proved to be unreliable and time-consuming.

The primary objective of this project is to streamline the data cleaning process, reducing the steps involved to a minimum and significantly cutting down the time required. This notebook demonstrates a sample of the data cleaning procedure applied to various sheets, which previously took 1-2 hours, but can now be completed in a matter of seconds.

Along the way I've created multiple functions that will come in handy in my future projects!

Future enhancements include the integration of Watchdog, an automation tool, to further simplify the process. The goal is to achieve a system where new files are automatically detected and processed, producing the final output without manual intervention.

This code, along with its detailed documentation, will be shared on my GitHub account, offering insights into the methods and reasoning behind this automated data cleaning approach.


In [None]:
# Importing Libraries for the data cleaning process

import pandas as pd
import unicodedata #  The unicodedata module is imported to handle Arabic characters effectively.
import re #used to extract the number of the file
import sys
import os
import concurrent.futures # for more efficient execution of importing files.

In [None]:
# Path to the directory with your files
path_main_file = 'file path' 
products_file_path = 'Product file' # products are coded, So I imported a file with the product name - code - brand for better analysis
def read_csv_file(file_path, columns=None):
    """This function imports csv files"""
    if columns:
        return pd.read_csv(file_path, usecols=columns)
    else:
        return pd.read_csv(file_path)


with concurrent.futures.ThreadPoolExecutor() as executor:
    SBASIN_main,products_file_main = executor.map(read_csv_file,[path_main_file,products_file_path])#[None,columns_of_interest,products_file_columns]


SBASIN_main['Date'] = pd.to_datetime(SBASIN_main['Date']) # For data manuipulation I made sure all dates where in datetime format.

SBASIN = SBASIN_main
# in a specific report the file names were exactly the same except for the number in the end indicating the version, therefore I extracted this number to differentiate between the old and the new files to only keep the new files. 
SBASIN['source_file'] = os.path.basename(path_main_file)
SBASIN['source_file_n'] = SBASIN['source_file'].apply(lambda x: re.search(r'\((\d+)\)', x).group(1) if re.search(r'\((\d+)\)', x) else None)
SBASIN['source_file_number'] = SBASIN['source_file_n'].astype(int)
SBASIN['Date'] = pd.to_datetime(SBASIN['Date'])

products_file = products_file_main


In [None]:
def Sponsored_Brand_ASIN(SBASIN):
    """This is the main function to fully clean and export the file into a ready for analysis CSV"""

    def clean_convert_to_numeric(column):
        """Because Arabic letters are RTL this posed an interesting challenge as even after removed it left a space that couldn't be deleted, therefore i needed to deal with the RTL characters and remove them"""
        def remove_rtl_characters(text):
            """Removing RTL characters to process the data correctly"""
            return ''.join(char for char in text if unicodedata.category(char) != 'Lm' and unicodedata.category(char) != 'Cf')
        
        cleaned_column = column.astype(str).apply(remove_rtl_characters)
        cleaned_column = cleaned_column.str.replace('ج.م.', '', regex=False)
        cleaned_column = cleaned_column.astype(float)
        return cleaned_column

    # adding the new columns with the updated numbers
    SBASIN['14 Day Total Sales'] = clean_convert_to_numeric(SBASIN['14 Day Total Sales'])
    SBASIN['14 Day New-to-brand Sales'] = clean_convert_to_numeric(SBASIN['14 Day Total Sales'])

    # Sort by 'source file name' descending, then by date, campaign name, adgroup name, asin

    SBASIN.sort_values(by=['Date','Campaign Name','Attribution type','Purchased ASIN'], ascending=[True,True,True,True], inplace=True)

    # Merge to get information about each product (Brand - Product name)
    cleaned_file = SBASIN.merge(products_file, how='left', left_on='Purchased ASIN', right_on= 'ASIN')
    
    # Drop columns after merge
    cleaned_file.drop(columns=['ASIN','Product Name'], inplace=True)

    # Renaming columns for better understanding
    cleaned_file = cleaned_file.rename(columns={'Brand Description':'Brand','Category Description':'Category','Purchased ASIN':'ASIN','Viewable impressions':'Vimp','14-day Detail Page Views (DPV)':'DPV','14 Day Total Orders (#)':'Total Orders','14 Day Total Units (#)':'Total Units','14 Day Total Sales':'Total Sales','14 Day New-to-brand Orders (#)':'NTB Orders','14 Day New-to-brand Sales':'NTB Sales','14 Day New-to-brand Units (#)':'NTB Units','14-Day Total Orders (#) \u2013 (Click)':'Total Orders(click)','14-Day Total Units (#) \u2013 (Click)':'Total Units(click)','14-Day Total Sales \u2013 (Click)':'Total Sales(click)','14-Day New-to-brand Orders (#) \u2013 (Click)':'NTB Orders(click)','14-Day New-to-brand Sales \u2013 (Click)':'NTB Sales (click)','14-Day New-to-Brand Units (#) \u2013 (Click)':'NTB Units (click)'})


    # fillna
    cleaned_file.fillna("Not Advertised ASIN", inplace=True)

    #day of week
    cleaned_file['Day of week'] = cleaned_file['Date'].dt.dayofweek
    # map the day of the week number to its name
    cleaned_file['Day of week'] = cleaned_file['Day of week'].map({
        0: 'Monday',
        1: 'Tuesday',
        2: 'Wednesday',
        3: 'Thursday',
        4: 'Friday',
        5: 'Saturday',
        6: 'Sunday'
    })

    #Re-arranging columns
    cleaned_file = cleaned_file.reindex(columns=['Date',
    'Day of week',
    'Campaign Name',
    'Attribution type',
    'ASIN',
    'Brand',
    'Product',
    'Category',
    'Total Orders',
    'Total Units',
    'Total Sales',
    'NTB Orders',
    'NTB Sales',
    'NTB Units'])


    return cleaned_file

#Finally I export the file to csv, as I found that the difference in time was massive!
# CSV takes max 2 seconds, while xlsx takes up to 30-40 seconds.

Sponsored_Brand_ASIN(SBASIN).to_csv(r"C:file-destination.csv", index=False)
