# Data Cleaning and Transformation for Merlin Cycles

## 1. Cleaning the FACT_InternetSales Table

In [None]:
-- Cleansed FACT_InternetSales Table
SELECT 
  [ProductKey], 
  [OrderDateKey], 
  [DueDateKey], 
  [ShipDateKey], 
  [CustomerKey], 
  [SalesOrderNumber], 
  [SalesAmount]
FROM 
  [AdventureWorksDW2019].[dbo].[FactInternetSales]
WHERE 
  LEFT (OrderDateKey, 4) >= YEAR(GETDATE()) - 2 -- Ensures we always only bring two years of data from extraction.
ORDER BY
  OrderDateKey ASC;

The **FACT_InternetSales** table was cleansed to retain only the most relevant fields for analysis, including `ProductKey`, `OrderDateKey`, `DueDateKey`, `ShipDateKey`, `CustomerKey`, `SalesOrderNumber`, and `SalesAmount`. By filtering the data to include only the past two years, I ensured the dataset remained current and manageable. This transformation facilitated temporal analysis and streamlined the dataset for efficient loading into Power BI.

## 2. Cleaning the DIM_Products Table

In [None]:
-- Cleansed DIM_Products Table
SELECT 
  p.[ProductKey], 
  p.[ProductAlternateKey] AS ProductItemCode, 
  p.[EnglishProductName] AS [Product Name], 
  ps.EnglishProductSubcategoryName AS [Sub Category],
  pc.EnglishProductCategoryName AS [Product Category], 
  p.[Color] AS [Product Color], 
  p.[Size] AS [Product Size], 
  p.[ProductLine] AS [Product Line], 
  p.[ModelName] AS [Product Model Name], 
  p.[EnglishDescription] AS [Product Description]
FROM 
  [AdventureWorksDW2019].[dbo].[DimProduct] AS p
  LEFT JOIN dbo.DimProductSubcategory AS ps ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
  LEFT JOIN dbo.DimProductCategory AS pc ON ps.ProductCategoryKey = pc.ProductCategoryKey
ORDER BY 
  p.[ProductKey] ASC;

The **DIM_Products** table was refined to focus on key attributes such as `ProductKey`, `ProductItemCode`, `Product Name`, `Sub Category`, `Product Category`, `Product Color`, `Product Size`, `Product Line`, `Product Model Name`, and `Product Description`. By joining the product data with subcategory and category tables, I enriched the product information, providing a comprehensive view of the product hierarchy. This detailed product-level analysis is essential for understanding sales performance across various product lines.

## 3. Cleaning the DIM_Customers Table

In [None]:
-- Cleansed DIM_Customers Table
SELECT 
  c.customerkey AS CustomerKey, 
  c.firstname AS [First Name], 
  c.lastname AS [Last Name], 
  c.firstname + ' ' + c.lastname AS [Full Name], 
  CASE c.gender WHEN 'M' THEN 'Male' WHEN 'F' THEN 'Female' END AS Gender,
  c.datefirstpurchase AS DateFirstPurchase, 
  g.city AS [Customer City]
FROM 
  [AdventureWorksDW2019].[dbo].[DimCustomer] AS c
  LEFT JOIN dbo.dimgeography AS g ON g.geographykey = c.geographykey 
ORDER BY 
  c.CustomerKey ASC;

The **DIM_Customers** table was consolidated by selecting essential customer information, including `CustomerKey`, `First Name`, `Last Name`, `Full Name`, `Gender`, `DateFirstPurchase`, and `Customer City`. Integrating geographic data through a join with the geography table enriched customer insights, enabling detailed demographic analysis. This step was crucial for segmenting customers based on their geographic locations and purchase behavior.

## 4. Cleaning the DIM_Date Table

In [None]:
-- Cleansed DIM_Date Table
SELECT 
  [DateKey], 
  [FullDateAlternateKey] AS Date, 
  [EnglishDayNameOfWeek] AS Day, 
  [EnglishMonthName] AS Month, 
  LEFT([EnglishMonthName], 3) AS MonthShort, 
  [MonthNumberOfYear] AS MonthNo, 
  [CalendarQuarter] AS Quarter, 
  [CalendarYear] AS Year
FROM 
  [AdventureWorksDW2019].[dbo].[DimDate]
WHERE 
  CalendarYear >= 2019
ORDER BY 
  [DateKey] ASC;

The **DIM_Date** table was tailored to support time-based analysis by including fields such as `DateKey`, `Date`, `Day`, `Month`, `MonthShort`, `MonthNo`, `Quarter`, and `Year`, filtered from the year 2019 onwards. This selection ensured that the date dimension was comprehensive and up-to-date, facilitating accurate temporal analysis and trend identification. By structuring the data in this manner, I enabled robust, multidimensional analysis, providing a solid foundation for insightful visualizations and data-driven decision-making in Power BI.