# Exploratory Data Analysis using SQL Guide

## Table of Contents

* [Introduction](#Introduction)
* [Database Overview](#Database-Overview)
* [SQL Queries for Exploratory Data Analysis](#SQL-Queries-for-Exploratory-Data-Analysis)
* [Data Quality Assessment](#Data-Quality-Assessment)
* [Sales Analysis](#Sales-Analysis)
* [Sales Team Performance](#Sales-Team-Performance)
* [Product Analysis](#Product-Analysis)
* [Customer Analysis](#Customer-Analysis)
* [Geographical Analysis](#Geographical-Analysis)
* [Key Findings](#Key-Findings)
* [Recommendations](#Recommendations)

## Introduction

Exploratory Data Analysis (EDA) is not just a preliminary step; it is the foundation upon which robust data-driven decisions are built. By understanding the intricacies of the data, we can uncover hidden patterns and insights that inform our analysis and guide our strategies.

Let's start by understanding what EDA is and why it's crucial for data science projects. We'll set our objectives for this guide and introduce the AdventureWorks database.

Exploratory Data Analysis (EDA) is a crucial first step in any data science project, typically consuming up to 70% of a data scientist's time. This guide focuses on performing EDA using SQL, specifically with the AdventureWorks database, to help you uncover insights and prepare for more advanced analyses. By leveraging SQL queries, you'll learn how to efficiently explore large datasets, identify patterns, and extract meaningful information from relational databases.

Throughout this guide, you'll work with various tables from the AdventureWorks database, including sales orders, products, customers, and territories. You'll learn how to investigate database schemas, examine table structures, check data quality, calculate basic statistics, and analyze distributions and trends. By the end of this guide, you'll be equipped with practical SQL techniques to confidently navigate new databases and set the stage for deeper data analysis tasks.

### Objective
The primary goal of this guide is to equip you with the skills and knowledge to:
- Gain confidence in exploring and understanding the structure of relational databases.
- Use SQL queries to explore and summarize data.
- Develop the ability to spot trends, anomalies, and correlations that can drive business decisions.
- Prepare the groundwork for more advanced statistical analyses and modeling.

By the end of this guide, you'll be able to confidently navigate a new database, extract meaningful insights, and set the stage for deeper data analysis tasks.

### What You'll Learn
- Understand how to read and interpret database schemas, which is crucial for effective data exploration.
- Techniques for examining table structures and relationships.
- Learn techniques to ensure the integrity and reliability of your data, which is vital for accurate analysis.
- Familiarize yourself with basic statistical methods and aggregations that can provide insights into your data's behavior.
- Approaches to analyze distributions and trends.
- Strategies for identifying top performers and key metrics.

Let's begin our exploration of the AdventureWorks database using SQL!

## What is Exploratory Data Analysis?

EDA is a critical phase in the data science workflow, where the goal is to summarize the main characteristics of the dataset, often using visual methods. This process not only helps in understanding the data but also in identifying potential issues that may affect subsequent analyses.

According to [Geeks for Geeks article](https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/), exploratory data analysis is one of the basic and essential steps of a data science project. A data scientist involves almost 70% of his work in doing the EDA of the dataset.

### Key aspects of EDA include:
1. **Distribution of Data**: Examining the distribution of data points to understand their range, central tendencies (mean, median), and dispersion (variance, standard deviation).

2. **Graphical Representations**: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize relationships within the data and distributions of variables.

3. **Outlier Detection**: Identifying unusual values that deviate from other data points.

4. **Correlation Analysis**: Checking the relationships between variables to understand how they might affect each other.

5. **Handling Missing Values**: Detecting and deciding how to address missing data points.

6. **Summary Statistics**: Calculating key statistics that provide insight into data trends and nuances.

7. **Testing Assumptions**: Verifying data meets the conditions required for further analysis.

### In the context of SQL and relational databases, EDA involves:

1. **Database Schema Investigation**: Examining the structure of the database, including tables, relationships, and key fields.

2. **Data Quality Assessment**: Checking for null values, inconsistencies, or anomalies in the data.

3. **Basic Statistics Calculation**: Computing summary statistics like min, max, average, and standard deviation for numerical fields.

4. **Distribution Analysis**: Understanding the spread of data across different categories or ranges.

5. **Trend Identification**: Analyzing patterns over time or across different dimensions.

6. **Relationship Exploration**: Investigating connections between different variables or tables.

7. **Top Performers Analysis**: Identifying key metrics, best-selling products, or high-value customers.

8. **Segmentation**: Grouping data into meaningful categories for further analysis.

## Database Overview

A comprehensive overview of the database is essential for understanding its structure and the relationships between different tables. This section will provide insights into the size and complexity of the data, which is crucial for planning your analysis.

### Database Schema and Documentation

Before diving into the data, it's important to understand the database schema. This involves examining the structure of the database, including tables, relationships, and key fields. Use the provided schema diagram and documentation to familiarize yourself with the database.

#### Step 1: Investigate the Schema
Start by examining the schema diagram to familiarize yourself with the database structure. Use the provided schema diagram: [AdventureWorks_Schema_2005.pdf](https://drive.google.com/file/d/1BIbfjvGUHn4Y0Ll3dZNkTfWUnIUptOQF/view?usp=sharing). Utilize **CTRL + F** or **CMD + F** to search for relevant keywords like Total, Cost, Date, Order, ID, etc. Pay attention to the relationships between tables.

![adventureworks_schema_2005](images/adventureworks_schema_2005.png)

#### Step 2: Key Tables Overview
Identify the key tables in the AdventureWorks database, such as:
- **SalesOrderHeader**: Contains information about sales orders.
- **Product**: Details about products available for sale.
- **Customer**: Information about customers making purchases.
- **SalesTerritory**: Data regarding sales territories.

#### Step 3: Relationships Between Tables
Understanding the relationships between tables is essential for querying the database effectively. For example:
- The **SalesOrderHeader** table is linked to the **Customer** table through the CustomerID.
- The **SalesOrderDetail** table connects to both **SalesOrderHeader** and **Product** tables, allowing for detailed sales analysis.

#### Step 4: Documentation Review
Read more about each interesting table and column in the documentation (if available): [AdventureWorks – Data Dictionary](https://dataedo.com/samples/html/AdventureWorks/doc/AdventureWorks_2/home.html)

![adventureworks_documentation](images/adventureworks_documentation.png)

#### Step 5: Investigate Table Schema in BigQuery
After choosing the specific table you’re interested in, investigate the table SCHEMA in BigQuery closely, and pay attention to data types and mode:

![salesorderheader_schema](images/salesorderheader_schema.png)

Check details

![salesorderheader_details](images/salesorderheader_details.png)

#### Step 6: Preview Table Data
Closely check the preview of a table:

![salesorderheader_preview](images/salesorderheader_preview.png)

## SQL Queries for Exploratory Data Analysis

Now that you have a solid understanding of the database schema and documentation, we will begin using SQL queries to explore the data further. This section will guide you through various SQL queries that will help you analyze the AdventureWorks database effectively.

#### Step 1: Listing All Tables

To begin your exploration, it's important to know what tables are available in the database. The following query lists all the tables in the `adwentureworks_db` database.

In [None]:
-- Listing all tables
SELECT
 table_name
FROM
 `adwentureworks_db.INFORMATION_SCHEMA.TABLES`
WHERE
 table_type = 'BASE TABLE'
ORDER BY
 table_name;

![listing_all_tables](images/listing_all_tables.png)

This query provides a comprehensive list of all base tables, which serves as a starting point for understanding the data structure and the entities involved in the AdventureWorks database.

### Step 2: Checking Row Counts for All Tables

Once you have a list of tables, the next step is to understand the size of each table. Knowing the row counts can help you gauge the volume of data you are dealing with, which is crucial for performance considerations and planning your analysis.

In [None]:
-- Checking row counts for all tables
SELECT
 table_id AS table_name,
 row_count
FROM
 `adwentureworks_db.__TABLES__`
ORDER BY
 row_count DESC;

![checking_row_counts_all_tables](images/checking_row_counts_all_tables.png)

This query retrieves the row counts for each table, allowing you to structure your analysis based on the size of the datasets.

#### Step 3: Investigating Table Schema

After identifying the tables and their sizes, it's essential to delve deeper into the structure of specific tables. Understanding the schema of a table, including its columns and data types, is vital for writing accurate SQL queries.

In [None]:
-- Examining table structure
SELECT
 column_name,
 data_type,
 is_nullable
FROM
 `adwentureworks_db.INFORMATION_SCHEMA.COLUMNS`
WHERE
 table_name = 'salesorderheader'
ORDER BY
 ordinal_position;

![examin_table_structure](images/examin_table_structure.png)

This query examines the structure of the `salesorderheader` table, providing insights into the data types and nullability of each column. This knowledge is crucial for ensuring that your queries are both efficient and effective.

#### Step 4: Previewing Table Data

To gain a better understanding of the actual data contained within a table, it's helpful to preview a few records. This allows you to see the data in context and assess its quality.

In [None]:
-- Head query for salesorderheader
SELECT *
FROM `adwentureworks_db.salesorderheader`
ORDER BY OrderDate DESC
LIMIT 10;

![preview_table](images/preview_table.png)

#### Step 5: Random Sampling from Table

In addition to previewing the most recent records, obtaining a random sample of data can provide a broader perspective on the dataset. This is particularly useful for initial exploration and identifying potential data quality issues.


In [None]:
-- Random sampling from salesorderheader
SELECT *
FROM `adwentureworks_db.salesorderheader`
ORDER BY RAND()
LIMIT 10;

![random_sampling](images/random_sampling.png)

## Data Quality Assessment

Data quality is paramount in EDA. This section will guide you through assessing the integrity of your data, ensuring that your findings are based on reliable information. Techniques for identifying null values, duplicates, and inconsistencies will be discussed.

Let's assess the data quality by checking for null values and basic statistics.

### Step 1: Check for Null Values

To ensure the reliability of your analysis, it's essential to check for null values in key columns of the `salesorderheader` table. This helps identify missing information that could impact your results.

In [None]:
-- Check for null values in key columns of salesorderheader
SELECT
  COUNT(*) AS total_rows,
  COUNTIF(SalesOrderID IS NOT NULL) AS non_null_order_id,
  COUNTIF(OrderDate IS NOT NULL) AS non_null_order_date,
  COUNTIF(CustomerID IS NOT NULL) AS non_null_customer_id,
  COUNTIF(TotalDue IS NOT NULL) AS non_null_total_due,
  COUNTIF(SalesOrderID IS NULL) AS null_order_id,
  COUNTIF(OrderDate IS NULL) AS null_order_date,
  COUNTIF(CustomerID IS NULL) AS null_customer_id,
  COUNTIF(TotalDue IS NULL) AS null_total_due
FROM `adwentureworks_db.salesorderheader`;

![null_values](images/null_values.png)

#### Step 2: Check for Null Values in Product Table

Similarly, assessing null values in the `product` table is crucial, as missing data in product information can lead to incomplete analyses and misinformed decisions.

In [None]:
-- Null value check in key columns in Product table
SELECT
  COUNTIF(ProductID IS NULL) AS null_product_ids,
  COUNTIF(Name IS NULL) AS null_names,
  COUNTIF(ProductNumber IS NULL) AS null_product_numbers,
  COUNTIF(Color IS NULL) AS null_colors
FROM `adwentureworks_db.product`;

![null_values_2](images/null_values_2.png)

#### Step 3: Check for Duplicate Records

Identifying duplicate records in the `salesorderheader` table is vital for ensuring data integrity. Duplicates can skew your analysis and lead to incorrect conclusions.

In [None]:
-- Check for duplicate records in salesorderheader
SELECT
 SalesOrderID,
 COUNT(*) AS count
FROM
 `adwentureworks_db.salesorderheader`
GROUP BY
 SalesOrderID
HAVING
 COUNT(*) > 1;

![duplicates_check](images/duplicates_check.png)

### Step 4: Value Range Check

Performing a value range check on the `TotalDue` column in the `salesorderheader` table is crucial for identifying outliers and ensuring data quality. This check reveals the minimum, maximum, average, and standard deviation of `TotalDue`, providing insights into the distribution of transaction amounts. Understanding these statistics helps in detecting incorrect entries that may impact your analysis.

In [None]:
-- Value range check for TotalDue in salesorderheader
SELECT
  FORMAT_TIMESTAMP('%Y-%m-%d', MIN(OrderDate)) AS min_order_date,
  FORMAT_TIMESTAMP('%Y-%m-%d', MAX(OrderDate)) AS max_order_date,
  ROUND(MIN(TotalDue), 2) AS min_total_due,
  ROUND(MAX(TotalDue), 2) AS max_total_due,
  ROUND(AVG(TotalDue), 2) AS avg_total_due,
  ROUND(STDDEV(TotalDue), 2) AS stddev_total_due,
  COUNT(*) AS total_records
FROM
  `adwentureworks_db.salesorderheader`;

![values_check](images/values_check.png)

#### Step 5: Consistency Check

Verifying the consistency between `OrderDate` and `ShipDate` in the `salesorderheader` table is important to ensure that the data adheres to logical constraints.

In [None]:
-- Consistency check for OrderDate and ShipDate in salesorderheader
SELECT
 COUNT(*) AS inconsistent_records
FROM (
  SELECT
  DATE(OrderDate) AS formatted_order_date,
  ShipDate
  FROM `adwentureworks_db.salesorderheader`
)
WHERE
 formatted_order_date > ShipDate;

![consistency_check](images/consistency_check.png)

#### Step 6: Unique Constraint Violations

Checking for unique constraint violations in the `product` table helps ensure that each product has a unique identifier, which is essential for accurate data representation.

In [None]:
-- Check for unique constraint violations in Product table
SELECT
 ProductNumber,
 COUNT(*) AS count
FROM
 `adwentureworks_db.product`
GROUP BY
 ProductNumber
HAVING
 COUNT(*) > 1;

![unique_check](images/unique_check.png)

## Sales Analysis

Sales data is often the most critical aspect of business analysis. This section will delve into various metrics and trends, providing insights that can drive strategic decisions. Understanding sales patterns can help identify opportunities for growth and areas needing improvement.

### Step 1: Basic Statistics for Sales

To gain a quick overview of sales performance, it's important to calculate basic statistics such as minimum, maximum, average sales, and the number of unique customers and salespersons.

In [None]:
-- Basic statistics for salesorderheader
SELECT
 ROUND(MIN(TotalDue), 2) AS min_total_due,
 ROUND(MAX(TotalDue), 2) AS max_total_due,
 ROUND(AVG(TotalDue), 2) AS avg_total_due,
 COUNT(DISTINCT CustomerID) AS unique_customers,
 COUNT(DISTINCT SalesPersonID) AS unique_salespersons
FROM `adwentureworks_db.salesorderheader`;

![basic_statistics](images/basic_statistics.png)

#### Step 2: Distribution of Order Total Amounts

Analyzing the distribution of order total amounts helps in understanding how sales are spread across different ranges. This can reveal trends in customer spending behavior.

In [None]:
-- Distribution of order total amounts
SELECT
 CASE
   WHEN TotalDue < 100 THEN '0-100'
   WHEN TotalDue < 500 THEN '100-500'
   WHEN TotalDue < 1000 THEN '500-1000'
   WHEN TotalDue < 5000 THEN '1000-5000'
   ELSE '5000+'
 END AS total_due_range,
 COUNT(*) AS order_count
FROM `adwentureworks_db.salesorderheader`
GROUP BY total_due_range
ORDER BY total_due_range;

![distribution_of_order_total](images/distribution_of_order_total.png)

#### Step 3: Sales by Year with Growth Rate

Examining sales data by year allows you to identify trends over time. Calculating the growth rate year-over-year can provide insights into the overall performance of the business.

In [None]:
-- Sales by Year with Growth Rate
WITH yearly_sales AS (
  SELECT
    FORMAT_TIMESTAMP('%Y', OrderDate) AS year,
    COUNT(*) AS order_count,
    ROUND(SUM(TotalDue), 2) AS total_sales
  FROM
    `adwentureworks_db.salesorderheader`
  GROUP BY
    year
)
SELECT
  year,
  order_count,
  total_sales,
  LAG(total_sales) OVER (ORDER BY year) AS previous_year_sales,
  ROUND((total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100, 2) AS growth_rate
FROM
  yearly_sales
ORDER BY
  year;

![sales_by_year](images/sales_by_year.png)

### Step 4: Sales by Quarter with Growth Rate

Breaking down sales data by quarter offers a more granular view of performance. This analysis can help in understanding seasonal trends and making informed decisions for future strategies.

In [None]:
-- Sales by Quarter with Growth Rate
WITH quarterly_sales AS (
  SELECT
    EXTRACT(YEAR FROM OrderDate) AS year,
    EXTRACT(QUARTER FROM OrderDate) AS quarter,
    COUNT(*) AS order_count,
    ROUND(SUM(TotalDue), 2) AS total_sales
  FROM
    `adwentureworks_db.salesorderheader`
  GROUP BY
    year,
    quarter
)
SELECT
  year,
  quarter,
  order_count,
  total_sales,
  LAG(total_sales) OVER (ORDER BY year, quarter) AS previous_quarter_sales,
  ROUND((total_sales - LAG(total_sales) OVER (ORDER BY year, quarter)) / LAG(total_sales) OVER (ORDER BY year, quarter) * 100, 2) AS growth_rate
FROM
  quarterly_sales
ORDER BY
  year,
  quarter;

![sales_by_quarter](images/sales_by_quarter.png)

#### Step 5: Sales by Month

Analyzing sales on a monthly basis helps in tracking performance and identifying patterns that may not be visible in annual or quarterly data.

In [None]:
-- Sales by Month
WITH monthly_sales AS (
  SELECT
    FORMAT_TIMESTAMP('%Y-%m', OrderDate) AS month,
    COUNT(*) AS order_count,
    ROUND(SUM(TotalDue), 2) AS total_sales
  FROM
    `adwentureworks_db.salesorderheader`
  GROUP BY
    month
)
SELECT
  month,
  order_count,
  total_sales,
  LAG(total_sales) OVER (ORDER BY month) AS previous_month_sales,
  ROUND((total_sales - LAG(total_sales) OVER (ORDER BY month)) / LAG(total_sales) OVER (ORDER BY month) * 100, 2) AS growth_rate
FROM
  monthly_sales
ORDER BY
  month;

![sales_by_month](images/sales_by_month.png)

#### Step 6: Sales by Day of Week

Understanding sales patterns by day of the week can inform marketing strategies and operational decisions, helping to optimize staffing and inventory management.

In [None]:
-- Sales by Day of Week
SELECT
  CASE
    WHEN EXTRACT(DAYOFWEEK FROM OrderDate) = 1 THEN 7
    ELSE EXTRACT(DAYOFWEEK FROM OrderDate) - 1
  END AS day_of_week,
  COUNT(*) AS order_count,
  ROUND(SUM(TotalDue), 2) AS total_sales
FROM
  `adwentureworks_db.salesorderheader`
GROUP BY
  day_of_week
ORDER BY
  day_of_week;

![sales_by_day](images/sales_by_day.png)

Analyzing basic statistics such as minimum, maximum, and average sales can provide a quick overview of the sales performance. Additionally, identifying trends over time can help in forecasting future sales.

## Sales Team Performance

Evaluate the performance of the sales team based on order count and total sales.

#### Step 1: Salesperson Performance with Targets

To assess the performance of each salesperson, it's important to compare their total sales against their sales targets. This analysis can highlight top performers and those who may need additional support. Additionally, calculating bonuses based on sales performance can motivate sales staff and align their efforts with business objectives. This step will analyze the total sales for each salesperson and determine their eligibility for bonuses.

In [None]:
-- Salesperson Performance with Targets on a Yearly Basis including Bonus Calculation
WITH SalesData AS (
  SELECT
    sp.SalesPersonID,
    EXTRACT(YEAR FROM soh.OrderDate) AS sales_year,
    COUNT(DISTINCT soh.SalesOrderID) AS order_count,
    ROUND(SUM(soh.TotalDue), 2) AS total_sales,
    ROUND(SUM(DISTINCT sp.SalesQuota), 2) AS sales_quota,
    ROUND(SUM(DISTINCT sp.Bonus), 2) AS bonus
  FROM
    `adwentureworks_db.salesperson` sp
  JOIN
    `adwentureworks_db.salesorderheader` soh ON sp.SalesPersonID = soh.SalesPersonID
  WHERE
    sp.SalesQuota IS NOT NULL
  GROUP BY
    sp.SalesPersonID,
    sales_year
)
SELECT
  SalesPersonID,
  sales_year,
  order_count,
  total_sales,
  sales_quota,
  CASE 
    WHEN total_sales > sales_quota THEN 'Achieved Quota'
    ELSE 'Did Not Achieve Quota'
  END AS quota_achievement,
  CASE
    WHEN total_sales >= sales_quota THEN bonus
    ELSE 0
  END AS calculated_bonus
FROM
  SalesData
ORDER BY
  total_sales DESC;

![quota_bonus](images/quota_bonus.png)

### Step 2: Salesperson Commission Calculation

Understanding how commissions are calculated can provide insights into the incentives driving sales behavior. This analysis will compute the total commission earned by each salesperson based on their sales performance.

In [None]:
-- Salesperson Commission Calculation
SELECT
  sp.SalesPersonID,
  ROUND(SUM(soh.TotalDue), 2) AS total_sales,
  sp.CommissionPct,
  ROUND(SUM(soh.TotalDue) * sp.CommissionPct, 2) AS commission
FROM
  `adwentureworks_db.salesperson` sp
JOIN
  `adwentureworks_db.salesorderheader` soh ON sp.SalesPersonID = soh.SalesPersonID
GROUP BY
  sp.SalesPersonID, sp.CommissionPct
ORDER BY
  total_sales DESC;

![commission](images/commission.png)

#### Step 3: Salesperson Performance by Territory

Analyzing sales performance by territory can help identify regional strengths and weaknesses. This step will evaluate how each salesperson is performing in their respective territories.

In [None]:
-- Salesperson Performance by Territory
SELECT
  st.CountryRegionCode AS territory_name,
  sp.SalesPersonID,
  COUNT(DISTINCT soh.SalesOrderID) AS order_count,
  ROUND(SUM(soh.TotalDue), 2) AS total_sales,
  sp.SalesYTD,
  sp.SalesLastYear
FROM
  `adwentureworks_db.salesperson` sp
JOIN
  `adwentureworks_db.salesorderheader` soh ON sp.SalesPersonID = soh.SalesPersonID
JOIN
  `adwentureworks_db.salesterritory` st ON soh.TerritoryID = st.TerritoryID
GROUP BY
  st.CountryRegionCode, sp.SalesPersonID, sp.SalesYTD, sp.SalesLastYear
ORDER BY
  total_sales DESC;

![performance_by_territory](images/performance_by_territory.png)

By following these steps, you will gain valuable insights into the performance of your sales team, enabling you to make informed decisions that can enhance sales effectiveness and drive business growth.

## Product Analysis

Examine product data to identify top-selling products, price distributions, and inventory status.


#### Step 1: Top Selling Products by Revenue

Identifying the top-selling products based on the revenue they generate is crucial for effective resource allocation, strategic planning, and profit maximization. Understanding which products contribute most to revenue helps businesses focus their inventory and marketing efforts, adapt to market trends, and manage stock levels efficiently.

In [None]:
-- Top selling products
SELECT
 p.Name AS product_name,
 SUM(sod.OrderQty) AS total_quantity_sold,
 SUM(sod.LineTotal) AS total_revenue
FROM `adwentureworks_db.salesorderdetail` sod
JOIN `adwentureworks_db.product` p ON sod.ProductID = p.ProductID
GROUP BY p.Name
ORDER BY total_revenue DESC
LIMIT 10;

![top_selling_products](images/top_selling_products.png)

### Step 2: Top 10 Best-Selling Products by Quantity

A focused analysis of the top 10 best-selling products based on volume can highlight key items that drive sales and warrant special attention in inventory management and marketing efforts. Understanding which products sell the most units helps businesses optimize stock levels, identify trends in customer demand, and tailor promotional strategies to enhance overall sales performance.

In [None]:
-- Top 10 Best-Selling Products
SELECT
    p.ProductID,
    p.Name AS product_name,
    SUM(sod.OrderQty) AS total_quantity_sold,
    ROUND(SUM(sod.LineTotal), 0) AS total_revenue
FROM `adwentureworks_db.product` p
JOIN `adwentureworks_db.salesorderdetail` sod ON p.ProductID = sod.ProductID
GROUP BY p.ProductID, p.Name
ORDER BY total_quantity_sold DESC
LIMIT 10;

![top_selling_products_quantity](images/top_selling_products_quantity.png)

#### Step 3: Product Category Distribution

Analyzing the distribution of products across different categories can help understand market segmentation and identify areas for potential growth.

In [None]:
-- Product Category distribution
SELECT
    pc.Name AS category_name,
    COUNT(*) AS product_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS percentage
FROM `adwentureworks_db.product` p
JOIN `adwentureworks_db.productsubcategory` ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
JOIN `adwentureworks_db.productcategory` pc ON ps.ProductCategoryID = pc.ProductCategoryID
GROUP BY pc.Name
ORDER BY product_count DESC;

![product_category_distribution](images/product_category_distribution.png)

### Step 4: Sales by Product Category

Examining sales performance by product category allows for a deeper understanding of which categories are performing well and which may need additional support.

In [None]:
  -- Sales by Product Category
SELECT
  pc.Name AS category_name,
  ROUND(SUM(sod.LineTotal), 0) AS total_sales
FROM `adwentureworks_db.salesorderdetail` sod
JOIN `adwentureworks_db.product` p ON sod.ProductID = p.ProductID
JOIN `adwentureworks_db.productsubcategory` ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
JOIN `adwentureworks_db.productcategory` pc ON ps.ProductCategoryID = pc.ProductCategoryID
GROUP BY pc.Name
ORDER BY total_sales DESC;

![sales_by_product_category](images/sales_by_product_category.png)

#### Step 5: Product List Price Statistics

Calculating statistics such as minimum, maximum, average, and standard deviation of product list prices can provide insights into pricing strategies.

In [None]:
-- Product List Price statistics
SELECT
    MIN(ListPrice) AS min_price,
    MAX(ListPrice) AS max_price,
    ROUND(AVG(ListPrice), 2) AS avg_price,
    ROUND(STDDEV(ListPrice), 2) AS stddev_price
FROM `adwentureworks_db.product`
WHERE ListPrice > 0;

![product_list_price_stats](images/product_list_price_stats.png)

### Step 6: Category vs. Average List Price

Comparing average list prices across different product categories can help identify pricing trends and inform pricing strategies.

In [None]:
-- Category vs. Average List Price
SELECT
    pc.Name AS category_name,
    ROUND(AVG(p.ListPrice), 2) AS avg_list_price
FROM `adwentureworks_db.product` p
JOIN `adwentureworks_db.productsubcategory` ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
JOIN `adwentureworks_db.productcategory` pc ON ps.ProductCategoryID = pc.ProductCategoryID
GROUP BY pc.Name
ORDER BY avg_list_price DESC;

![category_average_list_price](images/category_average_list_price.png)

#### Step 7: Product Sales Trend Over Time

Analyzing the sales trend of individual products over time can reveal patterns in customer demand and help forecast future sales.

In [None]:
-- Product sales trend over time
SELECT
    p.ProductID,
    p.Name AS product_name,
    FORMAT_DATE('%Y-%m', soh.OrderDate) AS order_month,
    SUM(sod.OrderQty) AS quantity_sold,
    ROUND(SUM(sod.LineTotal), 2) AS total_revenue
FROM `adwentureworks_db.product` p
JOIN `adwentureworks_db.salesorderdetail` sod ON p.ProductID = sod.ProductID
JOIN `adwentureworks_db.salesorderheader` soh ON sod.SalesOrderID = soh.SalesOrderID
GROUP BY p.ProductID, p.Name, order_month
ORDER BY p.ProductID, order_month
-- LIMIT 20;

![products_sales_over_time](images/products_sales_over_time.png)

### Step 8: Product Inventory Analysis

Evaluating the current inventory levels of products can help identify stock shortages or overstock situations, allowing for better inventory management.

In [None]:
  -- Product inventory analysis
SELECT
  p.ProductID,
  p.Name AS product_name,
  p.SafetyStockLevel,
  pi.Quantity AS current_inventory,
  CASE
    WHEN pi.Quantity < p.SafetyStockLevel THEN 'Low Stock'
    WHEN pi.Quantity > p.SafetyStockLevel * 2 THEN 'Overstocked'
    ELSE 'Adequate'
END
  AS inventory_status
FROM
  `adwentureworks_db.product` p
JOIN
  `adwentureworks_db.productinventory` pi
ON
  p.ProductID = pi.ProductID
ORDER BY
  pi.Quantity DESC
LIMIT
  200;

![product_inventory](images/product_inventory.png)

By following these steps, you will gain a comprehensive understanding of product performance, enabling you to make data-driven decisions that can enhance product offerings and optimize inventory management.

## Customer Analysis

Understanding customer behavior is key to driving sales and improving customer satisfaction. This section will explore how to segment customers based on their purchasing behavior, allowing for targeted marketing strategies and personalized experiences.

Analyze customer data to identify top customers and segment them based on purchase behavior.

#### Step 1: Top 10 Customers by Total Purchase

Identifying the top customers by total purchase can help recognize your most valuable clients and inform strategies for retention and engagement.

In [None]:
-- Top 10 Customers by Total Purchase
SELECT
    c.CustomerID,
    ROUND(SUM(soh.TotalDue), 2) AS total_purchase
FROM `adwentureworks_db.customer` c
JOIN `adwentureworks_db.salesorderheader` soh ON c.CustomerID = soh.CustomerID
GROUP BY c.CustomerID
ORDER BY total_purchase DESC
LIMIT 10;

![top_10_customers](images/top_10_customers.png)

#### Step 2: Top Customers by Order Count

Analyzing the top customers by order count provides insights into customer loyalty and frequency of purchases, which can inform marketing and sales strategies.

In [None]:
  -- Top Customers by Order Count
SELECT
  CustomerID,
  COUNT(SalesOrderID) AS order_count
FROM
  `adwentureworks_db.salesorderheader`
GROUP BY
  CustomerID
ORDER BY
  order_count DESC
LIMIT
  10;

![top_10_customers_by_quantity](images/top_10_customers_by_quantity.png)

### Step 3: Average Order Value by Customer

Calculating the average order value for each customer helps in understanding spending behavior and can guide pricing and promotional strategies.

In [None]:
  -- Average Order Value by Customer
SELECT
  CustomerID,
  ROUND(AVG(TotalDue), 2) AS avg_order_value
FROM
  `adwentureworks_db.salesorderheader`
GROUP BY
  CustomerID
ORDER BY
  avg_order_value DESC
LIMIT
  10;

![average_order_by_customer](images/average_order_by_customer.png)

#### Step 4: Customer Segmentation by Purchase Behavior

Segmenting customers based on their purchase frequency and total spend can provide insights into different customer groups, allowing for tailored marketing approaches.

In [None]:
-- Customer segmentation by purchase frequency and total spend
SELECT
    CustomerID,
    COUNT(DISTINCT SalesOrderID) AS order_count,
    ROUND(SUM(TotalDue), 2) AS total_spend,
    CASE
        WHEN COUNT(DISTINCT SalesOrderID) > 10 AND SUM(TotalDue) > 10000 THEN 'High Value'
        WHEN COUNT(DISTINCT SalesOrderID) > 5 OR SUM(TotalDue) > 5000 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS customer_segment
FROM `adwentureworks_db.salesorderheader`
GROUP BY CustomerID
ORDER BY total_spend DESC
LIMIT 200;

![customer_segmentation](images/customer_segmentation.png)

Identifying top customers by total purchase and order count can help in recognizing your most valuable customers. Segmenting customers based on purchase frequency and total spend can provide insights into different customer groups.

## Geographical Analysis

Geographical insights can reveal trends that are not visible at a higher level. This section will analyze sales data by region, helping to identify market opportunities and inform regional strategies.

Examine sales data by geographical regions to identify trends and key metrics.

#### Step 1: Sales by Country

Analyzing sales by country provides a high-level view of performance across different regions, helping to identify strong markets and areas for potential growth.

In [None]:
-- Sales by Country
SELECT
    st.CountryRegionCode AS country,
    COUNT(soh.SalesOrderID) AS order_count,
    ROUND(SUM(soh.TotalDue), 0) AS total_sales
FROM `adwentureworks_db.salesterritory` st
JOIN `adwentureworks_db.salesorderheader` soh ON st.TerritoryID = soh.TerritoryID
GROUP BY st.CountryRegionCode
ORDER BY total_sales DESC;

![sales_by_country](images/sales_by_country.png)

#### Step 2: Sales by Territory

Examining sales data by territory allows for a more granular understanding of regional performance, highlighting specific areas that may require targeted marketing efforts.

In [None]:
-- Sales by Territory
SELECT
    st.Name AS territory_name,
    COUNT(soh.SalesOrderID) AS order_count,
    ROUND(SUM(soh.TotalDue), 0) AS total_sales
FROM `adwentureworks_db.salesterritory` st
JOIN `adwentureworks_db.salesorderheader` soh ON st.TerritoryID = soh.TerritoryID
GROUP BY st.Name
ORDER BY total_sales DESC;

![sales_by_territory](images/sales_by_territory.png)

### Step 3: Sales Growth by Territory

Calculating sales growth by territory can help assess the effectiveness of regional strategies and identify trends in market performance over time.

In [None]:
-- Sales Growth by Territory
WITH territory_sales AS (
    SELECT
        st.Name AS territory_name,
        FORMAT_DATE('%Y-%m', soh.OrderDate) AS order_month,
        ROUND(SUM(soh.TotalDue), 0) AS total_sales
    FROM `adwentureworks_db.salesterritory` st
    JOIN `adwentureworks_db.salesorderheader` soh ON st.TerritoryID = soh.TerritoryID
    GROUP BY st.Name, order_month
)
SELECT
    territory_name,
    order_month,
    total_sales,
    LAG(total_sales) OVER (PARTITION BY territory_name ORDER BY order_month) AS previous_month_sales,
    ROUND((total_sales - LAG(total_sales) OVER (PARTITION BY territory_name ORDER BY order_month)) / LAG(total_sales) OVER (PARTITION BY territory_name ORDER BY order_month) * 100, 2) AS growth_rate
FROM territory_sales
ORDER BY territory_name, order_month;

![sales_growth_by_territory](images/sales_growth_by_territory.png)

Analyzing sales by country and territory can help in understanding regional performance. Identifying regions with high sales can inform strategic decisions for market expansion.

## Key Findings


1. **Importance of EDA**: The exploratory data analysis (EDA) conducted on the AdventureWorks database using SQL demonstrates the crucial role of EDA in understanding the data and uncovering valuable insights. By following a structured approach to EDA, we can gain a comprehensive understanding of the database schema, assess data quality, and identify patterns and trends in sales, product performance, customer behavior, and geographical distribution.
2. **SQL as a Powerful Tool**: The analysis showcases the power of SQL in performing EDA on large datasets. SQL provides a wide range of functions and techniques to efficiently query and manipulate data, calculate statistics, and aggregate information. The ability to join tables, filter data, and perform complex calculations makes SQL an indispensable tool for exploratory data analysis.
3. **Iterative Nature of EDA**: The EDA process is iterative, allowing for a gradual exploration of the data. By starting with an overview of the database schema and progressing to more specific analyses of sales, products, customers, and geography, we can systematically uncover insights and refine our understanding of the data. Each step of the EDA builds upon the previous findings, enabling a deeper comprehension of the relationships and patterns within the dataset.
4. **Foundation for Further Analysis**: The EDA conducted in this analysis lays the foundation for further statistical analysis and predictive modeling. By thoroughly exploring the data, identifying key variables, and understanding their distributions and relationships, we can make informed decisions about which variables to include in subsequent models and analyses. EDA helps in feature selection, data preprocessing, and hypothesis generation, setting the stage for more advanced analytical techniques.

## Recommendations

1. **Expand EDA Scope**: While the current EDA covers essential aspects of the AdventureWorks database, there is potential to expand the scope of the analysis. Additional areas to explore could include customer demographics, product reviews, website traffic data, and marketing campaign effectiveness. Incorporating these additional data sources can provide a more comprehensive understanding of the business and its customers.
2. **Leverage Data Visualization**: To enhance the EDA process and communicate insights effectively, it is recommended to leverage data visualization techniques. Visual representations such as charts, graphs, and dashboards can help in identifying patterns, outliers, and trends more easily. SQL can be combined with data visualization tools like Tableau, Power BI, Looker or Python libraries (e.g., Matplotlib, Seaborn) to create interactive and informative visualizations.
3. **Perform Statistical Analysis**: Building upon the EDA, it is recommended to perform statistical analysis to validate hypotheses and quantify relationships between variables. This can include hypothesis testing, correlation analysis, regression analysis, and time series analysis. Statistical techniques can help in determining the significance of observed patterns and making data-driven decisions.
4. **Collaborate with Domain Experts**: EDA is most effective when conducted in collaboration with domain experts who have deep knowledge of the business and its operations. Engaging with stakeholders from various departments, such as sales, marketing, product development, and customer service, can provide valuable context and insights during the EDA process. Their expertise can guide the exploration, help interpret findings, and ensure the analysis aligns with business objectives.
5. **Continuous Monitoring and Updating**: EDA should not be a one-time exercise but rather an ongoing process. As new data is collected and business dynamics change, it is crucial to regularly update the EDA to capture the latest trends and insights. Establishing a framework for continuous data exploration and monitoring can help in identifying emerging patterns, detecting anomalies, and adapting strategies accordingly.

By emphasizing the importance of EDA, leveraging the power of SQL, and following a structured approach, organizations can gain valuable insights from their data. The recommendations provided aim to enhance the EDA process, incorporate additional data sources, leverage visualization techniques, perform statistical analysis, collaborate with domain experts, and establish a continuous monitoring framework. By doing so, businesses can make data-driven decisions, optimize their strategies, and drive growth and success.