This project focuses on analyzing retail sales data using SQL. It involves setting up a database, creating tables, populating them with sample data, and performing various data cleaning, exploration, and analysis tasks. The goal is to practice and showcase SQL skills by solving real-world business questions.
- Database Setup: Create and configure a database and table structure for storing retail sales data.
- Data Cleaning: Identify and handle missing or incorrect data.
- Data Exploration: Understand the dataset by calculating basic statistics and identifying unique elements.
- Data Analysis: Answer specific business questions through SQL queries to gain insights from the data.
- Database Setup: Instructions to create the database and tables.
- Data Cleaning: Queries to clean and prepare the data for analysis.
- Data Exploration: Queries to explore the dataset and gather initial insights.
- Data Analysis: Queries to analyze the data and answer various business questions.
- Queries: SQL queries grouped by difficulty level with explanations.
To set up the database and tables,
Create the Database: Run the CREATE DATABASE command to create a new database named sql_practice_project1.
Create the Table: Use the CREATE TABLE command to set up the retail_sales table within the newly created database. This table will store information related to retail sales transactions.
-- Create Database
CREATE DATABASE sql_practice_project1;
-- Create Table
CREATE TABLE retail_sales
(
transactions_id INT PRIMARY KEY,
sale_date DATE,
sale_time TIME,
customer_id INT,
gender VARCHAR(15),
age INT,
category VARCHAR(15),
quantity INT,
price_per_unit FLOAT,
cogs FLOAT,
total_sale FLOAT
);
--QUERIES
--Beginner Level
--Q1:Retrieve all columns for sales made on '2022-11-05'
SELECT *
FROM retail_sales
WHERE sale_date = '2022-11-05';
--This query fetches all sales records for a specific date.
--Q2:Retrieve transactions where the category is 'Clothing' and the quantity sold is more than 4 in November 2022
SELECT *
FROM retail_sales
WHERE
category = 'Clothing'
AND
TO_CHAR(sale_date, 'YYYY-MM') = '2022-11'
AND
quantity >= 4;
--This query filters transactions for a particular category and quantity within a specified month.
--Q3:How many sales do we have?
SELECT COUNT(*) AS total_sales
FROM retail_sales;
--Counts the total number of sales transactions.
--Q4:How many unique customers do we have?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM retail_sales;
--Counts the number of distinct customers.
--Q5:Find all transactions where total_sale is greater than 1000
SELECT *
FROM retail_sales
WHERE total_sale > 1000;
--Filters transactions with a total sale amount greater than 1000.
--Intermediate Level
--Q1:Calculate total sales for each category
SELECT category, SUM(total_sale) AS total_sales, COUNT(*) AS total_orders
FROM retail_sales
GROUP BY category;
--Aggregates total sales and order count for each category.
--Q2:Find the average age of customers who purchased items from the 'Beauty' category
SELECT ROUND(AVG(age)) AS avg_age
FROM retail_sales
WHERE category = 'Beauty';
--Calculates the average age of customers in the 'Beauty' category.
--Q3:Find the total revenue generated by each gender
SELECT
gender,
SUM(total_sale) AS total_revenue
FROM retail_sales
GROUP BY gender;
--Aggregates total revenue by gender.
--Q4:Find the average price_per_unit for each category
SELECT
category,
ROUND(AVG(price_per_unit)::numeric, 2) AS avg_price_per_unit
FROM retail_sales
WHERE price_per_unit IS NOT NULL
GROUP BY category;
--Calculates the average price per unit for each category.
--Q5:Find transactions where cogs (Cost of Goods Sold) is greater than total_sale
SELECT *
FROM retail_sales
WHERE cogs > total_sale;
--Identifies transactions where the cost of goods sold exceeds the total sale amount.
--Advanced Level
--Q1:Find the top 5 customers based on the highest total sales
SELECT
customer_id,
SUM(total_sale) AS total_sales
FROM retail_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
--Lists the top 5 customers by total sales.
--Q2:Determine the total quantity of items sold per age group
SELECT
CASE
WHEN age < 20 THEN 'Below 20'
WHEN age BETWEEN 20 AND 40 THEN '20-40'
WHEN age BETWEEN 40 AND 60 THEN '40-60'
ELSE 'Above 60'
END AS age_group,
SUM(quantity) AS total_quantity
FROM retail_sales
GROUP BY age_group;
--Aggregates total quantity sold by age group.
--Q3:Find the total number of transactions made by each gender in each category
SELECT category, gender, COUNT(*) AS total_transactions
FROM retail_sales
GROUP BY 1, 2
ORDER BY 1;
--Counts transactions by gender and category.
--Q4:Find the most common age group among customers who made purchases in the 'Electronics' category
SELECT
CASE
WHEN age < 20 THEN 'Below 20'
WHEN age BETWEEN 20 AND 40 THEN '20-40'
WHEN age BETWEEN 40 AND 60 THEN '40-60'
ELSE 'Above 60'
END AS age_group,
COUNT(*) AS customer_count
FROM retail_sales
WHERE category = 'Electronics'
GROUP BY age_group
ORDER BY customer_count DESC
LIMIT 1;
--Identifies the most common age group for customers buying 'Electronics'.
--Q5:Calculate the average sale for each month and find the best-selling month in each year
SELECT year, month, avg_sale
FROM
(
SELECT
EXTRACT(YEAR FROM sale_date) AS year,
EXTRACT(MONTH FROM sale_date) AS month,
AVG(total_sale) AS avg_sale,
RANK() OVER(PARTITION BY EXTRACT(YEAR FROM sale_date) ORDER BY AVG(total_sale) DESC) AS rank
FROM retail_sales
GROUP BY 1, 2
) AS t1
WHERE rank = 1;
--Finds the best-selling month for each year based on average sales.
--Q6:Find the top 3 categories with the highest average sales per transaction
SELECT
category,
ROUND(AVG(total_sale)::numeric, 2) AS avg_sales
FROM retail_sales
GROUP BY category
ORDER BY avg_sales DESC
LIMIT 3;
--Lists the top 3 categories with the highest average sales.
--Q7:Calculate the percentage of total sales made by male customers
SELECT
ROUND(
(SUM(CASE WHEN gender = 'Male' THEN total_sale ELSE 0 END) / NULLIF(SUM(total_sale), 0))::numeric * 100, 2
) AS male_sales_percentage
FROM retail_sales;
--Calculates the percentage of total sales attributed to male customers.
--Q8:Identify which day of the week had the highest total sales
SELECT
TO_CHAR(sale_date, 'Day') AS day_of_week,
SUM(total_sale) AS total_sales
FROM retail_sales
GROUP BY day_of_week
ORDER BY total_sales DESC
LIMIT 1;
--Finds the day of the week with the highest total sales.
--Q9:Find the total cogs (Cost of Goods Sold) for each month and year
SELECT
EXTRACT(YEAR FROM sale_date) AS year,
EXTRACT(MONTH FROM sale_date) AS month,
SUM(cogs) AS total_cogs
FROM retail_sales
GROUP BY year, month
ORDER BY year, month;
--Aggregates total cost of goods sold by month and year.
--Q10:Create each shift and number of orders (e.g., morning < 12, afternoon between 12 & 17, evening > 17)
WITH hourly_sale AS
(
SELECT *,
CASE
WHEN EXTRACT(HOUR FROM sale_time) < 12 THEN 'Morning'
WHEN EXTRACT(HOUR FROM sale_time) BETWEEN 12 AND 17 THEN 'Afternoon'
ELSE 'Evening'
END AS shift
FROM retail_sales
)
SELECT
shift,
COUNT(*) AS total_orders
FROM hourly_sale
GROUP BY shift;
--Categorizes sales into shifts (morning, afternoon, evening) and counts orders.This project was inspired by a YouTube video tutorial on SQL. I have expanded upon the original content by adding additional queries and analyses to deepen the exploration of the retail sales dataset. This project was completed to practice and enhance my SQL skills.