# Pareto Analysis

The Pareto Principle is a very famous analytical observation
which states that roughly 20 % of the causes account for 80 %
of the results.


This observation is true for most of the scenarios such as



*   About 80 % of the sales are made by about 20% of the
customers in retail
*   About 80% of suffering is due to 20% of the problems :P




Here is the problem, You are given a retail sales dataset that
contains customer_id, date, transaction_id, product_category,
total_amount. You are asked - what is the percentage of total
customers who account for about 80% of the total sales of each
category that happened in the year 2023?

## Boiler Plate Code

In [None]:
# Import PySpark
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("pareto_sql") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

In [None]:
!kaggle datasets download -d mohammadtalib786/retail-sales-dataset

In [None]:
#unpacing the zip file
import zipfile
zip_ref = zipfile.ZipFile('retail-sales-dataset.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [None]:
#loading the data
df1 = spark.read\
    .format("csv")\
    .option("inferSchema","true")\
    .option("header","true")\
    .option("delimiter",",")\
    .load("/content/retail_sales_dataset.csv")

In [None]:
# renaming columns
cols = df1.columns
cols_new = [col.replace(" ", "_").lower() for col in cols]
df1 = df1.toDF(*cols_new)

In [None]:
#creating a view for Spark SQL
df1.createOrReplaceTempView("sales")

## Using DataFrames

In [None]:
# filtering the df
df1 = df1.filter(year(df1.date) == 2023)

In [None]:
df2 = df1.\
  groupBy(["product_category","customer_id"]).\
  agg(sum("total_amount").alias("total_sales"))

In [None]:
df_main = df2.groupBy("product_category").agg(count("*").alias("counts"))

win_1 = Window.partitionBy("product_category")

df3 = df2.withColumn("cumulative_sales", sum("total_sales").over(win_1.orderBy(desc("total_sales")).rowsBetween(Window.unboundedPreceding,Window.currentRow)))

df4 = df3.withColumn("category_sales", sum("total_sales").over(win_1))

df5 = df4.filter(df4.cumulative_sales <= 0.8 * df4.category_sales)

df_pareto =  df5.groupBy("product_category").agg(count("customer_id").alias("customers"))



df_joined = df_main.join(df_pareto, "product_category", "inner")

df_joined.selectExpr("product_category", "ROUND(customers/counts,2) as ratio").show()

+----------------+-----+
|product_category|ratio|
+----------------+-----+
|          Beauty|  0.3|
|        Clothing|  0.3|
|     Electronics|  0.3|
+----------------+-----+



# Using SQL

In [None]:
df_sql = spark.sql("""
WITH CTE1 as (
  SELECT product_category, customer_id, SUM(total_amount) as total_sales
  FROM sales
  WHERE year(date) = 2023
  GROUP BY product_category, customer_id
)
,
CTE2 as (
SELECT
  product_category,
  customer_id,
  SUM(total_sales) OVER (PARTITION BY product_category) as total_sales,
  COUNT(customer_id) OVER (PARTITION BY product_category) as total_customers,
  SUM(total_sales) OVER (PARTITION BY product_category ORDER BY total_sales DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sales
FROM CTE1
)
SELECT
  product_category,

  ROUND(COUNT(customer_id)/total_customers,2) as top_customers_ratio
FROM CTE2
WHERE cumulative_sales <= 0.8 * total_sales
GROUP BY
  product_category,
  total_customers
;
""")

df_sql.show()

+----------------+-------------------+
|product_category|top_customers_ratio|
+----------------+-------------------+
|          Beauty|                0.3|
|        Clothing|                0.3|
|     Electronics|                0.3|
+----------------+-------------------+

