In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=25f8d41ad533ad0c95a7b75061c64a73bda6bac44b49151ab733d2d52f16f802
  Stored in directory: /root/.cache/pip/wheels/5a/54/9b/a89cac960efb57c4c35d41cc7c9f7b80daa21108bc376339b7
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
  

## Ways of Clusters : Spark

With any data, we want to always cluster, categorize, partition and divide to learn fast. How best Spark supports it? 

Notebook discusses the way we can cluster the data inside the spark table. 

1) Distribute By

2) Cluster By

3) Partition By

4) Bucketing By

5) Group By

We will see each of these using the Dmart Sales Data

In [2]:
#Creating the Spark Session

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession. \
    builder. \
    appName('Cluster_Spark'). \
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/03 03:19:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
#importing the dataset

sales_raw = spark.read.csv('/kaggle/input/datasetbackups/dmart/Sales.csv',
                          inferSchema=True, sep='\t',header=True)

                                                                                

In [4]:
#Creating database so the Spark SQL tables can be created. 
spark.sql("CREATE DATABASE dmart_db")
spark.sql("USE dmart_db")

DataFrame[]

In [5]:
sales_raw.createOrReplaceTempView("sales_view")

In [6]:
#Views are temporary tables
spark.sql("SHOW TABLES").show()

+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
|         |sales_view|       true|
+---------+----------+-----------+



## Introducing Distributed By

The DISTRIBUTE BY clause is used to repartition the data based on the input expressions. Unlike the CLUSTER BY clause, this does not sort the data within each partition.

In [8]:
SQL = spark.sql

SQL("""SELECT ID, Date, GMV, MRP, Units_sold FROM sales_view LIMIT 5""").show(2)

+----------------+----------------+----+----+----------+
|              ID|            Date| GMV| MRP|Units_sold|
+----------------+----------------+----+----+----------+
|ACCCX3S58G7B5F6P|17-10-2015 15:11|6400|7190|         1|
|ACCCX3S58G7B5F6P|19-10-2015 10:07|6900|7190|         1|
+----------------+----------------+----+----+----------+
only showing top 2 rows



In [12]:
SQL("""SELECT ID, Date, GMV, MRP, Units_sold 
        FROM sales_view
        DISTRIBUTE BY CAST(GMV as int)
        LIMIT 20""").show()



+----------------+----------------+----+-----+----------+
|              ID|            Date| GMV|  MRP|Units_sold|
+----------------+----------------+----+-----+----------+
|ACCCYH98AZH5WHDF|04-10-2015 10:53|9900|15200|         1|
|ACCCYZFZZJYF95F3|04-10-2015 15:22|8389|10400|         1|
|ACCCYZFZZJYF95F3|03-10-2015 22:31|8389|10400|         1|
|ACCCYZFZZJYF95F3|02-10-2015 10:03|8389|10400|         1|
|ACCCYZFZZJYF95F3|02-10-2015 15:28|8389|10400|         1|
|ACCCYZFZZJYF95F3|02-10-2015 17:08|8389|10400|         1|
|ACCCYZFZZJYF95F3|03-10-2015 18:02|8389|10400|         1|
|ACCCYZFZZJYF95F3|05-10-2015 10:54|8389|10400|         1|
|ACCCYZFZZJYF95F3|04-10-2015 20:43|8389|10400|         1|
|ACCCZ34CBVZJTVQF|30-10-2015 15:52|3175| 3999|         1|
|ACCCZ34CBVZJTVQF|27-10-2015 13:36|3175| 3999|         1|
|ACCCZ34CBZFWKPBQ|20-10-2015 17:27|1829| 3500|         1|
|ACCCZ34CBZFWKPBQ|18-10-2015 19:32|1829| 3500|         1|
|ACCCZ34CBZFWKPBQ|20-10-2015 09:37|1829| 3500|         1|
|ACCCZ34CBZFWK

                                                                                

## Introducing Cluster by

The CLUSTER BY clause is used to first repartition the data based on the input expressions and then sort the data within each partition. This is semantically equivalent to performing a **DISTRIBUTE BY** followed by a **SORT BY**.

In [14]:
SQL("SET spark.sql.shuffle.partitions = 2;")

DataFrame[key: string, value: string]

In [16]:
SQL("""SELECT ID, Date, GMV, MRP, Units_sold 
        FROM sales_view
        CLUSTER BY MRP
        LIMIT 20""").show()



+----------------+----------------+---+---+----------+
|              ID|            Date|GMV|MRP|Units_sold|
+----------------+----------------+---+---+----------+
|CGEECC2GSWH4YYYY|29-11-2015 15:46| 79| 49|         1|
|CGEECC2GGDHVCH7F|22-12-2015 18:24| 79| 49|         1|
|CGEECC2GSWH4YYYY|07-12-2015 22:23| 79| 49|         1|
|CGEECC2GSWH4YYYY|24-12-2015 15:25| 79| 49|         1|
|CGEECC2GSWH4YYYY|05-12-2015 15:25| 79| 49|         1|
|CGEECC2GSWH4YYYY|11-12-2015 12:00| 79| 49|         1|
|CGEECC2GSWH4YYYY|03-12-2015 09:29| 79| 49|         1|
|CGEECC2GSWH4YYYY|23-12-2015 10:22| 79| 49|         1|
|CGEECC2GGDHVCH7F|07-01-2016 18:39| 79| 49|         1|
|CGEECC2GGDHVCH7F|18-01-2016 19:47| 60| 49|         1|
|CGEECC2GGDHVCH7F|24-01-2016 16:26| 60| 49|         1|
|CGEECC2GGDHVCH7F|11-01-2016 21:04| 79| 49|         1|
|CGEECC2GGDHVCH7F|03-01-2016 00:58| 79| 49|         1|
|CGEECC2GGDHVCH7F|20-01-2016 17:09| 60| 49|         1|
|CGEECC2GSWH4YYYY|22-01-2016 21:08| 49| 49|         1|
|CGEECC2GS

                                                                                

## Group By Needs no Intro

The purpose of groupby is for aggregating the values inside the categories. While the ideas we are discussing here is about clustering the data for better understanding / viewing

In [18]:
SQL("""SELECT GMV, ROUND(AVG(MRP), 1) as average_mrp, 
            SUM(Units_sold) as total_units,
            SUM(MRP * Units_sold) as total_sales
        FROM sales_view
        GROUP BY GMV
        LIMIT 20""").show()



+----+-----------+-----------+-----------+
| GMV|average_mrp|total_units|total_sales|
+----+-----------+-----------+-----------+
|6900|     7763.9|         49|     380429|
|1990|     3884.1|       1408|    5375955|
|1618|     2798.8|         15|      40084|
|3324|     3508.0|          8|      22037|
|6749|    13957.8|        254|    3545274|
|6003|     6718.0|          6|      38580|
|6073|     7574.5|          2|      15149|
|6002|     7150.0|          1|       7150|
|6565|     7584.8|         52|     394410|
|6554|     7150.0|         12|      85800|
|6533|     7174.3|         14|     100440|
|6133|     7150.0|          1|       7150|
|6695|     7198.6|          7|      50390|
|6589|     7208.4|         13|      93709|
|6550|     8355.4|         63|     526390|
|6435|     7454.0|         20|     149079|
|6500|    10908.2|         56|     575472|
|5898|     7256.0|          6|      44280|
|6700|     9848.1|         14|     137873|
|6025|     7377.4|          9|      66397|
+----+-----

                                                                                

# Moving to Partitioning

Partitioning is executed when the table is written as files to the hadoop/ local file system. The above commands create concept level partitioning. While partitioning makes the partitioning physically at files and folder level.  

In [19]:
sales_raw.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- ID_Order: double (nullable = true)
 |-- ID_Item_ordered: double (nullable = true)
 |-- GMV: string (nullable = true)
 |-- Units_sold: integer (nullable = true)
 |-- SLA: integer (nullable = true)
 |-- Product_Category: string (nullable = true)
 |-- Analytic_Category: string (nullable = true)
 |-- Sub_category: string (nullable = true)
 |-- product_analytic_vertical: string (nullable = true)
 |-- MRP: integer (nullable = true)
 |-- Procurement_SLA: integer (nullable = true)



In [23]:
# Lets create a table that has partition builtin

sales_raw.write.saveAsTable("sales_partition",
                           mode='overwrite',
                           partitionBy='GMV',
                           format='parquet')

                                                                                

In [25]:
SQL("""SHOW PARTITIONS sales_partition""").show(5)

+---------+
|partition|
+---------+
|    GMV= |
|    GMV=0|
|   GMV=10|
|  GMV=100|
| GMV=1000|
+---------+
only showing top 5 rows



In [26]:
SQL("""DESCRIBE TABLE EXTENDED sales_partition""").show()

+--------------------+---------------+-------+
|            col_name|      data_type|comment|
+--------------------+---------------+-------+
|                  ID|         string|   null|
|                Date|         string|   null|
|            ID_Order|         double|   null|
|     ID_Item_ordered|         double|   null|
|          Units_sold|            int|   null|
|                 SLA|            int|   null|
|    Product_Category|         string|   null|
|   Analytic_Category|         string|   null|
|        Sub_category|         string|   null|
|product_analytic_...|         string|   null|
|                 MRP|            int|   null|
|     Procurement_SLA|            int|   null|
|                 GMV|         string|   null|
|# Partition Infor...|               |       |
|          # col_name|      data_type|comment|
|                 GMV|         string|   null|
|                    |               |       |
|# Detailed Table ...|               |       |
|            

In [29]:
#Lets take another file

first_sales = spark.read.csv("/kaggle/input/datasetbackups/dmart/firstfile.csv",
                            inferSchema=True,header=True,sep=',')

                                                                                

In [31]:
first_sales.select("Date","Sales_name","gmv_new","units").show(2)

+-------------------+------------+-------+-----+
|               Date|  Sales_name|gmv_new|units|
+-------------------+------------+-------+-----+
|2015-07-01 00:00:00|No Promotion| 3040.0|    1|
|2015-07-01 00:00:00|No Promotion|  310.0|    1|
+-------------------+------------+-------+-----+
only showing top 2 rows



In [32]:
#Lets partition this dataframe on Sales_name and gmv_new. Lets count
first_sales.count()

                                                                                

1578079

In [35]:
#Before partitioning lets trimout the date, and select 4 columns

first_sales.selectExpr("split_part(Date, ' ',1) as day_date",
                    "Sales_name", "gmv_new", "units").show(2)

+----------+------------+-------+-----+
|  day_date|  Sales_name|gmv_new|units|
+----------+------------+-------+-----+
|2015-07-01|No Promotion| 3040.0|    1|
|2015-07-01|No Promotion|  310.0|    1|
+----------+------------+-------+-----+
only showing top 2 rows



In [36]:
trimed_sales = first_sales.selectExpr("split_part(Date, ' ',1) as day_date",
                    "Sales_name", "gmv_new", "units")

In [37]:
trimed_sales.write.saveAsTable("trimed_sales_partition",
                              mode='overwrite',
                              format='parquet',
                              partitionBy=['Sales_name','gmv_new'])

                                                                                

In [40]:
!ls /kaggle/working/spark-warehouse/dmart_db.db/sales_partition/ | head -n 10

GMV= 
GMV=0
GMV=10
GMV=100
GMV=1000
GMV=10000
GMV=10002
GMV=1001
GMV=10010
GMV=10017
ls: write error: Broken pipe


In [44]:
SQL("SHOW PARTITIONS trimed_sales_partition").show(2,truncate=False)

+-----------------------------+
|partition                    |
+-----------------------------+
|Sales_name=BED/gmv_new=1000.0|
|Sales_name=BED/gmv_new=1003.0|
+-----------------------------+
only showing top 2 rows



In [42]:
!ls /kaggle/working/spark-warehouse/dmart_db.db/trimed_sales_partition/

'Sales_name=BED'			'Sales_name=Independence Sale'
'Sales_name=BSD-5'			'Sales_name=No Promotion'
'Sales_name=Big Diwali Sale'		'Sales_name=Pacman'
'Sales_name=Christmas & New Year Sale'	'Sales_name=Rakshabandhan Sale'
'Sales_name=Daussera sale'		'Sales_name=Republic Day'
'Sales_name=Eid & Rathayatra sale'	'Sales_name=Valentine%27s Day'
'Sales_name=FHSD'			 _SUCCESS


In [46]:
!ls /kaggle/working/spark-warehouse/dmart_db.db/trimed_sales_partition/Sales_name\=BED | head -n 5

gmv_new=1000.0
gmv_new=1003.0
gmv_new=1004.0
gmv_new=10049.0
gmv_new=1010.0


## Introducing Bucket By

What is the difference between Bucket By and Partition By

Bucket By creates just files. While partition By creates folders based on the partition, and then places the files inside it.Example follows

In [47]:
trimed_sales.write.bucketBy(10, 'Sales_name').saveAsTable("bucketed_sales")

                                                                                

In [48]:
!ls /kaggle/working/spark-warehouse/dmart_db.db/bucketed_sales/ 

_SUCCESS
part-00000-9bf476ea-c936-417c-90d8-4af308d855ed_00001.c000.snappy.parquet
part-00000-9bf476ea-c936-417c-90d8-4af308d855ed_00002.c000.snappy.parquet
part-00000-9bf476ea-c936-417c-90d8-4af308d855ed_00003.c000.snappy.parquet
part-00000-9bf476ea-c936-417c-90d8-4af308d855ed_00006.c000.snappy.parquet
part-00001-9bf476ea-c936-417c-90d8-4af308d855ed_00001.c000.snappy.parquet
part-00001-9bf476ea-c936-417c-90d8-4af308d855ed_00003.c000.snappy.parquet
part-00001-9bf476ea-c936-417c-90d8-4af308d855ed_00005.c000.snappy.parquet
part-00001-9bf476ea-c936-417c-90d8-4af308d855ed_00009.c000.snappy.parquet
part-00002-9bf476ea-c936-417c-90d8-4af308d855ed_00000.c000.snappy.parquet
part-00002-9bf476ea-c936-417c-90d8-4af308d855ed_00001.c000.snappy.parquet
part-00002-9bf476ea-c936-417c-90d8-4af308d855ed_00005.c000.snappy.parquet
part-00002-9bf476ea-c936-417c-90d8-4af308d855ed_00007.c000.snappy.parquet
part-00003-9bf476ea-c936-417c-90d8-4af308d855ed_00001.c000.snappy.parquet
part-00003-9bf476ea-c936-417c

## Why these many table partitioned?

Table partitioning is only the beginning. When we read these partition back, we can learn lot more. 

In part-2 of this file we will see how to read the file and folders using Spark... Stay Tuned