Preparation Instructions

1. Create a PySpark DataFrame with the following schema:
OrderID (int)
CustomerName (string)
Product (string)
Category (string)
Quantity (int)
UnitPrice (int)
OrderDate (string in YYYY-MM-DD format)

2. Sample at least 12 rows across multiple categories:
"Electronics" , "Clothing" , "Furniture" , "Books"

3. Create:
A local temporary view: "orders_local"
A global temporary view: "orders_global"

In [31]:
from pyspark.sql import Row

data = [
    Row(OrderID=1, CustomerName="Ahana", Product="Laptop", Category="Electronics", Quantity=2, UnitPrice=1000, OrderDate="2023-01-10"),
    Row(OrderID=2, CustomerName="Brindha", Product="Smartphone", Category="Electronics", Quantity=1, UnitPrice=800, OrderDate="2023-02-15"),
    Row(OrderID=3, CustomerName="Sindhana", Product="T-Shirt", Category="Clothing", Quantity=5, UnitPrice=20, OrderDate="2023-01-05"),
    Row(OrderID=4, CustomerName="Zara", Product="Sofa", Category="Furniture", Quantity=1, UnitPrice=15000, OrderDate="2023-03-20"),
    Row(OrderID=5, CustomerName="Elakkiya", Product="Bookshelf", Category="Furniture", Quantity=2, UnitPrice=3000, OrderDate="2023-01-30"),
    Row(OrderID=6, CustomerName="Ferose", Product="Novel", Category="Books", Quantity=3, UnitPrice=15, OrderDate="2023-04-01"),
    Row(OrderID=7, CustomerName="Goutham", Product="Tablet", Category="Electronics", Quantity=3, UnitPrice=400, OrderDate="2023-03-25"),
    Row(OrderID=8, CustomerName="Harish", Product="Jeans", Category="Clothing", Quantity=2, UnitPrice=40, OrderDate="2023-02-20"),
    Row(OrderID=9, CustomerName="Sathya", Product="Notebook", Category="Books", Quantity=4, UnitPrice=10, OrderDate="2023-01-12"),
    Row(OrderID=10, CustomerName="Jackie", Product="Chair", Category="Furniture", Quantity=5, UnitPrice=2500, OrderDate="2023-01-08"),
    Row(OrderID=11, CustomerName="Krishna", Product="Dress", Category="Clothing", Quantity=1, UnitPrice=60, OrderDate="2023-05-18"),
    Row(OrderID=12, CustomerName="Nikitha", Product="Camera", Category="Electronics", Quantity=2, UnitPrice=1200, OrderDate="2023-01-03"),
    Row(OrderID=12, CustomerName="Nikitha", Product="Dress", Category="Clothig", Quantity=2, UnitPrice=1100, OrderDate="2023-02-03"),
]
df = spark.createDataFrame(data)

df.createOrReplaceTempView("orders_local")
df.createOrReplaceGlobalTempView("orders_global")


Part A: Local View – orders_local

1. List all orders placed for "Electronics" with a Quantity of 2 or more.
2. Calculate TotalAmount (Quantity × UnitPrice) for each order.
3. Show the total number of orders per Category .
4. List orders placed in "January 2023" only.
5. Show the average UnitPrice per category.
6. Find the order with the highest total amount.
7. Drop the local view and try querying it again.

In [18]:
spark.sql("select * from orders_local where Category = 'Electronics' and Quantity >= 2").show()

+-------+------------+-------+-----------+--------+---------+----------+
|OrderID|CustomerName|Product|   Category|Quantity|UnitPrice| OrderDate|
+-------+------------+-------+-----------+--------+---------+----------+
|      1|       Ahana| Laptop|Electronics|       2|     1000|2023-01-10|
|      7|     Goutham| Tablet|Electronics|       3|      400|2023-03-25|
|     12|     Nikitha| Camera|Electronics|       2|     1200|2023-01-03|
+-------+------------+-------+-----------+--------+---------+----------+



In [19]:

# 2. calculate TotalAmount (Quantity × UnitPrice) for each order.
spark.sql("select orderId,CustomerName,Product,Category,OrderDate,Quantity*UnitPrice as TotalAmount from orders_local").show()

+-------+------------+----------+-----------+----------+-----------+
|orderId|CustomerName|   Product|   Category| OrderDate|TotalAmount|
+-------+------------+----------+-----------+----------+-----------+
|      1|       Ahana|    Laptop|Electronics|2023-01-10|       2000|
|      2|     Brindha|Smartphone|Electronics|2023-02-15|        800|
|      3|    Sindhana|   T-Shirt|   Clothing|2023-01-05|        100|
|      4|        Zara|      Sofa|  Furniture|2023-03-20|      15000|
|      5|    Elakkiya| Bookshelf|  Furniture|2023-01-30|       6000|
|      6|      Ferose|     Novel|      Books|2023-04-01|         45|
|      7|     Goutham|    Tablet|Electronics|2023-03-25|       1200|
|      8|      Harish|     Jeans|   Clothing|2023-02-20|         80|
|      9|      Sathya|  Notebook|      Books|2023-01-12|         40|
|     10|      Jackie|     Chair|  Furniture|2023-01-08|      12500|
|     11|     Krishna|     Dress|   Clothing|2023-05-18|         60|
|     12|     Nikitha|    Camera|E

In [20]:
# 3. Show the total number of orders per Category
spark.sql("select Category,count(*) as total_count from orders_local group by Category").show()

+-----------+-----------+
|   Category|total_count|
+-----------+-----------+
|Electronics|          4|
|   Clothing|          3|
|      Books|          2|
|  Furniture|          3|
+-----------+-----------+



In [22]:
# 4. List orders placed in "January 2023" only
spark.sql("select * from orders_local where OrderDate like '2023-01%'").show()

+-------+------------+---------+-----------+--------+---------+----------+
|OrderID|CustomerName|  Product|   Category|Quantity|UnitPrice| OrderDate|
+-------+------------+---------+-----------+--------+---------+----------+
|      1|       Ahana|   Laptop|Electronics|       2|     1000|2023-01-10|
|      3|    Sindhana|  T-Shirt|   Clothing|       5|       20|2023-01-05|
|      5|    Elakkiya|Bookshelf|  Furniture|       2|     3000|2023-01-30|
|      9|      Sathya| Notebook|      Books|       4|       10|2023-01-12|
|     10|      Jackie|    Chair|  Furniture|       5|     2500|2023-01-08|
|     12|     Nikitha|   Camera|Electronics|       2|     1200|2023-01-03|
+-------+------------+---------+-----------+--------+---------+----------+



In [23]:
# 5. Show the average UnitPrice per category
spark.sql("select Category,avg(UnitPrice) as avg_price from orders_local group by Category").show()

+-----------+-----------------+
|   Category|        avg_price|
+-----------+-----------------+
|Electronics|            850.0|
|   Clothing|             40.0|
|      Books|             12.5|
|  Furniture|6833.333333333333|
+-----------+-----------------+



In [24]:
# 6. Find the order with the highest total amount
spark.sql("select orderId,CustomerName,Product,Category,OrderDate,Quantity*UnitPrice as TotalAmount from orders_local order by TotalAmount desc limit 1").show()

+-------+------------+-------+---------+----------+-----------+
|orderId|CustomerName|Product| Category| OrderDate|TotalAmount|
+-------+------------+-------+---------+----------+-----------+
|      4|        Zara|   Sofa|Furniture|2023-03-20|      15000|
+-------+------------+-------+---------+----------+-----------+



In [25]:
# 7. Drop the local view and try querying it again
spark.catalog.dropTempView("orders_local")

True

In [26]:
spark.sql("Select * from orders_local").show()

AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `orders_local` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [orders_local], [], false


Part B: Global View – orders_global

1. Display all "Furniture" orders with TotalAmount above
10,000.
2. Create a column called DiscountFlag :
Mark "Yes" if Quantity > 3
Otherwise "No"

3. List customers who ordered more than 1 product type (Hint: use GROUP BY and
HAVING).

4. Count number of orders per month across the dataset.

5. Rank all products by total quantity sold across all orders using a window
function.

6. Run a query using a new SparkSession and the global view.

In [28]:
spark.sql("""
select *, Quantity * UnitPrice AS TotalAmount
from global_temp.orders_global
where Category = 'Furniture' and (Quantity * UnitPrice) > 10000
""").show()

+-------+------------+-------+---------+--------+---------+----------+-----------+
|OrderID|CustomerName|Product| Category|Quantity|UnitPrice| OrderDate|TotalAmount|
+-------+------------+-------+---------+--------+---------+----------+-----------+
|      4|        Zara|   Sofa|Furniture|       1|    15000|2023-03-20|      15000|
|     10|      Jackie|  Chair|Furniture|       5|     2500|2023-01-08|      12500|
+-------+------------+-------+---------+--------+---------+----------+-----------+



In [29]:
#2.Create a column called DiscountFlag : Mark "Yes" if Quantity > 3 Otherwise "No"
spark.sql("""
select *, case when quantity > 3 then 'yes' else 'no' end as discountflag
from global_temp.orders_global
""").show()

+-------+------------+----------+-----------+--------+---------+----------+------------+
|OrderID|CustomerName|   Product|   Category|Quantity|UnitPrice| OrderDate|discountflag|
+-------+------------+----------+-----------+--------+---------+----------+------------+
|      1|       Ahana|    Laptop|Electronics|       2|     1000|2023-01-10|          no|
|      2|     Brindha|Smartphone|Electronics|       1|      800|2023-02-15|          no|
|      3|    Sindhana|   T-Shirt|   Clothing|       5|       20|2023-01-05|         yes|
|      4|        Zara|      Sofa|  Furniture|       1|    15000|2023-03-20|          no|
|      5|    Elakkiya| Bookshelf|  Furniture|       2|     3000|2023-01-30|          no|
|      6|      Ferose|     Novel|      Books|       3|       15|2023-04-01|          no|
|      7|     Goutham|    Tablet|Electronics|       3|      400|2023-03-25|          no|
|      8|      Harish|     Jeans|   Clothing|       2|       40|2023-02-20|          no|
|      9|      Sathya

In [32]:
#3.List customers who ordered more than 1 product type (Hint: use GROUP BY and HAVING).
spark.sql("""
select customername, count(distinct product) as productcount
from global_temp.orders_global
group by customername
having count(distinct product) > 1""").show()

+------------+------------+
|customername|productcount|
+------------+------------+
|     Nikitha|           2|
+------------+------------+



In [34]:
# 4. Count number of orders per month across the dataset.

spark.sql("""
select substring(orderdate, 1, 7) as ordermonth, count(*) as ordercount
from global_temp.orders_global
group by substring(orderdate, 1, 7)
order by ordermonth
""").show()

+----------+----------+
|ordermonth|ordercount|
+----------+----------+
|   2023-01|         6|
|   2023-02|         3|
|   2023-03|         2|
|   2023-04|         1|
|   2023-05|         1|
+----------+----------+



In [36]:
# 5. Rank all products by total quantity sold across all orders using a window function.
from pyspark.sql.functions import sum as _sum
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
products_df = spark.sql("""
    select product, sum(quantity) as totalsold
    from global_temp.orders_global
    group by product
""")

windowSpec = Window.orderBy(col("TotalSold").desc())
ranked = products_df.withColumn("Rank", rank().over(windowSpec))
ranked.show()


+----------+---------+----+
|   product|totalsold|Rank|
+----------+---------+----+
|   T-Shirt|        5|   1|
|     Chair|        5|   1|
|  Notebook|        4|   3|
|     Novel|        3|   4|
|     Dress|        3|   4|
|    Tablet|        3|   4|
|    Laptop|        2|   7|
| Bookshelf|        2|   7|
|    Camera|        2|   7|
|     Jeans|        2|   7|
|      Sofa|        1|  11|
|Smartphone|        1|  11|
+----------+---------+----+



In [37]:
# 6. Run a query using a new SparkSession and the global view.

new_spark = SparkSession.builder.appName("NewSession").getOrCreate()
new_spark.sql("select * from global_temp.orders_global where Category = 'Books'").show()


+-------+------------+--------+--------+--------+---------+----------+
|OrderID|CustomerName| Product|Category|Quantity|UnitPrice| OrderDate|
+-------+------------+--------+--------+--------+---------+----------+
|      6|      Ferose|   Novel|   Books|       3|       15|2023-04-01|
|      9|      Sathya|Notebook|   Books|       4|       10|2023-01-12|
+-------+------------+--------+--------+--------+---------+----------+



Bonus Challenges

In [42]:
books_df = df.filter(col("Category") == "Books")
books_df.createOrReplaceGlobalTempView("books_global")
books_df.show()

+-------+------------+--------+--------+--------+---------+----------+
|OrderID|CustomerName| Product|Category|Quantity|UnitPrice| OrderDate|
+-------+------------+--------+--------+--------+---------+----------+
|      6|      Ferose|   Novel|   Books|       3|       15|2023-04-01|
|      9|      Sathya|Notebook|   Books|       4|       10|2023-01-12|
+-------+------------+--------+--------+--------+---------+----------+



In [39]:
from pyspark.sql.functions import row_number

agg_df = df.groupBy("Category", "Product").agg(_sum("Quantity").alias("TotalQty"))
windowSpec = Window.partitionBy("Category").orderBy(col("TotalQty").desc())
top_products = agg_df.withColumn("rank", row_number().over(windowSpec)).filter("rank = 1")
top_products.show()


+-----------+--------+--------+----+
|   Category| Product|TotalQty|rank|
+-----------+--------+--------+----+
|      Books|Notebook|       4|   1|
|    Clothig|   Dress|       2|   1|
|   Clothing| T-Shirt|       5|   1|
|Electronics|  Tablet|       3|   1|
|  Furniture|   Chair|       5|   1|
+-----------+--------+--------+----+



In [41]:
filtered_df = df.filter(col("Category") != "Clothing")
filtered_df.createOrReplaceTempView("filtered_orders")
filtered_df.show()

+-------+------------+----------+-----------+--------+---------+----------+
|OrderID|CustomerName|   Product|   Category|Quantity|UnitPrice| OrderDate|
+-------+------------+----------+-----------+--------+---------+----------+
|      1|       Ahana|    Laptop|Electronics|       2|     1000|2023-01-10|
|      2|     Brindha|Smartphone|Electronics|       1|      800|2023-02-15|
|      4|        Zara|      Sofa|  Furniture|       1|    15000|2023-03-20|
|      5|    Elakkiya| Bookshelf|  Furniture|       2|     3000|2023-01-30|
|      6|      Ferose|     Novel|      Books|       3|       15|2023-04-01|
|      7|     Goutham|    Tablet|Electronics|       3|      400|2023-03-25|
|      9|      Sathya|  Notebook|      Books|       4|       10|2023-01-12|
|     10|      Jackie|     Chair|  Furniture|       5|     2500|2023-01-08|
|     12|     Nikitha|    Camera|Electronics|       2|     1200|2023-01-03|
|     12|     Nikitha|     Dress|    Clothig|       2|     1100|2023-02-03|
+-------+---