# Aggregating Features Regarding Orders by Merchants
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding orders by Merchant, as well as creating new features related to orders.

To start we will create a Spark session and import the orders dataset that contains all the features that relate to orders.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/09/25 17:53:29 WARN Utils: Your hostname, DulanComputer resolves to a loopback address: 127.0.1.1; using 172.30.15.25 instead (on interface eth0)
23/09/25 17:53:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/25 17:53:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [18]:
orders = spark.read.parquet("../../../data/insights/pre_insights/orders.parquet/")

In [None]:
orders.columns

In [None]:
orders.show(truncate= False)

### Aggreagation

Next we will aggregate the orders data by merchant abn.

In [12]:
orders_aggregated = orders.groupBy("merchant_abn").agg\
                        (f.count("order_datetime").alias("number_of_orders"),
                         f.avg("dollar_value").alias("average_cost_of_order"))

### Feature Engineering

1. Finding average difference in consumers per month

In [19]:
orders = orders.withColumn("order_month", f.month(orders.order_datetime)).withColumn("order_year", f.year(orders.order_datetime))

In [20]:
orders_by_month_year = orders.groupBy("merchant_abn","order_year","order_month").agg(f.countDistinct((orders.consumer_id)).alias("monthly_distinct_consumers"))

In [21]:
orders_by_month_year = orders_by_month_year.orderBy(f.col("order_year").asc(),f.col("order_month").asc(),f.col("monthly_distinct_consumers").desc())

In [7]:
from pyspark.sql.window import Window
window_spec = Window.partitionBy("merchant_abn").orderBy("order_year","order_month")
orders_month_diff = orders_by_month_year.withColumn("month_diff", f.col("monthly_distinct_consumers") - f.lag(f.col("monthly_distinct_consumers")).over(window_spec))
orders_month_diff = orders_month_diff.filter(orders_month_diff.month_diff.isNull() == False)
orders_month_diff = orders_month_diff.groupby("merchant_abn").agg(f.avg("month_diff").alias("average_monthly_diff_consumers"),f.sum("month_diff").alias("consumer_diff_over_period"))

In [13]:
orders_month_diff.orderBy(f.col("average_monthly_diff_consumers").asc()).show()

                                                                                

+------------+------------------------------+
|merchant_abn|average_monthly_diff_consumers|
+------------+------------------------------+
| 31686734877|           -0.5789473684210527|
| 26259500279|           -0.5263157894736842|
| 84391677958|           -0.5263157894736842|
| 55555661470|                          -0.5|
| 54277261175|                          -0.5|
| 84787662573|                          -0.5|
| 27657048362|          -0.47368421052631576|
| 38672941140|          -0.47368421052631576|
| 91178385997|          -0.47368421052631576|
| 33846458525|          -0.42105263157894735|
| 37442271968|          -0.42105263157894735|
| 48890662808|          -0.42105263157894735|
| 69073210783|          -0.42105263157894735|
| 70713877189|          -0.42105263157894735|
| 77421432003|          -0.42105263157894735|
| 96230979998|           -0.3888888888888889|
| 37358528402|                        -0.375|
| 25607153542|           -0.3684210526315789|
| 58459771721|           -0.368421

2. Finding Rate of growth

In [32]:
from pyspark.sql.window import Window
window_spec = Window.partitionBy("merchant_abn").orderBy("order_year","order_month")
orders_growth_rate = orders_by_month_year.withColumn("growth_rate",f.col("monthly_distinct_consumers") / f.lag(f.col("monthly_distinct_consumers")).over(window_spec))
orders_growth_rate = orders_growth_rate.filter(orders_growth_rate.growth_rate.isNull() == False)
orders_growth_rate = orders_growth_rate.withColumn("growth_rate", f.col("growth_rate") - 1)

In [33]:
orders_growth_rate.show()

                                                                                

+------------+----------+-----------+--------------------------+--------------------+
|merchant_abn|order_year|order_month|monthly_distinct_consumers|         growth_rate|
+------------+----------+-----------+--------------------------+--------------------+
| 10023283211|      2021|          2|                         3|                null|
| 10023283211|      2021|          3|                        97|  31.333333333333336|
| 10023283211|      2021|          4|                       111| 0.14432989690721643|
| 10023283211|      2021|          5|                       122|  0.0990990990990992|
| 10023283211|      2021|          6|                       117|-0.04098360655737...|
| 10023283211|      2021|          7|                       130| 0.11111111111111116|
| 10023283211|      2021|          8|                       119|-0.08461538461538465|
| 10023283211|      2021|          9|                       150| 0.26050420168067223|
| 10023283211|      2021|         10|                 

3. Finding the average dollar amount a consumer spent on a particular merchant.

In [14]:
merchant_consumer_average = orders.groupBy("merchant_abn","consumer_id").agg(f.sum("dollar_value").alias("total_spend_per_consumer"))
merchant_consumer_average = merchant_consumer_average.groupBy("merchant_abn").agg(f.avg("total_spend_per_consumer").alias("average_spend_per_consumer"))

In [None]:
merchant_consumer_average.show()

### Join
Now we will join the new features back into the aggregated dataset

In [15]:
orders_aggregated = orders_aggregated.join(merchant_consumer_average, on = "merchant_abn", how = "inner").\
                    join(orders_by_month_year, on = "merchant_abn", how = "inner").\
                    join(orders_growth_rate, on = "merchant_abn", how = "inner")

In [16]:
orders_aggregated.show(truncate=False)

                                                                                

+------------+----------------+---------------------+--------------------------+------------------------------+
|merchant_abn|number_of_orders|average_cost_of_order|average_spend_per_consumer|average_monthly_diff_consumers|
+------------+----------------+---------------------+--------------------------+------------------------------+
|19839532017 |614             |157.0                |159.86401326699834        |1.0                           |
|83412691377 |11928           |34.97122412593614    |46.4000846912309          |29.4                          |
|15613631617 |1483            |303.77953163770877   |315.700802676049          |3.65                          |
|38700038932 |5944            |1344.3426950083588   |1550.4022078249302        |13.3                          |
|73256306726 |4361            |283.94461992895316   |317.58976340347914        |9.1                           |
|35344855546 |1274            |89.12365168735937    |91.78943593346469         |3.1                     

### Save the dataset

In [17]:
orders_aggregated.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/orders_agg.parquet")

                                                                                

In [None]:
spark.stop()