# Aggregating Features Regarding Orders by Merchants
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding orders by Merchant, as well as creating new features related to orders.

To start we will create a Spark session and import the orders dataset that contains all the features that relate to orders.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("orders_insights")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/09/27 15:42:51 WARN Utils: Your hostname, DulanComputer resolves to a loopback address: 127.0.1.1; using 172.30.15.25 instead (on interface eth0)
23/09/27 15:42:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/27 15:42:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/27 15:42:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
orders = spark.read.parquet("../../../data/insights/pre_insights/orders.parquet/")

                                                                                

In [4]:
orders.columns

['merchant_abn',
 'merchant_name',
 'consumer_id',
 'order_datetime',
 'dollar_value']

In [5]:
orders.show(truncate= False)

[Stage 1:>                                                          (0 + 1) / 1]

+------------+-------------------------------------+-----------+--------------+------------------+
|merchant_abn|merchant_name                        |consumer_id|order_datetime|dollar_value      |
+------------+-------------------------------------+-----------+--------------+------------------+
|16570599421 |Non Magna Nam PC                     |1343547    |2021-08-16    |12.142216147150515|
|63290521567 |Vehicula Pellentesque Corporation    |1343547    |2021-05-21    |4.845198213342123 |
|13514558491 |Magna Praesent PC                    |1343547    |2021-06-20    |128.238067391034  |
|22227727512 |Malesuada Integer Id Foundation      |1343547    |2021-07-24    |223.82738875061688|
|79827781481 |Amet Risus Inc.                      |1343547    |2021-08-17    |1430.2642751352505|
|35556933338 |Semper Cursus Integer Limited        |1343547    |2021-06-23    |36.771899091579016|
|64203420245 |Pede Nonummy Corp.                   |1343547    |2021-06-12    |13.270749560868401|
|470864120

                                                                                

In [7]:
orders = orders.withColumn("order_month", f.month(orders.order_datetime)).withColumn("order_year", f.year(orders.order_datetime))

### Aggreagation

Next we will aggregate the orders data by merchant abn.

In [7]:
orders_aggregated = orders.groupBy("merchant_abn").agg\
                        (f.count("order_datetime").alias("number_of_orders"),
                         f.avg("dollar_value").alias("average_cost_of_order"))

### Feature Engineering

1. Finding average difference in consumers per month

In [8]:
orders_by_month_year = orders.groupBy("merchant_abn","order_year","order_month").agg(f.countDistinct((orders.consumer_id)).alias("monthly_distinct_consumers"))
orders_by_month_year = orders_by_month_year.orderBy(f.col("order_year").asc(),f.col("order_month").asc(),f.col("monthly_distinct_consumers").desc())

In [9]:
from pyspark.sql.window import Window
window_spec = Window.partitionBy("merchant_abn").orderBy("order_year","order_month")
orders_month_diff = orders_by_month_year.withColumn("month_diff", f.col("monthly_distinct_consumers") - f.lag(f.col("monthly_distinct_consumers")).over(window_spec))
#orders_month_diff = orders_month_diff.filter(orders_month_diff.month_diff.isNull() == False)
orders_month_diff = orders_month_diff.groupby("merchant_abn").agg(f.avg("month_diff").alias("average_monthly_diff_consumers"),f.sum("month_diff").alias("consumer_diff_over_period"))

In [10]:
orders_month_diff.count()

2. Finding rate of growth of consumer for each merchant

In [11]:
from pyspark.sql.window import Window
window_spec = Window.partitionBy("merchant_abn").orderBy("order_year","order_month")
orders_growth_rate = orders_by_month_year.withColumn("growth_rate",f.col("monthly_distinct_consumers") / f.lag(f.col("monthly_distinct_consumers")).over(window_spec))
#orders_growth_rate = orders_growth_rate.filter(orders_growth_rate.growth_rate.isNull() == False)
orders_growth_rate = orders_growth_rate.withColumn("growth_rate", f.col("growth_rate") - 1)
orders_growth_rate = orders_growth_rate.groupby("merchant_abn").agg(f.avg("growth_rate").alias("average_growth_consumers"))

In [12]:
orders_growth_rate.count()

3. Finding the average dollar amount a consumer spent on a particular merchant.

In [13]:
merchant_consumer_average = orders.groupBy("merchant_abn","consumer_id").agg(f.sum("dollar_value").alias("total_spend_per_consumer"))
merchant_consumer_average = merchant_consumer_average.groupBy("merchant_abn").agg(f.avg("total_spend_per_consumer").alias("average_spend_per_consumer"))

In [14]:
merchant_consumer_average.count()

4. Finding Revunue

In [15]:
merchant_revenue = orders.groupBy("merchant_abn").agg(f.sum("dollar_value").alias("merchant_revenue"))
merchant_revenue = merchant_revenue.select("*", f.round(f.col("merchant_revenue"),2).alias("merchant_revenue_rounded")).drop("merchant_revenue")

In [16]:
merchant_revenue.count()

5. Finding how months in between first transaction and last transaction for each merchant

In [17]:
sorted_orders = orders.orderBy(f.col("merchant_abn"),f.col("order_year"),f.col("order_month"))
grouped_sorted_orders = sorted_orders.groupBy("merchant_abn").\
                        agg(f.first("order_datetime").alias("first_recorded_transaction"),\
                            f.last("order_datetime").alias("last_recorded_transaction"))
grouped_sorted_orders = grouped_sorted_orders.withColumn("transcation_period_months", f.months_between(f.col("last_recorded_transaction"),f.col("first_recorded_transaction")))

In [18]:
grouped_sorted_orders.count()

6. Finding revenue growth rate 

### Join
Now we will join the new features back into the aggregated dataset

In [19]:
orders_aggregated = orders_aggregated.join(merchant_consumer_average, on = "merchant_abn", how = "inner").\
                    join(orders_month_diff, on = "merchant_abn", how = "inner").\
                    join(orders_growth_rate, on = "merchant_abn", how = "inner").\
                    join(merchant_revenue, on = "merchant_abn", how = "inner").\
                    join(grouped_sorted_orders ,on = "merchant_abn", how = "inner")

In [20]:
orders_aggregated.show()

                                                                                

+------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+-------------------+------------------------+--------------------------+-------------------------+-------------------------+
|merchant_abn|number_of_orders|average_cost_of_order|average_spend_per_consumer|average_monthly_diff_consumers|consumer_diff_over_period|     average_growth|merchant_revenue_rounded|first_recorded_transaction|last_recorded_transaction|transcation_period_months|
+------------+----------------+---------------------+--------------------------+------------------------------+-------------------------+-------------------+------------------------+--------------------------+-------------------------+-------------------------+
| 19839532017|             614|                157.0|        159.86401326699834|                           1.0|                       20| 0.5481852724993155|                 96398.0|                2021-02-28|     

### Save the dataset

In [21]:
orders_aggregated.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/orders_agg.parquet")

                                                                                

In [23]:
spark.stop()