# Aggregating Features Regarding Orders by Merchants
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding postcodes by Merchant, as well as creating new features related to postcode.
To start we will create a Spark session and import the orders dataset that contains all the features that relate to orders.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("feature_engineering")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/09/29 22:19:17 WARN Utils: Your hostname, LAPTOP-RELH58H1 resolves to a loopback address: 127.0.1.1; using 172.19.22.4 instead (on interface eth0)
23/09/29 22:19:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/29 22:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
postcode = spark.read.parquet("../../../data/insights/pre_insights/postcode.parquet")

                                                                                

In [4]:
postcode.columns

['merchant_abn',
 'consumer_postcode',
 'consumer_id',
 'merchant_name',
 'estimated_population',
 'median_age',
 'median_mortgage_monthly',
 'total_weekly_personal_income',
 'median_weekly_rent',
 'total_weekly_fam_income',
 'avg_num_persons_per_bedroom',
 'total_hhd_income_weekly',
 'avg_household_size']

In [5]:
postcode.show()

                                                                                

+------------+-----------------+-----------+--------------------+--------------------+----------+-----------------------+----------------------------+------------------+-----------------------+---------------------------+-----------------------+------------------+
|merchant_abn|consumer_postcode|consumer_id|       merchant_name|estimated_population|median_age|median_mortgage_monthly|total_weekly_personal_income|median_weekly_rent|total_weekly_fam_income|avg_num_persons_per_bedroom|total_hhd_income_weekly|avg_household_size|
+------------+-----------------+-----------+--------------------+--------------------+----------+-----------------------+----------------------------+------------------+-----------------------+---------------------------+-----------------------+------------------+
| 20985347699|             3332|    1343547|    Semper Tellus PC|              8074.0|      40.0|                 1733.0|                       780.0|             308.0|                 2103.0|            

In [6]:
postcode_agg = postcode.groupBy("merchant_abn").agg\
                    (f.first("merchant_name").alias("name"),
                     f.countDistinct("consumer_postcode").alias("number_of_postcodes"),
                     f.avg("total_weekly_personal_income").alias("avg_total_weekly_personal_income"),
                     f.avg("total_weekly_fam_income").alias("avg_total_weekly_fam_income"),
                     f.avg("median_age").alias("avg_median_age"),
                     f.avg("avg_household_size").alias("avg_household_size"))

### Feature Engineering

1. Finding the reach of merchant
For this analysis we define reach as the total number of a postcodes that a merchant serves divided by the total number of postcodes.

In [7]:
total_number_postcodes = postcode.select(f.col("consumer_postcode")).distinct().count()
postcode_agg = postcode_agg.withColumn("postcode_reach", postcode_agg.number_of_postcodes/total_number_postcodes)

                                                                                

2. Finding the average number of consumers that a merchant serves per postcode.

In [8]:
consumer_id_count_per_postcode = postcode.groupBy("merchant_abn","consumer_postcode").agg(f.count("consumer_id").alias("number_of_consumers"))
avg_num_of_consumers_per_postcode = consumer_id_count_per_postcode.groupBy("merchant_abn").agg(f.avg("number_of_consumers").alias("avg_num_of_consumers_per_postcode"))

### Join
Now we will join the created features to the aggregated dataset.

In [9]:
postcode_agg = postcode_agg.join(avg_num_of_consumers_per_postcode, on = "merchant_abn", how = "inner")

In [10]:
postcode_agg.orderBy(f.col("avg_num_of_consumers_per_postcode").desc()).show(truncate = False)



+------------+---------------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+------------------+---------------------------------+
|merchant_abn|name                             |number_of_postcodes|avg_total_weekly_personal_income|avg_total_weekly_fam_income|avg_median_age    |avg_household_size|postcode_reach    |avg_num_of_consumers_per_postcode|
+------------+---------------------------------+-------------------+--------------------------------+---------------------------+------------------+------------------+------------------+---------------------------------+
|24852446429 |Erat Vitae LLP                   |2639               |789.8770611930597               |1977.015175100498          |43.09343523206664 |2.4571954526448265|1.0               |91.52974611595302                |
|86578477987 |Leo In Consulting                |2639               |790.1220326409496               |1978.2443615788

                                                                                

### Saving the Data

In [11]:
postcode_agg.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/postcode_agg.parquet")

                                                                                

In [12]:
spark.stop()