# Aggregating Features Regarding Orders by Merchants
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding postcodes by Merchant, as well as creating new features related to postcode.
To start we will create a Spark session and import the orders dataset that contains all the features that relate to orders.

In [None]:
from pyspark.sql import SparkSession, functions as f

In [None]:
spark = (
    SparkSession.builder.appName("feature_engineering")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

In [None]:
postcode = spark.read.parquet("../../../data/insights/pre_insights/postcode.parquet")

In [None]:
postcode.columns

In [None]:
postcode.show()

In [None]:
postcode_agg = postcode.groupBy("merchant_abn").agg\
                    (f.first("merchant_name").alias("name"),
                     f.countDistinct("consumer_postcode").alias("number_of_postcodes"),
                     f.avg("total_weekly_personal_income").alias("avg_total_weekly_personal_income"),
                     f.avg("total_weekly_fam_income").alias("avg_total_weekly_fam_income"),
                     f.avg("median_age").alias("avg_median_age"),
                     f.avg("avg_household_size").alias("avg_household_size"))

### Feature Engineering

1. Finding the reach of merchant
For this analysis we define reach as the total number of a postcodes that a merchant serves divided by the total number of postcodes.

In [None]:
total_number_postcodes = postcode.select(f.col("consumer_postcode")).distinct().count()
postcode_agg = postcode_agg.withColumn("postcode_reach", postcode_agg.number_of_postcodes/total_number_postcodes)

2. Finding the average number of consumers that a merchant serves per postcode.

In [None]:
consumer_id_count_per_postcode = postcode.groupBy("merchant_abn","consumer_postcode").agg(f.count("consumer_id").alias("number_of_consumers"))
avg_num_of_consumers_per_postcode = consumer_id_count_per_postcode.groupBy("merchant_abn").agg(f.avg("number_of_consumers").alias("avg_num_of_consumers_per_postcode"))

### Join
Now we will join the created features to the aggregated dataset.

In [None]:
postcode_agg = postcode_agg.join(avg_num_of_consumers_per_postcode, on = "merchant_abn", how = "inner")

In [None]:
postcode_agg.orderBy(f.col("avg_num_of_consumers_per_postcode").desc()).show(truncate = False)

### Saving the Data

In [None]:
postcode_agg.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/postcode_agg.parquet")

In [None]:
spark.stop()

### Summary 
- Data was aggregated by merchant abn:
    1. Took the first value for merchant name as the merchant's name, as merchant name is unique.

    2. Count the distinct number of consumer postcodes to find the number of postcodes the merchant served. 

    3. Took the average of total weekly personal income.
    
    4. Took the average of total weekly family income.

    5. Took the average median age.

    6. Took the average household size.
    

- We created the following features:
    1. The reach of merchant

    2. The average number of consumers that a merchant serves per postcode..

- Aggregated data was saved to a checkpoint directory.
