# Aggregating Features Regarding Consumers by Merchant
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding consumers by merchant abn, as well as creating new features related to consumers.

To start we will create a Spark session and import the consumers dataset that contains all the features that relate to consumers.

In [None]:
from pyspark.sql import SparkSession, functions as f

In [None]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

In [None]:
consumers = spark.read.parquet("../../../data/insights/pre_insights/consumers.parquet")

### Aggregation
Next we will aggregate by merchant abn.

In [None]:
consumers_aggregated = consumers.groupBy("merchant_abn").agg(
                        f.countDistinct("consumer_id").alias("number_of_unique_consumers"),
                        f.avg("consumer_fraud_probability_%").alias("average_consumer_fraud_probability"))

### Feature Engineering
1. Finding repeat consumers
2. Finding the average number of times the repeat consumers order from a particular merchant.

In [None]:
repeat_consumers = consumers.groupBy("merchant_abn", "consumer_id").agg(f.count("consumer_id").alias("consumer_order_times"))
repeat_consumers_order_times = repeat_consumers.groupBy("merchant_abn").agg(f.avg("consumer_order_times").alias("average_repeat_transactions_per_consumer"))
repeat_consumers = repeat_consumers.filter(repeat_consumers.consumer_order_times > 1)
repeat_consumers_count = repeat_consumers.groupBy("merchant_abn").agg(f.count("consumer_order_times").alias("number_of_repeat_consumers"))

In [None]:
repeat_consumers_order_times.orderBy(f.col("average_repeat_transactions_per_consumer").desc()).show()

In [None]:
repeat_consumers.show()

### Join
Next will join the newly created features back into the aggregated data.

In [None]:
consumers_aggregated = consumers_aggregated.join(repeat_consumers_count, on = "merchant_abn", how = "leftouter")
consumers_aggregated = consumers_aggregated.join(repeat_consumers_order_times, on = "merchant_abn", how = "leftouter")

In [None]:
consumers_aggregated.show()

After joining we expect some null values as we did a left outer join and some merchants may not have had repeat customers. 

We will first check the number of merchants without repeat customers.

Then we will change the null values to 0 to show that this merchant did not have repeat customers in a numeric way so it can be used later.

In [None]:
consumers_aggregated.filter(consumers_aggregated["number_of_repeat_consumers"].isNull()).count()

In [None]:
consumers_aggregated = consumers_aggregated.fillna(0)

In [None]:
#checking whether number of merchants changed
consumers_aggregated.count()

### Saving the data

In [None]:
consumers_aggregated.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/consumers_agg.parquet")

In [None]:
spark.stop()

### Summary
In this notebook the following was a achieved:

- Data was aggregated by merchant abn:
    1. consumer_id was aggregated by counting the distinct number of consumer_ids for each merchant to give the unique number of consumers

    2. Average was taken for the consumer fraud probability

- We created the following features:
    1. The number of repeat consumers was created by checking wheteher the number of order times was greater than 1.

    2. The average number of time consumers ordered was created by taking the average number of times a particular consumers ordered from a particular merchant.

- There were 1456 merchants with no repeat consumers.

- Aggregated data was saved to a checkpoint directory.