# Aggregating Features Regarding Consumers by Merchant
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding consumers by merchant abn, as well as creating new features related to consumers.

To start we will create a Spark session and import the orders dataset that contains all the features that relate to consumers.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/10/03 19:59:14 WARN Utils: Your hostname, DulanComputer resolves to a loopback address: 127.0.1.1; using 172.30.15.25 instead (on interface eth0)
23/10/03 19:59:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/03 19:59:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
consumers = spark.read.parquet("../../../data/insights/pre_insights/consumers.parquet")

                                                                                

### Aggregation
Next we will aggregate by merchant abn.

In [4]:
consumers_aggregated = consumers.groupBy("merchant_abn").agg(
                        f.countDistinct("consumer_id").alias("number_of_unique_consumers"),
                        f.avg("consumer_fraud_probability_%").alias("average_consumer_fraud_probability"))

### Feature Engineering
1. Finding repeat consumers
2. Finding the average number of times the repeat consumers order from a particular merchant.

In [5]:
repeat_consumers = consumers.groupBy("merchant_abn", "consumer_id").agg(f.count("consumer_id").alias("consumer_order_times"))
repeat_consumers_order_times = repeat_consumers.groupBy("merchant_abn").agg(f.avg("consumer_order_times").alias("average_repeat_transactions_per_consumer"))
repeat_consumers = repeat_consumers.filter(repeat_consumers.consumer_order_times > 1)
repeat_consumers_count = repeat_consumers.groupBy("merchant_abn").agg(f.count("consumer_order_times").alias("number_of_repeat_consumers"))

In [6]:
repeat_consumers_order_times.orderBy(f.col("average_repeat_transactions_per_consumer").desc()).show()



+------------+----------------------------------------+
|merchant_abn|average_repeat_transactions_per_consumer|
+------------+----------------------------------------+
| 24852446429|                      12.008899274137416|
| 86578477987|                      11.325478498632862|
| 64203420245|                      10.818791946308725|
| 49891706470|                      10.287674638293641|
| 46804135891|                        9.72962760403719|
| 45629217853|                       9.485508327119065|
| 89726005175|                       8.960375857611615|
| 43186523025|                       8.348582794629538|
| 80324045558|                        8.17741534483616|
| 63290521567|                       7.564222465426326|
| 68216911708|                       7.516937770482017|
| 21439773999|                       5.043445617898794|
| 64403598239|                       4.755238095238095|
| 72472909171|                       4.294793609193085|
| 94493496784|                        4.19430392

                                                                                

In [8]:
repeat_consumers.show()



+------------+-----------+--------------------+
|merchant_abn|consumer_id|consumer_order_times|
+------------+-----------+--------------------+
| 68559320474|    1343547|                   3|
| 49505931725|    1463076|                   3|
| 89726005175|     298861|                   8|
| 48534649627|     298861|                   3|
| 75944642726|    1230828|                   3|
| 80324045558|    1163184|                   2|
| 56945597985|    1109504|                   2|
| 35223308778|     589921|                   3|
| 72472909171|     621089|                   2|
| 96566672398|     983685|                   2|
| 68559320474|     754347|                   2|
| 37459245212|    1454029|                   2|
| 47086412084|     738210|                   3|
| 16993524298|     738210|                   2|
| 32361057556|     608308|                   4|
| 98269572896|     608308|                   2|
| 98269572896|     131147|                   2|
| 93558142492|    1201519|              

                                                                                

### Join
Next will join the newly created features back into the aggregated data.

In [7]:
consumers_aggregated = consumers_aggregated.join(repeat_consumers_count, on = "merchant_abn", how = "leftouter")
consumers_aggregated = consumers_aggregated.join(repeat_consumers_order_times, on = "merchant_abn", how = "leftouter")

In [8]:
consumers_aggregated.show()

                                                                                

+------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+
|merchant_abn|number_of_unique_consumers|average_consumer_fraud_probability|number_of_repeat_consumers|average_repeat_transactions_per_consumer|
+------------+--------------------------+----------------------------------+--------------------------+----------------------------------------+
| 19839532017|                       603|              0.047899319371961166|                        11|                      1.0182421227197347|
| 83412691377|                      8990|               0.03216135694921109|                      2419|                      1.3268075639599555|
| 15613631617|                      1427|              0.026029083563212478|                        53|                      1.0392431674842326|
| 38700038932|                      5154|                0.6055661032397236|                       707|                       1.15

After joining we expect some null values as we did a left outer join and some merchants may not have had repeat customers. 

We will first check the number of merchants without repeat customers.

Then we will change the null values to 0 to show that this merchant did not have repeat customers in a numeric way so it can be used later.

In [9]:
consumers_aggregated.filter(consumers_aggregated["number_of_repeat_consumers"].isNull()).count()

                                                                                

1456

In [10]:
consumers_aggregated = consumers_aggregated.fillna(0)

In [11]:
#checking whether number of merchants changed
consumers_aggregated.count()

4026

### Saving the data

In [15]:
consumers_aggregated.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/consumers_agg.parquet")

                                                                                

In [16]:
spark.stop()

### Summary
In this notebook the following was a achieved:

- Data was aggregated by merchant abn:
    1. consumer_id was aggregated by counting the distinct number of consumer_ids for each merchant to give the unique number of consumers

    2. Average was taken for the consumer fraud probability

- We created the following features:
    1. The number of repeat consumers was created by checking wheteher the number of order times was greater than 1.

    2. The average number of time consumers ordered was created by taking the average number of times a particular consumers ordered from a particular merchant.

- There were 1456 merchants with no repeat consumers.

- Aggregated data was saved to a checkpoint directory.