# Aggregating Features Regarding Merchants by Merchants
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding merchants by merchant abn.

Firstly we will create a new directory to store the agggregated data into, and then create a Spark session and import the merchant dataset that contains all the features that relate to the merchants.

In [1]:
from pyspark.sql import SparkSession, functions as f
import os

In [2]:
if not os.path.exists("../../../data/insights/agg_insight_data"):
    os.makedirs("../../../data/insights/agg_insight_data")

In [3]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
23/10/02 21:58:00 WARN Utils: Your hostname, LAPTOP-RELH58H1 resolves to a loopback address: 127.0.1.1; using 172.19.22.4 instead (on interface eth0)
23/10/02 21:58:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/02 21:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
merchant = spark.read.parquet("../../../data/insights/pre_insights/merchant.parquet/")

                                                                                

In [5]:
merchant.show(truncate= False)

[Stage 1:>                                                          (0 + 1) / 1]

+------------+--------------------------------+----------------------+--------------------+----------------------------+
|merchant_abn|merchant_name                   |merchant_revenue_level|merchant_take_rate_%|merchant_fraud_probability_%|
+------------+--------------------------------+----------------------+--------------------+----------------------------+
|20985347699 |Semper Tellus PC                |a                     |6.1                 |0.0                         |
|45629217853 |Lacus Consulting                |a                     |6.98                |0.0                         |
|71528203369 |Ipsum Primis Associates         |a                     |6.94                |0.0                         |
|75900778714 |Laoreet Posuere Foundation      |c                     |2.61                |0.0                         |
|66626020312 |In Magna Incorporated           |c                     |2.46                |0.0                         |
|17324645993 |Eget Metus In Corp

                                                                                

In [6]:
merchant.dtypes

[('merchant_abn', 'bigint'),
 ('merchant_name', 'string'),
 ('merchant_revenue_level', 'string'),
 ('merchant_take_rate_%', 'float'),
 ('merchant_fraud_probability_%', 'double')]

Next we want to check how many unique merchants are in the data.

In [7]:
merchant.select(merchant.merchant_name, merchant.merchant_revenue_level).distinct().count()

                                                                                

4026

Then we want to check whether any merchants have multiple take rates or revenue levels.

In [8]:
merchant_revenue_level_count = merchant.groupBy("merchant_abn").agg(f.countDistinct("merchant_revenue_level").alias("revenue_level_count"))
merchant_take_rate_count = merchant.groupBy("merchant_abn").agg(f.countDistinct("merchant_take_rate_%").alias("take_rate_count"))

In [9]:
print(merchant_revenue_level_count.filter(merchant_revenue_level_count.revenue_level_count > 1).count())
print(merchant_take_rate_count.filter(merchant_take_rate_count.take_rate_count > 1).count())

                                                                                

0




0


                                                                                

### Aggregation

In [10]:
merchant_aggregated = merchant.groupBy("merchant_abn").agg(\
                        f.first("merchant_name").alias("name"),
                        f.first("merchant_revenue_level").alias("revenue_level"),
                        f.first("merchant_take_rate_%").alias("take_rate"),
                        f.avg("merchant_fraud_probability_%").alias("average_merchant_fraud_probability"))

In [11]:
merchant_aggregated.show()



+------------+--------------------+-------------+---------+----------------------------------+
|merchant_abn|                name|revenue_level|take_rate|average_merchant_fraud_probability|
+------------+--------------------+-------------+---------+----------------------------------+
| 10023283211|       Felis Limited|            e|     0.18|                               0.0|
| 10142254217|Arcu Ac Orci Corp...|            b|     4.22|                               0.0|
| 10165489824|    Nunc Sed Company|            b|      4.4|                               0.0|
| 10187291046|Ultricies Digniss...|            b|     3.29|                               0.0|
| 10192359162| Enim Condimentum PC|            a|     6.33|                               0.0|
| 10206519221|       Fusce Company|            a|     6.34|                               0.0|
| 10255988167|Aliquam Enim Inco...|            b|     4.32|                               0.0|
| 10264435225|    Ipsum Primis Ltd|            c| 

                                                                                

### Saving the data

In [12]:
merchant_aggregated.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/merchant_agg.parquet")

                                                                                

In [13]:
spark.stop()

### Summary
In this notebook the following was a achieved:
- Created a directory to store the data.

- Found that there are 4026 merchants

- Found that all merchants each have only one take rate and revenue level.

- Data was aggregated by merchant abn:
    1. Unique name, take rate, and revenue level was taken using the first function.

    2. Average was taken for the merchant fraud probability

- Aggregated data was saved to a checkpoint directory.