# Aggregating Features Regarding Descriptions by Merchant
## Author: Dulan Wijeratne 1181873

In this notebook we will aggregate the features regarding descriptions by merchant abn, as well as group merchants into segments.

To start we will create a Spark session and import the descriptions dataset that contains all the features that relate to descriptions.

In [1]:
from pyspark.sql import SparkSession, functions as f

In [2]:
spark = (
    SparkSession.builder.appName("Preprocessing_Yellow")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '3g')   
    .config('spark.executor.memory', '4g')  
    .config('spark.executor.instances', '2')  
    .config('spark.executor.cores', '2')
    .getOrCreate()
) 

your 131072x1 screen size is bogus. expect trouble
23/10/03 20:43:47 WARN Utils: Your hostname, DulanComputer resolves to a loopback address: 127.0.1.1; using 172.30.15.25 instead (on interface eth0)
23/10/03 20:43:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/03 20:43:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
descriptions = spark.read.parquet("../../../data/insights/pre_insights/descriptions.parquet/")

                                                                                

Next we want to check whether each merchant has a multiple descriptions

In [11]:
tags_count = descriptions.groupBy("merchant_abn").agg(f.countDistinct("merchant_description").alias("description_count"))
print(tags_count.filter(tags_count.description_count > 1).count())

[Stage 4:>                                                          (0 + 8) / 8]

0


                                                                                

### Aggregation
Here we will aggregate the data by merchant abn.

In [4]:
descriptions_agg = descriptions.groupBy("merchant_abn").\
            agg(f.first("merchant_description").alias("merchant_description"))

### Segmenting the data

Here we will segment the data by descriptions into 5 groups:
1. Tech and Electronics - Contains anything to do with computers, digital good, telecom and etc.
2. Retail and Novelty - Contains anything found in a department store or a novelty store
3. Garden and Furnishings - Contains anything to do with furniture and gardening including florist and tents 
4. Antiques and Jewellery - Contains anything to do with jewellery, galleries and antiques.
5. Specialized Services - Contains speacialized services such as opticians, motor vehecle services and health.

First we want to check how many unique descriptions there are and get an idea of the number of merchants within each description

In [None]:
descriptions.groupBy("merchant_description").agg(f.first("merchant_abn")).count()

In [None]:
descriptions_count = descriptions_agg.groupBy("merchant_description").agg(f.count("merchant_abn").alias("number_of_merchants_with_description"))

In [None]:
descriptions_count.orderBy("number_of_merchants_with_description")

Now will begin segmenting the data.

In [5]:
# lists which contain keywords for each segment
tech_and_electronics = ["computer", "digital", "television", "telecom"]
retail_and_novelty = ["newspaper", "novelty", "hobby", "shoe", "instruments", "bicycle", "craft","office"]
garden_and_furnishings = ["florists", "furniture", "garden", "tent"]
antiques_and_jewellery = ["galleries", "antique", "jewelry"]
specialized_services = ["health", "motor", "opticians"]

In [6]:
#create a function that segments the data if the description contains a keyword
def segment(description):
    for segment, keywords in [("tech_and_electronics", tech_and_electronics),
                               ("retail_and_novelty", retail_and_novelty),
                               ("garden_and_furnishings", garden_and_furnishings),
                               ("antiques_and_jewellery", antiques_and_jewellery),
                               ("specialized_services", specialized_services)]:
        if any(keyword in description for keyword in keywords):
            return segment
    return "other"

In [7]:
from pyspark.sql.types import StringType
segment_udf = f.udf(segment, StringType())

In [8]:
descriptions_agg = descriptions_agg.withColumn("segment", segment_udf(descriptions_agg.merchant_description))

In [9]:
#checking whether all merchants were put in a segment
other_df = descriptions_agg.filter(f.col("segment") == "other")

In [10]:
other_df.show()

[Stage 3:>                                                          (0 + 1) / 1]

+------------+--------------------+-------+
|merchant_abn|merchant_description|segment|
+------------+--------------------+-------+
+------------+--------------------+-------+



                                                                                

### Saving the Data

In [None]:
descriptions_agg.write.mode("overwrite").parquet("../../../data/insights/agg_insight_data/descriptions_agg.parquet")

In [None]:
spark.stop()

### Summary
- Each merchant has only one description
- Data was aggregated by merchant abn:
    1. Merchant description was aggregation using the first function as it is unique for each merchant.

- There are only 25 different merchant.
- Merchants were put into 1 of 5 segments:
    1. Tech and Electronics
    2. Retail and Novelty
    3. Garden and Furnishings
    4. Antiques and Jewellery
    5. Specialized Services
    
- Aggregated data was saved to a checkpoint directory.