# Loyalty Market Test

Data is stored in a S3 bucket, I need to fetch it using a hadoop library and then start testing it.

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
s3_access_key = os.environ.get("S3_ACCESS_KEY")
s3_secret_key = os.environ.get("S3_SECRET_KEY")

In [2]:
from pyspark.sql import SparkSession

# Remember to use 2 cores for laptop work and 4 cores for local machine
spark = SparkSession.builder \
    .appName("Loyalty") \
    .config("spark.master", "local[4]") \
    .config("spark.executor.cores", "4") \
    .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.1") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.access.key", s3_access_key) \
    .config("spark.hadoop.fs.s3a.secret.key", s3_secret_key) \
    .getOrCreate()

In [6]:
df = spark.read.format("csv") \
        .option("header", "true") \
        .load("s3a://data-test-202302/product.csv", header=True)
df.show(5)

+-----------+---+----------+---------+----------------+----------+--------+
|product_key|sku|       upc|item_name|item_description|department|category|
+-----------+---+----------+---------+----------------+----------+--------+
| 7652613339|  0|7652613339|   324168|              NA|  651b1068|8312aed6|
| 1810063322|  0|1810063322|   276973|              NA|  651b1068|7aaa7a34|
| 5274585486|  0|5274585486|   794396|              NA|  c81ba571|54ea8364|
| 6978362094|  0|6978362094|   510386|              NA|  b947a4a9| 382cf3a|
| 6978396053|  0|6978396053|   120105|              NA|  b947a4a9| 382cf3a|
+-----------+---+----------+---------+----------------+----------+--------+
only showing top 5 rows



## Understand work around partitions
To use the most of cores, partitions must be in multiples of number of cores in order to take the most advantage of processing power. 

**- On Laptop** The number of partitions processs by a core would be 2.

**- On Local machine** The number of partitions processs by a core would be just one.

As its expected, on laptop it would take more time to process data

TODO Show processing times and analyse spark UI

In [4]:
print("Number of partitions: ",df.rdd.getNumPartitions())

Number of partitions:  4


In [5]:
partition_sizes = df.rdd.mapPartitions(lambda partition: [sum(map(len, partition))]).collect()
for i, size in enumerate(partition_sizes):
    print(f"Partition {i}: {size} bytes")

Partition 0: 1155399 bytes
Partition 1: 1156855 bytes
Partition 2: 1155973 bytes
Partition 3: 720132 bytes
