# Dataset

## Products  

: product_id - The product ID  
: product_name  - The product name   
: price - The product price   

## Sellers  

: seller_id  - The seller ID   
: seller_name  - The seller name    
: daily_target - The number of items (regardless of the product type) that the seller needs to hit his/her quota. For example, if the daily target is 100,000, the employee needs to sell 100,000 products he can hit the quota by selling 100,000 units of product_0, but also selling 30,000 units of product_1 and 70,000 units of product_2   

## Sales  

: order_id  - The order ID   
: product_id  - The single product sold in the order. All orders have exactly one product)   
: seller_id  - The selling employee ID that sold the product   
: date - The date of the order.    
: num_pieces_sold - The number of units sold for the specific product in the order    
: bill_raw_text  -  A string that represents the raw text of the bill associated with the order   


In [1]:
import os
import pandas as pd
from tqdm import tqdm
import csv
import random
import string
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import Row
from numpy.random import rand
from pyspark.sql.types import IntegerType, StringType


random.seed(42)

My machine has following configuration...
- 6 cores with 12vCores
- 32GB RAM

Spark Standalone server:
```
cd /opt/softwares/spark-3.0.1-bin-hadoop3.2/

# export your python bin path
export PYSPARK_PYTHON=/opt/envs/ai4e/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/envs/ai4e/bin/python

sbin/start-all.sh
sbin/stop-all.sh
```
Spark UI: [http://localhost:8080](http://localhost:8080)   
Spark Master URL : spark://IMCHLT276:7077

In [2]:
spark = SparkSession.builder \
    .master("spark://IMCHLT276:7077") \
    .config("spark.sql.autoBroadcastJoinThreshold", -1) \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .config("spark.cores.max", "10") \
    .config("spark.local.dir", "/opt/tmp/spark-temp/") \
    .appName("DataSkewness") \
    .getOrCreate()

# https://stackoverflow.com/questions/27774443/is-it-safe-to-temporarily-rename-tmp-and-then-create-a-tmp-symlink-to-a-differe
# sudo mount --bind /path/to/dir/with/plenty/of/space /tmp
# sudo lsof /tmp # check for apps
# sudo umount /tmp

In [3]:
spark

### Generate products

In [4]:
products_df = spark.range(75000000).withColumnRenamed("id", "product_id").select(F.col("product_id"))

In [5]:
def get_product_name(product_id):
    return f"product_{product_id}"

get_product_name_udf = F.udf(get_product_name, StringType())
products_df = products_df.withColumn("product_name", get_product_name_udf(F.col("product_id")))

In [6]:
def get_price():
    return random.randint(1, 150)
get_price_udf = F.udf(get_price, IntegerType())

products_df = products_df.withColumn("product_price", get_price_udf())

In [7]:
if os.environ.get("PYSPARK_PYTHON", 'error') == 'error':
    print("Export PYSPARK_PYTHON and PYSPARK_DRIVER values and restsrt your jupyter!") 

In [8]:
products_df.count(), products_df.show()

+----------+------------+-------------+
|product_id|product_name|product_price|
+----------+------------+-------------+
|         0|   product_0|           59|
|         1|   product_1|           56|
|         2|   product_2|          144|
|         3|   product_3|            3|
|         4|   product_4|           56|
|         5|   product_5|          104|
|         6|   product_6|           15|
|         7|   product_7|           31|
|         8|   product_8|          135|
|         9|   product_9|          142|
|        10|  product_10|           61|
|        11|  product_11|           80|
|        12|  product_12|           14|
|        13|  product_13|           26|
|        14|  product_14|           82|
|        15|  product_15|          148|
|        16|  product_16|           33|
|        17|  product_17|          149|
|        18|  product_18|          134|
|        19|  product_19|           16|
+----------+------------+-------------+
only showing top 20 rows



(75000000, None)

In [9]:
products_df.repartition(10).write.parquet("data/products_parquet")

### Generate sellers

In [10]:
seller_ids = [x for x in range(1, 10)]
print("Sellers between {} and {}".format(1, 10))

Sellers between 1 and 10


In [11]:
sellers = [[0, "seller_0", 2500000]]

for s in tqdm(seller_ids):
    sellers.append([s, "seller_{}".format(s), random.randint(12000, 2000000)])
    
#   Save dataframe
df = pd.DataFrame(sellers)
df.columns = ["seller_id", "seller_name", "daily_target"]
df = spark.createDataFrame(df)
df.show()
df.write.parquet("data/sellers_parquet")

100%|██████████| 9/9 [00:00<00:00, 24291.34it/s]


+---------+-----------+------------+
|seller_id|seller_name|daily_target|
+---------+-----------+------------+
|        0|   seller_0|     2500000|
|        1|   seller_1|     1352975|
|        2|   seller_2|      245478|
|        3|   seller_3|       64451|
|        4|   seller_4|     1567144|
|        5|   seller_5|      588778|
|        6|   seller_6|      525575|
|        7|   seller_7|      480106|
|        8|   seller_8|      304632|
|        9|   seller_9|     1556492|
+---------+-----------+------------+



### Generate sales

Prepare skewed data where product 0 has more entries

In [12]:
dates = ['2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08',
         '2020-07-09', '2020-07-10']
def get_dates():
    return random.choice(dates)

get_dates_udf = F.udf(get_dates, StringType())


In [13]:
def get_num_pieces_sold():
    return random.randint(1, 100)

get_num_pieces_sold_udf = F.udf(get_num_pieces_sold, IntegerType())

In [23]:
letters = string.ascii_lowercase
letters_upper = string.ascii_uppercase

for _i in range(0, 10):
    letters += letters

for _i in range(0, 10):
    letters += letters_upper

print("Number of chars to choose from", len(letters))
sample_string = random.sample(letters, 500)
print("sample_string", ''.join(sample_string))

def random_string(stringLength=200):
    """Generate a random string of fixed length """
    return ''.join(random.sample(letters, stringLength))

random_string_udf = F.udf(random_string,StringType())


def static_string():
    """static string of fixed length """
    return ''.join(sample_string)

static_string_udf = F.udf(static_string,StringType())

Number of chars to choose from 26884
sample_string brjbogvcakmjwhlzxxpdgxkmijtywakzvnhohgtenzzjksfteonrfovokxzgrudowgpbedxpwjmxmzvspapuvffmttwkusxjogzzvulocfeyfrbiusfycnhitypxtpizsvoqmimcaeSqcyzydoljvxjxodmcsmqrwdbxplbtuoagtcfzohvflrzffqockedhugrregsbxvaqvyqyubpvwtcnvbxrrvdaezolfdjdwmzaczneonhnvxwwggtjpspzpmhskcqczfpduxlkbtqnpatvnsueojmdarsqkgbgudcoajvzfMohrzrcnbzbmjkqwkhksqotlcsuwzqqiodddtqjjtlvbilbhckvrskvylgkccczhrlslavtztcnhcusgxdibrqwfcycaunftwmtlvefvameybjvlfiuqldoqarseukwemhipxawdbbkkqcerjdubyxgtyxgjbsssjhimsDlFduoopqrfrctdzgaktsmritlljrc


In [15]:
#TODO : pass the range as input

def get_product_ids():
    return random.randint(1, 75000000)

get_product_ids_udf = F.udf(get_product_ids, IntegerType())


def get_seller_ids():
    return random.randint(1, 10)

get_seller_ids_udf = F.udf(get_seller_ids, IntegerType())


In [24]:
start_index = 0
chunk_size = 100000 # 100 thoshand or 1 Lakh
end_index = chunk_size

PROD_ZERO_CHUNKS = 190 
OTHER_PROD_CHUNKS = 10

product_zero_sales = PROD_ZERO_CHUNKS * chunk_size
other_product_sales = OTHER_PROD_CHUNKS * chunk_size

product_zero_sales, other_product_sales

(19000000, 1000000)

In [25]:
for i in tqdm(range(PROD_ZERO_CHUNKS)):
    sales_df = spark.range(start_index, end_index).withColumnRenamed('id', 'order_id')
    sales_df = sales_df.withColumn("product_id", F.lit(0))
    sales_df = sales_df.withColumn("seller_id", F.lit(0))
    sales_df = sales_df.withColumn("date", get_dates_udf())
    sales_df = sales_df.withColumn("num_pieces_sold", get_num_pieces_sold_udf())
    sales_df = sales_df.withColumn("bill_raw_text", random_string_udf())
    sales_df.write.parquet("data/sales_parquet_temp", mode="append")
    start_index = end_index
    end_index = end_index + chunk_size


100%|██████████| 190/190 [09:43<00:00,  3.07s/it]


In [26]:
for i in tqdm(range(OTHER_PROD_CHUNKS)):
    sales_df = spark.range(start_index, end_index).withColumnRenamed('id', 'order_id')
    sales_df = sales_df.withColumn("product_id", get_product_ids_udf())
    sales_df = sales_df.withColumn("seller_id", get_seller_ids_udf())
    sales_df = sales_df.withColumn("date", get_dates_udf())
    sales_df = sales_df.withColumn("num_pieces_sold", get_num_pieces_sold_udf())
    sales_df = sales_df.withColumn("bill_raw_text", random_string_udf())
    sales_df.write.parquet("data/sales_parquet_temp", mode="append")
    start_index = end_index
    end_index = end_index + chunk_size

100%|██████████| 10/10 [00:31<00:00,  3.16s/it]


In [27]:
sales_df = spark.read.parquet("data/sales_parquet_temp")
sales_df.show()

+--------+----------+---------+----------+---------------+--------------------+
|order_id|product_id|seller_id|      date|num_pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+---------------+--------------------+
|19020000|  19788583|        6|2020-07-06|             90|ypbcmaeftyaxnyept...|
|19020001|  51274713|        8|2020-07-06|             78|mtadxvnxakwnxzspm...|
|19020002|  70376547|       10|2020-07-09|             96|xmclnwkgktnjhpzdf...|
|19020003|  10864788|        4|2020-07-03|             47|fwayvxaxbvjjgysle...|
|19020004|   7442520|        2|2020-07-01|             19|rYmbdxgfnzywzhboA...|
|19020005|  27704185|        1|2020-07-02|             55|yxdamsyxqbbsgudyt...|
|19020006|  31776189|        4|2020-07-06|              2|crnabzjyjfiabicqy...|
|19020007|  44260415|        7|2020-07-10|             62|hJnxwncokcugfzvtn...|
|19020008|  55961415|        5|2020-07-10|             39|jsqlemxosuzzohxjk...|
|19020009|  57684076|        3|2020-07-0

In [28]:
sales_df.count()

20000000

In [29]:
%%time
sales_df.repartition(200, F.col("product_id")).write.parquet("data/sales_parquet", mode="overwrite")

CPU times: user 12.8 ms, sys: 1.61 ms, total: 14.4 ms
Wall time: 1min 23s
