# Dataset

Part of the exercise we are gonna create our own dataset simulating a product selling shop. 
For the dataskewness part we are going to favour seller `0` and product `0` over other ones.


## Products  

: product_id - The product ID  
: product_name  - The product name   
: price - The product price   

## Sellers  

: seller_id  - The seller ID   
: seller_name  - The seller name    
: daily_target - The number of items (regardless of the product type) that the seller needs to hit his/her quota. For example, if the daily target is 100,000, the employee needs to sell 100,000 products he can hit the quota by selling 100,000 units of product_0, but also selling 30,000 units of product_1 and 70,000 units of product_2   

## Sales  

: order_id  - The order ID   
: product_id  - The single product sold in the order. All orders have exactly one product)   
: seller_id  - The selling employee ID that sold the product   
: date - The date of the order.    
: num_pieces_sold - The number of units sold for the specific product in the order    
: bill_raw_text  -  A string that represents the raw text of the bill associated with the order   


`bill_raw_text` is what actually control teh data size in our experiment

In [1]:
import os
import pandas as pd
from tqdm import tqdm
import csv
import random
import string
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import Row
from numpy.random import rand
from pyspark.sql.types import IntegerType, StringType


random.seed(42)

In [2]:
!python --version

Python 3.7.9


In [3]:
!pip list  | grep -i spark

findspark           1.4.2
pyspark             2.4.7
pytest-spark        0.6.0
spark-nlp           2.7.3
sparknlp            1.0.0


**Note: Both Pyspark and Sparn standalone should have same version**

My machine has following configuration...
- 6 cores with 12vCores
- 32GB RAM

Spark Standalone server:
```
cd /opt/softwares/spark-2.4.7-bin-hadoop2.7

# export your python bin path
export PYSPARK_PYTHON=/opt/envs/ai4e/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/envs/ai4e/bin/python

sbin/start-all.sh
sbin/stop-all.sh
```
Spark UI: [http://localhost:8080](http://localhost:8080)   
Spark Master URL : spark://IMCHLT276:7077

In [4]:
spark = SparkSession.builder \
    .master("spark://IMCHLT276:7077") \
    .config("spark.sql.autoBroadcastJoinThreshold", -1) \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .config("spark.cores.max", "10") \
    .config("spark.local.dir", "/opt/tmp/spark-temp/") \
    .appName("DataSkewness") \
    .getOrCreate()

# https://stackoverflow.com/questions/27774443/is-it-safe-to-temporarily-rename-tmp-and-then-create-a-tmp-symlink-to-a-differe
# sudo mount --bind /path/to/dir/with/plenty/of/space /tmp
# sudo lsof /tmp # check for apps
# sudo umount /tmp

In [5]:
spark

**Number of Records**

Adjust the number of records as per you needs

In [5]:
PRODUCTS_COUNT = 7500#75000000 # number of reccords for product table
SELLERS_COUNT = 10 # Numebr of records

In [6]:
start_index = 0
# Number of records per chunk
chunk_size = 100 #100000 # 100 thoshand or 1 Lakh
end_index = chunk_size

PROD_ZERO_CHUNKS = 190 # Number of chunks for product zero
OTHER_PROD_CHUNKS = 10 # Number o fchunks for other products

product_zero_sales = PROD_ZERO_CHUNKS * chunk_size
other_product_sales = OTHER_PROD_CHUNKS * chunk_size

product_zero_sales, other_product_sales

(19000, 1000)

### Generate products

In [7]:
products_df = spark.range(PRODUCTS_COUNT).withColumnRenamed("id", "product_id").select(F.col("product_id"))

In [8]:
def get_product_name(product_id):
    return f"product_{product_id}"

get_product_name_udf = F.udf(get_product_name, StringType())
products_df = products_df.withColumn("product_name", get_product_name_udf(F.col("product_id")))

In [9]:
def get_price():
    return random.randint(1, 150)
get_price_udf = F.udf(get_price, IntegerType())

products_df = products_df.withColumn("product_price", get_price_udf())

In [10]:
if os.environ.get("PYSPARK_PYTHON", 'error') == 'error':
    print("Export PYSPARK_PYTHON and PYSPARK_DRIVER values and restart your jupyter!") 

In [11]:
products_df.count(), products_df.show()

+----------+------------+-------------+
|product_id|product_name|product_price|
+----------+------------+-------------+
|         0|   product_0|          135|
|         1|   product_1|            3|
|         2|   product_2|           95|
|         3|   product_3|           87|
|         4|   product_4|          150|
|         5|   product_5|           59|
|         6|   product_6|          140|
|         7|   product_7|          110|
|         8|   product_8|           31|
|         9|   product_9|          108|
|        10|  product_10|           65|
|        11|  product_11|          138|
|        12|  product_12|          116|
|        13|  product_13|           79|
|        14|  product_14|           12|
|        15|  product_15|           78|
|        16|  product_16|           16|
|        17|  product_17|          110|
|        18|  product_18|          148|
|        19|  product_19|           86|
+----------+------------+-------------+
only showing top 20 rows



(7500, None)

In [13]:
products_df.repartition(10).write.parquet("data/products_parquet")

### Generate sellers

In [14]:
seller_ids = [x for x in range(1, SELLERS_COUNT)]
print("Sellers between {} and {}".format(1, 10))

Sellers between 1 and 10


In [15]:
sellers = [[0, "seller_0", 2500000]]

for s in tqdm(seller_ids):
    sellers.append([s, "seller_{}".format(s), random.randint(12000, 2000000)])
    
#   Save dataframe
df = pd.DataFrame(sellers)
df.columns = ["seller_id", "seller_name", "daily_target"]
df = spark.createDataFrame(df)
df.show()
df.write.parquet("data/sellers_parquet")

100%|██████████| 9/9 [00:00<00:00, 44047.53it/s]


+---------+-----------+------------+
|seller_id|seller_name|daily_target|
+---------+-----------+------------+
|        0|   seller_0|     2500000|
|        1|   seller_1|     1352975|
|        2|   seller_2|      245478|
|        3|   seller_3|       64451|
|        4|   seller_4|     1567144|
|        5|   seller_5|      588778|
|        6|   seller_6|      525575|
|        7|   seller_7|      480106|
|        8|   seller_8|      304632|
|        9|   seller_9|     1556492|
+---------+-----------+------------+



### Generate sales

Prepare skewed data where product 0 has more entries. This means we need to create two sales dataframe and concatenate it.

Due to memory constrainits on single machine, each sales dataframe is further divided into chunks and after calculation its written to disk.

Lets start with defining the UDFs needed...

In [16]:
dates = ['2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08',
         '2020-07-09', '2020-07-10']
def get_dates():
    return random.choice(dates)

get_dates_udf = F.udf(get_dates, StringType())


In [17]:
def get_num_pieces_sold():
    return random.randint(1, 100)

get_num_pieces_sold_udf = F.udf(get_num_pieces_sold, IntegerType())

In [18]:
letters = string.ascii_lowercase
letters_upper = string.ascii_uppercase

for _i in range(0, 10):
    letters += letters

for _i in range(0, 10):
    letters += letters_upper

print("Number of chars to choose from", len(letters))
sample_string = random.sample(letters, 500)
# print("sample_string", ''.join(sample_string))

def random_string(stringLength=200):
    """Generate a random string of fixed length """
    return ''.join(random.sample(letters, stringLength))

random_string_udf = F.udf(random_string,StringType())


def static_string():
    """static string of fixed length """
    return ''.join(sample_string)

static_string_udf = F.udf(static_string, StringType())

Number of chars to choose from 26884


In [19]:
#TODO : pass the range as input

def get_product_ids():
    return random.randint(1, PRODUCTS_COUNT)

get_product_ids_udf = F.udf(get_product_ids, IntegerType())


def get_seller_ids():
    return random.randint(1, 10)

get_seller_ids_udf = F.udf(get_seller_ids, IntegerType())


Defining chunk size...

Product zero entries...

In [20]:
for i in tqdm(range(PROD_ZERO_CHUNKS)):
    sales_df = spark.range(start_index, end_index).withColumnRenamed('id', 'order_id')
    sales_df = sales_df.withColumn("product_id", F.lit(0))
    sales_df = sales_df.withColumn("seller_id", F.lit(0))
    sales_df = sales_df.withColumn("date", get_dates_udf())
    sales_df = sales_df.withColumn("num_pieces_sold", get_num_pieces_sold_udf())
    sales_df = sales_df.withColumn("bill_raw_text", random_string_udf())
    sales_df.write.parquet("data/sales_parquet_temp", mode="append")
    start_index = end_index
    end_index = end_index + chunk_size


100%|██████████| 190/190 [00:39<00:00,  4.80it/s]


Other product entries in sales DF...

In [21]:
for i in tqdm(range(OTHER_PROD_CHUNKS)):
    sales_df = spark.range(start_index, end_index).withColumnRenamed('id', 'order_id')
    sales_df = sales_df.withColumn("product_id", get_product_ids_udf())
    sales_df = sales_df.withColumn("seller_id", get_seller_ids_udf())
    sales_df = sales_df.withColumn("date", get_dates_udf())
    sales_df = sales_df.withColumn("num_pieces_sold", get_num_pieces_sold_udf())
    sales_df = sales_df.withColumn("bill_raw_text", random_string_udf())
    sales_df.write.parquet("data/sales_parquet_temp", mode="append")
    start_index = end_index
    end_index = end_index + chunk_size

100%|██████████| 10/10 [00:01<00:00,  5.87it/s]


In [22]:
sales_df = spark.read.parquet("data/sales_parquet_temp")
sales_df.show()

+--------+----------+---------+----------+---------------+--------------------+
|order_id|product_id|seller_id|      date|num_pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+---------------+--------------------+
|   19560|      7011|        7|2020-07-07|             17|bcboaxagnoatzzcsl...|
|   19561|      3094|        1|2020-07-02|             43|otvbjmdiwcspdkrkl...|
|   19562|      7320|       10|2020-07-01|             43|rbttlqnsfiprszvwj...|
|   19563|      5604|        6|2020-07-04|             54|wqlrhepzuvmiyquei...|
|   19564|      7075|       10|2020-07-04|             43|onycjowovrmhdoejr...|
|   19565|      2847|        3|2020-07-09|              1|rcocliwbzaumdstpk...|
|   19566|      3725|        1|2020-07-08|             74|zeqijvkodkdvfxsfy...|
|   19567|        11|        4|2020-07-02|             74|cltqkvaoufaSfwlgc...|
|   19568|      6416|        7|2020-07-10|             21|tsfrzxqbsotnycqtu...|
|   19569|      4148|        9|2020-07-0

In [23]:
sales_df.count()

20000

In [24]:
%%time
sales_df.repartition(200, F.col("product_id")).write.parquet("data/sales_parquet", mode="overwrite")

CPU times: user 3.74 ms, sys: 0 ns, total: 3.74 ms
Wall time: 4.35 s


In [10]:
import pyspark.sql.functions as F

In [11]:
given_df=spark.createDataFrame([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])

In [12]:
given_df.show()

+-----------------+
|      SampleField|
+-----------------+
|The old brown fox|
|       jumps over|
|     the lazy log|
+-----------------+



In [13]:
def someNLP(text):
    return text.split()

nlp_udf = F.udf(someNLP)

In [None]:
given_df.withColumn("new_col", nlp_udf(F.col("SampleField")))