## Generating data

Let's generate some data to build our datalake. We will generate a SALES_ORDER_FACT dataset in S3 in a version enabled bucket with year/month/day/hour partitions. We will generate ~100-128MB parquet files per hour for a year of data i.e. ~1 TB of data across 8760 files.

Package Dependencies to generate the data:

* Pandas
* Pyarrow
* S3FS

This was run on a 300-unit EMR Instance Fleet with a mix of R4 and R5 Spot instances with the above python packages bootstrapped on to the cluster. Total run time should be less than 2 hours.

In [20]:
%%configure -f
{"driverMemory": "8000M","executorMemory": "8000M", "executorCores": 1, "numExecutors":152, "conf":  { "spark.executor.memoryOverhead":"2G"}}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
6,application_1562004426807_0007,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
5,application_1562004426807_0006,pyspark,idle,Link,Link,
6,application_1562004426807_0007,pyspark,idle,Link,Link,✔


The *gen_order* function generates an Order record with some random data.

In [21]:
from random import randint
import datetime, random
from random import randrange
from pyspark.sql import Row
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
from s3fs import S3FileSystem
import os
import string
from pyspark.sql.functions import input_file_name

def gen_order(f,order_id,i,startDate):
    site_id=random.randint(1,500)
    order_date=startDate
    ship_modes=['STANDARD','ONE-DAY','TWO-DAY','NO-RUSH']
    ship_mode=ship_modes[random.randint(0,3)]
    last_modified_timestamp=startDate + datetime.timedelta(seconds=randrange(86400))
    lines=random.randint(1,5)
    for k in range(lines):
        line_id=k+1
        line_number=k+1
        product_id=random.randint(0,1000)
        quantity=random.randint(0,100)
        unit_price=random.randint(0,1000)/1
        supply_cost=unit_price/random.uniform(0.1, 5.0)/1
        discount=unit_price/random.uniform(0.1, 5.0)/1
        tax=unit_price/random.uniform(0.1, 5.0)/1
        f.write("{"+f'"ORDER_ID": {order_id}, "SITE_ID": {site_id}, \
"ORDER_DATE": "{order_date.isoformat()}", \
"SHIP_MODE": "{ship_mode}", "LINE_ID": {line_id}, "LINE_NUMBER": {line_number},\
"PRODUCT_ID": {product_id}, "QUANTITY": {quantity}, "UNIT_PRICE": {unit_price}, \
"DISCOUNT": {discount}, "SUPPLY_COST": {supply_cost}, "TAX": {tax}, \
"LAST_MODIFIED_TIMESTAMP": "{last_modified_timestamp.isoformat()}"'+"}\n")

startDate = datetime.datetime(2018, 1, 1,0,0) to start from Jan 1st, 2018.

In [23]:
#create new local file.
recordsPerFile=1031500
def drop_file(i):
    startDate = datetime.datetime(2018, 1, 1,0,0)+ datetime.timedelta(hours=i)
    filename='/tmp/'+id_generator()+'.txt'
    with open(filename, 'w+') as f:
        for k in range(i*recordsPerFile,(i+1)*recordsPerFile):
            gen_order(f,k,i,startDate)
    return (filename,startDate)

def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

In [24]:
bucket='neilawsversionedo'
t='SALES_ORDER_FACT'

# Place file in S3 datalake
def generate_file(i):
    f,startDate=drop_file(i)
    df=pd.read_json(f,lines=True)
    table = pa.Table.from_pandas(df)
    s3Location='s3://{0}/cdc/{1}/year={2}/month={3}/day={4}/hour={5}'\
        .format(bucket,t,startDate.year,startDate.month,startDate.day,startDate.hour)
    s3 = S3FileSystem() 
    pq.write_to_dataset(table, s3Location, filesystem=s3, use_dictionary=True, compression='snappy')
    os.remove(f)
    return True

n is Number of hours to generate data = number of days * 24 e.g. (31+30+31) * 24 = 2208 for 3 months of data.

In [25]:
n=8760
df=sc.parallelize(range(n)).map(lambda x:Row(x)).toDF(["i"])
df=df.repartition(n)
df.show(10)

+----+
|   i|
+----+
|2680|
|1844|
| 584|
|2469|
|1019|
|3130|
| 616|
| 917|
|2675|
| 652|
+----+
only showing top 10 rows

In [None]:
df.rdd.map(lambda x:generate_file(x["i"])).collect()