#  Retail Sales Data Preparation using Spark

Preparing retail data for training a regression model to predict total sales revenue of a product from a store using the following features: 
- Brand (The brand of the product)
- Quantity (Quantity of product purchased)
- Advert (Whether the product had an advertisement or not)
- Price (How much the product costs)

%md
<div><img src="https://stanalyticssolutionsdev.blob.core.windows.net/assets/sales_forecasting.jpg?sp=r&st=2022-09-23T16:12:34Z&se=2025-01-01T01:12:34Z&spr=https&sv=2021-06-08&sr=b&sig=l8Prl1UTwclNsUJQhhCKGxL%2B21dGPvUQVJKnEpB0NRk%3D" width="500" height="300"/></div>

In [0]:
%pip install dlt

Python interpreter will be restarted.
Collecting dlt
  Downloading dlt-0.2.3-py3-none-any.whl (9.3 kB)
Collecting tensorflow
  Downloading tensorflow-2.11.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
Collecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
Collecting keras
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
Installing collected packages: tensorflow-estimator, tensorboard, keras, tensorflow, dlt
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.10.0
    Not uninstalling tensorflow-estimator at /databricks/python3/lib/python3.9/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-0f70eb4a-282f-464a-9dc7-9ebf1de80d26
    Can't uninstall 'tensorflow-estimator'. No files were found to uninstall.
  Attempting uninstall

## Importing Libraries

In [0]:
import dlt
from pyspark.sql import SparkSession
from pyspark.sql.functions import percent_rank
from pyspark.sql import Window
from io import BytesIO
from copy import deepcopy
from datetime import datetime
from dateutil import parser
import logging
from pyspark.sql.types import *

## Defining the schema for the data

In [0]:
Dataschema = StructType([
    StructField("ID", StringType()),
    StructField("WeekStarting", DateType()),
    StructField("Store", IntegerType()),
    StructField("Brand", StringType()),
    StructField("Quantity", IntegerType()),
    StructField("Advert", IntegerType()),
    StructField("Price", FloatType()),
    StructField("Revenue", FloatType())
])


## Load the data from the source and perform the transformations

In [0]:
@dlt.table(comment="Raw data")
def bronze_SalesTrans():
  return (spark.read.csv('/mnt/data-source/Store Transactions Data/dbo.SalesTransData.txt',schema=Dataschema))

[0;31m---------------------------------------------------------------------------[0m
[0;31mAttributeError[0m                            Traceback (most recent call last)
File [0;32m<command-2734317015726152>:1[0m
[0;32m----> 1[0m [38;5;129m@dlt[39m[38;5;241;43m.[39;49m[43mtable[49m(comment[38;5;241m=[39m[38;5;124m"[39m[38;5;124mRaw data[39m[38;5;124m"[39m)
[1;32m      2[0m [38;5;28;01mdef[39;00m [38;5;21mbronze_SalesTrans[39m():
[1;32m      3[0m   [38;5;28;01mreturn[39;00m (spark[38;5;241m.[39mread[38;5;241m.[39mcsv([38;5;124m'[39m[38;5;124m/mnt/data-source/Store Transactions Data/dbo.SalesTransData.txt[39m[38;5;124m'[39m,schema[38;5;241m=[39mDataschema))

[0;31mAttributeError[0m: module 'dlt' has no attribute 'table'

In [0]:
@dlt.table(comment="Silver data")
def silver_rank_data():
    pydf = dlt.read('bronze_SalesTrans').withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("WeekStarting")))
    return pydf



In [0]:
@dlt.table(comment="Gold data")
def gold_train():
    train = dlt.read('silver_rank_data').where("rank <= .8").drop("rank")
    return train
    
@dlt.table(comment="Gold data")
def gold_test():
    test = dlt.read('silver_rank_data').where("rank > .8").drop("rank")
    return test

