# Data Engineer Technical Assigment
### Author: Konrad Wronski
Description: 
Technical assignment description
You are a data engineer working for a retail company. The company’s database handles large
volumes of data. You are tasked with creating a datalake in ADLS for reporting purposes.
For simplicity, instead of tables, assume files are in a source folder “Sales_Data” in csv format and
instead of ADLS, target location is “cleansed” folder (Create it yourself).
Source folder contains monthly sales data files.
1. Design the process to read all the files from source folder using PySpark, combine them
as a single file, and write it to the cleansed folder.
2. Which file format would you choose to write in the cleansed folder and why?
3. Mention data partitioning strategy you would propose for this table and justify your choice
of partitioning method.
4. Additionally, outline the steps you would take to implement this partitioning strategy,
considering both the technical aspects and potential challenges.
Note: Process all the files in a single run. Ensure that there is no data duplication.
[Hint: Make use of Window function for deduplication]
### Expected results of assignment:
- A notebook or python script.
- A separate documentation file with a brief explanation of the approach, data exploration,
assumptions/considerations, and instructions on how to run the application (if any).
- Output dataset in cleansed folder in your preferred file format.
- Data quality checks (like input/output dataset validation)
### Metyis development guidelines
- We appreciate a combination of Software and Data Engineering good practices.
- Proper logging and exception handling
### Evaluation criteria for results of technical assignment
We use following criteria to evaluate results:
- Well-structured code: we expect maintainability, readability.
- Scalability: Should be able to handle high volumes of data.
- Documentation

## 1. Data Ingestion 

### 1.1 Importing packages

In [60]:
from pyspark.sql import SparkSession
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt
import os 
from pyspark.sql.types import *


### 1.2 Initializing Session

In [61]:
spark = SparkSession.builder.appName("SalesETL").getOrCreate()

In [62]:
source = "/Users/konradwronski/Desktop/Projects/Grind/DataScienceJungle/MetyisTask/Sales_Data"
endpoint = "/Users/konradwronski/Desktop/Projects/Grind/DataScienceJungle/MetyisTask/Cleansed"

### 1.2.1 Checking the file 

In [63]:
df = spark.read.options(header = True, inferSchema = True).csv("/Users/konradwronski/Desktop/Projects/Grind/DataScienceJungle/MetyisTask/Sales_Data/Sales_April_2019.csv")

In [64]:
df.show()

+--------+--------------------+----------------+----------+--------------+--------------------+
|Order ID|             Product|Quantity Ordered|Price Each|    Order Date|    Purchase Address|
+--------+--------------------+----------------+----------+--------------+--------------------+
|  176558|USB-C Charging Cable|               2|     11.95|04/19/19 08:46|917 1st St, Dalla...|
|    NULL|                NULL|            NULL|      NULL|          NULL|                NULL|
|  176559|Bose SoundSport H...|               1|     99.99|04/07/19 22:30|682 Chestnut St, ...|
|  176560|        Google Phone|               1|     600.0|04/12/19 14:38|669 Spruce St, Lo...|
|  176560|    Wired Headphones|               1|     11.99|04/12/19 14:38|669 Spruce St, Lo...|
|  176561|    Wired Headphones|               1|     11.99|04/30/19 09:27|333 8th St, Los A...|
|  176562|USB-C Charging Cable|               1|     11.95|04/29/19 13:03|381 Wilson St, Sa...|
|  176563|Bose SoundSport H...|         

In [65]:
df.printSchema()

root
 |-- Order ID: integer (nullable = true)
 |-- Product: string (nullable = true)
 |-- Quantity Ordered: integer (nullable = true)
 |-- Price Each: double (nullable = true)
 |-- Order Date: string (nullable = true)
 |-- Purchase Address: string (nullable = true)



### 1.3 Loading the data 

Creating an empty Data Frame to store merged files.

In [None]:
salesSchema = StructType([StructField('Order ID',
                                  IntegerType(), True),
                    StructField('Product',
                                StringType(), True),
                    StructField('Quantity Ordered',
                                IntegerType(), True),
                    StructField('Price Each',
                                DoubleType(), True),
                    StructField('Order Date',
                                StringType(), True),
                    StructField('Purchase Address',
                                StringType(), True)
                    ])
sales = spark.createDataFrame(data = [], schema = salesSchema)

In [79]:
def load(file, base):
    '''Function loading and merging all of the datasets/tables'''
    dfFile = spark.read.options(header = True, inferSchema = True).csv(file)
    base = base.unionByName(dfFile)
    return base

Looping through all of the files in the Sales Data folder to merge them into one.

In [81]:
directory = os.fsencode(source)

for file in os.listdir(directory): 
    filename = os.path.join(source, os.fsdecode(file))
    sales = load(filename, sales)
    print(f"File {file} has been loaded and appended to the main file. Current Length of the file {sales.count()}")

File b'Sales_December_2019.csv' has been loaded and appended to the main file. Current Length of the file 25117
File b'Sales_April_2019.csv' has been loaded and appended to the main file. Current Length of the file 43500
File b'Sales_February_2019.csv' has been loaded and appended to the main file. Current Length of the file 55536
File b'Sales_March_2019.csv' has been loaded and appended to the main file. Current Length of the file 70762
File b'Sales_August_2019.csv' has been loaded and appended to the main file. Current Length of the file 82773
File b'Sales_May_2019.csv' has been loaded and appended to the main file. Current Length of the file 99408
File b'Sales_November_2019.csv' has been loaded and appended to the main file. Current Length of the file 117069
File b'Sales_October_2019.csv' has been loaded and appended to the main file. Current Length of the file 137448
File b'Sales_January_2019.csv' has been loaded and appended to the main file. Current Length of the file 147171
File