# Truted Data Generation - Status Items

## Scope of notebook

> Create `Status` dataset following the requirements below.

* Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED.

Import libraries.

In [1]:
%load_ext autoreload
%autoreload 2
from src.config import RAW_DATA_PATH
from src.IOController import create_pyspark_session
from src.DataProcessor import create_trusted_order_items, explore_dataframe

Spins off PySpark session.

In [2]:
spark = create_pyspark_session()

Starting PySpark session. Check your terminal for detailed logging...
PySpark session sucessfully created.


Generates and exports Order Items dataset.

In [3]:
df = spark.read.parquet(str(RAW_DATA_PATH / 'status'))
df.describe()

DataFrame[summary: string, order_id: string, status_id: string, value: string]

In [4]:
df.limit(3).toPandas()

Unnamed: 0,created_at,order_id,status_id,value
0,2019-01-25 01:05:07,0002fe02-d7dc-4232-b7ac-3394019ce240,b4298862-fa38-499a-93e2-a76930fb2bce,CONCLUDED
1,2019-01-24 23:04:27,0002fe02-d7dc-4232-b7ac-3394019ce240,7964bf63-007a-484d-a321-e9118ccc2f97,REGISTERED
2,2019-01-24 23:04:28,0002fe02-d7dc-4232-b7ac-3394019ce240,ca16b92b-db8f-4274-b165-929675541a9f,PLACED


In [11]:
import pyspark.sql.functions as F


df.limit(10).dropDuplicates().groupBy("order_id").agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("value", "created_at"))).alias("status_created_at")
).toPandas()

Unnamed: 0,order_id,status_created_at
0,0002fe02-d7dc-4232-b7ac-3394019ce240,"{'PLACED': 2019-01-24 23:04:28, 'CONCLUDED': 2..."
1,000cef8c-83c7-49eb-a0fb-404e6dc2150e,"{'PLACED': 2019-01-17 22:42:18, 'CONCLUDED': 2..."
2,0010995b-9212-455a-85ea-11ea7dd526c1,{'REGISTERED': 2019-01-01 22:11:21}


Conclusion:

> Indeed we've achieves a resulting DataFrame as required. It has one-to-many relationship to `Orders` dataset (primary key is `order_id`). Checking it's shape and first three rows there's a better feeling that the processing went well. A quick remark is that it has 7,5 million rows, thus producing an average 2 items per order since the Order dataset hast 3,6 million items. Seems reasonable!

Order Items dataset successfully generated!