# Build Base Feature/Future Store
The resulting output will be a feature store (if based on train) and a future store (if based on test) that is at an item-loc-day level that includes all item, store and event information.

These feature stores will be known as the *BASE FEATURE/FUTURE STORE*. Further information will be added to these feature stores but the base will remain relatively static.
For example, a pipeline that is attempting to better model sales during events will engineer features in the pipeline that add to the base feature store, rather than it getting built ito the base feature store here.


In [None]:
from config import proj
import pyspark.sql.functions as sf
from src.utils.validation_utils import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

build_on = "test" # train builds the feature store, test builds the future store.

## Add provided data

### Pull in train or test

In [2]:
if build_on == "train":
    base = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("train.parquet")))
elif build_on == "test":
    base = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("test.parquet")))
else:
    raise NotImplemented("Can only build feature or future store")

                                                                                

### Add item

In [3]:
items = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("items.parquet")))
base = base.join(items, base.item_nbr == items.item_nbr, "left").drop(items.item_nbr)
base.show(5)

                                                                                

+---------+----------+---------+--------+-----------+------------+-----+----------+
|       id|      date|store_nbr|item_nbr|onpromotion|      family|class|perishable|
+---------+----------+---------+--------+-----------+------------+-----+----------+
|125497040|2017-08-16|        1|   96995|      false|   GROCERY I| 1093|         0|
|125497041|2017-08-16|        1|   99197|      false|   GROCERY I| 1067|         0|
|125497042|2017-08-16|        1|  103501|      false|    CLEANING| 3008|         0|
|125497043|2017-08-16|        1|  103520|      false|   GROCERY I| 1028|         0|
|125497044|2017-08-16|        1|  103665|      false|BREAD/BAKERY| 2712|         1|
+---------+----------+---------+--------+-----------+------------+-----+----------+
only showing top 5 rows



### Add store

In [4]:
stores = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("stores.parquet")))
base = base.join(stores, base.store_nbr == stores.store_nbr, "left").drop(stores.store_nbr)
base.show(5)

+---------+----------+---------+--------+-----------+------------+-----+----------+-----+---------+----+-------+
|       id|      date|store_nbr|item_nbr|onpromotion|      family|class|perishable| city|    state|type|cluster|
+---------+----------+---------+--------+-----------+------------+-----+----------+-----+---------+----+-------+
|125497040|2017-08-16|        1|   96995|      false|   GROCERY I| 1093|         0|Quito|Pichincha|   D|     13|
|125497041|2017-08-16|        1|   99197|      false|   GROCERY I| 1067|         0|Quito|Pichincha|   D|     13|
|125497042|2017-08-16|        1|  103501|      false|    CLEANING| 3008|         0|Quito|Pichincha|   D|     13|
|125497043|2017-08-16|        1|  103520|      false|   GROCERY I| 1028|         0|Quito|Pichincha|   D|     13|
|125497044|2017-08-16|        1|  103665|      false|BREAD/BAKERY| 2712|         1|Quito|Pichincha|   D|     13|
+---------+----------+---------+--------+-----------+------------+-----+----------+-----+-------

## Feature engineering

### New and cleared item flag
This flag will help us know how to treat particular items. If they have been cleared they wont need to be predicted for, so we can possibly filter them out. Or if they are new, a different treatment will need to be applied since the model wont have seen these items before.

In [5]:
train = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("train.parquet")))
test = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("test.parquet")))

train_items = train.select("item_nbr", sf.lit(1).alias("train_fl")).distinct()
test_items = test.select("item_nbr", sf.lit(1).alias("test_fl")).distinct()

item_coverage = train_items.join(test_items, train_items.item_nbr == test_items.item_nbr, "full")

new_items = item_coverage.filter("train_fl is null")\
    .drop(train_items.item_nbr)\
    .select(test_items.item_nbr)\
    .withColumn("new_item", sf.lit(True)) # items not in train

cleared_items = item_coverage.filter("test_fl is null")\
    .drop(test_items.item_nbr)\
    .select(train_items.item_nbr)\
    .withColumn("cleared_item", sf.lit(True)) # items not in test

In [6]:
base = base\
    .join(new_items, ["item_nbr"], "left")\
    .join(cleared_items, ["item_nbr"], "left")\
    .na.fill(value = False, subset=["new_item", "cleared_item"])

### Add events
Engineer appropriate flags for events.
This will add the following columns to the base set
- event_nat - flag for national event
- event_reg - flag for regional event
- event_loc - flag for local event
- event_type - type of event

In [7]:
holidays = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("holidays_events.parquet")))

In [8]:
holidays_nat = holidays\
    .filter("locale == 'National'")\
    .filter("transferred == false")\
    .select(["date", "type"])\
    .withColumnRenamed("type", "type_nat")\
    .withColumn("event_nat", sf.lit(True))
holidays_reg = holidays\
    .filter("locale == 'Regional'")\
    .filter("transferred == false")\
    .select(["date", "type", "locale_name"])\
    .withColumnRenamed("locale_name", "state")\
    .withColumnRenamed("type", "type_reg")\
    .withColumn("event_reg", sf.lit(True))
holidays_loc = holidays\
    .filter("locale == 'Local'")\
    .filter("transferred == false")\
    .select(["date", "type", "locale_name"])\
    .withColumnRenamed("locale_name", "city")\
    .withColumnRenamed("type", "type_loc")\
    .withColumn("event_loc", sf.lit(True))

In [9]:
base = base\
    .join(holidays_nat, ["date"], "left")\
    .join(holidays_reg, ["date", "state"], "left")\
    .join(holidays_loc, ["date", "city"], "left")\
    .na.fill(value = False, subset = ["event_nat", "event_reg", "event_loc"])

In [10]:
base = base.withColumn("event_type", sf.coalesce(base["type_nat"], base["type_reg"], base["type_loc"]))\
    .drop("type_nat", "type_reg", "type_loc")

# Validation
- Count rows
- Check for nulls, etc

Check row counts

In [11]:
if build_on == "train":
    assert base.count() == 125497040, "Feature store rows have changed due to bad join. Should be the same as train."
elif build_on == "test":
    assert base.count() == 3370464, "Future store rows have changed due to bad join. Should be the same as test."
else:
    raise NotImplemented("Can only build feature or future store")

                                                                                

Check for missing values

In [17]:
null_allowed = ["event_type"]
null_not_allowed = [col_name for col_name in base.columns if col_name not in null_allowed]

for col_name in null_not_allowed:
    assert_col_has_no_null(base, col_name)

                                                                                

In [None]:
base.show(5)

# Write feature store
TODO