# This notebook is a variant of load-features to de-duplicate the product info for candidate search

##### Description of the `.jsonl` file in `./training-data-schema.json`

#### This does

1. Reads parameters for existing data
2. Create a small development data set
3. Expand to the full data and stream data into a formatted tfrecord in a new bucket `TF_RECORDS_DIR`

In [8]:
!gsutil -m rm gs://prod-catalog-central/* #use this to clear the target directory - if you are adding additional fields

Removing gs://prod-catalog-central/file_00-0.tfrec...
Removing gs://prod-catalog-central/file_01-1.tfrec...
Removing gs://prod-catalog-central/file_02-1.tfrec...
Removing gs://prod-catalog-central/file_03-1.tfrec...
Removing gs://prod-catalog-central/file_04-1.tfrec...
Removing gs://prod-catalog-central/file_05-1.tfrec...
Removing gs://prod-catalog-central/file_06-1.tfrec...
Removing gs://prod-catalog-central/file_07-1.tfrec...
Removing gs://prod-catalog-central/file_104-1.tfrec...
Removing gs://prod-catalog-central/file_10-1.tfrec...
Removing gs://prod-catalog-central/file_08-1.tfrec...
Removing gs://prod-catalog-central/file_09-1.tfrec...
Removing gs://prod-catalog-central/file_103-1.tfrec...
Removing gs://prod-catalog-central/file_100-1.tfrec...
Removing gs://prod-catalog-central/file_101-1.tfrec...
Removing gs://prod-catalog-central/file_102-1.tfrec...
Removing gs://prod-catalog-central/file_105-1.tfrec...                          
Removing gs://prod-catalog-central/file_106-1.tfre

In [9]:
#set parameters
BUCKET = 'gs://mcskinner-sample-data/2tower/last-view'
SCHEMA_JSON = 'gs://mcskinner-sample-data/2tower/last-view/training-data-schema.json'
TRAIN_JSON = 'gs://tfrs-sample-data/training-data.jsonl'
MY_BUCKET = 'gs://tfrs-sample-data'
TF_RECORDS_DIR = 'gs://prod-catalog-central'
SMALL_DATASET = 'gs://tfrs-sample-data/training-data_dev.jsonl'

In [10]:
### Create a smaller dev dataset first
import json
import os
import subprocess
from tensorflow.python.lib.io import file_io


ROW_LIMIT = 10000
jsonl_path = os.path.join(MY_BUCKET, 'training-data.jsonl')
SMALL_DATASET = 'gs://tfrs-sample-data/training-data_dev.jsonl'

input_file_columns = subprocess.getoutput(f'gsutil cp {jsonl_path} - | head -{ROW_LIMIT} | gsutil cp - {SMALL_DATASET}')

### Read in the schema for later parsing

#### FYI:Pre-calculated counts

In [11]:
num_records = 4293302 #sum(1 for _ in file_io.FileIO(SMALL_DATASET, 'rb')) #CHANGE THIS TO LARGE DATASET WHEN READY
print("Total number of records: {}".format(num_records))

Total number of records: 4293302


### Establish the parameters

`num_samples` is the number of data samples for each TFRECORD file

`num_tfrecods` is total number of TFRecords that we will create.

Generally - aim for around 100 MB size tfrecord files

In [12]:
#To be tuned later

num_samples = 12228 * 3 #did 3x b/c it was coming out about 33 MB per shard with original number
num_tfrecords = num_records // num_samples 
if num_records % num_samples:
    num_tfrecords += 1

print("Number of Expected TFRecords: {}".format(num_tfrecords))

Number of Expected TFRecords: 118


### These helper functions declare different feature types
This is used to parase the jsonl file

Note [this](https://keras.io/examples/keras_recipes/creating_tfrecords/#define-dataset-helper-functions) is a good resource

#### Notes on data transforms:
* Grabbing all fields avaialble for query
* Transforming and flattening of array / ragged data inputs (`['last_viewed', 'ss_prodTypeCombo_ss',` etc..`]`)
* Using special delimiter (`|` in this case) to later unpack values as a string-split for text vectorizor layer

In [13]:
import tensorflow as tf
import numpy as np


def string_array(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[v.encode('utf-8') for v in value]))


def float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[float(v) for v in value]))


def int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[int(v) for v in value]))


def float_feature_list(value):
    """Returns a list of float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

    
def parse_line_de_dupe(ln):
    
    c_ln = ln['candidate']
    
    pt = ""
    for item in c_ln['productTypeCombo_ss']:
        pt = pt + "|" + item

    pt_feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(pt).encode('utf-8')]))

    feature = {
        #########COMMENT OUT QUERY FEATURES - THIS IS GOING TO BE A DE-DUPLICATED TF-REC
        #candidate features
        "IVM_s": string_array(c_ln['IVM_s']), 
        "description": string_array(c_ln['description']),
        "price_td": float_feature(c_ln['price_td']),
        "PriceRange_s": string_array(c_ln['PriceRange_s']),
        "productTypeCombo_ss": pt_feature, 
        "visual": float_feature(c_ln['visual'])
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

### Create a target TF Records dataset for later use

In [None]:
import time
from tqdm import tqdm

# this will generate data into the correct format for tf_records
record_counter = 0
lns = [] #empty holder for the lines
tfrec_counter = 0
sku_accum = [] # this will hold a running list of SKU_s (in memory to de-dupe) 
#MAKE SURE YOU HAVE ENOUGH MACHINE MEMORY TO HOLD A LIST OF IVM_S (uniques) from the data

# quick function to write the data as we read through it
def write_a_tfrec(lns):
    #next write to a tfrecord
    with tf.io.TFRecordWriter(
        TF_RECORDS_DIR + "/file_%.2i-%i.tfrec" % (tfrec_counter, len(lns))
    ) as writer:
        for ln in lns:
            c_ln = ln['candidate']
            if c_ln['IVM_s'] in sku_accum:
                pass
            else:
                sku_accum.append(c_ln['IVM_s'])
                example  = parse_line_de_dupe(ln)
                writer.write(example.SerializeToString())
            

###### This still needs work as data will not be evenly distributed - it will get smaller as shard sequence increases (harder to find unique products as you go)
with file_io.FileIO(TRAIN_JSON, 'r') as reader:
    for line in reader:
        record_counter += 1
        if record_counter % num_samples == 0 or record_counter == num_records: 
            write_a_tfrec(lns) #write out a batch
            lns = [] #reset to a new batch
            tfrec_counter += 1
#             pbar.update(record_counter)
        else:
            pass
#         lns.append(json.loads(line)) #toggle if you want to save stuff locally
