# Kaggle competition: Shopee - Price Match Guarantee 

#### Full code is publicly available on Kaggle at www.kaggle.com/mvenou/productmatching-siamese-network

*In this Kaggle challenge our goal is to develop a discriminator network capabale of identifying unique products by partitioning a set of images into distinct groupings. During inference we are provided a dataset consisting of 70,000+ product images, along with perceptual hash codes and user-submitted descriptions. Some products may be featured in up to 50 images in the set and we are to return a CSV file listing each "post ID" and the ID from all images containing that same product. Our training set contains 32,000+ of sample data.*

*This is a heavily imbalanced problem. On the data side, we have a classification problem with vastly more classes than positive examples per class. On the technological side we have unconstrained machinery during training, but face CPU / GRU budget constraints (8 hours CPU or 2 hours GPU) for inference. With a 70,000+ image inference data set, this resource limitation is significant.*

---

Model Architecture

My strategy is to build a One-Shot-Learning model similar to [1], comprised of a (resource-intensive) encoder and a very lightweight discriminator network. This model is trained on "Siamese pairs" of examples, where each training sample consists of an anchor image and two comparison images- one image from same classification category and another image with a different classification. This Siamese network alleviates the enormous class imbalance within the dataset. However, instead of using a distance metric to produce a decision boundary as in [1], [2] and [3], I utilize a modified binary classification (sigmoid output) with loss function based on false positives and false negatives, which is discussed as an alternative option in [2].

Encodings for all of the test images will be produced and stored using a modestly sized network. We can then use an extremly light discriminator to categorize images into groups. In this way, while categorizing images may require a huge number of comparisons, this resource-intensive encoding process only occurs once per sample. On the other hand, classification of 70,000 images requires a great many pairwise comparisons that cannot benefit from GPU acceleration. Do to this, I limit my encoding model parameters to a fraction of that used in [3] and run inference on CPU. Encoder training is run on GPU because our resources are only limited during inference.

The encoder uses a pretrained Tensorflow Hub multi-lingual text embedding network [4] [5] [6] and a newly-trained CNN based on [1] to extract features from the image and its text description. The discriminator takes encoded images, encoded text, and perceptual hash of two images as input, and computes the "distance" along each of these features, and outputs an overall distance.

*This model architecture and training strategy is based on "Learning a Similarity Metric Discriminatively, with Application to Face Verification" by Sumit Chopra Raia Hadsell Yann LeCun (2005) [1]. The choice of loss metric is based on DeepLearning.ai's Deep Learning Specialization course on Coursera [2], which credits [2]. See the bottom of this readme for additional citations.*

---

Data Pipeline

During inference we will process (batches) of products through the encoder, one at a time. During training, however, products will be processed three at a time using the "Siamese" training structure. These "product triples" consist of an anchor product, a matching product and a non-matching product. The main effort in our data preprocessing is creating an efficient pipeline for the product triples of the form [(image1, title1, phash1). (image2, title2, phash2), (image3, title3, phash3)]

---

Training

Product triples are fed through the network in product pairs (anchor, match), (anchor, non-match) to yield "matching distance" and "non-matching distance." Our goal is to define a decision boundary using this distance metric. There are two approaches suggesed in [2]. One is to think in terms of a true distance metric (as in [1], [2] and [3]. The other approach is to treat this as a binary classification problem (match = 0 / non-match = 1) using a sigmoid output we can interpret as the probability that two images do not match. This idea is presented as an alternative approach in [2].

I chose to use binary classification (to let the network learn its own decision boundary) while maintaining the triplet / siamese structure (to deal with the large class imbalance). In this context our triplet loss function is "loss = (matching_dist + (1 - non_matching_dist))/2."

---

My Journey

This being my first Kaggle challenge, my first time working with a large unprepared dataset, and my first time facing serious computational GPU restrains, I learned a great deal working through the project.

My first major obstacle was developing an efficient data pipeline, as my initial one produced data much too slowly to handle 70,000+ inference samples within a time-restricted environment. This difficulty arose from working with matched Siamese triples of data, which use of Tensorflow's high-level image processing pipeline tools. Instead, I learned to use 'tf.data.experimental.CsvDataset' and low-level data loading tools. This increased my data pipeline speed by a factor of 3 compared to my original implemenatation.

My second obstacle was in stripping down my encoder network to handle the processing and saving of 70,000+ images in a time-limited environment. My first models included elements of an object localization process and a pretrained Tensorflow Hub image feature extractor, both of which I sadly had to remove due to how much processing time they required. Instead I ended with en encoder model with under 40,000 paramaters, which is tiny compared to what I had been using in my Coursera deep learning specializations.

The third obstacle was cutting down the number of operations required to classify 70,0000+ products, when each category contained no more that 50 products. My initial, naive approach would require an impossible number of pairwise comparisons. This was very interesting and was my first encounter with using algorithm optimization (ideas I learned working when completing Foobar with Google) in a data science context.

The fourth obstacle was unexpected NaN results removing all image data from the model. Adding gradient clipping, batch norms and carefully checking all divisions did not remove the problem, which I still have been unable to track down. My makeshift "running out of time" solution is to babysit the training process and stop training before NaN's appear.

---

Results

Finally resolving the above challenges and leaving my model to train overnight, I awoke to find that the notebook had shut shut down, erasing my model checkpoints. I do not know why this occured, as I stayed within Kaggle's documented resource limits. After frantically retraining the model up until the last possible minute I found that, although the rules explicitly allowed the use of pretrained models, they prohibited me from installing Tensorflow Text that the Tensorflow Hub embedding [4] required in order to run. Dissapointed and without time to implement an alternative solution, I was not able to submit a solution in time for the competition deadline.

This of course exposes my own fault in not alloting myself an adequate amount of time to work through unanticipated issues that arose. However, I learned a great deal throughout the process and have come out the better for it regardless of the ultimate outcome.

One of my main takewaways is the importance of producing a minimal working end-to-end process built before spending too much time experimenting with architecture. I spent several days attempting to build an advanced model with novel ideas (nearly ALL of which I had to strip out after truly understanding my computational resource limitations). Had I first developed a minimal working model under the exact competition restrictions, I would have overcome the unexpected "nuts and bolts" obstacles with enough time remaining to produce a good model and satisfactory submission to the challenge.

---

Citations:

    [1] @INPROCEEDINGS{1467314, author={Chopra, S. and Hadsell, R. and LeCun, Y.}, booktitle={2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)}, title={Learning a similarity metric discriminatively, with application to face verification}, year={2005}, volume={1}, number={}, pages={539-546 vol. 1}, doi={10.1109/CVPR.2005.202}}

    [2] @misc{author = {Andrew Ng}, title = {Special Applications: Face recognition & Neural Style Transfer}, howpublished = {Available at \url{https://www.coursera.org/learn/convolutional-neural-networks#syllabus} (2020/05/09)}}

    [3] @article{DBLP:journals/corr/SchroffKP15, author= {Florian Schroff and Dmitry Kalenichenko and James Philbin}, title = {FaceNet: {A} Unified Embedding for Face Recognition and Clustering}, journal = {CoRR}, volume = {abs/1503.03832}, year={2015}}

    [4] Tensorflow Hub pretrained multilingial word embedding model, available at https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3, which in turn credits the two works cited below.

    [5] Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego , Steve Yuan, Chris Tar, Yun-hsuan Sung, Ray Kurzweil. Multilingual Universal Sentence Encoder for Semantic Retrieval. July 2019

    [6] Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model. Repl4NLP@ACL, July 2019.

@mvenouziou
Commit changes
Commit summary
Optional extended description
Commit directly to the main branch.
Create a new branch for this commit and start a pull request. Learn more about pull requests.

    © 2021 GitHub, Inc.
    Terms
    Privacy
    Security
    Status
    Docs

    Contact GitHub
    Pricing
    API
    Training
    Blog
    About



In [None]:
#### PACKAGE IMPORTS ####

# TF Model design
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
!pip install tensorflow_text
import tensorflow_text as text 
#import tensorflow_probability as tfp
#!pip install -q -U tensorflow-addons
#import tensorflow_addons as tfa

# Visualizations
import matplotlib.pyplot as plt
#!pip install -U tensorboard-plugin-profile
#%load_ext tensorboard

# data management
import numpy as np
import pandas as pd
import string

# file management
import datetime
import os

## Parameters

Class offering easy access to hyperparamaters and file directory structure

In [None]:
class ModelParameters:
    def __init__(self, model_name, cloud_server='kaggle'):
               
        # universal parameters
        self._image_size = (196, 196)  # shape to process images in data pipeline
                
        # File Paths
        if cloud_server == 'colab':  # Google Colab with GDrive
            from google.colab import drive
            drive.mount('/content/gdrive')        
            base_dir = '/content/gdrive/MyDrive/Colab_Notebooks/models/ProductComparison/'
            self._ds_prep_dir = base_dir + 'data_prep/'
            self._prep_save_dir = self._ds_prep_dir
            self._dataset_dir = ''
            self._labels_dir = self._dataset_dir + 'labels/'
            self._saved_weights_dir = base_dir + 'saved_weights'
            os.chdir(base_dir)  

        elif cloud_server == 'kaggle': # Kaggle cloud notebook
            base_dir = ''  # working directory
            
            self._dataset_dir = '../input/shopee-product-matching/'
            self._ds_prep_dir = '../input/productmatching-shopee/'
            self._prep_save_dir = base_dir
            self._saved_weights_dir = '../input/productmatching-siamese-network/saved_weights'
            self._labels_dir = self._dataset_dir
                  
        self._test_images_dir = self._dataset_dir + 'test_images/'
        self._train_images_dir = self._dataset_dir + 'train_images/'
        self._test_labels_csv = self._labels_dir + 'test.csv'
        self._train_labels_csv = self._labels_dir + 'train.csv'
        self._sample_submission_csv = self._labels_dir + 'sample_submission.csv'

        # set model subfolders (unique for each model name)
        self._model_dir = model_name
        
        # ## checkpoints
        self._checkpoint_dir = base_dir
        """
        if not os.path.isdir(self._checkpoint_dir):
            os.makedirs(self._checkpoint_dir) 
            print('created checkpoint directory:', self._checkpoint_dir)
        """
        """
        self._saved_predictions_dir = base_dir + 'predictions/'
        if not os.path.isdir(self._saved_predictions_dir ):
            os.makedirs(self._saved_predictions_dir ) 
            print('created predictions directory:', self._saved_predictions_dir )
        


        # ## tensorboard logs
        self._logdir = model_name + '/logs' 
        if not os.path.isdir(self._logdir):
            os.makedirs(self._logdir) 
            print('created log directory:', self._logdir)
        """
            

    # functions to access params
    def image_size(self):
        return self._image_size
    def dataset_dir(self):
        return self._dataset_dir
    def train_images_dir(self):
        return self._train_images_dir
    def test_images_dir(self):
        return self._test_images_dir   
    def train_labels_csv(self):
        return self._train_labels_csv
    def test_labels_csv(self):
        return self._test_labels_csv  
    def saved_predictions_dir(self):
        return self._saved_predictions_dir 
    def checkpoint_dir(self):
        return self._checkpoint_dir
    def saved_weights_dir(self):
        return self._saved_weights_dir
    def logdir(self):
        return self._logdir
    def model_dir(self):
        return self._model_dir
    def ds_prep_dir(self):
        return self._ds_prep_dir
    def prep_save_dir(self):
        return self._prep_save_dir

In [None]:
PARAMETERS = ModelParameters(model_name='version_1', cloud_server='kaggle')

# Prepare Dataset for training

#### *During inference we will process (batches) of products through the encoder, one at a time. During training, however, products will be processed three at a time using the "Siamese" training structure. These "product triples" consist of an anchor product, a matching product and a non-matching product. The main challenge in our data preprocessing is creating an efficient pipeline for the product triples of the form [(image1, title1, phash1). (image2, title2, phash2), (image3, title3, phash3)]*

### Training / Validation Datasets

In [None]:
# load train csv as dataframe
FULL_TRAIN_DF = pd.read_csv(PARAMETERS.train_labels_csv())
FULL_TRAIN_DF['id'] = FULL_TRAIN_DF['posting_id']
FULL_TRAIN_DF = FULL_TRAIN_DF.set_index('id')
FULL_TRAIN_DF['title'] = FULL_TRAIN_DF['title'].apply(lambda x: x.lower())
FULL_TRAIN_DF = FULL_TRAIN_DF.rename(columns={'image':'image_path'})

# load test csv as dataframe
TEST_DF = test_df = pd.read_csv(PARAMETERS.test_labels_csv())
TEST_DF['id'] = TEST_DF['posting_id']
TEST_DF = TEST_DF.set_index('id')
TEST_DF['title'] = TEST_DF['title'].apply(lambda x: x.lower())
TEST_DF = TEST_DF.rename(columns={'image':'image_path'})

In [None]:
# train / valid split
cutoff = int(.2* len(FULL_TRAIN_DF))
VALID_DF = FULL_TRAIN_DF[: cutoff].sample(frac=1)
TRAIN_DF = FULL_TRAIN_DF[cutoff :].sample(frac=1)

## Preprocessing

Functions to find all sets  of (matching, nonmatching) post ids

In [None]:
# function to find all sets  of (matching, nonmatching) post ids
def matching_labels(loc_id, data_subset):
    if data_subset == 'test':
        df = TEST_DF
    elif data_subset == 'train':
        df = TRAIN_DF
    elif data_subset == 'valid':
        df = VALID_DF
           
    posting_id = df.loc[loc_id]['posting_id']
    label_group = df.loc[posting_id]['label_group']

    # find all matching and non-matching products
    match_df = df[df['label_group'] == label_group]
    matches_array = match_df.index.to_numpy()
    
    non_match_df = df[df['label_group'] != label_group]
    non_match_array = non_match_df.index.to_numpy()

    return matches_array, non_match_array


def get_matches(data_subset):
    
    if data_subset == 'test':
        df = TEST_DF.copy()
    elif data_subset == 'train':
        df = TRAIN_DF.copy()
    elif data_subset == 'valid':
        df = VALID_DF.copy()
        
    df = df.drop_duplicates(subset=['label_group'])
    
    matches_df = df['posting_id'].apply(lambda x: matching_labels(x ,data_subset)[0])
    non_matches_df = df['posting_id'].apply(lambda x: matching_labels(x, data_subset)[1])
    
    return matches_df, non_matches_df

Functions to generate triples of post ids

In [None]:
# functions to generate triples of post ids
def triples_from_row(row_num, selections_per_example, matches_df, non_matches_df):
    # isolate row
    match_row = matches_df.iloc[row_num]
    non_match_row = non_matches_df.iloc[row_num]

    # choose number of selections to make
    num_selections = match_row.shape[0]  * selections_per_example

    # get matches
    match_pairs = np.random.choice(match_row, size=2*num_selections, replace=True).reshape(-1,2)
    non_match_selections = np.random.choice(non_match_row, size=num_selections, replace=True).reshape(-1,1)
    return np.concatenate((match_pairs[:, :1], match_pairs[:, 1:], non_match_selections), axis=1)


# function to create dataframe consisting of the three post ids (anchor, match, non-match)
def get_triples_post_ids(matches_df, non_matches_df, selections_per_example=3):

    # initialize container to hold results
    triples = np.empty((0,3), dtype=str)

    for row_num in range(len(matches_df)):
        # get triples and update found results
        temp = triples_from_row(row_num, selections_per_example, matches_df, non_matches_df)
        triples = np.concatenate((triples, temp), axis=0)
        
    return pd.DataFrame(triples, columns = ['post_id_1', 'post_id_2', 'post_id_3'])

Functions to generate product info from post id (image, title, phash)

In [None]:
# functions to generate info on a single product (image, title, phash) or (image, title, phash, post id)
def product_singles(posting_id, data_subset, return_post_id, return_label_group, inference_df=None):
    
    # select df (note: product info is contained in original dataframe)
    if data_subset == 'test':
        this_df = TEST_DF
        image_directory = PARAMETERS.test_images_dir()
    elif data_subset == 'train':
        this_df = TRAIN_DF
        image_directory = PARAMETERS.train_images_dir()
    elif data_subset == 'valid':
        this_df = VALID_DF
        image_directory = PARAMETERS.train_images_dir()
    elif data_subset == 'inference':
        this_df = inference_df
        image_directory = PARAMETERS.test_images_dir()

    # select product
    row = this_df.loc[posting_id]

    # get info
    image_path = image_directory + row['image_path']
    title = row['title']
    phash = row['image_phash']
    output = image_path, title, phash

    if return_label_group:
        label_group = row['label_group']
        output = image_path, title, phash, posting_id, label_group
    
    elif return_post_id:
        output = image_path, title, phash, posting_id
    
    else:
        output = image_path, title, phash  


    return output



def get_single_product_info_df(id_column_name, df, data_subset, return_post_id, return_label_group, inference_df=None):   
    info = df[id_column_name].apply(lambda x: product_singles(x, data_subset, return_post_id, return_label_group, inference_df))

    info_df = pd.concat([pd.DataFrame({'image_path':info.apply(lambda x: x[0]).values}, index=info.index), 
                         pd.DataFrame({'title':info.apply(lambda x: x[1]).values}, index=info.index),
                         pd.DataFrame({'phash':info.apply(lambda x: x[2]).values}, index=info.index)],
                        axis=1)
    
    # prepare phashes
    info_df['phash'] = info_df['phash'].apply(lambda x: ' '.join(list(x)))
    
    if return_post_id:
        info_df = pd.concat([info_df, 
                             pd.DataFrame({'posting_id':info.apply(lambda x: x[3]).values}, index=info.index)],
                            axis=1)
    if return_label_group:
        info_df = pd.concat([info_df, 
                             pd.DataFrame({'label_group':info.apply(lambda x: x[4]).values}, index=info.index)],
                            axis=1)
    return info_df


def get_triples_product_info_df(triples_df, data_subset, return_post_id=False, return_label_group=False, inference_df=None):   

    product_1_info = get_single_product_info_df('post_id_1', triples_df, data_subset, return_post_id, return_label_group, inference_df)
    product_2_info = get_single_product_info_df('post_id_2', triples_df, data_subset, return_post_id, return_label_group, inference_df)
    product_3_info = get_single_product_info_df('post_id_3', triples_df, data_subset, return_post_id, return_label_group, inference_df)
    
    def rename_columns(number, df):
        num = str(number)
        columns={'image_path':'image_path_' + num, 'title':'title_' + num, 'phash':'phash_' + num, 
                 'posting_id': 'posting_id' + num, 'label_group': 'label_group' + num}
        return df.rename(columns=columns)

    product_1_info = rename_columns(1, product_1_info)
    product_2_info = rename_columns(2, product_2_info)
    product_3_info = rename_columns(3, product_3_info)
    
    triples_df = pd.concat([product_1_info, product_2_info, product_3_info], axis=1)
    
    return triples_df

Function to load images and prepare phash as int

In [None]:
# functions to load images and prepare phash
def load_images_as_tensor(image_path, perturb_image):
   
    target_size = PARAMETERS.image_size()

    # preprocessing
    image = tf.io.read_file(image_path)
    image = tf.io.decode_jpeg(image, channels=1)
    image = keras.layers.experimental.preprocessing.Rescaling(1./255)(image)
    image = keras.layers.experimental.preprocessing.Resizing(height=target_size[0], width=target_size[1])(image)
    
    if perturb_image:  # add random rotation
        image = tf.keras.layers.experimental.preprocessing.RandomRotation(.3, fill_mode='nearest')(image)
        
    return image

def prepare_phash_as_tensor(phash):
    # placeholder
    return phash

Final functions for dataset creation

In [None]:
# function to create Dataset for a single product
def prepare_singles_ds_from_csv(single_product_csv, return_label_group):

    if not return_label_group:
        ds = tf.data.experimental.CsvDataset(
                filenames=single_product_csv, 
                record_defaults=[tf.string, tf.string, tf.string, tf.string], 
                exclude_cols=[0],
                compression_type=None, buffer_size=None,
                header=True, field_delim=',', use_quote_delim=True,
                na_value='')

        ds = ds.cache()

        # load images and update phash
        def map_fn(x):
            return (load_images_as_tensor(x[0], False),  # 'image_path'
                    x[1],  # 'title'
                    prepare_phash_as_tensor(x[2]),  # 'phash',
                    x[3]  # product id
            )

        ds = ds.map(lambda x0, x1, x2, x3: map_fn([x0, x1, x2, x3]),
                    num_parallel_calls=tf.data.AUTOTUNE)
    
    else:
        ds = tf.data.experimental.CsvDataset(
                filenames=single_product_csv, 
                record_defaults=[tf.string, tf.string, tf.string, tf.string, tf.string], 
                exclude_cols=[0],
                compression_type=None, buffer_size=None,
                header=True, field_delim=',', use_quote_delim=True,
                na_value='')

        ds = ds.cache()

        # load images and update phash
        def map_fn(x):
            return (load_images_as_tensor(x[0], False),  # 'image_path'
                    x[1],  # 'title'
                    prepare_phash_as_tensor(x[2]),  # 'phash',
                    x[3],  # product id
                    x[4]  # label group id
            )

        ds = ds.map(lambda x0, x1, x2, x3, x4: map_fn([x0, x1, x2, x3, x4]),
                    num_parallel_calls=tf.data.AUTOTUNE)
    
    ds = ds.batch(1)
    #ds = ds.prefetch(buffer_size=150)#tf.data.AUTOTUNE)

    return ds


# function to create dataset for 3 products
def prepare_triples_ds_from_csv(triples_product_csv, perturb_image=True):

    ds = tf.data.experimental.CsvDataset(
        filenames=triples_product_csv, 
        record_defaults=3*[tf.string, tf.string, tf.string], 
        exclude_cols=[0],
        compression_type=None, buffer_size=None,
        header=True, field_delim=',', use_quote_delim=True,
        na_value='')

    ds = ds.cache()
    
    # load images and update phash
    def map_fn(x):
        return (load_images_as_tensor(x[0], False),  # 'image_path_1'
                x[1],  # 'title_1'
                prepare_phash_as_tensor(x[2]),  # 'phash_1'
                load_images_as_tensor(x[3], perturb_image),  # 'image_path_2'
                x[4],  # 'title_2'
                prepare_phash_as_tensor(x[5]),  # 'phash_2'
                load_images_as_tensor(x[6], perturb_image),  # 'image_path_3'
                x[7],  # 'title_3'
                prepare_phash_as_tensor(x[8])  # 'phash_3'
        )

    ds = ds.map(lambda x0, x1, x2, x3, x4, x5, x6, x7, x8: 
                map_fn([x0, x1, x2, x3, x4, x5, x6, x7, x8]),
                num_parallel_calls=tf.data.AUTOTUNE)   
    ds = ds.batch(1)
    return ds

Functions above are combined to allow easy dataset generation 

In [None]:
# compilation functions
# to apply above procedures
def create_fresh_train_dataframe():
    train_matches_df, train_non_matches_df = get_matches(data_subset='train')
    valid_matches_df, valid_non_matches_df = get_matches(data_subset='valid')
    return train_matches_df, train_non_matches_df, valid_matches_df, valid_non_matches_df

def create_triples_dataframe():
    train_triples_df = get_triples_post_ids(train_matches_df, train_non_matches_df)
    valid_triples_df = get_triples_post_ids(valid_matches_df, valid_non_matches_df)
    return train_triples_df, valid_triples_df

def extract_products_from_dataframes(train_triples_df, valid_triples_df):
    train_product_1 = get_product_triples_info_down_column('post_id_1', train_triples_df, data_subset='train')
    train_product_2 = get_product_triples_info_down_column('post_id_2', train_triples_df, data_subset='train')
    train_product_3 = get_product_triples_info_down_column('post_id_3', train_triples_df, data_subset='train')
    
    valid_product_1 = get_product_triples_info_down_column('post_id_1', valid_triples_df, data_subset='valid')
    valid_product_2 = get_product_triples_info_down_column('post_id_2', valid_triples_df, data_subset='valid')
    valid_product_3 = get_product_triples_info_down_column('post_id_3', valid_triples_df, data_subset='valid')
    
    return train_product_1, train_product_2, train_product_3, valid_product_1, valid_product_2, valid_product_3

def convert_product_triples_to_ds(train_product_1, train_product_2, train_product_3, valid_product_1, valid_product_2, valid_product_3, perturb_image=True):
    train_triples_ds = create_triples_ds(train_product_1, train_product_2, train_product_3, perturb_image)
    valid_triples_ds = create_triples_ds(valid_product_1, valid_product_2, valid_product_3, perturb_image)
    return train_triples_ds, valid_triples_ds

## Create test datasets
(To be used for converting single product to match predictions)

In [None]:
# Utility functions for saving generated dataframes and arrays
def save_to_csv(filename, df):
    with open(filename, 'w') as f:
        f.write("New file created")
    df.to_csv(filename)   
    return None

In [None]:
# compilation functions to apply above procedures

# Triple Products
def create_matches_df():
    train_matches_df, train_non_matches_df = get_matches(data_subset='train')
    valid_matches_df, valid_non_matches_df = get_matches(data_subset='valid')
    return train_matches_df, train_non_matches_df, valid_matches_df, valid_non_matches_df

def create_triples_ID_df(train_matches_df, train_non_matches_df, valid_matches_df, valid_non_matches_df):
    train_triples_df = get_triples_post_ids(train_matches_df, train_non_matches_df, selections_per_example=3)
    valid_triples_df = get_triples_post_ids(valid_matches_df, valid_non_matches_df, selections_per_example=3)
    return train_triples_df, valid_triples_df

def create_triple_products_df(train_triples_df, valid_triples_df):
    train_triples_product_df = get_triples_product_info_df(train_triples_df, data_subset='train', return_post_id=False)
    valid_triples_product_df = get_triples_product_info_df(valid_triples_df, data_subset='valid', return_post_id=False)
    return train_triples_product_df, valid_triples_product_df

def create_triples_ds_from_csv(train_triples_product_csv, valid_triples_product_csv, batch_size, perturb_image):
    triples_ds_train = prepare_triples_ds_from_csv(train_triples_product_csv, perturb_image)
    triples_ds_valid = prepare_triples_ds_from_csv(valid_triples_product_csv, perturb_image)
    return triples_ds_train, triples_ds_valid


"""
file_dir = PARAMETERS.prep_save_dir()

train_matches_df, train_non_matches_df, valid_matches_df, valid_non_matches_df = \
    create_matches_df()

save_to_csv(file_dir + 'train_matches_df.csv', train_matches_df)
save_to_csv(file_dir + 'train_non_matches_df.csv', train_non_matches_df)
save_to_csv(file_dir + 'valid_matches_df.csv', valid_matches_df)
save_to_csv(file_dir + 'valid_non_matches_df.csv', valid_non_matches_df)

train_triples_df, valid_triples_df = \
    create_triples_ID_df(train_matches_df, train_non_matches_df, valid_matches_df, valid_non_matches_df)

save_to_csv(file_dir + 'train_triples_df.csv', train_triples_df)
save_to_csv(file_dir + 'valid_triples_df.csv', valid_triples_df)

train_triples_product_df, valid_triples_product_df = \
    create_triple_products_df(train_triples_df, valid_triples_df)

save_to_csv(file_dir + 'train_triples_product_df.csv', train_triples_product_df)
save_to_csv(file_dir + 'valid_triples_product_df.csv', valid_triples_product_df)
"""



In [None]:
# Single Products
def create_single_product_df(inference_df=None):
    if inference_df is None:  # during training
        test_single_product_df = get_single_product_info_df('posting_id', TEST_DF, 'test', return_post_id=True, return_label_group=False, inference_df=inference_df)
        train_single_product_df = get_single_product_info_df('posting_id', TRAIN_DF, 'train', return_post_id=True, return_label_group=True, inference_df=inference_df)
        valid_single_product_df = get_single_product_info_df('posting_id', VALID_DF, 'valid', return_post_id=True, return_label_group=True, inference_df=inference_df)
        
        return test_single_product_df, train_single_product_df, valid_single_product_df
        
    else:  # for inference
        inference_single_product_df = get_single_product_info_df('posting_id', inference_df, 'inference', return_post_id=True, return_label_group=False, inference_df=inference_df)
    return inference_single_product_df
    

def create_singles_ds_from_csv(test_single_product_csv=None, train_single_product_csv=None, valid_single_product_csv=None, inference_csv=None):
    
    if inference_csv is None:
        single_ds_test = prepare_singles_ds_from_csv(test_single_product_csv, return_label_group=False)
        single_ds_train = prepare_singles_ds_from_csv(train_single_product_csv, return_label_group=True)
        single_ds_valid = prepare_singles_ds_from_csv(valid_single_product_csv, return_label_group=True)
        return single_ds_test, single_ds_train, single_ds_valid
    
    else:
        single_ds_inference = prepare_singles_ds_from_csv(inference_csv, return_label_group=False)
        return single_ds_inference

"""
# Use this to recreate dataframes / csv files
file_dir = PARAMETERS.prep_save_dir()

test_single_product_df, train_single_product_df, valid_single_product_df = \
    create_single_product_df()

save_to_csv(file_dir + 'test_single_product_df.csv', test_single_product_df)
save_to_csv(file_dir + 'train_single_product_df.csv', train_single_product_df)
save_to_csv(file_dir + 'valid_single_product_df.csv', valid_single_product_df)
"""

## Load Datasets

In [None]:
file_dir = PARAMETERS.ds_prep_dir()
batch_size = 1

# load singles datasets from csv
single_ds_test, single_ds_train, single_ds_valid = \
    create_singles_ds_from_csv(file_dir + 'test_single_product_df.csv',
                               file_dir + 'train_single_product_df.csv',
                               file_dir + 'valid_single_product_df.csv')

# generate triples datasets from csv
triples_ds_train, triples_ds_valid = \
    create_triples_ds_from_csv(file_dir + 'train_triples_product_df.csv', 
                               file_dir + 'valid_triples_product_df.csv',
                               batch_size=batch_size,
                               perturb_image=False)

In [None]:
# test the training data pipeline
for temp in triples_ds_train.take(1):
    print('triples_ds_train:'.upper())
    print('batch size:', temp[0][0].shape[0])
    print('image:', temp[0].shape)
    print('title:', temp[1].shape)
    print('phash:', temp[2].shape)

# test the testing data pipeline
for temp in single_ds_train.take(1):
    print('\nsingle_ds_train:'.upper())
    print('batch size:', temp[0].shape[0])
    print('image:', temp[0].shape)
    print('title:', temp[1].shape)
    print('phash:', temp[2].shape)
    print('phash:', temp[2])
    print('post id:', temp[3].shape)
    print('label group id:', temp[4].shape)  # labels not available in test set

# Siamese Model

## Encoders

### Title Encoder

In [None]:
# Title Encoding - universal language encoder
class TitleEmbedding(keras.Model):

    def __init__(self, units, name='TitleEmbedding', **kwargs):
        super().__init__(name=name, **kwargs)
        
        self.units = units

        # layers
        #self.embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")
        self.embed = tf.keras.layers.Embedding(input_dim=10, output_dim=50)
        self.dense1 = keras.layers.Dense(self.units, activation='relu', 
                                        kernel_initializer= tf.keras.initializers.HeNormal(),
                                        kernel_regularizer=keras.regularizers.l1_l2(l1=1e-2, l2=1e-1))
        self.batch_norm1 = keras.layers.BatchNormalization(name='batch_norm')
        self.batch_norm2 = keras.layers.BatchNormalization(name='batch_norm')
        self.dense2 = keras.layers.Dense(self.units, activation=None,
                                         kernel_regularizer=keras.regularizers.L2(l2=10)
                                        )

    def call(self, inputs):      
        x = inputs
        x = self.embed(x)
        x = self.dense1(x)
        x = self.batch_norm1(x)
        x = self.dense2(x)
        x = self.batch_norm2(x)
        
        return x
    
    def get_config(self):
        config = {"units": self.units}
        return config

In [None]:
a = TitleEmbedding(10, name='TitleEmbedding')
for data in triples_ds_train.take(1):
    title = data[1]
a(title)
print(a.summary())

### Phash Encoder

In [None]:
# PHash Encoding (processing handled in data pipeline)
def pHashEncoder(name='pHashEmbedding'):    
    phash = keras.layers.Input((), dtype=tf.string, name='phash')
    
    vocabulary = list(string.printable)
    vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(output_mode='int',
                    output_sequence_length=16, pad_to_max_tokens=True, vocabulary=vocabulary)
    
    # model path
    inputs = [phash]
    phash = vectorizer(phash)
    phash = tf.one_hot(phash, depth=len(vocabulary))
    outputs = [phash]
    
    return keras.Model(inputs, outputs, name=name)

In [None]:
pHashEncoder(name='pHashEmbedding').summary()

### Image Encoders

In [None]:
# Component A: CNN image encoder
def ImageEncoder(units, input_shape, name='ImageEncoder'):

    # Inputs
    x = keras.layers.Input(input_shape, name='image')  # image
    inputs = [x]

    # standardize for CNN. Architecture comes from [1]
    x = keras.layers.experimental.preprocessing.Resizing(height=196, width=196)(x)  
    
    x = keras.layers.Conv2D(3, kernel_size=7, strides=1, activation='relu')(x)  # Conv2D 
    x = keras.layers.MaxPool2D(pool_size=3)(x)
    x = keras.layers.BatchNormalization()(x)
    
    x = keras.layers.Conv2D(15, kernel_size=7, strides=1, activation='relu')(x)  # Conv2D 
    x = keras.layers.MaxPool2D(pool_size=2)(x)
    x = keras.layers.BatchNormalization()(x)

    x = keras.layers.Conv2D(45, kernel_size=6, strides=1, activation='relu')(x)  # Conv2D 
    x = keras.layers.MaxPool2D(pool_size=4)(x)
    x = keras.layers.BatchNormalization()(x)

    x = keras.layers.Conv2D(250, kernel_size=5, strides=1, activation='relu')(x)  # Conv2D 
    x = keras.layers.BatchNormalization()(x)

    # standardize output shape
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(units, activation=None, 
                           kernel_regularizer=keras.regularizers.L2(l2=1)
                          )(x)
    x = keras.layers.BatchNormalization()(x)
    
    outputs= [x]

    return keras.Model(inputs, outputs, name=name)

In [None]:
ImageEncoder(40, input_shape=(196,196,1), name='ImageEncoder').summary()

## Discriminator Network

In [None]:
def ProductEncoder(image_units, title_units, image_input_shape, name='ProductEncoder'):

    # Inputs
    image = keras.layers.Input(image_input_shape, dtype=tf.float32, name='image')
    title = keras.layers.Input((), dtype=tf.string, name='title', )
    phash = keras.layers.Input((), dtype=tf.string, name='phash')

    inputs = [image, title, phash]

    # Encodings
    image = ImageEncoder(image_units, image_input_shape, name='ImageEncoder')(image)  # (batch, image_units)   
    title = TitleEmbedding(title_units, name='TitleEmbedding')(title)  # (batch, title_units)
    phash = pHashEncoder(name='pHashEncoder')(phash)  # (batch, 16, vocab_size)
    
    # Normalize
    image = image / 100.0
    title = title / (100.0 + tf.norm(title, ord='euclidean', axis=-1))
    phash = phash / 16
    
    outputs = [image, title, phash]
    return keras.Model(inputs, outputs, name=name)

In [None]:
ProductEncoder(50, 20, (196,196,1)).summary()

## Metric

In [None]:
# can be used as (bounded) distance measurement for TripletLoss
# or a probability estimate that the products do not match (for F1 score)
def PseudoMetric(units, encoded_image_shape, encoded_title_shape, encoded_phash_shape, name='PseudoMetric'):
             
    # Inputs
    image_encoding_1 = keras.layers.Input(encoded_image_shape, dtype=tf.float32, name='image_encoding_1')  # (batch, image units)
    title_encoding_1 = keras.layers.Input(encoded_title_shape, dtype=tf.float32, name='title_encoding_1')  # (batch, title units)
    phash_1 = keras.layers.Input(encoded_phash_shape, dtype=tf.float32, name='phash_1')  # one-hot encoding (batch, 16, vocab_size) 
    
    image_encoding_2 = keras.layers.Input(encoded_image_shape, dtype=tf.float32, name='image_encoding_2')  # (batch, image units)
    title_encoding_2 = keras.layers.Input(encoded_title_shape, dtype=tf.float32, name='title_encoding_2')  # (batch, title units)
    phash_2 = keras.layers.Input(encoded_phash_shape, dtype=tf.float32, name='phash_2')  # one-hot encoding (batch, 16, vocab_size) 

    inputs= [image_encoding_1, title_encoding_1, phash_1, image_encoding_2, title_encoding_2, phash_2]
   
    # image distance
    image_dist = image_encoding_1 - image_encoding_2
    image_dist = keras.layers.Dense(1, activation='sigmoid')(image_dist)

    # title distance 
    title_dist = title_encoding_1 - title_encoding_2
    title_dist = keras.layers.Dense(1, activation='sigmoid')(title_dist)   
    
    # phash distance  
    phash_dist = keras.losses.categorical_crossentropy(phash_1, phash_2)
    phash_dist = tf.reduce_mean(phash_dist, axis=[-1], keepdims=True)
    
    # combine metrics
    metric = keras.layers.Concatenate(axis=-1)([image_dist, title_dist, phash_dist])
    metric = keras.layers.Dense(1, activation='sigmoid')(metric)  # weighted square-norm
    
    # final distance value
    outputs = [metric]

    return keras.Model(inputs, outputs, name=name)

In [None]:
PseudoMetric(units=20, encoded_image_shape=[40], encoded_title_shape=[20], encoded_phash_shape=[16, 100]).summary()

## Combined Model

# Full Model

#### *In order to alleviate our dataset's enormous class imbalance, product triples are fed through the network in product pairs (anchor, match), (anchor, non-match) to yield (sigmoid output) binary classification probabilites predicting whether or not products match. Our triplet setup allows us to measure the prevalance of false positives and false negatives. Our loss function is the average of these errors for each (batch) of product triples.*

#### *The model is written as a Tensorflow subclass model with custom training and inference steps hard-coded in. Precision and Recall metrics are included to track the errors that can reduce our F1 score.*

Define model with metrics, training step and inference step coded using the Tensorflow model subclassing API. This allows training and checkpoints to be conducted directly through model.fit().

In [None]:
class MatchPredictor(keras.Model):

    def __init__(self, image_units, title_units, metric_units, name='MatchPredictor', **kwargs):
        super().__init__(name=name, **kwargs)
        
        self.image_units = image_units
        self.title_units = title_units
        self.metric_units = metric_units
        
        self.loss_tracker = keras.metrics.Mean(name="loss")  
        self.triplet_ratio_tracker = keras.metrics.Mean(name="triplet_ratio")
        self.recall_tracker = keras.metrics.Mean(name="recall")
        self.precision_tracker = keras.metrics.Mean(name="precision")
        self.f1_tracker = keras.metrics.Mean(name="f1_score")
    
    def get_config(self):
        config = {'image_units': self.image_units, 'title_units': self.title_units, 'metric_units': self.metric_units}
        return config   
 
    def build(self, input_shape):

        # Encoder params
        image_shape = input_shape[0][0][1:]
        self.product_encoder = ProductEncoder(self.image_units, self.title_units, image_shape)

        # Metric params
        encoded_image_shape = self.product_encoder.output_shape[0][-1:]  # drops batch dim
        encoded_title_shape = [self.title_units]  #self.product_encoder.output_shape[1][-1:] 
        encoded_phash_shape = self.product_encoder.output_shape[2][-2:]  
                
        self.product_metric = PseudoMetric(self.metric_units, encoded_image_shape, encoded_title_shape, encoded_phash_shape)

    def call(self, inputs, **kwargs):
        # Inputs   
        product_1 = inputs[0]
        product_2 = inputs[1]
        
        # Encode Products
        image_encoding_1, title_encoding_1, phash_1 = self.product_encoder(product_1)
        image_encoding_2, title_encoding_2, phash_2 = self.product_encoder(product_2)

        # Compute product distances
        distance = self.product_metric([image_encoding_1, title_encoding_1, phash_1, 
                                        image_encoding_2, title_encoding_2, phash_2])    
        return distance
    
    def encoder(self, inputs, **kwargs):
        
        # Encode Products
        image_encoding, title_encoding, phash = self.product_encoder(inputs)
  
        return image_encoding, title_encoding, phash
    
    def train_step(self, data):  
        """ Strategy: One-shot learning via Siamese Network """        
        product_1 = data[0:3]
        product_2 = data[3:6]
        product_3 = data[6:9]
               
        # forward pass
        with tf.GradientTape() as tape:
            tape.watch(self.trainable_variables)
            
            matching_dist = self([product_1, product_2], training=True)                
            non_matching_dist = self([product_1, product_3], training=True)       
            
            loss = (matching_dist + (1 - non_matching_dist)) / 2.
            loss = tf.math.reduce_mean(loss)

        # compute grads and apply updates
        grads = tape.gradient(loss, self.trainable_variables)
        grads = [tf.clip_by_norm(g, 10.) for g in grads]  # clipping
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        
        # Update metrics
        self.loss_tracker.update_state(loss)   
        self.compiled_metrics.update_state(matching_dist, non_matching_dist)
        
        # additional metrics        
        self.triplet_ratio_tracker.update_state(
            self.triplet_ratio(matching_dist, non_matching_dist)
        )
        self.precision_tracker.update_state(
            self.precision(matching_dist, non_matching_dist, soft=False)
        )
        self.recall_tracker.update_state(
            self.recall(matching_dist, non_matching_dist, soft=False)
        )
        self.f1_tracker.update_state(
            self.f1_score(matching_dist, non_matching_dist, soft=False)
        )

        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}
        
    def predict_step(self, data):
        product_1 = data[0][0]       
        product_2 = data[0][1]
        
        # Encode Products
        image_encoding_1, title_encoding_1, phash_1 = self.product_encoder(product_1)
        image_encoding_2, title_encoding_2, phash_2 = self.product_encoder(product_2)

        # Compute product distances
        distance = self.product_metric([image_encoding_1, title_encoding_1, phash_1, 
                                        image_encoding_2, title_encoding_2, phash_2])
        
        return tf.math.less_equal(distance, .5)  # predict match=True when distance is small
    
    def ProductEncoder(self, product):
        return self.product_encoder(product)
    
    def PseudoMetric(self, encoded_product_pair):
        product_1, product_2 = encoded_product_pair
        return self.product_metric([product_1, product_2]) 
    
    # LOSSES    
    def false_neg(self, matching_dist, non_matching_dist, soft):
        # count of matching products predicted as non-matching
        if soft:
            return matching_dist
        else:
            return tf.cast(tf.greater(matching_dist, .5), tf.float32)
    
    def false_pos(self, matching_dist, non_matching_dist, soft):
        # nonmatching products named matching
        if soft:
            return 1 - non_matching_dist
        else:
            return tf.cast(tf.less_equal(non_matching_dist, .5), tf.float32)
    
    def true_neg(self, matching_dist, non_matching_dist, soft):
        # note: non_matching products correctly classified
        if soft:
            return non_matching_dist
        else:
            return tf.cast(tf.greater(non_matching_dist, .5), tf.float32)
    
    def true_pos(self, matching_dist, non_matching_dist, soft):
        # note: matching products correctly classified
        if soft:
            return 1 - matching_dist
        else:
            return tf.cast(tf.less_equal(matching_dist, .5), tf.float32)
    
    def precision(self, matching_dist, non_matching_dist, soft):
        # percent of predictions that are correct
        false_pos = tf.reduce_sum(self.false_pos(matching_dist, non_matching_dist, soft))
        true_pos = tf.reduce_sum(self.true_pos(matching_dist, non_matching_dist, soft))
        return true_pos / (true_pos + false_pos + 1e-7)
        
    def recall(self, matching_dist, non_matching_dist, soft):
        # percent of "positive" ground truth examples identified
        false_neg = tf.reduce_sum(self.false_neg(matching_dist, non_matching_dist, soft))
        true_pos = tf.reduce_sum(self.true_pos(matching_dist, non_matching_dist, soft))
        return true_pos / (true_pos + false_neg + 1e-7)
        
    def f1_score(self, matching_dist, non_matching_dist, soft):
        precision = self.precision(matching_dist, non_matching_dist, soft)
        recall = self.recall(matching_dist, non_matching_dist, soft)
        return 2. * precision * recall / (precision + recall + 1e-7)

    def triplet_ratio(self, matching_dist, non_matching_dist):
        loss = (matching_dist + 1e-7) / (non_matching_dist + 1e-7)    
        return tf.reduce_mean(loss, axis=-1)
    
    @property
    def metrics(self):
        return [self.loss_tracker, self.triplet_ratio_tracker, self.recall_tracker, self.precision_tracker, self.f1_tracker]

Initialize and build model

In [None]:
# params
for triplet in triples_ds_train.take(1): 
    image_1 = triplet[0]
    title_1 = triplet[1]
    phash_1 = triplet[2]
    
    image_shape = image_1.shape[1:]
    batch_size = image_1.shape[0]
    phash_units = phash_1.shape[-1]
    
    product_1 = triplet[0:3]
    product_2 = triplet[3:6]
    product_3 = triplet[6:9]

image_units = 50
title_units = phash_units
metric_units = 1
    

# Initialize and build Full Model
MatchPredictorModel = MatchPredictor(image_units, title_units, metric_units)
MatchPredictorModel([product_1, product_2])

# Load weights
try:
    #** Choose ONE of the below options **
    # WARNING: loads weights from saved notbook outbook directory NOT the working directory!!
    MatchPredictorModel.load_weights(PARAMETERS.saved_weights_dir())  
    
    # Warning: loads weights from working directory, NOT previously saved notebook outputs
    #MatchPredictorModel.load_weights('saved_weights')  
    print('Loaded saved weights.')
except:
    print('No weights loaded.')
    
MatchPredictorModel.summary()


## Training

Training Loop

In [None]:
"""
# optional Tensorboard callback
# This code works on Google Colab, but not Kaggle.
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, #histogram_freq=1,
                                                       profile_batch=1)

#%reload_ext tensorboard
#%tensorboard --logdir=logs
"""

Run model.fit()

In [None]:
"""
# callbacks
checkpoint = tf.keras.callbacks.ModelCheckpoint('./checkpoints', monitor='loss', save_weights_only=True, 
    save_freq='epoch')

# compile model
MatchPredictorModel.compile(optimizer=tf.keras.optimizers.Adagrad(.001))

   
MatchPredictorModel.fit(triples_ds_train, batch_size=128, epochs=1, steps_per_epoch=100,
                        callbacks=[checkpoint], verbose=2, use_multiprocessing=True)

# check for NaN
for triplet in triples_ds_train.take(1): 

    product_1 = triplet[0:3]
    product_2 = triplet[3:6]
    product_3 = triplet[6:9]

# check for NaN
print(MatchPredictorModel.product_encoder(product_1)[0])
print(MatchPredictorModel.product_encoder(product_1)[1])
"""

In [None]:
#MatchPredictorModel.save_weights('saved_weights')

# Inference

#### *Inference is a two-step process. Each image is run once through the encoder network, with encodings saved in a dictionary. We then run pairwise comparisons in order to categorize images into product groups and record the resulting groups in a CSV file for submission. With 70,000+ images, a reduction in pairwise comparions is required in order to keep within a manageable processing time. This reduction is accomplished by noting that (if predicted accurately), a positively matched (A, [A, B, C]) comparison automatically identifies matchings (B, [A, B, C]) and (C, [A, B, C]), and identifies that none of these elements can match with any other image group.*

Function to collect product encodings

In [None]:
"""
for triplet in triples_ds_train.skip(20).take(5): 
    image_1 = triplet[0]
    title_1 = triplet[1]
    phash_1 = triplet[2]
    
    image_shape = image_1.shape[1:]
    batch_size = image_1.shape[0]
    phash_units = phash_1.shape[-1]
    
    product_1 = triplet[0:3]
    product_2 = triplet[3:6]
    product_3 = triplet[6:9]

    # forward pass
    matching_dist = MatchPredictorModel([product_1, product_2])                
    non_matching_dist = MatchPredictorModel([product_1, product_3])   
    
    image_encoding_1, title_encoding_1, phash_1 = MatchPredictorModel.product_encoder(product_1)
    image_encoding_2, title_encoding_2, phash_2 = MatchPredictorModel.product_encoder(product_2)
        
for single in single_ds_valid.take(1): 
    image_1 = single[0]
    title_1 = single[1]
    phash_1 = single[2]
    
    image_shape = image_1.shape[1:]
    batch_size = image_1.shape[0]
    phash_units = phash_1.shape[-1]
    
    product_1 = triplet[0:3]
    
    #print(product_1[0].shape, product_1[1].shape, product_1[2].shape)
    print(MatchPredictorModel.ProductEncoder(single[:3]))
"""

In [None]:
def get_batch_encodings_ds(singles_ds, batch_size):

    encoded_ds = None
    for prod_with_id in singles_ds:
        
        # separate out id's
        products = prod_with_id[:3]
        product_ids = prod_with_id[3]

        # encoder   
        image_encoding, title_encoding, phash_encoding = MatchPredictorModel.ProductEncoder(products)

        # sub-dataset
        image_ds = tf.data.Dataset.from_tensor_slices(image_encoding)
        title_ds = tf.data.Dataset.from_tensor_slices(title_encoding)
        phash_ds = tf.data.Dataset.from_tensor_slices(phash_encoding)
        product_id_ds = tf.data.Dataset.from_tensor_slices(product_ids)
        
        # combine sub-datasets
        batch_ds = tf.data.Dataset.zip((image_ds, title_ds, phash_ds, product_id_ds))
        
        # update full dataset
        if encoded_ds is None:
            encoded_ds = batch_ds
        else:
            encoded_ds = encoded_ds.concatenate(batch_ds)

    print('complete')

    return encoded_ds

Test the function on our validation set

In [None]:
"""
encoding_ds_valid = get_batch_encodings_ds(singles_ds=single_ds_valid.take(1000), 
                                           batch_size=256)  # use same batch size as in training
"""

Function to make predictions and save to CSV

In [None]:
def collect_predictions(encoded_ds, batch_size, for_submission, save_frequency=1000):
    
    # initialize containers and dataset
    matches_dict = {}
    comparison_ds = encoded_ds.batch(batch_size, drop_remainder=True).prefetch(5)
    completed_predictions = set([])
    
    # utility function for saving intermediate & final results   
    def save_results(matches_dict, number=None):
        df = pd.DataFrame.from_dict(matches_dict, orient='index')
        df = df.fillna('**')
        df = df.apply(lambda x: ', '.join(x), axis=1)
        df = df.apply(lambda x: x.replace(', **', ''))#, axis=1)
        df.index.name = 'posting_id'
        df.name = 'matches'

        # update filename
        if number is None:
            number = ''
        else:
            number = '' + number
        
        if for_submission:
            filename = 'submission.csv'
        else:
            filename = 'train_submission' + number + '.csv'
        with open(filename, 'w') as f:
            df.to_csv(filename, index=True)
        return df
    
    i = 1
    for prod1 in encoded_ds.batch(1).prefetch(5):
        
        # separate encoding and id
        encoded_prod_1 = prod1[:3]
        post_id_1 = prod1[3].numpy()[0].decode()  # extract and convert to string      
        
        if post_id_1 in completed_predictions:
            continue

        # make sure to self-match
        matches_dict[post_id_1] = set([post_id_1])
       
        # replicate product for broadcasting against batches        
        encoding_1_ds = tf.data.Dataset.from_tensors(encoded_prod_1)     
        encoding_1_ds = encoding_1_ds.unbatch().repeat(batch_size).batch(batch_size)
        
        # compare with comparison product encodings
        for comparison_batch in comparison_ds:
            
            # make sure batch sizes match
            current_batch_size = comparison_batch[0].shape[0]
            if batch_size != current_batch_size:
                temp_ds = tf.data.Dataset.from_tensor_slices((comparison_batch))
                temp_ds = temp_ds.unbatch().repeat(current_batch_size).batch(current_batch_size)
                
                for val in temp_ds.take(1):
                    comparison_batch = val
            
            # get encodings and ids
            comparison_encodings = comparison_batch[:3]
            comparison_product_id = comparison_batch[3]
 
            for encoding_1 in encoding_1_ds.take(1):
                
                # Compute product distances
                distance = MatchPredictorModel.PseudoMetric([encoding_1, comparison_encodings])
 
                # make predictions
                predictions = tf.math.less_equal(distance, 1.1)  # predict match=True when distance is small
 
                # decode matches
                predictions = np.squeeze(predictions.numpy())
                matching_prod_ids = comparison_product_id.numpy()[predictions]
                matching_prod_ids = matching_prod_ids.tolist()
                matching_prod_ids = [prod_id.decode() for prod_id in matching_prod_ids]
                
                # record matches
                matches_dict[post_id_1].update(matching_prod_ids)
        
        # assume prediction correct, so that all products in group have identical solutions
        for post in matches_dict[post_id_1]:
            matches_dict[post] = matches_dict[post_id_1]
            
        # mark as completed
        completed_predictions.update(matches_dict[post_id_1])
        
        # report and save progress
        if i % 100 == 0:
            print('completed_predictions:', len(completed_predictions))          
        if i % save_frequency == 0:
            _ = save_results(matches_dict, number='i')

        # advance counter
        i += 1

    # save final results
    df = save_results(matches_dict, number='')
    print('finished')     

    return df

Test the function on our validation set

In [None]:
#matches_df_valid = collect_predictions(encoded_ds=encoding_ds_valid.take(1000), batch_size=256, for_submission=False)

In [None]:
#matches_df_valid.head()

#### Function to compute F1 scores (model evaluation on validation set)

In [None]:
def compute_f1_score(matches_dict, data_df):

    f1_score_mean_accumulator = keras.metrics.Mean(name='f1_mean')
    i=0
    for prod_num in matches_dict:
    
        # find correct matches from dataframe
        group_id = data_df.loc[prod_num]['label_group']
        correct_matches = set(data_df[data_df['label_group'] == group_id].index.to_list())
        our_matches = set(matches_dict[prod_num])

        
        true_pos_count = len(our_matches.intersection(correct_matches))
        false_pos_count = len(our_matches.difference(correct_matches))
        actual_pos_count = len(correct_matches)
        
        print('correct_matches:', correct_matches)
        print('true_pos_count:', true_pos_count)
        print('false_pos_count:', false_pos_count)
        print('actual_pos_count:', actual_pos_count)

        precision = true_pos_count / (true_pos_count + false_pos_count)
        recall = true_pos_count / (actual_pos_count)
        f1 = 2 * precision * recall / (precision + recall)
        f1_score_mean_accumulator.update(f1)
        
        print('precision:', precision)
        print('recall:', recall)
        
        if i % 100 ==0:
            print('mean f1_score:', f1_score_mean_accumulator.result())

        break
    print('finished')   
    print('final f1_score:', f1_score_mean_accumulator.result())
    
    return df

In [None]:
#compute_f1_score(matches_dict=matches_df_valid.to_dict(), data_df=VALID_DF)

# FINAL INFERENCE SUBMISION CREATION

Run these functions to generate submission on the test set

In [None]:
# INFERENCE DATA PREP


# load inference csv as dataframe
file_dir = PARAMETERS.dataset_dir()

inference_DF = pd.read_csv(file_dir + 'test.csv')
inference_DF['id'] = inference_DF['posting_id']
inference_DF = inference_DF.set_index('id')
inference_DF['title'] = inference_DF['title'].apply(lambda x: x.lower())
inference_DF = inference_DF.rename(columns={'image':'image_path'})

# pre-processing step
file_dir = PARAMETERS.prep_save_dir()
inference_single_product_df = create_single_product_df(inference_DF)
save_to_csv(file_dir + 'inference_single_product_df.csv', inference_single_product_df)

# load singles datasets from csv
file_dir = PARAMETERS.prep_save_dir()

# create dataset
single_ds_inference = create_singles_ds_from_csv(inference_csv = file_dir + 'inference_single_product_df.csv')

In [None]:
# get encodings
encoding_ds_inference = get_batch_encodings_ds(singles_ds=single_ds_inference, batch_size=64)

In [None]:
# make predictions
matches_df_inference = collect_predictions(encoded_ds=encoding_ds_inference, 
                                       batch_size=128, for_submission=True, save_frequency=250)
matches_df_inference.head()