# Task: Basket Completion

The recommendation task of basket completion is a key part of many online retail applications. Basket completion involves computing predictions for the next item that should be added to a shopping basket, given a collection of items that the user has already added to the basket.


# Dataset

Amazon Baby Registries - This is a public dataset that consists of registries of baby products
from 15 different categories (such as ’feeding’, ’diapers’,’toys’, etc.), where the item catalog and registries for each category are disjoint. Each category therefore provides a small dataset, with a maximum of 15,000 purchased baskets per category. 



# Solution:

1Dimensional CNNs - Similar to what we presented in the TextCNN section of our course.

![Foo](./images/textCNN.png)



# Metrics:

### Precision@k

## Data preparation
We are going to load the data from a provided URL, build the product vocab, ...

But first, let's import everything.

In [None]:
from typing import NamedTuple, List, Tuple
from collections import Counter
from tensorflow import keras
from keras.layers import *
import keras.backend as K
from keras.models import Model
from collections import defaultdict
import numpy as np
from sklearn.model_selection import train_test_split
import urllib.request

Let's define a few useful functions

In [None]:
def read_basket_data_from_url(url: str) -> List[List[int]]:
    # We read the data from the provided url
    # Each product is represented by a number from 1 to nb_products, so we substract 1 to be zero indexed
    dataset = []
    with urllib.request.urlopen(url) as data:
        for line in data:
            line = line.decode("utf-8")
            products = [int(p)-1 for p in line.split(',')]
            if len(products) > 1:
                dataset.append(products)
    return dataset

def build_vocab(dataset: List[List[int]]) -> Tuple[List[int], Counter]:
    # Just counting how many times each products appears and building the list of unique products
    counter = Counter()
    for basket in dataset:
        counter.update(basket)
    return list(counter.keys()), counter

def to_size(data: List[int], size: int):
    # Here we are going to make the baskets of a predetermined size
    # either by truncating them, or by duplicating some products
    if len(data) > size:
        return np.random.choice(data, size=size, replace=False)
    else:
        return np.random.choice(data, size=size, replace=True)

We now defined our main dataset class `BasketData`.
It contains several fields:
* dataset: X, y tuple where X is np.array with one basket per row, y is a np.array with the target item to predict
* vocab: list of unique products
* counter: Counter of unique items and their counts
* vocab_size: nb unique products
* max_basket_length: maximum size of a seen basket

In [None]:
class BasketData(NamedTuple):
    dataset: Tuple[np.ndarray, np.ndarray]
    vocab: List[int]
    counter: Counter
    vocab_size: int
    max_basket_length: int
    
    @staticmethod
    def build_from_url(url: str):
        dataset = read_basket_data_from_url(url)
        print(f"Read {len(dataset)} baskets from {url}")
        
        vocab, counter = build_vocab(dataset)
        print(f"Number of distinct products {len(vocab)}")
        
        max_basket_length = max(len(b) for b in dataset)
        print(f"Max basket size {max_basket_length}")
        
        dataset = BasketData.build_input_and_labels(dataset, max_basket_length)
        
        print(f"Done building dataset")
        return BasketData(dataset, vocab, counter, len(vocab), max_basket_length)
    
    @staticmethod
    def build_input_and_labels(baskets: List[List[int]], max_length: int) -> Tuple[np.ndarray, np.ndarray]:
        inputs = []
        labels = []
        for basket in baskets:
            input_basket = basket[:-1]
            label_product = basket[-1]

            inputs.append(to_size(input_basket, max_length))
            labels.append(label_product)

        inputs = np.array(inputs)
        labels = np.array(labels)
        return inputs, labels

Let's load the data from the URL.

In [None]:
data_url = "https://www.dropbox.com/s/hkwnwlut4mb5yyb/1_100_100_100_apparel_regs.csv?dl=1"
basket_data = BasketData.build_from_url(data_url)

We are now going to define the BasketCNN model using Keras layers, the most important operations is the 1D Convolution, using different kernel sizes and max pooling, we will be changing these parameters in the exercises.

In [None]:
def BasketCNN(max_sequence_length, vocab_size, embedding_dim=100, num_filters=16, dropout_rate=0.25):
    """
    Input:
        - max_sequence_length: maximum length of baskets
        - vocab_size: number of distinct products
        - embedding_layer: embedding layer of Keras created by model type and static flags
        - dropout_rate: dropout rate for flattened pooled outputs
    Returns:
        - model: Model class created with specified inputs
    """        
    x_input = Input(shape=(max_sequence_length,), dtype='int32')

    embedding_layer = Embedding(input_dim=vocab_size,
                                output_dim=embedding_dim)

    x = embedding_layer(x_input)

    kernel_sizes = [3, 5, 7]
    pooled = []

    for kernel in kernel_sizes:

        conv = Conv1D(filters=num_filters,
                      kernel_size=kernel,
                      padding='valid',
                      strides=1,
                      activation='relu')(x)
        
        pool = MaxPooling1D(pool_size=max_sequence_length - kernel + 1)(conv)

        pooled.append(pool)

    merged = Concatenate(axis=-1)(pooled)

    flatten = Flatten()(merged)

    drop = Dropout(rate=dropout_rate)(flatten)
    
    x_output = Dense(vocab_size, activation='softmax')(drop)

    return Model(inputs=x_input, outputs=x_output)

In [None]:
model = BasketCNN(basket_data.max_basket_length, basket_data.vocab_size, dropout_rate=0.25)
model.summary()

Let's split our data to train, test and train our model !
We will be using `sparse_categorical_crossentropy` as our loss  and mesuring precision@1 and precision@5 (default k value of `sparse_top_k_categorical_accuracy`) 

In [None]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='Adam',
              metrics=['sparse_categorical_accuracy', 'sparse_top_k_categorical_accuracy'])

In [None]:
X, y = basket_data.dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=128, epochs=10, verbose=1)

Let's compare the performance of this model to a very naive baseline that just always predicts the most popular product (the goal is obviously to beat it ^^ )

In [None]:
most_common_product, max_count = basket_data.counter.most_common(1)[0]

In [None]:
naive_labels = np.zeros_like(y_test, dtype=np.float32)
naive_labels = most_common_product

In [None]:
(naive_labels == y_test).mean()

### Questions

* Q1 : What's the impact of the number of filters, kernels sizes, embedding size on the metrics
* Q2 : Add a second convolutional layer to the model, how does it impact the model ? Does it overfit ?
* Q3 : Try augmenting the dataset with shuffled versions of the baskets