# CA684 Machine Learning Assignment Spring 2022

Dublin City University has teamed up with leading online fashion retailer Zalando to create the 2022 CA684 Machine Learning assignment.

## Introduction

As a customer proposition, Zalando strives for “trustworthy” prices. That is, the company wants to offer competitive prices in each of its dynamic market environments, to alleviate its customers from having to compare prices, and to drive revenue growth. In order to do that for its hundreds of thousands of individual products, Zalando needs to Identify exact product matches across the relevant European competitors. 

A very similar use case exists at stores like Amazon or Walmart, which allow multiple sellers to offer the same product on their platform: identical products need to be grouped together, even when the names, descriptions, images, etc. are not exactly the same.

## Challenge

Barcode systems like EAN allow for unique identification of every product. Unfortunately, reliable EAN information is not always available. Zalando uses multi-modal data to solve the problem, relying on images and text. For this challenge, we are asking to make intelligent use of text data (such as product titles, colors and descriptions). As these are not standardized, and often manually written / changed for marketing purposes, matching products is a non-trivial task.

This challenge has a direct business impact for a retailer like Zalando. It is also closely related to many other problems, like record deduplication in heterogeneous catalogues, document retrieval, and many more.

## Getting Started

Here is some sample code to get you started on the challenge!

Happy Hacking!

In [82]:
import argparse
import random
import hashlib
import os
import numpy as np
import re
import operator
import tensorflow as tf
import math

In [1]:
# libraries
import os
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
import urllib
from random import choices
from itertools import chain
# Levenshtein Distance in Python
# https://github.com/seatgeek/thefuzz
from thefuzz import fuzz, process

# Pandas config
pd.options.mode.chained_assignment = None  # default='warn'



In [208]:
#  separating shops 
zalando_df = offers_training_df[offers_training_df['shop'] == 'zalando'][['offer_id','brand','color','title','description']].reset_index()
aboutyou_df = offers_training_df[offers_training_df['shop'] == 'aboutyou'][['offer_id','brand','color','title','description']].reset_index()

In [220]:
offers_training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102884 entries, 0 to 102883
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   offer_id     102884 non-null  object 
 1   shop         102884 non-null  object 
 2   lang         102884 non-null  object 
 3   brand        102884 non-null  object 
 4   color        102882 non-null  object 
 5   title        102884 non-null  object 
 6   description  102884 non-null  object 
 7   price        102882 non-null  float64
 8   url          102884 non-null  object 
 9   image_urls   102858 non-null  object 
dtypes: float64(1), object(9)
memory usage: 7.8+ MB


In [218]:
zalando_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40904 entries, 0 to 40903
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            40904 non-null  int64 
 1   offer_id         40904 non-null  object
 2   brand            40904 non-null  object
 3   color            40904 non-null  object
 4   title            40904 non-null  object
 5   description      40904 non-null  object
 6   combo_text       40904 non-null  object
 7   norm_combo_text  40904 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.5+ MB


In [219]:
aboutyou_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61980 entries, 0 to 61979
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            61980 non-null  int64 
 1   offer_id         61980 non-null  object
 2   brand            61980 non-null  object
 3   color            61978 non-null  object
 4   title            61980 non-null  object
 5   description      61980 non-null  object
 6   combo_text       61980 non-null  object
 7   norm_combo_text  61980 non-null  object
dtypes: int64(1), object(7)
memory usage: 3.8+ MB


In [210]:
# combine brand, color, title and decription in one column
def combo (df):
    return df['brand'] + ' ' + df['color'] + ' ' + df['title'] +' ' + df['description']

In [211]:
# create new column using combo function
zalando_df['combo_text'] = combo(zalando_df)
zalando_df['combo_text'] = zalando_df['combo_text'].fillna('').apply(str)

In [212]:
# normalize combined text
def normalizeText(description):
    description = re.sub('<.>|<..>', ' ', description.lower())
    return re.sub('[^a-z0-9 ]', '', description)

In [214]:
# create new column for normalized text 
result_z = []
for x in range(len(zalando_df)):
    result_z.append(normalizeText(zalando_df['combo_text'][x]))
zalando_df['norm_combo_text'] = result_z

In [215]:
zalando_df.head(1)

Unnamed: 0,index,offer_id,brand,color,title,description,combo_text,norm_combo_text
0,5,02df5ca3-8adc-48fa-bf42-91b41c3ea5a9,Guess,weiß,JUNIOR REVERSIBLE HOODED LONG Wintermantel,skirt_details Eingrifftaschen | Ziersteine $ n...,Guess weiß JUNIOR REVERSIBLE HOODED LONG Winte...,guess wei junior reversible hooded long winter...


In [216]:
# repeat for aboutyou_df

aboutyou_df['combo_text'] = combo(aboutyou_df)
aboutyou_df['combo_text'] = aboutyou_df['combo_text'].fillna('').apply(str)

result_a = []
for x in range(len(aboutyou_df)):
    result_a.append(normalizeText(aboutyou_df['combo_text'][x]))
aboutyou_df['norm_combo_text'] = result_a

In [217]:
aboutyou_df.head(1)

Unnamed: 0,index,offer_id,brand,color,title,description,combo_text,norm_combo_text
0,0,d8e0dba8-98e8-48db-9850-dd30cff374e0,PIECES,hellblau | Blau,Kleid,"{""Material"": [""Baumwolle""], ""\u00c4rmell\u00e4...","PIECES hellblau | Blau Kleid {""Material"": [""Ba...",pieces hellblau blau kleid material baumwolle...


In [234]:
from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings

In [224]:
def get_matches(query, choices, limit=3):
    results = process.extract(query, choices, limit=limit)
    return results

In [238]:
list1= [aboutyou_df.norm_combo_text[0],aboutyou_df.norm_combo_text[1],aboutyou_df.norm_combo_text[2]]

In [245]:
list2= [zalando_df.norm_combo_text[10],zalando_df.norm_combo_text[20],zalando_df.norm_combo_text[30]]

In [248]:
model = PolyFuzz("EditDistance")
model.match(list2, list1)

<polyfuzz.polyfuzz.PolyFuzz at 0x7ff2d802e340>

In [249]:
model.get_matches()

Unnamed: 0,From,To,Similarity
0,pieces schwarz pcmaggi midi dress freizeitklei...,pieces hellblau blau kleid material baumwolle...,0.855
1,vero moda hellgrn noos strickpullover mainsupp...,pieces hellblau blau kleid material baumwolle...,0.855
2,vero moda schwarz vmmollie jacket daunenjacke ...,pieces hellblau blau kleid material baumwolle...,0.421171


In [250]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

PROVIDED CONTENT

In [2]:
# set random seed
np.random.seed(seed=42)

## Dataset

The dataset will contain files as follows. 

* Two files containing **offers of products**, for training and testing respectively. An offer is a particular description of a product by an online shop, either Zalando or one of its competitors. They contains the following fields:

| Label | Description |
|:-----:|:------------|
| offer_id | unique identifier for an offer of a product (i.e. a product x shop combination, where we don’t know the product component) |
| shop | "zalando", "aboutyou" |
| lang | "de" (German) |
| brand | e.g. "Nike" - note that different `shop`s might have different `brand` nomenclature |
| color | e.g. "blue" - note that there could be more than one and different `shop`s might have different `brand` nomenclature ("ocean", "light blue", "...") and may have more than one color (ordering matters) |
| title | e.g. "White Nike tennis top" |
| description | a long product description that can may contain material composition, cleaning instructions, etc |
| price | price in euro without any discount |
| url | url of the product description page |
| image_urls | list of product images such as stock photo, with model, lifestyle photo, or close up |

* A separate file containing the **matches** in between those offers that describe the same products using the offer id. Note this is only provided for the training offers.

| Label | Description |
|:-----:|:------------|
| zalando | offer_id from “zalando” shop |
| aboutyou | offer_id from “aboutyou” shop |
| brand | unique identifier for the brand representing the match |


!ls {offers,matches}_{training,test}.parquet

## Exploratory Data Analysis

It is important to familiarize yourself with the dataset by using measures of centrality (e.g. mean) and statistical dispersion (e.g. variance) and data visualization methods. The following is just some Pandas preprocessing and Matplotlib visualizations to get you started. Feel free to explore the data much further and come up with ideas that might help you in the matching task!

### Offers of Products

offers_training_df = pd.read_parquet('offers_training.parquet')

offers_test_df = pd.read_parquet('offers_test.parquet')

matches_training_df = pd.read_parquet('matches_training.parquet')

offers_training_df[offers_training_df['offer_id'] == 'b33f55d6-0149-4063-8b63-3eeae63562a2']

matches_training_df.head(100)

f'Number of products in training: {len(offers_training_df):,}'

list(offers_training_df.columns)

offers_training_df.head(4)

offers_training_df['lang'].unique()

pd.value_counts(offers_training_df['shop'], sort=True, ascending=False)

figure, ax = plt.subplots(figsize=(15, 5))
pd.value_counts(
    offers_training_df[
        offers_training_df['title'].str.lower().str.contains("t-shirt", na=False)
    ]['shop'], 
    sort=True, ascending=False).plot.barh(color='darkcyan')
plt.title('T-Shirt Offers of Products per Shop')
xlabels = [f'{x:,}' for x in range(0, 3500, 500)]
plt.xticks(range(0, 3500, 500), xlabels)
plt.xlabel('Number of products')
plt.setp(ax.get_xticklabels()[0], visible=False)
plt.show()

figure, ax = plt.subplots(figsize=(15, 5))
plt.title('Histogram of Prices (Products under 300€)')
offers_training_df[
    offers_training_df['price'] < 300
]['price'].plot.hist(bins=50)
plt.show()

brands_training = offers_training_df['brand'].unique()

brands_training[:5]

f'Number of unique brands in training: {len(brands_training):,}'

offers_test_df = pd.read_parquet('offers_test.parquet')

f'Number of products in test: {len(offers_test_df):,}'

brands_test = offers_test_df['brand'].unique()

f'Number of unique brands in test: {len(brands_test):,}'

**Note** that Brands in training and test are different!

# Intersection between brands in training and test
f'Number of brands in train and test: {sum(np.in1d(brands_training, brands_test, assume_unique=True)):,}'

### Matches

**Note** that matches for the offers in testing are hidden!

matches_training_df = pd.read_parquet('matches_training.parquet')

f'Number of groundtruth matches: {len(matches_training_df):,}'

matches_training_df.head()

matches_training_df.iloc[0]

def get_offer(products, match, shop):
    return products[
        products['offer_id'] == match[shop]
    ].iloc[0]

f"Number of unique brands in training matches: {len(offers_training_df['brand'].unique()):,}"

def plot_images(product):
    
    # Data
    images = product['image_urls']
    
    # Plot it!
    fig, axes = plt.subplots(nrows=1, ncols=len(images), figsize=(12, 4), dpi=100)
    
    if len(images) > 1:     
        axes = axes.flatten()
        for i, axis in enumerate(axes):
            url = images[i]
            image = np.array(Image.open(urllib.request.urlopen(url)))
            axis.imshow(image)
            axis.axis('off')
    else:
        url = images[0]
        image = np.array(Image.open(urllib.request.urlopen(url)))
        axes.imshow(image)
        axes.axis('off')

    fig.tight_layout()
    plt.show()

index = 9209 # particular index
product = get_offer(offers_training_df, matches_training_df.iloc[index], 'zalando')
product

print('Zalando')
plot_images(product)

product = get_offer(offers_training_df, matches_training_df.iloc[index], 'aboutyou')
product

print('AboutYou')
plot_images(product)

## Matching

The task is to predict the matches for the offers in testing making use of the offers in training and the corresponding groundtruth matches. Feel free to use any Machine Learning library you like such as Pytorch, TensorFlow or scikit-learn. An idea is to split the training data into train and validation to measure the generalizability of your approach. The following is just a slow and dummy algorithm to get you started just looking at a few brands.

**Note** that it might help map brands first between Zalando and AboutYou.

Get creative!

def get_shops_for_brand(offers, brands):
    """ Get mapping for brands in between the two shops """
    
    mapping = {}
    for brand in brands:
        shops = offers[offers["brand"] == brand]["shop"].unique()
        for shop in shops:
            mapping.setdefault(shop, [])
            mapping[shop].append(brand)
        print(f'Brand: "{brand}" is in {", ".join(shops)}')
    return mapping

def get_offers_by_shop(offers, mapping):
    """ Get offers per shop """
    
    offers_zal = offers[
        (offers['shop'] == 'zalando') & 
        (offers['brand'].isin(mapping['zalando']))
    ]
    offers_comp = offers[
        (offers['shop'] == 'aboutyou') &
        (offers['brand'].isin(mapping['aboutyou']))
    ]
    return offers_zal, offers_comp

def get_features(offers):
    """ Extract some text features using title and color """
    
    offers['text'] = offers[
        ['title','color']
    ].apply(lambda x : f"{x[0]} {x[1].split('|')[0]}", axis=1)
    
    return offers[['offer_id', 'text']].values

def dummy_matcher(zal_offers, comp_offers, brand_block_index):
    """
    Slow and dummy matcher that matches each Zalando offer to an AboutYou offer 
    Note: there is no need to match all offers as not all of them can be matched
    """
    
    # Get text from offers
    comp_text = comp_offers[:, 1]
    choices_dict = {idx: el for idx, el in enumerate(comp_text)}
    
    predicted_matches = []

    # For each zalando offer
    for zal_offer_id, zal_text in zal_offers:
        
        # Extract the best match using TheFuzz's package
        title, score, index = process.extractOne(zal_text, choices_dict) 
        comp_offer_id = comp_offers[index][0]

        # Add predicted match
        predicted_matches.append(
            {
                'zalando': zal_offer_id,
                'aboutyou': comp_offer_id,
                'brand': brand_block_index
            }
        )

    return pd.DataFrame(predicted_matches)

def get_brand_predictions(brand_pattern, brand_unique_index):
    """ 
    Custom pipeline to get the brand mapping, offers per shop, extract the features and generate predictions
    """

    list_brands = [
        brand
        for brand in brands_training
        if brand_pattern in brand.lower()
    ]

    # Get brand mapping
    brand_mapping = get_shops_for_brand(offers_training_df, list_brands)
    print(f'Mapping: {brand_mapping}')

    # Get offers
    brand_offers_zal, brand_offers_comp = get_offers_by_shop(offers_training_df, brand_mapping)
    
    print(f'Number of "{brand_pattern}" products: {len(brand_offers_zal) + len(brand_offers_comp):,} (' + \
          f'Zalando: {len(brand_offers_zal):,} ' + \
          f'and AboutYou: {len(brand_offers_comp):,})')

    # Get features
    brand_offers_zal_features = get_features(brand_offers_zal)
    brand_offers_comp_features = get_features(brand_offers_comp)

    # Match!
    predictions = dummy_matcher(
        brand_offers_zal_features, 
        brand_offers_comp_features, 
        brand_unique_index
    )
    
    print(f'Number of predicted matches for "{brand_pattern}": {len(predictions):,}')
    
    return brand_offers_zal, brand_offers_comp, predictions

quiksilver_offers_zal, quiksilver_offers_comp, quiksilver_predicted_matches = get_brand_predictions('quiksilver', 0)

burberry_offers_zal, burberry_offers_comp, burberry_predicted_matches = get_brand_predictions('burberry', 1)

veja_offers_zal, veja_offers_comp, veja_predicted_matches = get_brand_predictions('veja', 2)

## Evaluation

Now that we just learned some mapping between the offers and the matches, we can evaluate the performance of our matching algorithm. In a qualitative way, we can explore the matches using the actual images as we can quickly figure out whether that is actually a match or not. In a quantitative manner, we can measure its performance leveraging F1, precision and recall metrics after calculating the confusion matrix between actual matches and the predicted ones. 

The goal of the assignment is to maximize F1 overall!

def explore_match(match):
    """ Explore a match with the offers' images """
    
    # get offer ids
    zal_offer_id = match['zalando']
    comp_offer_id = match['aboutyou']
    
    # get offers
    zalando_offer = offers_training_df[offers_training_df['offer_id'] == zal_offer_id].iloc[0]
    comp_offer = offers_training_df[offers_training_df['offer_id'] == comp_offer_id].iloc[0]
    
    # show images and text
    print(f"Zalando: {zalando_offer['title']} {zalando_offer['color']}")
    plot_images(zalando_offer)
    print(f"AboutYou: {comp_offer['title']} {zalando_offer['color']}")
    plot_images(comp_offer)

def get_true_matches_brand(zal_offers):
    """ Get true matches based on their brand block """
    
    # get brand block / mapping index from the training matches
    indexes = zal_offers.merge(
        matches_training_df,
        left_on='offer_id',
        right_on='zalando',
        suffixes=['offer', 'match']
    )['brandmatch'].unique()
    
    return matches_training_df[matches_training_df['brand'].isin(indexes)]

def get_metrics(true_matches, predicted_matches, offers_comp):
    """ Calculate performance metrics """
    
    # True Positives
    TP = len(
        true_matches.merge(
            predicted_matches, 
            on=['zalando', 'aboutyou'], 
            how='inner', 
        )
    )
    
    # False Negatives
    FN = len(true_matches) - TP
    
    # Actual Positives
    positives = len(true_matches)
    assert positives == TP + FN
    
    # Actual Negatives (with respect to the competitor)
    negatives = len(offers_comp) - positives
    
    # Actual negative values (with respect to the competitor)
    offers_comp_with_matches = offers_comp.merge(
        true_matches, 
        left_on='offer_id',
        right_on='aboutyou',
        how='outer',
        indicator=True
    )
    negative_values = offers_comp_with_matches[
        offers_comp_with_matches['_merge'] == 'left_only'
    ]['offer_id'].unique()
    
    assert negatives == len(negative_values)
    
    # Competitor predictions
    comp_preds = predicted_matches['aboutyou'].unique()
    
    # False Negatives (with respect to the competitor)
    FP = len(np.intersect1d(negative_values, comp_preds))
    
    # True Negatives
    TN = negatives - FP
    
    # Precision, Recall and F1 metrics
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    F1 = 0
    if precision + recall > 0:
        F1 = 2 * precision * recall / (precision + recall)
    
    metrics = dict(
        TP=TP,
        FN=FN,
        FP=FP,
        TN=TN,
        positives=positives,
        negatives=negatives,
        precision=precision,
        recall=recall,
        F1=F1,
    )
        
    return metrics

def get_brand_metrics(brand_offers_zal, brand_offers_comp, brand_predicted_matches):

    # Get groundtruth
    brand_true_matches = get_true_matches_brand(brand_offers_zal)
    
    print(f'Number of true matches: {len(brand_true_matches):,}')

    # Get metrics
    brand_metrics = get_metrics(brand_true_matches, brand_predicted_matches, brand_offers_comp)
    
    return brand_true_matches, brand_metrics

# Explore a particular predicted match
predicted_match = quiksilver_predicted_matches.iloc[27]
explore_match(predicted_match)

Not a correct match!

Let's see some details:

get_offer(offers_training_df, predicted_match, 'zalando')

get_offer(offers_training_df, predicted_match, 'aboutyou')

# Groundtruth match (note: there is a true match for this particular offer)
true_match = matches_training_df[
    matches_training_df['zalando'] == quiksilver_predicted_matches.iloc[27]['zalando']
].iloc[0]
explore_match(true_match)

quiksilver_true_matches, quiksilver_metrics = get_brand_metrics(
    quiksilver_offers_zal, quiksilver_offers_comp, quiksilver_predicted_matches)
quiksilver_metrics

Let's look at some of these results!

For Quiksilver, we figured out 10 matches, true positives, out of the 98 actual matches (positives) by just using the title and color of the offers. Hence, 88 were false negatives, we failed to predict them as matches. In terms of the actual negatives, there were 243 offers in AboutYou that did not have a corresponding match. We predicted 54 of those to have a match but in reality they did not any, those are our false positives. The rest are correct true negatives.

burberry_true_matches, burberry_metrics = get_brand_metrics(
    burberry_offers_zal, burberry_offers_comp, burberry_predicted_matches)
burberry_metrics

veja_true_matches, veja_metrics = get_brand_metrics(veja_offers_zal, veja_offers_comp, veja_predicted_matches)
veja_metrics

Let's measure our metrics over all the offers in the groundtruth. **Note** that there is no need to perform predictions per brand block, it is just an approach used in this notebook to showcase matches and not matches.

offers_zal = offers_training_df[offers_training_df['shop'] == 'zalando']
offers_comp = offers_training_df[offers_training_df['shop'] == 'aboutyou']

predicted_matches = pd.concat([
    quiksilver_predicted_matches,
    burberry_predicted_matches,
    veja_predicted_matches
])

get_metrics(matches_training_df, predicted_matches, offers_comp)

The goal of the assignment is to maximize that F1 over all the test offers!

## Submission

Prepare a submission for matching the test offers between Zalando and AboutYou. The following example makes predictions for just a few brand blocks that are identified from the test offers. The objective is to make predictions for all test offers. Remember not all of them will have matches!

Happy Hacking!

dkny_brands = [
    brand
    for brand in brands_test
    if 'dkny' in brand.lower()
]

dkny_brand_mapping = get_shops_for_brand(offers_test_df, dkny_brands)
dkny_brand_mapping

gant_brands = [
    brand
    for brand in brands_test
    if 'gant' in brand.lower()
]

gant_brand_mapping = get_shops_for_brand(offers_test_df, gant_brands)
gant_brand_mapping

# Brand mappings for brands in test offers
test_mapping = [
    dkny_brand_mapping,
    gant_brand_mapping
]

def get_test_predictions(mapping):
    """ Predicts brands per block for the test set """
    
    predictions = []

    for brand_index, brand_mapping in enumerate(mapping):

        print(f'Predicting for block {", ".join(list(chain.from_iterable(brand_mapping.values())))}...')

        # Get offers
        brand_offers_zal, brand_offers_comp = get_offers_by_shop(offers_test_df, brand_mapping)
        print(f'Number of offers: {len(brand_offers_zal) + len(brand_offers_comp):,}')

        # Get features
        brand_offers_zal_features = get_features(brand_offers_zal)
        brand_offers_comp_features = get_features(brand_offers_comp)

        # Match!
        brand_pred_matches = dummy_matcher(
            brand_offers_zal_features, 
            brand_offers_comp_features, 
            brand_index
        )
        print(f'Number of matches: {len(brand_pred_matches):,}')

        # Add the predictions
        predictions.append(brand_pred_matches)

    return pd.concat(predictions)

predictions_df = get_test_predictions(test_mapping)

predictions_df.to_parquet('matches_test_predicted.parquet')