# Abstract

We demo a text classification approach based on an open dataset from https://world.openfoodfacts.org/. The goal of our talk is to categorize food items with various tags, where a particular food item is defined by a product's name, a generic name as well as a brand. For classification, we use facebooks NLP model fasttext, which provides a text classification model based on word embeddings as well as character n-gram embeddings. In a first experiment, we only use a single tag and remove additional ones from each data point. In this case, the evaluation is straight forward. However, since some classes are more closely related than others, we don't want to evaluate predictions in a binary manner as one would typically do. To this end, we implement a similarity concept and a multilabel classification approach. Additionally, we present some applications of a standardized food catalog, for instance search and recommendations.

In [46]:
import pandas as pd
import numpy as np
from fasttext import *
%matplotlib inline
import evaluation_metrics

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import scipy.spatial.distance as sd
import sklearn.metrics.pairwise as pw
import re
import warnings
warnings.filterwarnings('ignore')

In [47]:
FASTTEXT_HOME="/Users/evelyn.trautmann/repos/fasttext3"

# Load Data

The following classification is based on an open dataset from https://world.openfoodfacts.org/

In [48]:
df = pd.read_csv("data/foodcategories_single_label.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

df.count()

product_name            324820
generic_name             83152
brands                  269465
categories              329303
categories_tags         329303
origins                  47351
manufacturing_places     73767
labels                  118779
emb_codes                52104
countries               329099
main_category           329297
en_tags                 329303
dtype: int64

We want to classify tags based on product name, generic name and brand. In the first dataset we have a single tag assigned to each data point.

In [49]:
df.en_tags.str.len().value_counts(), "%i labels "%len(set(np.concatenate(df.en_tags.values)))

(1    329303
 Name: en_tags, dtype: int64, '87 labels ')

# Extract Train and Test Data

In [50]:
#generate features
feature_cols = ['product_name', 'generic_name', 'brands']
assert(df[df.en_tags.str.len()==0].shape[0]==0)
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)
y = df.en_tags

In [51]:
# preprocess and split train and test date
X = X.str.lower().apply(lambda x: re.sub(r'[^\w\s]','',x))

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=13)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train

In [52]:
dftrain_emb = y_train.apply(lambda x: " ".join(x)) + " " + X_train


dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test
dfvalid = y_valid.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_valid
dftest.shape, dfvalid.shape

((65861,), (65861,))

In [8]:
dftrain.to_csv(
    "train.csv", index = False, sep=";")
dftest.to_csv(
    "test.csv", index = False, sep=";")
dfvalid.to_csv(
    "valid.csv", index = False, sep=";")
dftrain_emb.to_csv(
    "train_emb.csv", index = False, sep=";")

# Train

For classification we use fasttext, which can be either executed on command line or called from a python API. 

In [9]:
model_singlelabel = train_supervised("train.csv",
                lr = 0.01,
                autotuneValidationFile = "valid.csv",
                autotuneDuration = 1000)

model_singlelabel.save_model("model_singlelabel.bin")

In [53]:
model_singlelabel = load_model("model_singlelabel.bin")




In [54]:
!$FASTTEXT_HOME/fastText/fasttext dump model_singlelabel.bin args

dim 41
ws 5
epoch 100
minCount 1
neg 5
wordNgrams 3
loss softmax
model sup
bucket 10000000
minn 0
maxn 0
lrUpdateRate 100
t 0.0001


In [55]:
!$FASTTEXT_HOME/fastText/fasttext test model_singlelabel.bin test.csv 1

N	65860
P@1	0.778
R@1	0.778


# Test

In [56]:
model_singlelabel.test("test.csv", k=1)

(65860, 0.7776040085028849, 0.7776040085028849)

In [57]:
model_singlelabel.test("valid.csv", k=1)

(65861, 0.7778199541458526, 0.7778199541458526)

In [58]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

1

# Single Label Evaluation 

In [59]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_singlelabel.predict(x))

df_test["label_predicted"] = df_test.prediction.str[0].str[0].str.replace("__label__","")
df_test["confidence"] = df_test.prediction.str[1].str[0]
df_test["truth"] = df_test.en_tags.str[0].str.replace("__label__","")


In [60]:
report = classification_report(df_test.truth, df_test.label_predicted)
print(report)

                                 precision    recall  f1-score   support

            alcoholic-beverages       0.79      0.81      0.80      1657
                     baby-foods       0.77      0.74      0.76       288
                   bee-products       1.00      0.38      0.55         8
                          beers       0.00      0.00      0.00         5
                      beverages       0.00      0.00      0.00        42
             biscuits-and-cakes       0.63      0.50      0.55       569
                     breakfasts       0.87      0.87      0.87      3202
                          cakes       0.50      0.40      0.45       423
                        candies       0.76      0.26      0.39        85
                  canned-fishes       0.74      0.77      0.76       503
                   canned-foods       0.74      0.19      0.31        72
       canned-plant-based-foods       0.56      0.35      0.43        80
              carbonated-drinks       0.65      0.

In [61]:
# assertion no empty tags
assert(len(df_test.en_tags[df_test.en_tags.str.len()==0])==0)

In [62]:
label = "cakes"
df_test[df_test.truth==label].label_predicted.value_counts()

cakes                  171
snacks                 156
biscuits-and-cakes      28
desserts                17
pastries                14
plant-based-foods        8
chocolate-biscuits       6
ice-creams               4
meals                    3
fresh-foods              2
frozen-foods             2
breakfasts               2
sweet-snacks             2
dairies                  1
nuts                     1
alcoholic-beverages      1
fruits-based-foods       1
meats                    1
sweetened-beverages      1
sauces                   1
spreads                  1
Name: label_predicted, dtype: int64

# Similarity

Sometimes the model is predicting other than the ground truth but still makes reasonable class assignments. To distinguish the reasonable assignments from the actually wrong classifications we introduce similarities between classes to count them into the accurate predicftions.

Example
Consider followin common products
chicken, poultries, seafood
Given that the first two are pretty similar and the latter very different, a possible similarity matrix S could look like



|Labels | chicken | poultries |seafood|
|---------|:------------------|:-------------------|:--------------|
|chicken | 1 | 0.9 | 0 |
| poultries | 0.9 | 1 | 0 |
| seafood | 0 | 0 | 1 |


If we want to include now similarity into calculation of precision and recall, we have the nominator containing not only

```math
Count[Predicted = chicken, Truth = chicken]
```

but

```math
Count[Predicted = chicken, Truth = chicken] * 1 
+ Count[Predicted  = chicken, Truth = poultries ] * 0.9.
```
Hence diagonal entries of confusion matrix 

$$ C_{ii} = Count[Predicted=i, Truth=i]$$

become 

$$ C_{ij} = \sum_j Count[Predicted=i,Truth=j]* S_{ij} $$

That means, we count in all similar predictions weighted by degree of similatity.
Technically that means we take the diagonal entris of

$$\hat{C}=C∗S\hat{C} = C * S \hat{C}=C∗S$$


Where * denotes matrix product. Denominator remains the same: Row sum in case of precision, column sum in case of recall. Here we don't weight by similar entries since for denominator only the exact number of True (resp. Predicted ) counts.

In [63]:
truth=["truth"]
predictions=["label_predicted"]
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))


In [65]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)
confusion.to_csv("confusion_single_label.csv")

compute diffecence sets
compute intersections
determine vectors
CPU times: user 3min 27s, sys: 3.49 s, total: 3min 31s
Wall time: 3min 30s


In [66]:
confusion = pd.read_csv("confusion_single_label.csv", index_col=0)

In [23]:
model_emb = train_unsupervised("train_emb.csv")

In [24]:
# Compute word embeddings for each class and pairwise distance
# between every two classes. Set all distances that exceed a
# pre-defined threshold to null and compute rbf function as similarity.
#
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))
MAGIC_NUMBER = 0.7
mask = Dst > MAGIC_NUMBER
Dst[mask]=np.nan
# print similar but not identical classes 
print(Dst.apply(np.argmax)[Dst.max()>0.0])

S = np.exp(-Dst**2).fillna(0)

frozen-foods                                       frozen-desserts
seafood                                              smoked-fishes
fermented-milk-products                            fermented-foods
fermented-foods                            fermented-milk-products
olive-tree-products                                   bee-products
chickens                                           chicken-breasts
plant-based-foods-and-beverages              unsweetened-beverages
chicken-breasts                                           chickens
non-alcoholic-beverages                        alcoholic-beverages
wines                                            wines-from-france
canned-foods                              canned-plant-based-foods
plant-based-beverages                      non-alcoholic-beverages
canned-plant-based-foods                  frozen-plant-based-foods
hot-beverages                              non-alcoholic-beverages
fresh-plant-based-foods                     vegetables-based-f

In [25]:

df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = True)[:20]

Unnamed: 0,precision,recall,f1score,support
hot-beverages,0.0,0.164437,0.0,25.0
plant-based-beverages,0.0,0.36796,0.0,2.0
chicken-breasts,0.0,0.793644,0.0,1.0
smoked-fishes,0.0,0.626288,0.0,8.0
chocolates,0.380952,0.119403,0.181818,67.0
beverages,0.331648,0.129354,0.186116,42.0
frozen-ready-made-meals,0.352941,0.256684,0.297214,187.0
canned-foods,0.736842,0.194444,0.307692,72.0
fats,0.477273,0.230769,0.311111,364.0
syrups,0.418605,0.272727,0.330275,66.0


In [26]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.604499,0.550221,0.576084,65861.0
weighted avg,0.792592,0.799328,0.795946,65861.0


# Multilabel Evaluation 

In [28]:
#load data
df = pd.read_csv("data/foodcategories_3labels.csv.zip", 
                 sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

In [29]:
y = df.en_tags
feature_cols = ['product_name', 'generic_name', 'brands']
assert(df[df.en_tags.str.len()==0].shape[0]==0)
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, 
                                                      random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, 
                                                    test_size=0.25, random_state=23)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test
dfvalid = y_valid.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_valid


In [30]:
dftrain.to_csv(
    "train_3label.csv", index = False, sep=";")
dftest.to_csv(
    "test3label.csv", index = False, sep=";")
dfvalid.to_csv(
    "valid3label.csv", index = False, sep=";")

In [31]:
model_3label = train_supervised("train_3label.csv",
                lr = 0.01,
                autotunePredictions = 3,
                autotuneValidationFile = "valid3label.csv",
                autotuneDuration = 1000)
model_3label.save_model("model_3label.bin")


In [32]:
model_3label = load_model("model_3label.bin")




In [33]:
model_3label.test("test3label.csv", 3)

(65844, 0.7418241095113703, 0.8264795627724917)

Let \{$T_1^{(1)}, ...,T_M^{(1)}$\},...,\{$T_1^{(N)}, ...,T_M^{(N)}$\} be our ground truth, 
\{$P_1^{(1)}, ...,P_M^{(1)}$\},...,\{$P_1^{(N)}, ...,P_M^{(N)}$\} the prediction.

on diagonal we count the events of truth matching prediction

$\sum_{i=1}^N |\{T_1^{(i)}, ...,T_M^{(i)}\}\cap \{P_1^{(i)}, ...,P_M^{(i)}\}|$

so for each diagonal entry we count over all data points (N) and all predictions (M) the total amount when truth is matching prediction.

$C_{kk} =\sum_{i=1}^N \sum_{m=1}^M \chi{1}\{T_m^{(i)}=k,P_m^{(i)}=k\} $

in contrast on off-diagonal we count all events where truth was not covered by prediction.
For the off-diagonal we take only the difference sets into account. If truth for data point i was for example {fish, curry, rice-dish} and prediction was {fish, curry, soup }, we only count 
$T^i$ = {rice_dish} and $P^i$ = {soup} for the off-daigonal. Formally we do the following transformation:

$\{\tilde{T}_m^{(i)}\}_{q=1,..M1} = \{T_m^{(i)}\}_{m=1,..M}\setminus \{P_m^{(i)}\}_{m=1,..M}$ 

$\{\tilde{P}_m^{(i)}\}_{p=1,..M2} = \{P_m^{(i)}\}_{m=1,..M}\setminus \{T_m^{(i)}\}_{m=1,..M}$

Eventually we count all events of the ground truth that were not captured in predictions, which results in

$C_{kl} =\sum_{i=1}^N \sum_{m1=1}^{M1} \sum_{m2=1}^{M2}
\frac{1}{M1}\chi{1}\{\tilde{T}_{m1}^{(i)}=k,\tilde{P}_{m2}^{(i)}=l\} $

Once we have computed a multi-class - multi-label confusion matrix the classification report is straight forward.

Precision for class k is computed as

$Pr_k=\frac{C_{kk}}{\sum_j C_{kj}}$

and Recall

$R_k=\frac{C_{kk}}{\sum_j C_{jk}}$



In [34]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

3

In [35]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_3label.predict(x, k=K))

for k in range(K):
    df_test["label_predicted%i" %k] = df_test.prediction.str[0].str[k].str.replace("__label__","")
    df_test["confidence%i" %k] = df_test.prediction.str[1].str[k]
    df_test["truth%i" %k] = df_test.en_tags.str[k].str.replace("__label__","")


truth = list()
predictions = list()
for k in range(int(K)):
    truth.append("truth%i" %k)
    predictions.append("label_predicted%i" %k)
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))
classes.remove("nan")


In [36]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)

confusion.to_csv("confusion_3_label.csv")

compute diffecence sets
compute intersections
determine vectors
CPU times: user 3min 46s, sys: 4.3 s, total: 3min 50s
Wall time: 3min 49s


In [37]:
confusion = pd.read_csv("confusion_3_label.csv", index_col=0)

In [38]:
df_report = evaluation_metrics.get_report(confusion)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.973046,0.95,0.961385,380.0
mustards,0.979504,0.928425,0.953281,489.0
prepared-meats,0.941559,0.943049,0.942303,3582.0
honeys,0.920375,0.959004,0.939293,683.0
hams,0.921358,0.953741,0.93727,735.0
fish-and-meat-and-eggs,0.940775,0.931646,0.936188,395.0
fermented-milk-products,0.932867,0.935103,0.933984,5424.0
poultries,0.925694,0.932216,0.928943,1372.0
sauces,0.920702,0.934432,0.927516,3096.0
olive-tree-products,0.935105,0.913174,0.924009,668.0


In [39]:
evaluation_metrics.get_summary(df_report, confusion)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.715043,0.655135,0.683779,177565.0
weighted avg,0.822135,0.826278,0.824201,177565.0


# Combining Multilabel Classification with Near Accuracy 

In [40]:
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))

mask = Dst > 0.7
Dst[mask]=np.nan
# show entries with highest distance <= 0.7 for each class
Dst.apply(np.argmax)[Dst.max()>0]

pastas                                                       dry-pasta
milk-chocolates                                      chocolate-candies
frozen-foods                                           frozen-desserts
plant-based-foods                              dried-plant-based-foods
prepared-vegetables                                     vegetable-oils
seafood                                                  smoked-fishes
fermented-milk-products                                fermented-foods
juices-and-nectars                     plant-based-foods-and-beverages
fruit-based-beverages                                    hot-beverages
fermented-foods                                fermented-milk-products
dark-chocolates                                      chocolate-candies
olive-tree-products                                       bee-products
chickens                                               chicken-breasts
fishes                                                    pasta-dishes
french

In [41]:
S = np.exp(-Dst**2).fillna(0)
df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.979032,0.968234,0.973603,380.0
mustards,0.979504,0.928425,0.953281,489.0
prepared-meats,0.941559,0.943049,0.942303,3582.0
honeys,0.920375,0.959004,0.939293,683.0
hams,0.921358,0.953741,0.93727,735.0
fish-and-meat-and-eggs,0.940775,0.931646,0.936188,395.0
fermented-milk-products,0.932909,0.935375,0.934141,5424.0
poultries,0.930048,0.935996,0.933013,1372.0
sauces,0.920702,0.934432,0.927516,3096.0
olive-tree-products,0.935105,0.913174,0.924009,668.0


In [42]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.735391,0.694294,0.714252,177565.0
weighted avg,0.833269,0.836318,0.834791,177565.0


In [43]:
confusion.sum().sum()

177565.0

In [44]:
df_test.en_tags.apply(len).sum()

177565