# Abstract

We demo a classification based on an open dataset from https://world.openfoodfacts.org/. We want to classify food items into tags. Classification is based on product name, generic name and brand.
For classification we use facebooks NLP model fasttext, which provides a text calssification based on word2vec embeddings.
In the first dataset we have a single tag assigned to each data point. In this case evaluation is straight forward. But taking into account that some classes are closer related than others and not only one prediction is correct and all other are wrong requires a more sophisticated analysis.
This point will be addressed by similarity concepts and multilabel classification.
A multilabel dataset will be shown as a second example.
Eventually we will present some applications of a standardized food catalogue for search and recommendations.


In [1]:
import pandas as pd
import numpy as np
from fasttext import *
%matplotlib inline
import evaluation_metrics

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import scipy.spatial.distance as sd
import sklearn.metrics.pairwise as pw
import re

# Load Data

The following classification is based on an open dataset from https://world.openfoodfacts.org/

In [2]:
df = pd.read_csv("data/foodcategories_single_label.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

In [3]:
label, freq = np.unique(np.concatenate(df.en_tags.values), return_counts = True)
sr_labels = pd.Series(index=label, data=freq).sort_values()


We want to classify tags based on product name, generic name and brand. In the first dataset we have a single tag assigned to each data point.

In [4]:
df.en_tags.str.len().value_counts()

1    329303
Name: en_tags, dtype: int64

# Extract Train and Test Data

In [5]:
#generate features
feature_cols = ['product_name', 'generic_name', 'brands']
print(df[df.en_tags.str.len()==0])
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)
y = df.en_tags

Empty DataFrame
Columns: [product_name, generic_name, brands, categories, categories_tags, origins, manufacturing_places, labels, emb_codes, countries, main_category, en_tags]
Index: []


In [6]:
# preprocess and split train and test date
X = X.str.lower().apply(lambda x: re.sub(r'[^\w\s]','',x))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train

In [7]:
dftrain_emb = y_train.apply(lambda x: " ".join(x)) + " " + X_train
dftrain_emb.to_csv(
    "train_emb.csv", index = False, sep=";")

In [8]:
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test


In [9]:
dftrain.to_csv(
    "train.csv", index = False, sep=";")
dftest.to_csv(
    "test.csv", index = False, sep=";")

# Train

For classification we use fasttext, which can be either executed on command line or 

In [10]:
# train_supervised("/Users/evelyn.trautmann/projects/openfood/train.csv",
#                 autotuneValidationFile = "test.csv")

!/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext supervised \
    -input "train.csv" -output model_singlelabel \
    -autotune-validation "test.csv" -autotune-duration 1000

Progress: 100.0% Trials:   35 Best score:  0.783370 ETA:   0h 0m 0ss15m30s37s53s52sm45s10m41s 6ss 5m12s
Training again with best arguments
Read 1M words
Number of words:  79212
Number of labels: 86
Progress: 100.0% words/sec/thread:  210917 lr:  0.000000 avg.loss:  0.203952 ETA:   0h 0m 0s0h 0m49s47sh 0m47s  0h 0m47s ETA:   0h 0m43sh 0m40s 0m30s0h 0m30s29sm27s 0m18s0h 0m11s 0m 4sm 0s


In [11]:
!/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext dump model_singlelabel.bin args

dim 58
ws 5
epoch 77
minCount 1
neg 5
wordNgrams 3
loss softmax
model sup
bucket 10000000
minn 0
maxn 0
lrUpdateRate 100
t 0.0001


In [12]:
!/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext test model_singlelabel.bin test.csv 1

N	108669
P@1	0.783
R@1	0.783


In [13]:
model = load_model("model_singlelabel.bin")




In [14]:
model_emb = train_unsupervised("/Users/evelyn.trautmann/projects/openfood/train_emb.csv")

In [15]:
# model = train_supervised(input="/Users/evelyn.trautmann/projects/openfood/train.csv", 
#                          autotuneValidationFile="test.csv")

# Test

In [16]:
model.test("test.csv", k=1)

(108669, 0.7831488280926483, 0.7831488280926483)

In [21]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

# Single Label Evaluation 

In [23]:
df_test["prediction"] = df_test.feature.apply(lambda x: model.predict(x))

df_test["label_predicted"] = df_test.prediction.str[0].str[0].str.replace("__label__","")
df_test["confidence"] = df_test.prediction.str[1].str[0]
df_test["truth"] = df_test.en_tags.str[0].str.replace("__label__","")


In [24]:
report = classification_report(df_test.truth, df_test.label_predicted)
print(report)

  'precision', 'predicted', average, warn_for)


                                 precision    recall  f1-score   support

            alcoholic-beverages       0.83      0.81      0.82      2683
                     baby-foods       0.82      0.72      0.77       497
                   bee-products       0.82      0.74      0.78        19
                          beers       0.00      0.00      0.00         5
                      beverages       0.59      0.45      0.51        51
             biscuits-and-cakes       0.68      0.50      0.58       875
                     breakfasts       0.90      0.88      0.89      5242
                          cakes       0.54      0.42      0.47       714
                        candies       0.48      0.31      0.38       133
                  canned-fishes       0.76      0.74      0.75       907
                   canned-foods       0.42      0.15      0.22       115
       canned-plant-based-foods       0.63      0.46      0.53       138
              carbonated-drinks       0.65      0.

In [25]:
# assertion no empty tags
assert(len(df_test.en_tags[df_test.en_tags.str.len()==0])==0)

# Similarity

Sometimes the model is predicting other than the ground truth but still makes reasonable class assignments. To distinguish the reasonable assignments from the actually wrong classifications we introduce similarities between classes to count them into the accurate predicftions.

Example
Consider followin common products
spaghetti bolognese, linguine bolognese, chicken biryani
Given that the first two are pretty similar and the latter very different, a possible similarity matrix S could look like



|Labels | spaghetti bolognese | linguine bolognese |chicken biryani|
|---------|:------------------|:-------------------|:--------------|
|spaghetti bolognese | 1 | 0.9 | 0 |
| linguine bolognese | 0.9 | 1 | 0 |
| chicken biryani | 0 | 0 | 1 |


If we want to include now similarity into calculation of precision and recall, we have the nominator containing not only

#[ Actual = spaghetti bolognese, Predicted = spaghetti bolognese ]

but

#[ Actual = spaghetti bolognese, Predicted = spaghetti bolognese ] * 1 + 
#[ Actual = spaghetti bolognese, Predicted = linguine bolognese ] * 0.9.

Hence diagonal entries of confusion matrix 

$$ C_{ii} = \#[Actual=i,Predicted=i]$$

become 

$$ C_{ii} = \sum_j \#[Actual=i,Predicted=j]* S_{ij} $$

That means, we count in all similar predictions weighted by degree of similatity.
Technically that means we take the diagonal entris of

$$\hat{C}=C∗S\hat{C} = C * S \hat{C}=C∗S$$


Where * denotes matrix product. Denominator remains the same: Row sum in case of precision, column sum in case of recall. Here we don't weight by similar entries since for denominator only the exact number of Actuals (resp. Predicted ) counts.

In [28]:
# classes = list(set([l.replace("__label__", "") for l in model.get_labels()])
#               | set(np.concatenate(df_test.en_tags[df_test.en_tags.str.len()>0].values)))
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))


In [29]:
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

In [30]:
Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))

In [31]:
mask = Dst > 0.5
Dst[mask]=np.nan
Dst.apply(np.argmax)

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


breakfasts                                         breakfasts
wines-from-france                                       wines
canned-plant-based-foods             canned-plant-based-foods
poultries                                            chickens
sweeteners                                         sweeteners
legume-seeds                                     legume-seeds
meats                                                   meats
milk-substitute                               milk-substitute
cakes                                                   cakes
fermented-milk-products                       fermented-foods
dietary-supplements                       dietary-supplements
meals-with-fish                               meals-with-fish
pastries                                             pastries
salty-snacks                                     salty-snacks
smoked-fishes                                   smoked-fishes
hot-beverages                                            teas
sauces  

In [32]:
Dst[mask].count().value_counts()

0    87
dtype: int64

In [33]:
S = np.exp(-Dst**2).fillna(0)

In [34]:

truth = ["truth"]
predictions = ["label_predicted"]

%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)

df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna()

CPU times: user 7min 18s, sys: 9.11 s, total: 7min 27s
Wall time: 7min 25s


Unnamed: 0,precision,recall,f1score,support
breakfasts,0.878863,0.895607,0.887156,5144.0
wines-from-france,0.572707,0.640000,0.604486,400.0
canned-plant-based-foods,0.463768,0.627451,0.533333,102.0
poultries,0.897295,0.911843,0.904510,626.0
sweeteners,0.840304,0.902041,0.870079,735.0
legume-seeds,0.791463,0.817380,0.804213,794.0
meats,0.808917,0.835068,0.821785,1825.0
milk-substitute,0.125000,1.000000,0.222222,1.0
cakes,0.418768,0.537770,0.470866,556.0
fermented-milk-products,0.921645,0.914085,0.917850,4738.0


In [35]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
micro avg,0.806341,0.806341,0.806341,108670.0
macro avg,0.58026,0.687775,0.62946,108670.0
weighted avg,0.821831,0.806341,0.814012,108670.0


# Multilabel Evaluation 

In [78]:
#load data
df = pd.read_csv("data/foodcategories_3labels.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

In [73]:
y = df.en_tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test


In [74]:
dftrain.to_csv(
    "train_3label.csv", index = False, sep=";")
dftest.to_csv(
    "test3label.csv", index = False, sep=";")

In [39]:
# train_supervised("/Users/evelyn.trautmann/projects/openfood/train.csv",
#                 autotuneValidationFile = "test.csv")

!/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext supervised \
    -input "train_3label.csv" -output model_3label \
    -autotune-validation "test3label.csv" -autotune-duration 1000

Progress: 100.0% Trials:   19 Best score:  0.473753 ETA:   0h 0m 0s 5ss1s43ssh 8m 1s  0h 7m50s 7m25sss11sss18s 0.473230 ETA:   0h 5m 7sss10s2m20ss18s
Training again with best arguments
Read 2M words
Number of words:  79212
Number of labels: 137
Progress: 100.0% words/sec/thread:  150387 lr:  0.000000 avg.loss:  1.298227 ETA:   0h 0m 0s 3s ETA:   0h 2m 3s2m 2s 1m58s57s52s49sss0h 1m42s1.563314 ETA:   0h 1m26s24s0h 1m24s0h 1m20s 1m18s1m18s16s15ssm11ss55s1.429880 ETA:   0h 0m54ss33ss0m13s12s5s 3s


In [49]:
model = load_model("model_3label.bin")




Let \{$T_1^{(1)}, ...,T_M^{(1)}$\},...,\{$T_1^{(N)}, ...,T_M^{(N)}$\} be our ground truth, 
\{$P_1^{(1)}, ...,P_M^{(1)}$\},...,\{$P_1^{(N)}, ...,P_M^{(N)}$\} the prediction.

on diagonal we count the events of truth matching prediction

$\sum_{i=1}^N |\{T_1^{(i)}, ...,T_M^{(i)}\}\cap \{P_1^{(i)}, ...,P_M^{(i)}\}|$

so for each diagonal entry we count over all data points (N) and all predictions (M) the total amount when truth is matching prediction.

$C_{kk} =\sum_{i=1}^N \sum_{m=1}^M \chi{1}\{T_m^{(i)}=k,P_m^{(i)}=k\} $

in contrast on off-diagonal we count all events where truth was not covered by prediction.
For the off-diagonal we take only the difference sets into account. If truth for data point i was for example {fish, curry, rice-dish} and prediction was {fish, curry, soup }, we only count 
$T^i$ = {rice_dish} and $P^i$ = {soup} for the off-daigonal. Formally we do the following transformation:

$\{\tilde{T}_m^{(i)}\}_{q=1,..M1} = \{T_m^{(i)}\}_{m=1,..M}\setminus \{P_m^{(i)}\}_{m=1,..M}$ 

$\{\tilde{P}_m^{(i)}\}_{p=1,..M2} = \{P_m^{(i)}\}_{m=1,..M}\setminus \{T_m^{(i)}\}_{m=1,..M}$

Eventually we count all events of the ground truth that were not captured in predictions, which results in

$C_{kl} =\sum_{i=1}^N \sum_{m1=1}^{M1} \sum_{m2=1}^{M2}
\frac{1}{M1}\chi{1}\{\tilde{T}_{m1}^{(i)}=k,\tilde{P}_{m2}^{(i)}=l\} $

Once we have computed a multi-class - multi-label confusion matrix the classification report is straight forward.

Precision for class k is computed as

$Pr_k=\frac{C_{kk}}{\sum_j C_{kj}}$

and Recall

$R_k=\frac{C_{kk}}{\sum_j C_{jk}}$



In [79]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

3

In [80]:
df_test["prediction"] = df_test.feature.apply(lambda x: model.predict(x, k=K))

for k in range(K):
    df_test["label_predicted%i" %k] = df_test.prediction.str[0].str[k].str.replace("__label__","")
    df_test["confidence%i" %k] = df_test.prediction.str[1].str[k]
    df_test["truth%i" %k] = df_test.en_tags.str[k].str.replace("__label__","")


In [101]:
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))

truth = list()
predictions = list()
for k in range(int(K)):
    truth.append("truth%i" %k)
    predictions.append("label_predicted%i" %k)

In [102]:
'fruit-juices-and-nectars' in classes
#'fruit-juices-and-nectars' in np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)

True

In [None]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)

In [89]:
df_report = evaluation_metrics.get_report(confusion)
df_report.dropna()

Unnamed: 0,precision,recall,f1score,support
plant-based-foods,0.367657,0.524608,0.432328,13898.0
plant-based-foods-and-beverages,0.345391,0.179624,0.236338,8824.0
snacks,0.404350,0.643587,0.496661,7337.0
beverages,0.764706,0.005796,0.011504,6729.0
dairies,0.516105,0.139411,0.219524,5057.0
sweet-snacks,0.281621,0.102224,0.150000,2788.0
fermented-milk-products,0.396876,0.487173,0.437413,3859.0
meats,0.811040,0.325939,0.465003,4688.0
meals,0.634295,0.458142,0.532016,5232.0
prepared-meats,0.421537,0.567039,0.483580,2506.0


In [90]:
evaluation_metrics.get_summary(df_report, confusion)

Unnamed: 0,precision,recall,f1-score,support
micro avg,0.358425,0.358425,0.358425,108670.0
macro avg,0.188461,0.287717,0.227745,108670.0
weighted avg,0.384383,0.358425,0.37095,108670.0


In [91]:
df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna()

Unnamed: 0,precision,recall,f1score,support


In [63]:
evaluation_metrics.get_summary(df_report, confusion, S)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  sim_summary.loc["micro avg", "precision"] = pd.Series(np.diag(confusion.dot(S.loc[cps,cps])),
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  sim_summary.loc["micro avg", "recall"] = pd.Series(np.diag(S.loc[cps,cps].dot(confusion)),


Unnamed: 0,precision,recall,f1-score,support
micro avg,0,0,,108670.0
macro avg,0,0,,108670.0
weighted avg,0,0,,108670.0


In [64]:
S

Unnamed: 0,breakfasts,wines-from-france,canned-plant-based-foods,poultries,sweeteners,legume-seeds,meats,milk-substitute,cakes,fermented-milk-products,...,meals,alcoholic-beverages,unsweetened-beverages,farming-products,fermented-foods,olives,fruits-based-foods,fats,legumes-and-their-products,wines
breakfasts,1.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
wines-from-france,0.0,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.807667
canned-plant-based-foods,0.0,0.000000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
poultries,0.0,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
sweeteners,0.0,0.000000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
legume-seeds,0.0,0.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
meats,0.0,0.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
milk-substitute,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
cakes,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000
fermented-milk-products,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,...,0.0,0.0,0.000000,0.0,0.793113,0.0,0.0,0.0,0.0,0.000000
