# Abstract

We demo a text classification approach based on an open dataset from https://world.openfoodfacts.org/. The goal of our talk is to categorize food items with various tags, where a particular food item is defined by a product's name, a generic name as well as a brand. For classification, we use facebooks NLP model fasttext, which provides a text classification model based on word embeddings as well as character n-gram embeddings. In a first experiment, we only use a single tag and remove additional ones from each data point. In this case, the evaluation is straight forward. However, since some classes are more closely related than others, we don't want to evaluate predictions in a binary manner as one would typically do. To this end, we implement a similarity concept and a multilabel classification approach. Additionally, we present some applications of a standardized food catalog, for instance search and recommendations.

In [1]:
import pandas as pd
import numpy as np
from fasttext import *
%matplotlib inline
import evaluation_metrics

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import scipy.spatial.distance as sd
import sklearn.metrics.pairwise as pw
import re

In [2]:
FASTTEXT_HOME="/Users/evelyn.trautmann/repos/fasttext3"

# Load Data

The following classification is based on an open dataset from https://world.openfoodfacts.org/

In [3]:
df = pd.read_csv("data/foodcategories_single_label.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

df.count()

product_name            324820
generic_name             83152
brands                  269465
categories              329303
categories_tags         329303
origins                  47351
manufacturing_places     73767
labels                  118779
emb_codes                52104
countries               329099
main_category           329297
en_tags                 329303
dtype: int64

We want to classify tags based on product name, generic name and brand. In the first dataset we have a single tag assigned to each data point.

In [4]:
df.en_tags.str.len().value_counts(), "%i labels "%len(set(np.concatenate(df.en_tags.values)))

(1    329303
 Name: en_tags, dtype: int64, '87 labels ')

# Extract Train and Test Data

In [5]:
#generate features
feature_cols = ['product_name', 'generic_name', 'brands']
assert(df[df.en_tags.str.len()==0].shape[0]==0)
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)
y = df.en_tags

In [6]:
# preprocess and split train and test date
X = X.str.lower().apply(lambda x: re.sub(r'[^\w\s]','',x))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train

In [7]:
dftrain_emb = y_train.apply(lambda x: " ".join(x)) + " " + X_train
dftrain_emb.to_csv(
    "train_emb.csv", index = False, sep=";")

In [8]:
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test


In [9]:
dftrain.to_csv(
    "train.csv", index = False, sep=";")
dftest.to_csv(
    "test.csv", index = False, sep=";")

# Train

For classification we use fasttext, which can be either executed on command line or 

In [10]:
model_singlelabel = train_supervised("train.csv",
                lr = 0.01,
                autotuneValidationFile = "test.csv",
                autotuneDuration = 1000)

model_singlelabel.save_model("model_singlelabel.bin")
# !/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext supervised \
#     -input "train.csv" -output model_singlelabel \
#     -autotune-validation "test.csv" -autotune-duration 1000

In [11]:
!$FASTTEXT_HOME/fastText/fasttext dump model_singlelabel.bin args

dim 40
ws 5
epoch 100
minCount 1
neg 5
wordNgrams 3
loss softmax
model sup
bucket 10000000
minn 0
maxn 0
lrUpdateRate 100
t 0.0001


In [12]:
!$FASTTEXT_HOME/fastText/fasttext test model_singlelabel.bin test.csv 1

N	108669
P@1	0.782
R@1	0.782


In [13]:
model = load_model("model_singlelabel.bin")




In [14]:
model_emb = train_unsupervised("/Users/evelyn.trautmann/projects/openfood/train_emb.csv")

In [15]:
# model = train_supervised(input="/Users/evelyn.trautmann/projects/openfood/train.csv", 
#                          autotuneValidationFile="test.csv")

# Test

In [16]:
model_singlelabel.test("test.csv", k=1)

(108669, 0.7820905686074225, 0.7820905686074225)

In [17]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

1

# Single Label Evaluation 

In [18]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_singlelabel.predict(x))

df_test["label_predicted"] = df_test.prediction.str[0].str[0].str.replace("__label__","")
df_test["confidence"] = df_test.prediction.str[1].str[0]
df_test["truth"] = df_test.en_tags.str[0].str.replace("__label__","")


In [19]:
report = classification_report(df_test.truth, df_test.label_predicted)
print(report)

  'precision', 'predicted', average, warn_for)


                                 precision    recall  f1-score   support

            alcoholic-beverages       0.81      0.82      0.81      2683
                     baby-foods       0.80      0.72      0.76       497
                   bee-products       1.00      0.26      0.42        19
                          beers       0.00      0.00      0.00         5
                      beverages       0.50      0.20      0.28        51
             biscuits-and-cakes       0.64      0.52      0.58       875
                     breakfasts       0.89      0.88      0.88      5242
                          cakes       0.53      0.42      0.47       714
                        candies       0.74      0.26      0.38       133
                  canned-fishes       0.76      0.75      0.76       907
                   canned-foods       0.38      0.11      0.17       115
       canned-plant-based-foods       0.54      0.42      0.47       138
              carbonated-drinks       0.67      0.

In [20]:
# assertion no empty tags
assert(len(df_test.en_tags[df_test.en_tags.str.len()==0])==0)

# Similarity

Sometimes the model is predicting other than the ground truth but still makes reasonable class assignments. To distinguish the reasonable assignments from the actually wrong classifications we introduce similarities between classes to count them into the accurate predicftions.

Example
Consider followin common products
chicken, poultries, seafood
Given that the first two are pretty similar and the latter very different, a possible similarity matrix S could look like



|Labels | chicken | poultries |seafood|
|---------|:------------------|:-------------------|:--------------|
|chicken | 1 | 0.9 | 0 |
| poultries | 0.9 | 1 | 0 |
| seafood | 0 | 0 | 1 |


If we want to include now similarity into calculation of precision and recall, we have the nominator containing not only

#[ Truth = chicken, Predicted = chicken ]

but

#[ Truth = chicken, Predicted = chicken ] * 1 + #[ Truth = chicken, Predicted = poultries ] * 0.9.

Hence diagonal entries of confusion matrix 

$$ C_{ii} = \#[Truth=i,Predicted=i]$$

become 

$$ C_{ii} = \sum_j \#[Truth=i,Predicted=j]* S_{ij} $$

That means, we count in all similar predictions weighted by degree of similatity.
Technically that means we take the diagonal entris of

$$\hat{C}=C∗S\hat{C} = C * S \hat{C}=C∗S$$


Where * denotes matrix product. Denominator remains the same: Row sum in case of precision, column sum in case of recall. Here we don't weight by similar entries since for denominator only the exact number of True (resp. Predicted ) counts.

In [21]:

truth = ["truth"]
predictions = ["label_predicted"]
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))

%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)


compute diffecence sets
compute intersections
compute onehot vectors
CPU times: user 7min 14s, sys: 8.94 s, total: 7min 23s
Wall time: 7min 23s


In [22]:
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))

mask = Dst > 0.7
Dst[mask]=np.nan
Dst.apply(np.argmax)[Dst.max()>0.0]

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


meals                                      frozen-ready-made-meals
frozen-plant-based-foods                    vegetables-based-foods
chocolates                                         milk-chocolates
seafood                                              canned-fishes
plant-based-foods-and-beverages                  plant-based-foods
teas                                                 hot-beverages
hot-beverages                                                 teas
olives                                                     pickles
fruits-based-foods                          vegetables-based-foods
fresh-plant-based-foods                     vegetables-based-foods
pickles                                                     olives
snacks                                                sweet-snacks
non-alcoholic-beverages                      plant-based-beverages
meats                                               prepared-meats
sauces                                                   groce

In [23]:
S = np.exp(-Dst**2).fillna(0)

In [24]:

df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
fish-and-meat-and-eggs,0.943307,0.956869,0.95004,626.0
farming-products,0.933202,0.946215,0.939664,502.0
olives,0.945248,0.922032,0.933496,588.0
sauces,0.935191,0.921683,0.928388,5064.0
fermented-milk-products,0.927594,0.923374,0.925479,4708.0
chickens,0.936173,0.904721,0.920178,1288.0
hams,0.932464,0.902675,0.917327,1346.0
poultries,0.900126,0.907548,0.903822,631.0
prepared-meats,0.902423,0.892035,0.897199,3414.0
olive-tree-products,0.916589,0.857895,0.886271,1140.0


In [25]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
micro avg,0.814828,0.814828,0.814828,108670.0
macro avg,0.677636,0.762359,0.717505,108670.0
weighted avg,0.82895,0.814828,0.821828,108670.0


# Multilabel Evaluation 

In [26]:
#load data
df = pd.read_csv("data/foodcategories_3labels.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

In [27]:
y = df.en_tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test


In [28]:
dftrain.to_csv(
    "train_3label.csv", index = False, sep=";")
dftest.to_csv(
    "test3label.csv", index = False, sep=";")

In [43]:
model_3label = train_supervised("train_3label.csv",
                lr = 0.01,
                autotuneValidationFile = "test3label.csv",
                autotuneDuration = 1000)

# !/Users/evelyn.trautmann/repos/fasttext3/fastText/fasttext supervised \
#     -input "train_3label.csv" -output model_3label \
#     -autotune-validation "test3label.csv" -autotune-duration 1000

In [31]:
model_3label.save_model("model_3label.bin")

In [46]:
model_3label.test("test3label.csv", 3)

(108670, 0.7585074077482286, 0.843550758503529)

In [47]:
model = load_model("model_3label.bin")




Let \{$T_1^{(1)}, ...,T_M^{(1)}$\},...,\{$T_1^{(N)}, ...,T_M^{(N)}$\} be our ground truth, 
\{$P_1^{(1)}, ...,P_M^{(1)}$\},...,\{$P_1^{(N)}, ...,P_M^{(N)}$\} the prediction.

on diagonal we count the events of truth matching prediction

$\sum_{i=1}^N |\{T_1^{(i)}, ...,T_M^{(i)}\}\cap \{P_1^{(i)}, ...,P_M^{(i)}\}|$

so for each diagonal entry we count over all data points (N) and all predictions (M) the total amount when truth is matching prediction.

$C_{kk} =\sum_{i=1}^N \sum_{m=1}^M \chi{1}\{T_m^{(i)}=k,P_m^{(i)}=k\} $

in contrast on off-diagonal we count all events where truth was not covered by prediction.
For the off-diagonal we take only the difference sets into account. If truth for data point i was for example {fish, curry, rice-dish} and prediction was {fish, curry, soup }, we only count 
$T^i$ = {rice_dish} and $P^i$ = {soup} for the off-daigonal. Formally we do the following transformation:

$\{\tilde{T}_m^{(i)}\}_{q=1,..M1} = \{T_m^{(i)}\}_{m=1,..M}\setminus \{P_m^{(i)}\}_{m=1,..M}$ 

$\{\tilde{P}_m^{(i)}\}_{p=1,..M2} = \{P_m^{(i)}\}_{m=1,..M}\setminus \{T_m^{(i)}\}_{m=1,..M}$

Eventually we count all events of the ground truth that were not captured in predictions, which results in

$C_{kl} =\sum_{i=1}^N \sum_{m1=1}^{M1} \sum_{m2=1}^{M2}
\frac{1}{M1}\chi{1}\{\tilde{T}_{m1}^{(i)}=k,\tilde{P}_{m2}^{(i)}=l\} $

Once we have computed a multi-class - multi-label confusion matrix the classification report is straight forward.

Precision for class k is computed as

$Pr_k=\frac{C_{kk}}{\sum_j C_{kj}}$

and Recall

$R_k=\frac{C_{kk}}{\sum_j C_{jk}}$



In [48]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

3

In [49]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_3label.predict(x, k=K))

for k in range(K):
    df_test["label_predicted%i" %k] = df_test.prediction.str[0].str[k].str.replace("__label__","")
    df_test["confidence%i" %k] = df_test.prediction.str[1].str[k]
    df_test["truth%i" %k] = df_test.en_tags.str[k].str.replace("__label__","")


In [50]:
truth = list()
predictions = list()
for k in range(int(K)):
    truth.append("truth%i" %k)
    predictions.append("label_predicted%i" %k)
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))


In [51]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)

compute diffecence sets
compute intersections
compute onehot vectors
CPU times: user 7min 52s, sys: 9.92 s, total: 8min 2s
Wall time: 13min 35s


In [52]:
df_report = evaluation_metrics.get_report(confusion)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.967187,0.977883,0.972506,633.0
mustards,0.940711,0.982569,0.961185,726.666667
prepared-meats,0.95707,0.953065,0.955063,5894.666667
honeys,0.966783,0.937818,0.95208,1179.333333
fish-and-meat-and-eggs,0.959119,0.940874,0.949909,648.333333
hams,0.949772,0.942005,0.945873,1324.833333
fermented-milk-products,0.945569,0.939884,0.942718,9149.0
meats,0.960818,0.923369,0.941721,9215.166667
poultries,0.950667,0.932568,0.94153,2293.666667
seafood,0.937128,0.931326,0.934218,3553.0


In [53]:
evaluation_metrics.get_summary(df_report, confusion)

Unnamed: 0,precision,recall,f1-score,support
micro avg,0.843548,0.843548,0.843548,293144.0
macro avg,0.733785,0.800435,0.765662,293144.0
weighted avg,0.852484,0.843548,0.847992,293144.0


# Combining Multilabel Classification with Near Accuracy 

In [54]:
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))

mask = Dst > 0.5
Dst[mask]=np.nan
# show entries with highes distance <= 0.5 for each class
Dst.apply(np.argmax)[Dst.max()>0]

frozen-desserts                                          ice-creams
chocolates                                          milk-chocolates
seafood                                                      fishes
dried-products                      dried-products-to-be-rehydrated
plant-based-foods-and-beverages                   plant-based-foods
teas                                                  hot-beverages
hot-beverages                                                  teas
chicken-breasts                                            chickens
appetizers                                             salty-snacks
artificially-sweetened-beverages                sweetened-beverages
fishes                                                      seafood
non-alcoholic-beverages                       unsweetened-beverages
sauces                                                    groceries
groceries                                                    sauces
dried-plant-based-foods                    canne

In [55]:
S = np.exp(-Dst**2).fillna(0)
df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.967187,0.977883,0.972506,633.0
mustards,0.940711,0.982569,0.961185,726.666667
prepared-meats,0.95707,0.953065,0.955063,5894.666667
honeys,0.966783,0.937818,0.95208,1179.333333
fish-and-meat-and-eggs,0.959119,0.940874,0.949909,648.333333
hams,0.949772,0.942005,0.945873,1324.833333
poultries,0.950846,0.935507,0.943114,2293.666667
fermented-milk-products,0.945946,0.940057,0.942992,9149.0
meats,0.960818,0.923369,0.941721,9215.166667
seafood,0.937684,0.931326,0.934494,3553.0


In [56]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
micro avg,0.847782,0.847782,0.847782,293144.0
macro avg,0.745112,0.809841,0.776129,293144.0
weighted avg,0.855626,0.847782,0.851686,293144.0
