# Abstract

We demo a text classification approach based on an open dataset from https://world.openfoodfacts.org/. The goal of our talk is to categorize food items with various tags, where a particular food item is defined by a product's name, a generic name as well as a brand. For classification, we use facebooks NLP model fasttext, which provides a text classification model based on word embeddings as well as character n-gram embeddings. In a first experiment, we only use a single tag and remove additional ones from each data point. In this case, the evaluation is straight forward. However, since some classes are more closely related than others, we don't want to evaluate predictions in a binary manner as one would typically do. To this end, we implement a similarity concept and a multilabel classification approach. Additionally, we present some applications of a standardized food catalog, for instance search and recommendations.

In [1]:
import pandas as pd
import numpy as np
from fasttext import *
%matplotlib inline
import evaluation_metrics

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import scipy.spatial.distance as sd
import sklearn.metrics.pairwise as pw
import re

In [2]:
FASTTEXT_HOME="/Users/evelyn.trautmann/repos/fasttext3"

# Load Data

The following classification is based on an open dataset from https://world.openfoodfacts.org/

In [3]:
df = pd.read_csv("data/foodcategories_single_label.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

df.count()

product_name            324820
generic_name             83152
brands                  269465
categories              329303
categories_tags         329303
origins                  47351
manufacturing_places     73767
labels                  118779
emb_codes                52104
countries               329099
main_category           329297
en_tags                 329303
dtype: int64

We want to classify tags based on product name, generic name and brand. In the first dataset we have a single tag assigned to each data point.

In [4]:
df.en_tags.str.len().value_counts(), "%i labels "%len(set(np.concatenate(df.en_tags.values)))

(1    329303
 Name: en_tags, dtype: int64, '87 labels ')

# Extract Train and Test Data

In [5]:
#generate features
feature_cols = ['product_name', 'generic_name', 'brands']
assert(df[df.en_tags.str.len()==0].shape[0]==0)
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)
y = df.en_tags

In [6]:
# preprocess and split train and test date
X = X.str.lower().apply(lambda x: re.sub(r'[^\w\s]','',x))

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=13)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train

In [7]:
dftrain_emb = y_train.apply(lambda x: " ".join(x)) + " " + X_train


dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test
dfvalid = y_valid.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_valid
dftest.shape, dfvalid.shape

((65861,), (65861,))

In [8]:
dftrain.to_csv(
    "train.csv", index = False, sep=";")
dftest.to_csv(
    "test.csv", index = False, sep=";")
dfvalid.to_csv(
    "valid.csv", index = False, sep=";")
dftrain_emb.to_csv(
    "train_emb.csv", index = False, sep=";")

# Train

For classification we use fasttext, which can be either executed on command line or 

In [10]:
model_singlelabel = train_supervised("train.csv",
                lr = 0.01,
                autotuneValidationFile = "valid.csv",
                autotuneDuration = 1000)

model_singlelabel.save_model("model_singlelabel.bin")
# !$FASTTEXT_HOME/fastText/fasttext supervised \
#     -input "train.csv" -output model_singlelabel \
#     -autotune-validation "valid.csv" -autotune-duration 1000

In [9]:
model_singlelabel = load_model("model_singlelabel.bin")




In [10]:
!$FASTTEXT_HOME/fastText/fasttext dump model_singlelabel.bin args

dim 43
ws 5
epoch 100
minCount 1
neg 5
wordNgrams 3
loss softmax
model sup
bucket 10000000
minn 0
maxn 0
lrUpdateRate 100
t 0.0001


In [11]:
!$FASTTEXT_HOME/fastText/fasttext test model_singlelabel.bin test.csv 1

N	65860
P@1	0.777
R@1	0.777


In [12]:
model_emb = train_unsupervised("/Users/evelyn.trautmann/projects/openfood/train_emb.csv")

# Test

In [13]:
model_singlelabel.test("test.csv", k=1)

(65860, 0.7772547828727604, 0.7772547828727604)

In [14]:
model_singlelabel.test("valid.csv", k=1)

(65861, 0.7778199541458526, 0.7778199541458526)

In [15]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

1

# Single Label Evaluation 

In [16]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_singlelabel.predict(x))

df_test["label_predicted"] = df_test.prediction.str[0].str[0].str.replace("__label__","")
df_test["confidence"] = df_test.prediction.str[1].str[0]
df_test["truth"] = df_test.en_tags.str[0].str.replace("__label__","")


In [17]:
report = classification_report(df_test.truth, df_test.label_predicted)
print(report)

  'precision', 'predicted', average, warn_for)


                                 precision    recall  f1-score   support

            alcoholic-beverages       0.79      0.81      0.80      1657
                     baby-foods       0.76      0.73      0.74       288
                   bee-products       1.00      0.38      0.55         8
                          beers       0.00      0.00      0.00         5
                      beverages       0.00      0.00      0.00        42
             biscuits-and-cakes       0.64      0.50      0.56       569
                     breakfasts       0.87      0.87      0.87      3202
                          cakes       0.50      0.41      0.45       423
                        candies       0.77      0.24      0.36        85
                  canned-fishes       0.74      0.77      0.75       503
                   canned-foods       0.68      0.21      0.32        72
       canned-plant-based-foods       0.54      0.35      0.42        80
              carbonated-drinks       0.65      0.

In [18]:
# assertion no empty tags
assert(len(df_test.en_tags[df_test.en_tags.str.len()==0])==0)

In [19]:
label = "cakes"
df_test[df_test.truth==label].label_predicted.value_counts()

cakes                  172
snacks                 152
biscuits-and-cakes      27
desserts                17
pastries                15
plant-based-foods        9
chocolate-biscuits       6
ice-creams               4
sweet-snacks             4
meals                    3
breakfasts               2
frozen-foods             2
dairies                  2
sauces                   1
meats                    1
spreads                  1
alcoholic-beverages      1
nuts                     1
sweetened-beverages      1
fruits-based-foods       1
fresh-foods              1
Name: label_predicted, dtype: int64

# Similarity

Sometimes the model is predicting other than the ground truth but still makes reasonable class assignments. To distinguish the reasonable assignments from the actually wrong classifications we introduce similarities between classes to count them into the accurate predicftions.

Example
Consider followin common products
chicken, poultries, seafood
Given that the first two are pretty similar and the latter very different, a possible similarity matrix S could look like



|Labels | chicken | poultries |seafood|
|---------|:------------------|:-------------------|:--------------|
|chicken | 1 | 0.9 | 0 |
| poultries | 0.9 | 1 | 0 |
| seafood | 0 | 0 | 1 |


If we want to include now similarity into calculation of precision and recall, we have the nominator containing not only

Count[ Truth = chicken, Predicted = chicken ]

but

Count[ Truth = chicken, Predicted = chicken ] * 1 + Count[Predicted  = chicken, Truth = poultries ] * 0.9.

Hence diagonal entries of confusion matrix 

$$ C_{ii} = Count[Truth=i,Predicted=i]$$

become 

$$ C_{ij} = \sum_j Count[Predicted=i,Truth=j]* S_{ij} $$

That means, we count in all similar predictions weighted by degree of similatity.
Technically that means we take the diagonal entris of

$$\hat{C}=C∗S\hat{C} = C * S \hat{C}=C∗S$$


Where * denotes matrix product. Denominator remains the same: Row sum in case of precision, column sum in case of recall. Here we don't weight by similar entries since for denominator only the exact number of True (resp. Predicted ) counts.

In [21]:
truth=["truth"]
predictions=["label_predicted"]
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))


In [88]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)
confusion.to_csv("confusion_single_label.csv")

compute diffecence sets
compute intersections
determine vectors
CPU times: user 3min 38s, sys: 3.56 s, total: 3min 42s
Wall time: 3min 41s


In [172]:
evaluation_metrics.get_confusion??

In [26]:
confusion = pd.read_csv("confusion_single_label.csv", index_col=0)

In [27]:
# Compute word embeddings for each class and pairwise distance
# between every two classes. Set all distances that exceed a
# pre-defined threshold to null and compute rbf function as similarity.
#
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))
MAGIC_NUMBER = 0.7
mask = Dst > MAGIC_NUMBER
Dst[mask]=np.nan
# print similar but not identical classes 
print(Dst.apply(np.argmax)[Dst.max()>0.0].sample(10))

S = np.exp(-Dst**2).fillna(0)

plant-based-foods-and-beverages              plant-based-beverages
fermented-milk-products                                    dairies
plant-based-beverages              plant-based-foods-and-beverages
groceries                                                   sauces
seafood                                              smoked-fishes
vegetables-based-foods                    frozen-plant-based-foods
snacks                                                sweet-snacks
fresh-plant-based-foods                     vegetables-based-foods
canned-fishes                                              seafood
pickles                                                     olives
dtype: object


will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


In [50]:

df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = True)[:10]

Unnamed: 0,precision,recall,f1score,support
pasteurized-cheeses,0.363636,0.030303,0.055944,132.0
refrigerated-foods,0.278381,0.13467,0.181526,349.0
canned-plant-based-foods,0.493671,0.118182,0.190709,110.0
microwave-meals,0.393443,0.183746,0.250502,283.0
bee-products,1.0,0.222222,0.363636,9.0
canned-foods,0.407878,0.35786,0.381235,299.0
fats,0.352941,0.484765,0.408481,361.0
fresh-foods,0.456874,0.416988,0.436021,777.0
fresh-plant-based-foods,0.584229,0.356674,0.442935,457.0
frozen-ready-made-meals,0.499261,0.426768,0.460177,396.0


In [30]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.617111,0.57801,0.596921,65861.0
weighted avg,0.804638,0.811064,0.807839,65861.0


# Multilabel Evaluation 

In [31]:
#load data
df = pd.read_csv("data/foodcategories_3labels.csv.zip", sep = "\x01", compression="zip")
df.en_tags = df.en_tags.apply(eval)
df = df[df.en_tags.str.len()>0]

In [32]:
y = df.en_tags
feature_cols = ['product_name', 'generic_name', 'brands']
assert(df[df.en_tags.str.len()==0].shape[0]==0)
X = df[feature_cols].fillna("").apply(lambda x: " ".join(x), axis = 1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=23)

dftrain = y_train.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_train
dftest = y_test.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_test
dfvalid = y_valid.apply(lambda x: " ".join(["__label__" + y
                    for y in x])) + " " + X_valid


In [98]:
dftrain.to_csv(
    "train_3label.csv", index = False, sep=";")
dftest.to_csv(
    "test3label.csv", index = False, sep=";")
dfvalid.to_csv(
    "valid3label.csv", index = False, sep=";")

In [39]:
model_3label = train_supervised("train_3label.csv",
                lr = 0.01,
                autotunePredictions = 3,
                autotuneValidationFile = "valid3label.csv",
                autotuneDuration = 1000)
model_3label.save_model("model_3label.bin")

# !FASTTEXT_HOME/fastText/fasttext supervised \
#     -input "train_3label.csv" -output model_3label \
#     -autotune-validation "test3label.csv" -autotune-duration 1000

In [33]:
model_3label = load_model("model_3label.bin")




In [34]:
model_3label.test("test3label.csv", 3)

(65844, 0.7418443593949334, 0.8265021235314356)

Let \{$T_1^{(1)}, ...,T_M^{(1)}$\},...,\{$T_1^{(N)}, ...,T_M^{(N)}$\} be our ground truth, 
\{$P_1^{(1)}, ...,P_M^{(1)}$\},...,\{$P_1^{(N)}, ...,P_M^{(N)}$\} the prediction.

on diagonal we count the events of truth matching prediction

$\sum_{i=1}^N |\{T_1^{(i)}, ...,T_M^{(i)}\}\cap \{P_1^{(i)}, ...,P_M^{(i)}\}|$

so for each diagonal entry we count over all data points (N) and all predictions (M) the total amount when truth is matching prediction.

$C_{kk} =\sum_{i=1}^N \sum_{m=1}^M \chi{1}\{T_m^{(i)}=k,P_m^{(i)}=k\} $

in contrast on off-diagonal we count all events where truth was not covered by prediction.
For the off-diagonal we take only the difference sets into account. If truth for data point i was for example {fish, curry, rice-dish} and prediction was {fish, curry, soup }, we only count 
$T^i$ = {rice_dish} and $P^i$ = {soup} for the off-daigonal. Formally we do the following transformation:

$\{\tilde{T}_m^{(i)}\}_{q=1,..M1} = \{T_m^{(i)}\}_{m=1,..M}\setminus \{P_m^{(i)}\}_{m=1,..M}$ 

$\{\tilde{P}_m^{(i)}\}_{p=1,..M2} = \{P_m^{(i)}\}_{m=1,..M}\setminus \{T_m^{(i)}\}_{m=1,..M}$

Eventually we count all events of the ground truth that were not captured in predictions, which results in

$C_{kl} =\sum_{i=1}^N \sum_{m1=1}^{M1} \sum_{m2=1}^{M2}
\frac{1}{M1}\chi{1}\{\tilde{T}_{m1}^{(i)}=k,\tilde{P}_{m2}^{(i)}=l\} $

Once we have computed a multi-class - multi-label confusion matrix the classification report is straight forward.

Precision for class k is computed as

$Pr_k=\frac{C_{kk}}{\sum_j C_{kj}}$

and Recall

$R_k=\frac{C_{kk}}{\sum_j C_{jk}}$



In [35]:
df_test = X_test.to_frame()
df_test.columns = ["feature"]

# determine number of labels
df_test = df_test.join(y_test)

K = int(df_test.en_tags.str.len().quantile(0.75))
K

3

In [36]:
df_test["prediction"] = df_test.feature.apply(lambda x: model_3label.predict(x, k=K))

for k in range(K):
    df_test["label_predicted%i" %k] = df_test.prediction.str[0].str[k].str.replace("__label__","")
    df_test["confidence%i" %k] = df_test.prediction.str[1].str[k]
    df_test["truth%i" %k] = df_test.en_tags.str[k].str.replace("__label__","")


truth = list()
predictions = list()
for k in range(int(K)):
    truth.append("truth%i" %k)
    predictions.append("label_predicted%i" %k)
classes = list(set(np.concatenate(df_test[predictions + truth].apply(tuple).apply(list).values)))
classes.remove("nan")


In [234]:
%time confusion = evaluation_metrics.get_confusion(df_test, classes, K, truth, predictions)

confusion.to_csv("confusion_3_label.csv")

compute diffecence sets
compute intersections
determine vectors
CPU times: user 3min 56s, sys: 4.25 s, total: 4min 1s
Wall time: 4min


In [38]:
confusion = pd.read_csv("confusion_3_label.csv", index_col=0)

In [39]:
df_report = evaluation_metrics.get_report(confusion)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.969397,0.944737,0.956908,380.0
mustards,0.979548,0.93047,0.954379,489.0
prepared-meats,0.941215,0.946119,0.943661,3582.0
hams,0.92887,0.956463,0.942465,735.0
honeys,0.920911,0.95754,0.938868,683.0
fish-and-meat-and-eggs,0.940222,0.929114,0.934635,395.0
fermented-milk-products,0.930561,0.936394,0.933468,5424.0
poultries,0.926476,0.932216,0.929337,1372.0
sauces,0.917505,0.932817,0.925097,3096.0
olive-tree-products,0.933401,0.916168,0.924704,668.0


In [40]:
evaluation_metrics.get_summary(df_report, confusion)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.715004,0.655128,0.683758,177565.0
weighted avg,0.822256,0.8263,0.824273,177565.0


# Combining Multilabel Classification with Near Accuracy 

In [41]:
class_vecs = pd.Series({cl:model_emb.get_sentence_vector(cl) for cl in classes})

Dst = class_vecs.apply(lambda x: class_vecs.apply(lambda y: sd.euclidean(x,y)))

mask = Dst > 0.5
Dst[mask]=np.nan
# show entries with highes distance <= 0.5 for each class
Dst.apply(np.argmax)[Dst.max()>0]

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


plant-based-foods-and-beverages                   plant-based-foods
wines                                             wines-from-france
frozen-desserts                                          ice-creams
artificially-sweetened-beverages                sweetened-beverages
milk-chocolates                                          chocolates
fermented-milk-products                             fermented-foods
teas                                                  hot-beverages
pastas                                                    dry-pasta
dried-plant-based-foods                    canned-plant-based-foods
fruit-yogurts                                               yogurts
groceries                                                    sauces
dried-products-to-be-rehydrated                      dried-products
wines-from-france                                             wines
fermented-foods                             fermented-milk-products
dry-pasta                                       

In [42]:
S = np.exp(-Dst**2).fillna(0)
df_report = evaluation_metrics.get_report(confusion, S)
df_report.dropna().sort_values(by="f1score", ascending = False)[:10]

Unnamed: 0,precision,recall,f1score,support
cooked-pressed-cheeses,0.969397,0.944737,0.956908,380.0
mustards,0.979548,0.93047,0.954379,489.0
prepared-meats,0.941215,0.946119,0.943661,3582.0
hams,0.92887,0.956463,0.942465,735.0
honeys,0.920911,0.95754,0.938868,683.0
fish-and-meat-and-eggs,0.940222,0.929114,0.934635,395.0
fermented-milk-products,0.930609,0.936906,0.933747,5424.0
poultries,0.933975,0.932819,0.933397,1372.0
sauces,0.918065,0.933386,0.925663,3096.0
olive-tree-products,0.933401,0.916168,0.924704,668.0


In [43]:
evaluation_metrics.get_summary(df_report, confusion, S)

Unnamed: 0,precision,recall,f1-score,support
macro avg,0.723461,0.666248,0.693677,177565.0
weighted avg,0.826741,0.830441,0.828587,177565.0


In [44]:
confusion.sum().sum()

177565.0

In [47]:
df_test.en_tags.apply(len).sum()

177565

In [51]:
!pwd

/Users/evelyn.trautmann/projects/pydata/evaluate_supervised
