# Course 2: Project - Task C - Text data

<a name="top"></a>
This notebook is concerned with task C.

**Contents:**
* [Imports](#task-c-imports)
* [Data loading](#task-c-data-loading)
* [Exploration](#task-c-exploration)
* [Implementation & execution](#task-c-implementation-and-execution)
* [Result](#task-c-result)

## Imports<a name="task-c-imports"></a> ([top](#top))
---

In [1]:
# Standard library:
import pathlib
import typing as t

# 3rd party:
import numpy as np
import pandas as pd
import pandas.io.formats.style

# Project:
import ingredients
import utils

## Data loading<a name="task-c-data-loading"></a> ([top](#top))
---

First, we load the subset of the cleaned-up dataset that we need:

In [2]:
base_name = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.clean')

In [3]:
# The columns to load:
usecols=['countries_tags', 'ingredients_text']

# Load:
data_types, parse_dates  = utils.load_and_amend_dtypes(base_name)
# We can only parse dates in the columns that we are loading:
parse_dates = list(set(parse_dates) & set(usecols))
df = pd.read_csv(
        f'{base_name}.csv',
        header=0,
        parse_dates=parse_dates,
        usecols=usecols,
        dtype=data_types)

In [4]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 355569 rows and 2 columns


Here are the first few rows:

In [5]:
df.head()

Unnamed: 0,countries_tags,ingredients_text
0,en:france,
1,en:united-states,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,en:united-states,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,en:united-states,"Organic hazelnuts, organic cashews, organic wa..."
4,en:united-states,Organic polenta


## Exploration<a name="task-c-exploration"></a> ([top](#top))
---

**Observations:** After manually inspecting multiple records (spot-checking), we noticed that:

* **Content:** The majority of entries are valid entries that contain a list of ingredients. Some invalid entries are clearly malicious in nature (e.g. for `code`: *355951*, we have `product_name`: _Ma b***_ (_My d***_)). Other invalid entries might be the resulf of a genuine mix-up (e.g. for `code`: *355945*, we have `product_name:` *La pratique du vocabulaire allemand* (*German vocabulary*)).

* **Language:** In the majority of cases, the list of ingredients uses a single language. In some cases it uses two or more languages (e.g. for `code`: *930*, it uses English, French and Dutch). In the majority of cases the `country_tags` column seems to be a good proxy for the language used.

* **Format:** The "ideal" underlying format seem to be a comma-delimited list of ingredients, where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in question is enclosed in matching parentheses. Very few entries conform to this "ideal" format, though. **Parentheses:** In some cases they are used to precise the percentage of an ingredient (e.g. for `code`: *155*, we have *"[...] milk chocolate (32%) (sugar, cocoa butter [...]"*) or to provide clarification (e.g. for `code`: *24* we have *"[...] soy lecithin (an emulsifier) [...]"*). Different types of parentheses are used, such as round brackets and square brackets. **Delimiters:** Different types of delimiters are used, such as comma, colon and bullets to name a few.

**Assumptions:** In order to keep complexity under control, we make the following assumptions:
* Entries are valid and contain a list of ingredients.
* Entries use a single language. The `country_tags` column is a good proxy for the language used.
* The format is a delimited list of ingredients (that uses some type of delimiter), where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in questions starts with some type of opening parenthesis and contains at least one ingredient.

We expect entries that do not conform to the above to be "infrequent enough" so as to not impact the final result.

**Scope:** Also in order to keep complexity under control, we take the following decisions:
* We will limit oursevles to 2 markets - the USA and France. Each one is a rather large country, with a single official language (that we understand). (For the USA, although there is no official language, English is considered to be the de-facto official language. For France, French is the official language.)

## Implementation & execution<a name="task-c-implementation-and-execution"></a> ([top](#top))
---

### Implementation

**Note:** In order not to clutter the notebook, most of the code is in a separate module - see `ingredients.py`.

**Tokenization:** The first step is to split a given list of ingredients into tokens. For that we decided to hand-roll our own tokenizer. Another way would have been to use NLTK (e.g. the [nltk.chunk](https://www.nltk.org/api/nltk.chunk.html) package) or another 3rd party library. The code is quite short (< 150 lines). We use 2 phases:

* **Phase 1:** Language agnostic. We use regular expressions to split the text into tokens:
  * `SPACE`: A sequence of 1+ white spaces
  * `DELIM`: A delimiter
  * `LPAR`: A left parenthesis
  * `RPAR`: A right parenthesis
  * `FIELD`: This is everything in-between tokens of the above types. If a *'.'* or a *','* is surrounded by digits, we consider it as part of a decimal number and do not treat it as a delimiter
  * `END`: This is an additional type, used to communicate that we reached the end-of-input
  * `INVALID`: This is an additional type, used to communicate that we encountered an error (and the sequence of characters that we had to skip until we could re-synchronize with the input)
* **Phase 2:** Language-specific. Tokens of type `FIELD` are further split on language-specific delimiters (e.g. _and_, _or_, for English; _et_, _ou_ for French).

Here are the language-specific delimiters:

In [6]:
# Order matters due to substrings:
delims_en = ['and/or', 'or/and', 'and', 'or']  
delims_fr = ['et/ou', 'ou/et', 'et',  'ou']

Here is a small example (for each token, we print its type and highlight its position in the text with `[` and `]`):

In [7]:
text = 'Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt), garlic and cayenne.'

for token in ingredients.tokenize(text, delims_en):
    print(f"{token.type.name:5} | {ingredients.highglight_token(text, token, '[', ']')}")

FIELD | [Organic dry roasted pumpkin seeds], tamari (soybeans, water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds[,] tamari (soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, [tamari] (soybeans, water and salt), garlic and cayenne.
LPAR  | Organic dry roasted pumpkin seeds, tamari [(]soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari ([soybeans], water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans[,] water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, [water] and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans, water [and] salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, water and [salt]), garlic and cayenne.
RPAR  | Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt[)], garlic and cayenne.


**Normalization:** We perform the following steps to normalize the tokens:
- Convert to lower-case
- Remove numbers
- Replace punctuation marks by a single space
- Remove accents
- Replace a sequence of 1+ white spaces by a single space
- Strip leading/trailing white spaces

**Finding ingredients:** Finally, we keep only tokens of type `FIELD`. We also need a criterion to decide whether an ingredient is simple or composite. For that, we look 3 tokens ahead. To be classified as "composite", an ingredient must be the 1st field in a sequence `FIELD`, `LPAR`, `FIELD`.

Here is a small example:

In [8]:
text = "Thé noir de Chine, zestes d'oranges 7,5 %, arômes naturels (cannelle 4,7 %, orange 4,7 %, poudre de cannelle 3,9 %)."

ingredients.texts_to_ingredients_df([text], delims_fr)

Unnamed: 0,ingredient,is_composite
0,the noir de chine,0
1,zestes d oranges,0
2,aromes naturels,1
3,cannelle,0
4,orange,0
5,poudre de cannelle,0


### Execution

We create a separate data-frame for each country and discard rows with NA values:

In [9]:
def filter_country(country: str) -> pd.DataFrame:
    is_country = df['countries_tags'].str.contains(country, na=False, regex=False)
    df_country = df[is_country & df['ingredients_text'].notna()]
    return df_country[['ingredients_text']]

In [10]:
df_us = filter_country('united-states')
nrows, ncols = df_us.shape
print(f"the dataset 'us' contains {nrows} row(s) and {ncols} column(s)")

df_fr = filter_country('france')
nrows, ncols = df_fr.shape
print(f"the dataset 'fr' contains {nrows} row(s) and {ncols} column(s)")

the dataset 'us' contains 171605 row(s) and 1 column(s)
the dataset 'fr' contains 86654 row(s) and 1 column(s)


We take a peek at the result so far:

In [11]:
display(
    df_fr.head(),
    df_us.head()
)

Unnamed: 0,ingredients_text
184,lentilles vertes
185,"Eau gazéifiée, sirop de maïs à haute teneur en..."
186,"Sucre, farine de _Blé_, graisse et huiles végé..."
189,Thé noir aromatisé à la fleur de violette et p...
190,"Thé noir de Chine, zestes d'oranges 7,5 %, arô..."


Unnamed: 0,ingredients_text
1,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,"Organic hazelnuts, organic cashews, organic wa..."
4,Organic polenta
5,"Rolled oats, grape concentrate, expeller press..."


We convert each dataset into a data-frame with columns _(ingredient, is-composite)_:

In [12]:
df_us_ingredients = ingredients.texts_to_ingredients_df(df_us['ingredients_text'], delims_en)
df_fr_ingredients = ingredients.texts_to_ingredients_df(df_fr['ingredients_text'], delims_fr)

**Note:** Some ingredients appear multiple times, sometimes marked as "simple" and sometime marked as "composite". Some form of "reconciliation" is thus required. In order for an ingredient to be considered as "composite", we require that the fraction of instances marked as "composite" be greater than 0.5.

In [13]:
df_us_final = ingredients.reconcile_composite(df_us_ingredients, 0.5)
df_fr_final = ingredients.reconcile_composite(df_fr_ingredients, 0.5)

## Result<a name="task-c-result"></a> ([top](#top))
---

We implement utility functions to find the 5 most common "simple" ingredients, the 5 most common ingredient overall and finally the 5 most common "composite" ingredients. We also compute their prevalence, i.e. the percentage of products that contain that ingredient:

In [14]:
def compute_prevalence(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df['prevalence'] = df['count'] / nproducts
    return df


# Custom format for the 'prevalence' column:
def styled(df: pd.DataFrame) -> pd.io.formats.style.Styler:
    df = df.style.format({
        'prevalence': lambda n: f'{n*100:.2f} %'
    })
    return df


def top_n_simple(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df_result = df[~df['is_composite']].nlargest(5, columns=['count'])[['count']]
    return styled(compute_prevalence(df_result, nproducts))


def top_n_overall(df: pd.DataFrame ,nproducts: int) -> pd.DataFrame:
    df_result = df.nlargest(5, columns=['count'])[['count']]
    return styled(compute_prevalence(df_result, nproducts))


def top_n_composite(df, nproducts):
    df_result = (df[df['is_composite']].nlargest(5, columns=['count']))[['count']]
    return styled(compute_prevalence(df_result, nproducts))

The 5 most common "simple" ingredients for the USA and France are:

In [15]:
print("USA:")
display(top_n_simple(df_us_final, len(df_us)))
print("France:")
display(top_n_simple(df_fr_final, len(df_fr)))

USA:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104086,60.65 %
sugar,77270,45.03 %
water,71749,41.81 %
citric acid,34792,20.27 %
riboflavin,22443,13.08 %


France:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51785,59.76 %
sucre,37253,42.99 %
eau,35592,41.07 %
farine de ble,13048,15.06 %
huile de tournesol,9060,10.46 %


**Comment:** Salt, sugar and water top the ranking for both the USA and France and are present in a large percentage of all products.

The 5 most common ingredients overall for the USA and France are:

In [16]:
print('USA:')
display(top_n_overall(df_us_final, len(df_us)))
print('France:')
display(top_n_overall(df_fr_final, len(df_fr)))

USA:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104086,60.65 %
sugar,77270,45.03 %
water,71749,41.81 %
citric acid,34792,20.27 %
riboflavin,22443,13.08 %


France:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51785,59.76 %
sucre,37253,42.99 %
eau,35592,41.07 %
farine de ble,13048,15.06 %
huile de tournesol,9060,10.46 %


**Comment:** The 5 most common "simple" ingredients are also the 5 most common ingredients overall.

Finally, the 5 most common "composite" ingredients are:

In [17]:
print('USA:')
display(top_n_composite(df_us_final, len(df_us)))
print('France:')
display(top_n_composite(df_fr_final, len(df_fr)))

USA:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
ascorbic acid,8937,5.21 %
vegetable oil,7915,4.61 %
potassium sorbate,7683,4.48 %
butter,5620,3.27 %
sodium benzoate,5380,3.14 %


France:


Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
emulsifiant,2776,3.20 %
legumes,1628,1.88 %
huiles vegetales,1607,1.85 %
colorant,1463,1.69 %
acidifiant,1202,1.39 %


**Comment:** We see that the 5 most common "composite" ingredients seem to be some sort of additives. The result is not perfect, though. E.g. we would not put *legumes* on the same level as *soy sauce* for instance, as the former would not appear as-is in a recipe bool while the latter would. Since this was not part of the requirements for this task, we will not pursue this further.