# Course 2: Project - Task C: text data

<a name="top"></a>
This notebook is concerned with task C.

**Contents:**
* [Imports](#imports)
* [Preparatives](#preparatives)
* [Data loading](#task-c-data-loading)
* [Data exploration](#task-c-data-exploration)
* [Implementation](#task-c-implementation)

## Imports<a name="imports"></a>
---

In [11]:
# Standard library:
import functools
import itertools
import pathlib
import re
import typing as t
import unicodedata

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Project:
import ingredients

%matplotlib inline

## Preparatives<a name="preparatives"></a>
---

This section regroups utility functions, etc. that we will use later in this notebook.

### Utilities

In [2]:
@functools.wraps(display)  # nicer for interactive use
def display_allcols(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns."""
    with pd.option_context('display.max_columns', None):
        display(*args, **kwargs)
        
        
@functools.wraps(display)  # nicer for interactive use
def display_allcols_notrunc(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns with no truncation."""
    with pd.option_context('display.max_columns', None, 'display.max_colwidth', -1):
        display(*args, **kwargs)


def move_after(words: t.List[str], word: str, word_to_move: str) -> t.List[str]:
    """Utility function to re-order columns."""
    try:
        word_idx = words.index(word)
        word_to_move_idx = words.index(word_to_move)
    except ValueError:
        pass
    else:
        if word_idx < word_to_move_idx:
            words.pop(word_to_move_idx)
            words.insert(word_idx + 1, word_to_move)
        else:
            words.insert(word_idx + 1, word_to_move)
            words.pop(word_to_move_idx)
    return words

## Data loading<a name="task-c-data-loading"></a> ([top](#top))
---

First, we load the subset of the cleaned-up dataset that we need:

**/!\ TODO: Load the cleaned-up data-frame once it is available!**

In [3]:
data_filename = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.tsv')

In [67]:
df = pd.read_csv(data_filename,
                 sep='\t',
                 usecols=['countries_tags', 'ingredients_text'],
                 lineterminator='\n')

In [72]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 356001 rows and 2 columns


Here are the first few rows:

In [71]:
df.head()

Unnamed: 0,countries_tags,ingredients_text
0,en:france,
1,en:united-states,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,en:united-states,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,en:united-states,"Organic hazelnuts, organic cashews, organic wa..."
4,en:united-states,Organic polenta


## Data exploration<a name="task-c-data-exploration"></a> ([top](#top))
---

**Observations:** After manually inspecting multiple records (spot-checking), we were able to make the following observations:

* **Content:** The majority of entries seem to be valid entries that contain a list of ingredients. Some invalid entries were clearly created on purpose (e.g. for *code: 355951*, we have *product\_name:* _Ma b***_ (English: _My d***_)). Other invalid entries might be the resulf of a genuine mix-up (e.g. for *code: 355945*, we have *product\_name:* _La pratique du vocabulaire allemand_ (English: _German vocabulary_)).

* **Language:** In the majority of cases, the list of ingredients seem to use a single language. In some cases it uses two or more languages (e.g. for *code: 930*, it uses English, French and Dutch). In the majority of cases the *country\_tags* column seems to be a good proxy for the language used.

* **Format:** The "ideal" underlying format seem to be a delimited list of ingredients (that uses the comma as a delimiter), where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in question is enclosed in matching parentheses. Very few entries conform to this "ideal" format, though. **Parentheses:** In some cases they are used to precise the percentage of an ingredient (e.g. for *code: 155*, we have _"[...] milk chocolate (32%) (sugar, cocoa butter [...]"_) or to provide clarification (e.g. for *code: 24* we have _"[...] soy lecithin (an emulsifier) [...]"_). Different types of parentheses are used, such as round brackets and square brackets. **Delimiters:** Different types of delimiters are used, such as comma, colon and bullets to name a few.

**Assumptions:** In order to keep complexity under control, we make the following assumptions:
* Entries are valid and contain a list of ingredients.
* Entries use a single language. The *country\_tags* column is a good proxy for the language used.
* The format is a delimited list of ingredients (that uses "some" type of delimiter), where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in questions starts with "some" type of opening parenthesis.

We expect entries that do not conform to the above to be "infrequent enough" so as to not impact the final result.

**Scope:** In order to keep complexity under control, we take the following decisions:
* We will limit oursevles to the **USA** and **France**. Each one is a rather large country, with a single official language (that we understand). (For the USA, although there is no official language, English is considered to be the de-facto official language. For France, French is the official language.)

## Implementation<a name="task-c-implementation"></a>
---

In order not to clutter the notebook, most of the code is in a separate module - _ingredients.py_.

**Tokenization:** The first step is to split a given list of ingredients into tokens. For that we decided to roll-out our own tokenizer. Another way would have been to use NLTK (e.g. the [nltk.chunk](https://www.nltk.org/api/nltk.chunk.html) package) or another 3rd party library. The code is quite short (< 150 lines). We use 2 phases.

**Phase 1:** Language agnostic. We use regular expressions to split the text into tokens:
* `SPACE`: A sequence of white spaces
* `DELIM`: A delimiter
* `LPAR`: A left parenthesis
* `RPAR`: A right parenthesis
* `FIELD`: This is everything in-between tokens of the above types. If a _'.'_ or a _','_ is surrounded by digits, we do not treat it as a delimiter
* `END`: This is an additional type, used to communicate that we reached the end-of-input
* `INVALID`: This is an additional type, used to communicate that we encountered an error (and the sequence of characters that had to be skipped until we could re-synchronize with the input)


**Phase 2:** Language-specific. Tokens of type `FIELD` are further split on language-specific delimiters (e.g. _and, or, and/or, or/and_ for English, _et, ou, et/ou, ou/et_ for French).

In [130]:
delims_en = ['and/or', 'or/and', 'and', 'or']  # order matters due to substrings
delims_fr = ['et/ou', 'ou/et', 'et',  'ou']  # ditto

Here is a small example. We print the type of the token and highlight its position in the text with `[` and `]`:

In [131]:
text = 'Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt), garlic and cayenne.'

for token in ingredients.tokenize(text, delims_en):
    print(f"{token.type.name:5} | {ingredients.highglight_token(text, token, '[', ']')}")

FIELD | [Organic dry roasted pumpkin seeds], tamari (soybeans, water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds[,] tamari (soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, [tamari] (soybeans, water and salt), garlic and cayenne.
LPAR  | Organic dry roasted pumpkin seeds, tamari [(]soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari ([soybeans], water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans[,] water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, [water] and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans, water [and] salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, water and [salt]), garlic and cayenne.
RPAR  | Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt[)], garlic and cayenne.


We will use the utility funtion below to iterate over tokens using a sliding window of width 2:

In [151]:
# This is an old itertools recipe:
def window(seq: t.Iterable[t.Any], n=2) -> t.Iterable[t.Sequence[t.Any]]:
    """\
    Returns a sliding window (of width ``n``) over data from the iterable.
    For a sequence ``s``, the result will be ``(s0, s1, ..., s[n-1]), (s1, s2, ..., s[n]), ...``.
    """
    it = iter(seq)
    result = tuple(itertools.islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

We will use the utility function below to convert a sequence of tokens into a list of pairs _(ingredient, is-composite)_:

In [175]:
def tokens_to_ingredients(tokens: t.Iterable[ingredients.Token]) -> t.Iterable[t.Tuple[str, bool]]:
    from ingredients import TokenType
    for t1, t2 in window(tokens, 2):
        if t1.type == TokenType.END:
            break
        # Discard tokens of a type other than 'FIELD':
        if t1.type == TokenType.FIELD:
            is_composite = (t2.type == TokenType.LPAR)  # see assumptions
            yield (t1.text, is_composite)

We will use the function below to convert a list of ingredients into a list of pairs _(normalized-ingredient, is-composite)_:

In [197]:
def text_to_ingredients(text: str, delims: t.Sequence[str]) -> t.Iterable[t.Tuple[str, bool]]:
    from ingredients import tokenize, normalize
    yield from tokens_to_ingredients(normalize(tokenize(text, delims)))

## Processing<a name="task-c-processing"></a>
---

We create a separate data-frame for each country and discard rows with NA values:

**/!\ TODO: Load the cleaned-up data-frame once it is available!**

In [177]:
def filter_country(country: str) -> pd.DataFrame:
    is_country = df['countries_tags'].str.contains(country, na=False, regex=False)
    df_country = df[is_country & df['ingredients_text'].notna()]
    return df_country[['ingredients_text']]

In [178]:
df_usa = filter_country('united-states')

nrows, ncols = df_usa.shape
print(f"the dataset 'usa' contains {nrows} rows and {ncols} columns")

df_usa.head()

the dataset 'usa' contains 171874 rows and 1 columns


Unnamed: 0,ingredients_text
1,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,"Organic hazelnuts, organic cashews, organic wa..."
4,Organic polenta
5,"Rolled oats, grape concentrate, expeller press..."


In [179]:
df_france = filter_country('france')

nrows, ncols = df_france.shape
print(f"the dataset 'france' contains {nrows} rows and {ncols} columns")

df_france.head()

the dataset 'france' contains 86743 rows and 1 columns


Unnamed: 0,ingredients_text
184,lentilles vertes
185,"Eau gazéifiée, sirop de maïs à haute teneur en..."
186,"Sucre, farine de _Blé_, graisse et huiles végé..."
190,Thé noir aromatisé à la fleur de violette et p...
191,"Thé noir de Chine, zestes d'oranges 7,5 %, arô..."


We convert each dataset into a data-frame with columns _(ingredient, is-composite)_:

In [201]:
def texts_to_ingredients(texts: pd.Series, delims: t.Sequence[str]) -> t.Iterable[t.Tuple[str, bool]]:
    for _, text in texts.iteritems():
        yield from text_to_ingredients(text, delims)

        
def texts_to_ingredients_df(texts: pd.Series, delims: t.Sequence[str]) -> pd.DataFrame:
    return pd.DataFrame.from_records(
        texts_to_ingredients(texts, delims), columns=['ingredient', 'is_composite'])

In [203]:
df_usa_ingredients = texts_to_ingredients_df(df_usa['ingredients_text'], delims_en)

nrows, ncols = df_usa_ingredients.shape
print(f"the dataset 'usa_ingredients' contains {nrows} rows and {ncols} columns")

df_usa_ingredients.head()

the dataset 'usa_ingredients' contains 2224133 rows and 2 columns


Unnamed: 0,ingredient,is_composite
0,bananas,False
1,vegetable oil,True
2,coconut oil,False
3,corn oil,False
4,palm oil,False


In [204]:
df_france_ingredients = texts_to_ingredients_df(df_france['ingredients_text'], delims_fr)

nrows, ncols = df_france_ingredients.shape
print(f"the dataset 'france_ingredients' contains {nrows} rows and {ncols} columns")

df_france_ingredients.head()

the dataset 'france_ingredients' contains 1131704 rows and 2 columns


Unnamed: 0,ingredient,is_composite
0,lentilles vertes,False
1,eau gazeifiee,False
2,sirop de mais a haute teneur en fructose,False
3,colorant caramel,False
4,conservateur e211,False


Some ingredients appear multiple times, sometimes marked as "simple" and sometime marked as "composite". Some form of "reconciliation" is thus required. In order for an ingredient to be considered as "composite", we require that the fraction of instances marked as "composite" be greater than 0.5:

In [205]:
def reconcile_composite(df: pd.DataFrame, threshold: float = 0.5) -> pd.DataFrame:
    # Compute count and mean:
    df_reconciled = (
        df
        .groupby(by='ingredient')
        .agg(['count', 'mean'])  # can be applied to a column of type 'bool'
        .rename(columns={'count': 'count', 'mean': 'composite_mean'})
    )
    # Get rid of the multi-index:
    df_reconciled.columns = df_reconciled.columns.get_level_values(1)
    # Reconcile:
    df_reconciled['is_composite'] = (df_reconciled['composite_mean'] > threshold)
    df_reconciled = df_reconciled.drop(columns=['composite_mean'])
    return df_reconciled

In [233]:
df_usa_final = reconcile_composite(df_usa_ingredients)
df_france_final = reconcile_composite(df_france_ingredients)

We can easily compute the prevalence of each ingredient, i.e. the percentage of products that contain that ingredient:

In [257]:
def compute_prevalence(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df['prevalence'] = df['count'] / nproducts
    df = df.style.format({
        'prevalence': lambda n: f'{n*100:.2f} %'
    })
    return df

We find the 5 most common "simple" ingredients:

In [260]:
def top_n_simple(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df_result = (df[~df['is_composite']].nlargest(5, columns=['count']))[['count']]#, 'prevalence']]
    return compute_prevalence(df_result, nproducts)

In [261]:
top_n_simple(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104185,60.62 %
sugar,77324,44.99 %
water,71782,41.76 %
citric acid,34816,20.26 %
riboflavin,22454,13.06 %


In [262]:
top_n_simple(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51797,59.71 %
sucre,37257,42.95 %
eau,35605,41.05 %
farine de ble,13053,15.05 %
huile de tournesol,9063,10.45 %


These are also the 5 most common ingredients overall:

In [263]:
def top_n_overall(df: pd.DataFrame ,nproducts: int) -> pd.DataFrame:
    df_result = df.nlargest(5, columns=['count'])[['count']]#, 'prevalence']]
    return compute_prevalence(df_result, nproducts)

In [264]:
top_n_overall(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104185,60.62 %
sugar,77324,44.99 %
water,71782,41.76 %
citric acid,34816,20.26 %
riboflavin,22454,13.06 %


In [265]:
top_n_overall(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51797,59.71 %
sucre,37257,42.95 %
eau,35605,41.05 %
farine de ble,13053,15.05 %
huile de tournesol,9063,10.45 %


We find the 5 most common "composite" ingredients:

In [266]:
def top_n_composite(df, nproducts):
    df_result = (df[df['is_composite']].nlargest(5, columns=['count']))[['count']]#, 'prevalence']]
    return compute_prevalence(df_result, nproducts)

In [267]:
top_n_composite(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
ascorbic acid,8938,5.20 %
vegetable oil,7922,4.61 %
potassium sorbate,7681,4.47 %
butter,5626,3.27 %
sodium benzoate,5380,3.13 %


In [269]:
top_n_composite(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
emulsifiant,2781,3.21 %
legumes,1624,1.87 %
huiles vegetales,1614,1.86 %
colorant,1461,1.68 %
acidifiant,1202,1.39 %


In [272]:
display_allcols_notrunc(df_usa[df_usa['ingredients_text'].str.contains('ascorbic acid')])

Unnamed: 0,ingredients_text
200,"Wheat flour, butter (cream), water, yeast, sugar, salt, eggs, wheat gluten, ascorbic acid, enzymes."
202,"Enriched wheat flour (wheat flour niacin, reduced iron, thiamine mononitrate, riboflavin, folic acid) malted barley flour, water, soybean oil, wheat gluten, yeast, sugar, salt, calcium sulfate, ascorbic acid, enzymes, sorbic acid, vegetable oil, monoglyce"
212,"Enriched wheatflour (wheat flour, niacin, reduced iron, thiamin mononitrate, riboflavin, folic acid), water, yeast, vegetable oil, salt, dough improver (yeast, wheatflour. enzymes, soybean oil, ascorbic acid)."
215,"Pastry: unbleached wheat flour (wheat flour, malted barley flour, niacin, iron, thiamine mononitrate, riboflavin, folic acid), water, non hydrogenated palm oil, sugar, salt, ascorbic acid, beta carotene. filling: cherries, sugar, water, corn syrup, modifi"
216,"Pastry: unbleached wheat flour (wheat flour, malted barley flour, niacin, iron, thiamine mononitrate, riboflavin, folic acid), water, non hydrogenated palm oil, sugar, salt, ascorbic acid, beta carotene. filling: apples, sugar, water, corn syrup, modified"
218,"Pastry dough [unbleached enriched wheat flour (wheat flour, malted barley flour, ascorbic acid, niacin, thiamin mononitrate, riboflavin, folic acid), water, shortening (palm oil, beta-carotene [for color]), sugar, salt], fruit filling [water, sugar, apple"
219,"Wheat flour, butter, water, chocolate (sugar, cocoa mass, cocoa butter, natural vanilla flavor, soy lecithin), sugar, yeast, whole milk powder, salt, wheat gluten, rapeseed oil lecithin, egg wash (eggs, water), ascorbic acid, enzymes."
338,"Water, xylitol, citric acid, lactic acid, apple juice concentrate*, ascorbic acid, malic acid, natural and artificial flavors, sucralose, blueberry juice concentrate, sodium benzoate and potassium sorbate (as a preservative)."
339,"Water, glycerin, citric acid, xylitol, apple juice concentrate*, ascorbic acid, xanthan gum, sucralose, malic acid, natural and artificial flavors, sodium benzoate and potassium sorbate (as preservatives), yellow #5, blue #1."
359,"Concentrated apple puree, concentrated apple juice, ascorbic acid (vitamin c), natural pineapple flavor, natural mango flavor, concentrated black carrot juice (for color), pectin, coconut oil, shellac, beeswax."
