# Course 2: Project - Task C - Text data

<a name="top"></a>
This notebook is concerned with task C.

**Contents:**
* [Imports](#imports)
* [Data loading](#task-c-data-loading)
* [Data exploration](#task-c-data-exploration)
* [Implementation & execution](#task-c-implementation-and-execution)
* [Results](#task-c-results)

## Imports<a name="task-c-imports"></a> ([top](#top))
---

In [3]:
pathlib.Path.cwd()

PosixPath('/Users/taariet1/ContEd/Adsml/git/course-02-project')

In [14]:
# Standard library:
import itertools
import pathlib
import re
import typing as t
import unicodedata

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pandas.io.formats.style
import seaborn as sns

# Project:
import ingredients
import utils

%matplotlib inline

## Data loading<a name="task-c-data-loading"></a> ([top](#top))
---

First, we load the subset of the cleaned-up dataset that we need:

In [7]:
base_name = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.clean')

In [8]:
# The columns to load:
usecols=['countries_tags', 'ingredients_text']

# Load:
data_types, parse_dates  = utils.amend_dtypes(utils.load_dtypes(base_name))
# We can only parse dates in the columns that we are loading:
parse_dates = list(set(parse_dates) & set(usecols))
df = pd.read_csv(
        f'{base_name}.csv',
        header=0,
        parse_dates=parse_dates,
        usecols=usecols,
        dtype=data_types)

In [9]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 355569 rows and 2 columns


Here are the first few rows:

In [10]:
df.head()

Unnamed: 0,countries_tags,ingredients_text
0,en:france,
1,en:united-states,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,en:united-states,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,en:united-states,"Organic hazelnuts, organic cashews, organic wa..."
4,en:united-states,Organic polenta


## Data exploration<a name="task-c-data-exploration"></a> ([top](#top))
---

**Observations:** After manually inspecting multiple records (spot-checking), we noticed the following things:

* **Content:** The majority of entries seem to be valid entries that contain a list of ingredients. Some invalid entries were clearly created on purpose (e.g. for *code: 355951*, we have *product\_name:* _Ma b***_ (English: _My d***_)). Other invalid entries might be the resulf of a genuine mix-up (e.g. for *code: 355945*, we have *product\_name:* _La pratique du vocabulaire allemand_ (English: _German vocabulary_)).

* **Language:** In the majority of cases, the list of ingredients seem to use a single language. In some cases it uses two or more languages (e.g. for *code: 930*, it uses English, French and Dutch). In the majority of cases the *country\_tags* column seems to be a good proxy for the language used.

* **Format:** The "ideal" underlying format seem to be a comma-delimited list of ingredients, where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in question is enclosed in matching parentheses. Very few entries conform to this "ideal" format, though. **Parentheses:** In some cases they are used to precise the percentage of an ingredient (e.g. for *code: 155*, we have _"[...] milk chocolate (32%) (sugar, cocoa butter [...]"_) or to provide clarification (e.g. for *code: 24* we have _"[...] soy lecithin (an emulsifier) [...]"_). Different types of parentheses are used, such as round brackets and square brackets. **Delimiters:** Different types of delimiters are used, such as comma, colon and bullets to name a few.

**Assumptions:** In order to keep complexity under control, we make the following assumptions:
* Entries are valid and contain a list of ingredients.
* Entries use a single language. The *country\_tags* column is a good proxy for the language used.
* The format is a delimited list of ingredients (that uses some type of delimiter), where each ingredient may be either "simple" or "composite". A "composite" ingredient is followed by the list of its own ingredients (recursion). The list in questions starts with some type of opening parenthesis and contains at least one ingredient.

We expect entries that do not conform to the above to be "infrequent enough" so as to not impact the final result.

**Scope:** Still in order to keep complexity under control, we take the following decisions:
* We will limit oursevles to the **USA** and **France**. Each one is a rather large country, with a single official language (that we understand). (For the USA, although there is no official language, English is considered to be the de-facto official language. For France, French is the official language.)

## Implementation & execution<a name="task-c-implementation-and-execution"></a> ([top](#top))
---

### Implementation

In order not to clutter the notebook, most of the code is in a separate module - _ingredients.py_.

**Tokenization:** The first step is to split a given list of ingredients into tokens. For that we decided to hand-roll our own tokenizer. Another way would have been to use NLTK (e.g. the [nltk.chunk](https://www.nltk.org/api/nltk.chunk.html) package) or another 3rd party library. The code is quite short (< 150 lines). We use 2 phases.

**Phase 1:** Language agnostic. We use regular expressions to split the text into tokens:
* `SPACE`: A sequence of 1+ white spaces
* `DELIM`: A delimiter
* `LPAR`: A left parenthesis
* `RPAR`: A right parenthesis
* `FIELD`: This is everything in-between tokens of the above types. If a _'.'_ or a _','_ is surrounded by digits, we consider it as part of a decimal number and do not treat it as a delimiter
* `END`: This is an additional type, used to communicate that we reached the end-of-input
* `INVALID`: This is an additional type, used to communicate that we encountered an error (and the sequence of characters that had to be skipped until we could re-synchronize with the input)

**Phase 2:** Language-specific. Tokens of type `FIELD` are further split on language-specific delimiters (e.g. _and_, _or_, for English; _et_, _ou_ for French).

In [11]:
# Order matters due to substrings:
delims_en = ['and/or', 'or/and', 'and', 'or']  
delims_fr = ['et/ou', 'ou/et', 'et',  'ou']

Here is a small example (for each token, we print its type and highlight its position in the text with `[` and `]`):

In [12]:
text = 'Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt), garlic and cayenne.'

for token in ingredients.tokenize(text, delims_en):
    print(f"{token.type.name:5} | {ingredients.highglight_token(text, token, '[', ']')}")

FIELD | [Organic dry roasted pumpkin seeds], tamari (soybeans, water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds[,] tamari (soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, [tamari] (soybeans, water and salt), garlic and cayenne.
LPAR  | Organic dry roasted pumpkin seeds, tamari [(]soybeans, water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari ([soybeans], water and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans[,] water and salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, [water] and salt), garlic and cayenne.
DELIM | Organic dry roasted pumpkin seeds, tamari (soybeans, water [and] salt), garlic and cayenne.
FIELD | Organic dry roasted pumpkin seeds, tamari (soybeans, water and [salt]), garlic and cayenne.
RPAR  | Organic dry roasted pumpkin seeds, tamari (soybeans, water and salt[)], garlic and cayenne.


**Normalization:** We performs the following steps to normalize the names of ingredients:
- Convert to lower-case
- Remove numbers
- Replace punctuation marks by a single space
- Remove accents
- Replace a sequence of 1+ white spaces by a single space
- Strip leading and trailing white spaces

Here is a small example:

In [17]:
text = "Thé noir de Chine, zestes d'oranges 7,5 %, arômes naturels (cannelle 4,7 %, orange 4,7 %, poudre de cannelle 3,9 %)."

ingredients.texts_to_ingredients_df([text], delims_fr)

Unnamed: 0,ingredient,is_composite
0,the noir de chine,0
1,zestes d oranges,0
2,aromes naturels,1
3,cannelle,0
4,orange,0
5,poudre de cannelle,0


### Execution

We create a separate data-frame for each country and discard rows with NA values:

In [18]:
def filter_country(country: str) -> pd.DataFrame:
    is_country = df['countries_tags'].str.contains(country, na=False, regex=False)
    df_country = df[is_country & df['ingredients_text'].notna()]
    return df_country[['ingredients_text']]

In [19]:
df_usa = filter_country('united-states')
nrows, ncols = df_usa.shape
print(f"the dataset 'usa' contains {nrows} rows and {ncols} columns")
df_usa.head()

the dataset 'usa' contains 171606 rows and 1 columns


Unnamed: 0,ingredients_text
1,"Bananas, vegetable oil (coconut oil, corn oil ..."
2,"Peanuts, wheat flour, sugar, rice flour, tapio..."
3,"Organic hazelnuts, organic cashews, organic wa..."
4,Organic polenta
5,"Rolled oats, grape concentrate, expeller press..."


In [20]:
df_france = filter_country('france')
nrows, ncols = df_france.shape
print(f"the dataset 'france' contains {nrows} rows and {ncols} columns")
df_france.head()

the dataset 'france' contains 86673 rows and 1 columns


Unnamed: 0,ingredients_text
184,lentilles vertes
185,"Eau gazéifiée, sirop de maïs à haute teneur en..."
186,"Sucre, farine de _Blé_, graisse et huiles végé..."
189,Thé noir aromatisé à la fleur de violette et p...
190,"Thé noir de Chine, zestes d'oranges 7,5 %, arô..."


We convert each dataset into a data-frame with columns _(ingredient, is-composite)_:

In [21]:
df_usa_ingredients = ingredients.texts_to_ingredients_df(df_usa['ingredients_text'], delims_en)
df_france_ingredients = ingredients.texts_to_ingredients_df(df_france['ingredients_text'], delims_fr)

**Note:** Some ingredients appear multiple times, sometimes marked as "simple" and sometime marked as "composite". Some form of "reconciliation" is thus required. In order for an ingredient to be considered as "composite", we require that the fraction of instances marked as "composite" be greater than 0.5.

In [25]:
df_usa_final = ingredients.reconcile_composite(df_usa_ingredients, 0.5)
df_france_final = ingredients.reconcile_composite(df_france_ingredients, 0.5)

Finally, we can easily compute the prevalence of each ingredient, i.e. the percentage of products that contain that ingredient:

In [26]:
def compute_prevalence(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df['prevalence'] = df['count'] / nproducts
    return df


# Custom format for the 'prevalence' column:
def styled(df: pd.DataFrame) -> pd.io.formats.style.Styler:
    df = df.style.format({
        'prevalence': lambda n: f'{n*100:.2f} %'
    })
    return df

## Results<a name="task-c-results"></a> ([top](#top))
---

We find the 5 most common "simple" ingredients:

In [27]:
def top_n_simple(df: pd.DataFrame, nproducts: int) -> pd.DataFrame:
    df_result = df[~df['is_composite']].nlargest(5, columns=['count'])[['count']]
    return styled(compute_prevalence(df_result, nproducts))

In [28]:
top_n_simple(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104084,60.65 %
sugar,77268,45.03 %
water,71744,41.81 %
citric acid,34786,20.27 %
riboflavin,22442,13.08 %


In [29]:
top_n_simple(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51781,59.74 %
sucre,37245,42.97 %
eau,35591,41.06 %
farine de ble,13048,15.05 %
huile de tournesol,9060,10.45 %


**Comment:** Salt, sugar and water top the ranking for both USA and France and are present in a large percentage of all products.

The 5 most common "simple" ingredients are also the 5 most common ingredients overall:

In [30]:
def top_n_overall(df: pd.DataFrame ,nproducts: int) -> pd.DataFrame:
    df_result = df.nlargest(5, columns=['count'])[['count']]
    return styled(compute_prevalence(df_result, nproducts))

In [32]:
top_n_overall(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
salt,104084,60.65 %
sugar,77268,45.03 %
water,71744,41.81 %
citric acid,34786,20.27 %
riboflavin,22442,13.08 %


In [31]:
top_n_overall(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
sel,51781,59.74 %
sucre,37245,42.97 %
eau,35591,41.06 %
farine de ble,13048,15.05 %
huile de tournesol,9060,10.45 %


Finally, we find the 5 most common "composite" ingredients:

In [33]:
def top_n_composite(df, nproducts):
    df_result = (df[df['is_composite']].nlargest(5, columns=['count']))[['count']]
    return styled(compute_prevalence(df_result, nproducts))

In [34]:
top_n_composite(df_usa_final, len(df_usa))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
ascorbic acid,8935,5.21 %
vegetable oil,7915,4.61 %
potassium sorbate,7678,4.47 %
butter,5620,3.27 %
sodium benzoate,5376,3.13 %


In [36]:
top_n_composite(df_france_final, len(df_france))

Unnamed: 0_level_0,count,prevalence
ingredient,Unnamed: 1_level_1,Unnamed: 2_level_1
emulsifiant,2776,3.20 %
legumes,1624,1.87 %
huiles vegetales,1607,1.85 %
colorant,1461,1.69 %
acidifiant,1202,1.39 %
