# Formulation Symbolic language

## Dependencies
FSL is based on "Formulate" library available at https://github.com/l0d0v1c/formulate and the local unit FSL.py

In [23]:
!pip install "https://github.com/l0d0v1c/formulate/blob/main/dist/formulate-1.3-py3-none-any.whl?raw=true"
!pip install pandas
# comment the lines above if installed
import pandas as pd
from formulate.components import components
from FSL import formulationsymboliclanguage

## Purpose
FSL is a language focused on formulation description and deep learning. A formulation is a list of ingredients and quantities. FSL transforms this recipe in a string inspired by SMILES language used to represent molecules. These strings may be used for instance to train a deep auto encoder and generate new formulations from existing ones
## Encoding process
Ingredients can be either major or minor. Major components are the ones usually present in significant amount, minor ones are usually additives used to modify properties of the formulation, like colouring or viscosity agents. Major ingredients are encoded in latin alphabet and minor one is greek. To be included in FSL each FORMULATE object must embed a minor <True|False> property.
### Example
Considering Air as Oxygen/Nitrogen major ingredients and a minor water additive

In [24]:
c=components(physical={"∆Hf":True,"rho":None,"minor":None})
c.add("Water","H2O",{'∆Hf':-285.83,"rho":1.0,'minor':True})
c.add("Nitrogen","N2",{'∆Hf':0,"rho":0.01,'minor':False})
c.add("Oxygen","O2",{'∆Hf':0,"rho":0.01,'minor':False})
c.setrates({"Water":0.01,"Oxygen":0.19,'Nitrogen':0.8})
c.mixing()

Unnamed: 0,Component,Rate,N,O,H,∆Hf,rho,minor
0,Water,0.01,0.0,55.508,111.017,-15865.97,1.0,1
1,Nitrogen,0.8,71.394,0.0,0.0,0.0,0.01,0
2,Oxygen,0.19,0.0,62.502,0.0,0.0,0.01,0
3,Formulation,1.0,57.1152,12.43046,1.11017,-158.6597,Non additive,Non additive


We can now encode the air formulation as

In [25]:
from IPython.display import display, HTML
f=formulationsymboliclanguage([c])
e=f.encode([c])
display(HTML(f"<span style='font-size:3em'>{e[0]}</span>"))

The dictionary of ingredients is

In [26]:
f.dict

{'Water': 'α', 'Nitrogen': 'A', 'Oxygen': 'B'}

### Formulation list with several quantities
To train an autoencoder we need a list of formulations having the same ingredients at several quantities. During the FSL initialisation process you can define a "dose". In formulation recipes, the quantity of each component is often given in units (oz, parts..). FSL use the same representation:

    formulationsymboliclanguage(formulae,granulo=5)

means for each ingredient the delta between the maximum and the minimum quantity is splitted in 5 doses. So CCCD means 3 doses of C and one of D. Minor components are only represented by one letter.

Let's try encoding a recipes book of cocktails

In [5]:
import pandas as pd
df=pd.read_excel("cocktails.xlsx")
df.head()

Unnamed: 0.1,Unnamed: 0,nom,categ,i1,d1,i2,d2,i3,d3,i4,d4,i5,d5,i6,d6
0,0,Gauguin,Cocktail Classics,Light Rum,2.0,Passion Fruit Syrup,1.0,Lemon Juice,1.0,Lime Juice,1.0,,,,
1,1,Fort Lauderdale,Cocktail Classics,Light Rum,1.5,Sweet Vermouth,0.5,Juice of Orange,0.25,Juice of a Lime,0.25,,,,
2,2,Apple Pie,Cordials and Liqueurs,Apple schnapps,3.0,Cinnamon schnapps,1.0,,,,,,,,
3,3,Cuban Cocktail No. 1,Cocktail Classics,Juice of a Lime,0.5,Powdered Sugar,0.5,Light Rum,2.0,,,,,,
4,4,Cool Carlos,Cocktail Classics,Dark rum,1.5,Cranberry Juice,2.0,Pineapple Juice,2.0,Orange curacao,1.0,Sour Mix,1.0,,


Now we have to transform this sheet in a list of formulations. As many ingredients are only used a few times they are not usable for a deep learning training. So we can limit the major ingredients list to the ones uses in more than 30 recipes. The rare ingredients are represented as minors

In [27]:
from collections import Counter
ingredients=[]
for i in range(1,7):
    for j in df[f"i{i}"].tolist():
        ingredients.append(j)
ingredients=Counter(ingredients)
composant={}
for name,cnt in ingredients.items():
    if cnt>30:
        composant[name]={'minor':False}
print(f"based on {len(composant)} ingredients")
listcompo=[]
for i,j in df.iterrows():
    try:
        cp=components(physical={"minor":None})
        rates={}
        for k in range(1,7):
            if j[f"d{k}"]==j[f"d{k}"] and j[f"i{k}"]==j[f"i{k}"] : #not nan
                name=j[f"i{k}"]
                if name in composant:
                    rate=j[f"d{k}"]
                    cp.add(name,"",{'minor':False})
                    rates[name]=rate
                else:
                    cp.add(name,"",{'minor':True})
                    rates[name]=0.001
                    
        cp.setrates(rates)
        cp.mixing()
    except:
        pass
    listcompo.append(cp)

based on 23 ingredients


For instance we can inpect the first cocktail

In [28]:
listcompo[0].formulationlist

Unnamed: 0,Component,Rate,minor
0,Light Rum,0.666,0
1,Passion Fruit Syrup,0.0,1
2,Lemon Juice,0.333,0
3,Lime Juice,0.0,1
4,Formulation,1.0,Non additive


Then encode the full recipe's book

In [29]:
cocktails=formulationsymboliclanguage(listcompo,granulo=10,verbose=False)

As the number of minor ingredients is limited to the length of the greek alphabet some of them are not encoded. It is possible to use longer alphabet by changing the lists 

    formulationsymboliclanguage.major=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
    formulationsymboliclanguage.major=list('αβγδεζηθικλμνξοπρστυφχψω')
    
so we can now get an encoded training set. You may display unencoded ingredients by specifying verbose=True
    

In [30]:
encoded=cocktails.encode(listcompo)

The first cocktail is encoded as 

In [34]:
display(HTML(f"<div style='font-size:3em;'>Encoded recipe 0 : {encoded[0]}</div>"))


If you check what means A

In [16]:
name={j:i for i,j in cocktails.dict.items()}['A']
print(f"Ingredient: {name}")
print(f"Minimum in recipes : {cocktails.min[name]}, maximum: {cocktails.max[name]}")
print(f"One dose of {name} is {cocktails.delta[name]}")

Ingredient:  Light Rum
Minimum in recipes : 0.04, maximum: 1.0
One dose of  Light Rum is 0.096


Encoding is a balance between accuracy (as the quantities are encoded as a number of doses) and the number of available recipes. Having long encoded FSL strings gives a good accuracy but requires a lot of recipes to train a deep encoder. For instance, let's decode the encoded recipe

In [40]:
display(HTML("<span style='font-size:2em;'>FSL encoded recipe is:</span>"))
display(cocktails.decode([encoded[0]])[0].formulationlist)
display(HTML("<span style='font-size:2em;'>And the original recipe was:</span>"))
display(listcompo[0].formulationlist)

Unnamed: 0,Component,Rate,minor
0,Light Rum,0.633,False
1,Lemon Juice,0.365,False
2,Passion Fruit Syrup,0.001,True
3,Lime Juice,0.001,True
4,Formulation,1.0,Non additive


Unnamed: 0,Component,Rate,minor
0,Light Rum,0.666,0
1,Passion Fruit Syrup,0.0,1
2,Lemon Juice,0.333,0
3,Lime Juice,0.0,1
4,Formulation,1.0,Non additive


# Limits of the current version
This published version is limited to
* unordered ingredients: A development version is in progress to take into account a complete sequential manufacturing process
* The cocktail generation by autoencoder's latent space exploration has been successfully tested for cocktails but it has to be assessed in other contexts

# Licence
MIT

2021/2022 https://www.rd-mediation.com