# 2. Extracting data from *Common Fragrance and Flavor Materials*

## Purpose, contents, & conclusions

**Purpose:** This notebook was used to extract data about aroma/flavor chemicals from a PDF copy of *Common Fragrance and Flavor Materials*. Data is given only basic cleaning and treatment. *Note:* The PDF copy is excluded from the public repository for copyright reasons.

**Reference:** Surburg, H., & Panten, J. (Eds.). (2016). *Common Fragrance and Flavor Materials: Preparation, Properties, and Uses.* Wiley. https://doi.org/10.1002/9783527693153

**Contents:** The notebook contains:
* Extraction of data from one ~300-page PDF.

**Conclusions:** Key outputs are:
* Lightly cleaned data are written to the `common-materials` table of the `aromas-flavors.db` SQLite database. Each row on the table is one flavor/aroma chemical, and there are three data fields: the chemical `name`, its unique `CAS_num`, and a `full_description`.

## Test drive `pdfminer` again

Let's start by tinkering with the options available from `pdfminer` to see what will provide the data we need without too much trouble.

The PDF contains 326 pages. From visual inspection, the chemical information of interest **begins on page 20 and ends on page 249**.

In [1]:
pagerange = [19, 25]

In [2]:
from pdfminer.high_level import extract_text

#### Read text with `extract_text`

In [3]:
extracted = extract_text("raw-data/common-fragrance-and-flavor-materials.pdf",
                         page_numbers = [page for page in range(pagerange[0], pagerange[1])]
                        )
extracted = "".join(extracted)

In [4]:
extracted[:2000]

'8\n\nIndividual Fragrance and Flavor Materials\n\nSymrise\nTakasago\nTreatt\nVioryl\n\n= Symrise GmbH & Co KG, Germany\n= Takasago Perfumery Co., Japan\n= R.C. Treatt & Co., Ltd., UK\n= Vioryl S.A., Greece\n\nMonographs on fragrance materials and essential oils which have been published by\nthe Research Institute for Fragrance Materials (RIFM) in “Food and Chemical Toxi-\ncology” are cited below the individual compounds as “FCT” with year, volume, and\npage of publication.\n\n2.1 Aliphatic Compounds\n\nThe acyclic terpenes are discussed separately in Section 2.2. Some of the cycloali-\nphatic fragrance and flavor materials are structurally related to the cyclic terpenes\nand are, therefore, discussed in Section 2.4 after the cyclic terpenes.\n\n2.1.1 Hydrocarbons\n\nSaturated and unsaturated aliphatic hydrocarbons with straight as well as branched\nchains occur abundantly in natural foodstuffs, but they contribute to the odor and\ntaste only to a limited extent and have not therefore 

This output is pretty reasonable. It has a fair number of extra line breaks, but that's fixable. 

#### Parse chemical names & CAS numbers

*Common Fragrance and Flavor Chemicals* is more simply formatted than *Fenaroli's Handbook* was, so less of the content seems to be in a jumbled order. In fact, it might be possible to fully parse out the chemical data via regex using the unique CAS numbers as punctuation because it looks like they are always enclosed in square brackets, e.g.: "[51447-08-6]" with trailing line breaks.

In [5]:
import re

In [6]:
titles = re.findall("[\n\f]\S+\s\[[\d-]+\]", extracted)
titles[:5]

['\n(E,Z)-1,3,5-Undecatriene [51447-08-6]',
 '\x0c1,3-Undecadien-5-yne [166432-52-6]',
 '\n3-Octanol [589-98-0]',
 '\n2,6-Dimethyl-2-heptanol [13254-34-7]',
 '\ntrans-2-Hexen-l-ol [928-95-0]']

In [7]:
len(titles)

20

Fantastic! This was a pretty easy way to pull out the molecule names and their corresponding CAS numbers. Now all that remains is to also collect the corresponding technical description.

#### Parse chemical descriptions

Again because of their clean & simple formatting in this book, chemical names & CAS numbers will be a convenient way to extract out the chemical descriptions: each one starts with the same basic format of `"\nchemical name [CAS number]"`.

Using regex to split at the same pattern as above ought to give the a complete set of descriptions.

In [8]:
# The regex below includes both newline (\n) and form feed (\f) at the start
# of the line. Some entries start at the very top of the PDF page, and these
# begin with the form feed character. All others begin with newline.
descriptions = re.split(r"[\n\f]\S+\s\[[\d-]+\]", extracted)

In [9]:
len(descriptions)

21

As expected, there is an extra entry in `descriptions` because of the text present on the first page before the first actual chemical entry begins. Remove the unecessary first list entry.

In [10]:
descriptions.pop(0)[:50]

'8\n\nIndividual Fragrance and Flavor Materials\n\nSymr'

Double-check to make sure this approach is working as expected:

In [11]:
for i, entry in enumerate(zip(titles, descriptions)):
    if i < 4:
        print("Title: ", entry[0])
        print("Description: ", entry[1][:500])
        print("\n====\n")

Title:  
(E,Z)-1,3,5-Undecatriene [51447-08-6]
Description:  

C11H18, Mr 150.26, is a colorless liquid with a strong green, galbanum-like odor. It
occurs naturally in galbanum oil (see p. 207) and is the odor-determining constituent
of the oil. The commercial qualities also contain some all-trans isomer and are of-
fered only in dilution due to better stability.

Numerous synthetic routes for the production of 1,3,5-undecatrienes have been

developed. Typical routes are described in [10] – [10b].
FCT 1988 (26), p. 415.

Trade Names. Galbanolene Super (Fi

====

Title:  1,3-Undecadien-5-yne [166432-52-6]
Description:  

Aliphatic Compounds

9

C11H16, Mr 148.24, n20
20 0.845 – 0.855, is a colorless liquid with a
nice fruity-green strong violet-leaf note. It recommended as an alternative to methyl
octynoate and methyl nonynoate [10c].

D 1.44 – 1.444, d20

Trade Name. Violettyne MIP (Firmenich)

2.1.2 Alcohols

Free and esterified, saturated primary alcohols occur widely in nature, e.g

Great! Each entry begins with a chemical name and unique CAS number as expected. Spot checking these results vs. the actual PDF indicates that the chemical names in `titles` are correctly matched with their corresponding `descriptions` using this approach.

The only drawbacks are:
* some unnecessary (i.e.: non-specific, non-chemical) section header information is present in some of the `descriptions` entries. This is visibile, for example, in the description for *1,3-Undecadien-5-yne [166432-52-6]*, which also contains the start of the generic alcohols chapter ("2.1.2 Alcohols Free and esterified, saturated..." It's necessary to remove these before more detailed parsing for odor or taste descriptors because the section header information could contain a vague description of organleptic properties representative of the whole category that do not necessarily apply to the specific molecule that follows.
* page numbers and chapter / section titles are present in some of the `descriptions`, too. For example, the entry for *1,3-Undecadien-5-yne [166432-52-6]* also shows the book's page number ("9") and the section title ("Aliphatic Compounds), each on their own lines. Although it is ugly, it probably is not necessary to remove this information because it does not appear to contain information that will convolute further analysis (like the word "odor" or "flavor", for example).

#### Remove generic section headers and content

Section headers take the format of "n.n.n Title", such as "2.1.2 Alcohols". In the list of `descriptions`, they come *after* the chemically interesting information about individual molecules. That means it should be feasible to split each chemical description at a regex pattern that matches "n.n.n Title", discarding everything after the split.

In [12]:
re.split(r"[\d.]+\s[\w\d]+[\n\f]", descriptions[1])[0]

'\n\nAliphatic Compounds\n\n9\n\nC11H16, Mr 148.24, n20\n20 0.845 – 0.855, is a colorless liquid with a\nnice fruity-green strong violet-leaf note. It recommended as an alternative to methyl\noctynoate and methyl nonynoate [10c].\n\nD 1.44 – 1.444, d20\n\nTrade Name. Violettyne MIP (Firmenich)\n\n'

That looks like it will work! Compare this to the output for *1,3-Undecadien-5-yne [166432-52-6]* a few cells up. The "2.1.2 Alcohols" and subsequent text has been removed.

## Build parsing function

Using the scratch work above as a template, let's build some functions to handle parsing the whole PDF book in a more organized fashion.

In [13]:
def remove_breaks(text: str) -> str:
    """
    Called:
        By parse_extracted.
    Purpose:
        Remove all line breaks from the input text
    Accepts:
        text: A string of text that may or may not contain
            a linebreak, form feed, carriage return, etc.
    Returns:
        A modified string free of such breaks
    """
    
    nolines = re.sub(r"[\n\f\r]", " ", text)
    return nolines.strip()

In [14]:
def parse_extracted(extract: str) -> list:
    """
    Called:
        By user.
    Purpose:
        Split a large chunk of extracted text into chemical names, unique
        CAS numbers, and chemical descriptions.
    Accepts:
        extract: One long string containing the extracted text as directly
            read from the PDF by pdfminer
    Returns:
        a list of dicts in the format:
        (cas number, chemical name, chemical description)
    """
    
    # Extract entry titles (chemical names + CAS numbers) and roughly split 
    # into corresponding descriptions using regex.
    titles = re.findall("[\n\f]\S+\s\[[\s\d-]+\]", extracted)
    raw_descriptions = re.split(r"[\n\f]\S+\s\[[\s\d-]+\]", extracted)
    
    # Split the entry titles into separate lists of chemical names and
    # CAS numbers.
    names = []
    cas_nums = []
    for title in titles:
        match = re.split(r"\s(?:\[)", title)
        names.append(remove_breaks(match[0]))
        cas_nums.append(remove_breaks(match[1][:-1]))
        # the last character of the cas_nums match will always be "]",
        # which should be excluded. That is why the string slice is present above.
    
    # Remove the first entry in the descriptions list, which is currently the
    # leading text _before_ the first real chemical description begins.
    raw_descriptions.pop(0)
    
    # Remove extraneous, generic subsection text that is not actually a part
    # of a single chemical's data
    descriptions = [re.split(r"[\d.]+\s[\w\d]+[\n\f]", desc)[0]
                    for desc in raw_descriptions]
    # Also remove line breaks \n and form feeds \f from the descriptions
    descriptions = [remove_breaks(desc) for desc in descriptions]
    
    # If all parsed lists are the same length, bundle them into a single
    # list of dicts.
    parsed_data = []
    if len(names) == len(cas_nums) == len(descriptions):
        for i, chem_name in enumerate(names):
            parsed_data.append(dict(name = chem_name,
                                    CAS_num = cas_nums[i],
                                    full_description = descriptions[i]))
        
        return parsed_data
    
    else:
        print("Error: Data not parsed correctly.")
        print("Names: {} found\n{}".format(len(names), names[-5:]))
        print("CAS nums: {} found\n{}".format(len(cas_nums), titles[-5:]))
        print("Descriptions: {} found\n{}".format(len(descriptions), descriptions[-5:][:50]))   

## Parse the full PDF

Use the functions from above to extract and parse the full PDF document.

In [15]:
pagerange = (19, 187)
extracted = extract_text("raw-data/common-fragrance-and-flavor-materials.pdf",
                              page_numbers = [page for page in range(pagerange[0], pagerange[1])])

In [16]:
aromas_flavors = parse_extracted(extracted)

Visually examine the first few and last few entries, comparing them to the actual text of the PDF:

In [17]:
aromas_flavors[:5], aromas_flavors[-5:]

([{'name': '(E,Z)-1,3,5-Undecatriene',
   'CAS_num': '51447-08-6',
   'full_description': 'C11H18, Mr 150.26, is a colorless liquid with a strong green, galbanum-like odor'},
  {'name': '1,3-Undecadien-5-yne',
   'CAS_num': '166432-52-6',
   'full_description': 'Aliphatic Compounds  9  C11H16, Mr 148.24, n20 20 0.845 – 0.855, is a colorless liquid with a nice fruity-green strong violet-leaf note. It recommended as an alternative to methyl octynoate and methyl nonynoate [10c].  D 1.44 – 1.444, d20  Trade Name. Violettyne MIP (Firmenich)'},
  {'name': '3-Octanol',
   'CAS_num': '589-98-0',
   'full_description': 'C, d20 CH3(CH2)4CH(OH)CH2CH3, C8H18O, Mr 130.23, bp97.6 kPa 176 –'},
  {'name': '2,6-Dimethyl-2-heptanol',
   'CAS_num': '13254-34-7',
   'full_description': '10  Individual Fragrance and Flavor Materials  (cid:1)  C9H20O, Mr 144.26, bp101.3 kPa 170 – 172 D 1.4248, which has not yet been found in nature, is a colorless liquid with a delicate, flowery odor reminis- cent of freesi

Fantastic! This sampling visually matches the PDF document; names, CAS numbers, and descriptions are matched correctly. There is some extraneous information present still, but this will be dealt with in subsequent processing.

The presence of data for chemicals without a listed CAS number (e.g. "alkyildimethyl-1,3,5-dithiazines" in the final entry) is annoying in its difficulty to separate out. On visual inspection of the PDF, this appears to be the *one, single entry* with that problem. So... I'll just fix it manually.

In [18]:
troublesome = aromas_flavors.pop(-1)
troublesome["full_description"] = "cooked beef odor"
aromas_flavors.append(troublesome)
aromas_flavors[-1]

{'name': '2-acetyl-2-thiazoline',
 'CAS_num': '29926-41-8',
 'full_description': 'cooked beef odor'}

That's an inelegant solution, but it's good enough.

## Write to the database

Finally, write the extracted data to the `common-materials` table of the SQLite database, `aromas-flavors.db`. There are three fields: the chemical `name`, its unique `CAS_num`, and a `full_description`.

In [19]:
import dataset

In [20]:
db = dataset.connect("sqlite:///aromas-flavors.db")

In [21]:
db["common-materials"].insert_many(aromas_flavors)

Great -- all done. The entries from *Fenaroli's Handbook* have been extracted and written to `aromas-flavors.db`. They're searchable by chemical name and CAS number.