# 1. Extracting data from *Fenaroli's Handbook*

## Purpose, contents, & conclusions

**Purpose:** This notebook was used to extract data about aroma/flavor chemicals from a PDF copy of *Fenaroli's Handbook*. Data is given only basic cleaning and treatment. *Note:* The PDF copy is excluded from the public repository for copyright reasons.

**Reference:** Burdock, G.A. (2009). *Fenaroli's Handbook of Flavor Ingredients (6th ed.).* CRC Press. https://doi.org/10.1201/9781439847503

**Contents:** The notebook contains:
* Extraction of data from one ~2000-page PDF.

**Conclusions:** Key outputs are:
* Lightly cleaned data are written to the `fenaroli` table of the `aromas-flavors.db` SQLite database.

## Test drive a few PDF tools

Before investing much time in any one approach to extracting the data from the PDF copy of *Fenaroli's Handbook*, let's invest a little effort in trying out two different libraries on the first few pages of the document: `PyPDF` and `pdfminer`. Then, I can move ahead with whichever one works most conveniently on this particular PDF.

### Test drive of `PyPDF`

In [1]:
from pypdf import PdfReader

#### Load the pdf

In [2]:
reader = PdfReader("raw-data/fenarolis-handbook.pdf")

In [3]:
len(reader.pages)

2162

The PDF contains 2162 pages. From visual inspection, the chemical information of interest **begins on page 26 and ends on page 2035**.

#### First pass -- let's just see what happens

In [4]:
pagerange = [25, 30]

In [5]:
extracted = ""
for page in range(pagerange[0], pagerange[1]):
    extracted += reader.pages[page].extract_text()

In [6]:
extracted[:500]

'1A\nACACIA GUM\nBotanical name: .Acacia senegal .(L.).Willd.\nBot\nanical family: .Le\nguminosae\nOther names: .Aca\ncia.[JA\nN];.Aca\ncia senegal .gum; .Ara\nbic.gum; .gum.Ara\nbic;.Aca\ncia delbata .gum; .Aca\ncia.sol\nution; .Aca\ncia.syr\nup;.\nAus\ntralian .gu\nm;.Gu\nm.Ar\nabic; .In\ndian.gu\nm;.Wa\nttle.gu\nm\nCAS .No.: 9000 -01-05 FL .No.: n/a FEMA .No.: 2001 NAS .No.: 2001\nCoE.No.: n/a EINECS .No.: 232 -519-5 JECF\nA.No.: n/a E.No.: 414\nD\nescription: .Ara\nbic.or.aca\ncia.gum.is.the.dri\ned.exu\ndate.obt\nained .fro\nm'

**Oh no.** PyPDF has found a lot more linebreaks than are visually present in the document. Additionally, we'll also want to make sure page numbers/headers are excluded; here one shows up at the very beginning ("1A").

### Test drive of `pdfminer`

In [7]:
from pdfminer.high_level import extract_text

#### Via `extract_text`

In [8]:
extracted = extract_text("raw-data/fenarolis-handbook.pdf",
                         page_numbers = [page for page in range(pagerange[0], pagerange[1])]
                        )

In [9]:
extracted[:500]

'A\n\nACACIA GUM\n\nBotanical name:.Acacia senegal.(L.).Willd.\n\nBotanical family:.Leguminosae\n\nOther names:.Acacia.[JAN];.Acacia senegal.gum;.Arabic.gum;.gum.Arabic;.Acacia delbata.gum;.Acacia.solution;.Acacia.syrup;.\nAustralian.gum;.Gum.Arabic;.Indian.gum;.Wattle.gum\n\nCAS.No.:\nCoE.No.:\n\n9000-01-05\nn/a\n\nFL.No.:\nEINECS.No.: 232-519-5\n\nn/a\n\nFEMA.No.:\nJECFA.No.:\n\n2001\nn/a\n\nNAS.No.:\nE.No.:\n\n2001\n414\n\nDescription:.Arabic.or.acacia.gum.is.the.dried.exudate.obtained.from.the.stems.and.branches.of.Acacia sen'

Unexpectedly, this output is already cleaner than `PyPDF`; fewer extraneous line breaks are present.

#### Via `extract_pages`

In [10]:
from pdfminer.high_level import extract_pages

In [11]:
from pdfminer.layout import LTTextContainer, LTTextLine, LTChar

In [12]:
extracted = []
for page in extract_pages("raw-data/fenarolis-handbook.pdf",
                          page_numbers = [page for page in range(pagerange[0], pagerange[1])]):
    for element in page:
        extracted.append(element)

The extracted elements can be read directly...

In [13]:
extracted[:10]

[<LTTextBoxHorizontal(0) 54.000,690.564,79.992,726.564 'A\n'>,
 <LTTextBoxHorizontal(1) 54.000,655.514,126.105,666.514 'ACACIA GUM\n'>,
 <LTTextBoxHorizontal(2) 54.000,640.341,224.800,649.851 'Botanical name:.Acacia senegal.(L.).Willd.\n'>,
 <LTTextBoxHorizontal(3) 54.000,624.837,179.295,634.337 'Botanical family:.Leguminosae\n'>,
 <LTTextBoxHorizontal(4) 53.981,597.829,560.350,618.843 'Other names:.Acacia.[JAN];.Acacia senegal.gum;.Arabic.gum;.gum.Arabic;.Acacia delbata.gum;.Acacia.solution;.Acacia.syrup;.\nAustralian.gum;.Gum.Arabic;.Indian.gum;.Wattle.gum\n'>,
 <LTTextBoxHorizontal(5) 56.250,569.613,93.728,592.603 'CAS.No.:\nCoE.No.:\n'>,
 <LTTextBoxHorizontal(6) 116.252,569.613,160.579,592.603 '9000-01-05\nn/a\n'>,
 <LTTextBoxHorizontal(7) 182.248,569.613,281.827,592.603 'FL.No.:\nEINECS.No.: 232-519-5\n'>,
 <LTTextBoxHorizontal(8) 242.251,583.103,253.859,592.603 'n/a\n'>,
 <LTTextBoxHorizontal(9) 308.247,569.613,354.522,592.603 'FEMA.No.:\nJECFA.No.:\n'>]

...or they can be filtered by the LTTextContainer type to eliminate other elements (lines, etc.):

In [14]:
for element in extracted[:10]:
    if isinstance(element, LTTextContainer):
        for text_line in element:
            print(text_line)

<LTTextLineHorizontal 54.000,690.564,79.992,726.564 'A\n'>
<LTTextLineHorizontal 54.000,655.514,126.105,666.514 'ACACIA GUM\n'>
<LTTextLineHorizontal 54.000,640.341,224.800,649.851 'Botanical name:.Acacia senegal.(L.).Willd.\n'>
<LTTextLineHorizontal 54.000,624.837,179.295,634.337 'Botanical family:.Leguminosae\n'>
<LTTextLineHorizontal 54.000,609.333,560.350,618.843 'Other names:.Acacia.[JAN];.Acacia senegal.gum;.Arabic.gum;.gum.Arabic;.Acacia delbata.gum;.Acacia.solution;.Acacia.syrup;.\n'>
<LTTextLineHorizontal 53.981,597.829,265.252,607.329 'Australian.gum;.Gum.Arabic;.Indian.gum;.Wattle.gum\n'>
<LTTextLineHorizontal 56.250,583.103,93.728,592.603 'CAS.No.:\n'>
<LTTextLineHorizontal 56.250,569.613,92.141,579.113 'CoE.No.:\n'>
<LTTextLineHorizontal 116.252,583.103,160.579,592.603 '9000-01-05\n'>
<LTTextLineHorizontal 116.252,569.613,127.861,579.113 'n/a\n'>
<LTTextLineHorizontal 182.249,583.103,212.335,592.603 'FL.No.:\n'>
<LTTextLineHorizontal 182.248,569.613,281.827,579.113 'EINECS

Apparently reading the text by character rather than text box is also an option. Let's try that.

In [15]:
extracted_chars = []
for page in extract_pages("raw-data/fenarolis-handbook.pdf",
                          page_numbers = [page for page in range(pagerange[0], pagerange[1])]
                          ):
    for element in page:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        extracted_chars.append((character._text, character.fontname, round(character.size, 1)))

In [16]:
extracted_chars[:20]

[('A', 'TimesLTStd-Bold', 36.0),
 ('A', 'TimesLTStd-Bold', 11.0),
 ('C', 'TimesLTStd-Bold', 11.0),
 ('A', 'TimesLTStd-Bold', 11.0),
 ('C', 'TimesLTStd-Bold', 11.0),
 ('I', 'TimesLTStd-Bold', 11.0),
 ('A', 'TimesLTStd-Bold', 11.0),
 (' ', 'TimesLTStd-Bold', 11.0),
 ('G', 'TimesLTStd-Bold', 11.0),
 ('U', 'TimesLTStd-Bold', 11.0),
 ('M', 'TimesLTStd-Bold', 11.0),
 ('B', 'TimesLTStd-Bold', 9.5),
 ('o', 'TimesLTStd-Bold', 9.5),
 ('t', 'TimesLTStd-Bold', 9.5),
 ('a', 'TimesLTStd-Bold', 9.5),
 ('n', 'TimesLTStd-Bold', 9.5),
 ('i', 'TimesLTStd-Bold', 9.5),
 ('c', 'TimesLTStd-Bold', 9.5),
 ('a', 'TimesLTStd-Bold', 9.5),
 ('l', 'TimesLTStd-Bold', 9.5)]

This could be a convenient break! The header information (here, the first "A") is in a unique font size. Each chemical entry (here, "ACACIA GUM") is both in all-caps and a unique font size. Finally, within that, data labels (such as "Botanical name:") are bolded. This font information could be a useful way to parse the document.

#### Try parsing the document by font formatting.

I'll do this in some nice, organized functions so that after I troubleshoot them, I can easily reuse them to extract the entire ~2000 page document.

In [17]:
def read_line(textline: str) -> str:
    
    """
    Called:
         By parse_pages function for partial parsing of a PDF document.
    Accepts:
         textline: A LTTextLine object from partial PDF parsing by pdfminer
    Returns:
         a string of text from the TextLine object, terminating with a pipe (|)
    """
    
    text = []
    for character in textline:
        if isinstance(character, LTChar):
            text.append(character._text)
    text.append("|") # Include pipe as a separator at the end of each text container.
                     # This will be a fallback in case values cannot readily be parsed
                     # a more elegant way down the road.
    return "".join(text)

In [18]:
def start_new_entry_content(segmented_entry: list) -> str:
    
    """
    Called:
        By parse_pages function whenever a new encyclopedia entry is detected.
    Purpose:
        Clean up the *prior entry's* content, which at the time of calling is
        stored as a disjointed list of strings, segmented_entry.
    Accepts:
        segmented_entry: A list of strings containing the body of reference
              information for a single encyclopedia entry. The strings are
              of varied length, and their length isn't particularly
              meaningful.
    Returns:
        a single string combining all the entries from the input list.
    """
    
    return " ".join(segmented_entry)

In [19]:
# Display a progress bar so this long process doesn't occur
# entirely in a black box.
from tqdm import tqdm

In [20]:
def parse_pages(start_page: int, end_page: int) -> tuple:
    
    """
    Called:
        By user.
    Purpose:
        Manage data extraction from the multipage Fenaroli's Handbook PDF.
    Accepts:
        start_page: int indicating the first page of the PDF for extraction
        end_page: int indicating the final page of the PDF for extraction
    Returns:
        a tuple containing:
            - a list of titles for articles from the encyclopedia-style book
            - a list containing the corresponding article content (with one
              list entry for each article)
    """
    
    
    extracted = []
    entry_titles = []
    entry_content = []
    partial_content = []

    for page in tqdm(extract_pages("raw-data/fenarolis-handbook.pdf",
                                   page_numbers = [page for page in range(start_page, end_page)])):
        for element in page:
            # extracted.append(element)
            if isinstance(element, LTTextContainer):
                for text in element:
                    
                    if isinstance(text, LTTextLine):
                        # If we are, in fact, iterating over a TextLine object
                        # as expected instead of, e.g., a LTChar single
                        # character, then proceed. Single characters in text
                        # containers will be ignored.
                        
                        # First, extract the text from this line.
                        newline = read_line(text)

                        # Then figure out how to categorize the line.
                        # Determine the font size for this line based on the first
                        # character present. Is this the title of an encyclopedia entry
                        # (font size 11), the body of an entry (font size 9.5), or
                        # something else (in which case it can be ignored)?
                        chars = [char for char in text]
                        if isinstance(chars[0], LTChar):
                            if round(chars[0].size, 1) == 11:
                                entry_titles.append(newline)
                                # When a new entry is detected, wrap up the old entry &
                                # prepare to start a fresh new one.
                                entry_content.append(start_new_entry_content(partial_content))
                                partial_content = []
                            elif round(chars[0].size, 1) == 9.5:
                                partial_content.append(newline)
                            else:
                                pass
    
    # Before finishing, append the final partial_content to entry_content, if any
    entry_content.append(start_new_entry_content(partial_content))
      
    return entry_titles, entry_content

Now, let's test these functions to make sure everything is working as expected:

In [21]:
entry_titles, entry_content = parse_pages(25, 30)

5it [00:02,  1.69it/s]


In [22]:
entry_titles

['ACACIA GUM|',
 'ACETAL|',
 'ACETALDEHYDE|',
 'ACETALDEHYDE, BUTYL PHENETHYL ACETAL|',
 'ACETALDEHYDE DI-cis-3-HEXENYL ACETAL|']

In [23]:
entry_content

['',
 'Botanical name:.Acacia senegal.(L.).Willd.| Botanical family:.Leguminosae| Other names:.Acacia.[JAN];.Acacia senegal.gum;.Arabic.gum;.gum.Arabic;.Acacia delbata.gum;.Acacia.solution;.Acacia.syrup;.| Australian.gum;.Gum.Arabic;.Indian.gum;.Wattle.gum| CAS.No.:| CoE.No.:| 9000-01-05| n/a| FL.No.:| EINECS.No.:232-519-5| n/a| FEMA.No.:| JECFA.No.:| 2001| n/a| NAS.No.:| E.No.:| 2001| 414| Description:.Arabic.or.acacia.gum.is.the.dried.exudate.obtained.from.the.stems.and.branches.of.Acacia senegal (L.).Willd..or.of.| related.species.of.Acacia..Injured.trees.exude.gum.Arabic;.heat,.poor.nutrition.and.drought.stimulate.its.production..Most.of.the.gum.| Arabic.production.is.from.wild.trees,.but.some.from.privately.owned.and.cultivated.gardens.are.tapped.and.collected.on.a.systematic.| basis.| The.gum.called.Hashab.geneina.(garden.gum).is.the.cleanest.and.lightest.grade.and.is.most.preferred.for.the.U.S..market..The.wild.| gum.(called.Hashab.wady).is.collected.on.a.part-time.basis.in.the.

**Fantastic.** This is working fairly nicely. The last thing that needs to be done (bare minimum) before turning this loose on the full document is to extract out the unique CAS identification numbers. These will be the most reliable way for searching the data later.

#### Extract unique CAS numbers

Because of the structure of the PDF document & my approach to extracting it, the unique CAS numbers that we need have an extraneous data label ("CoE.No.") between the "CAS No." label and the number itself, e.g.: "CAS.No.:| CoE.No.:| 105-57-7|" So, the simplest way to find the CAS number will likely be to search for the extraneous "CoE.No." and snag the number that follows it.

In [24]:
import re

In [25]:
def extract_cas_num(chem_desc: str) -> str:
        
    """
    Called:
        By user.
    Purpose:
        Extract a chemical's unique CAS number identifier from the chemical
        description extracted from the Fenaroli's Handbook.
    Accepts:
        chem_desc: String containing the entirety of one chemical's entry
            in the Fenaroli's Handbook.
    Returns:
        a string: the unique CAS number
    """
    
    regex = r"CoE.No.:\|\s[\d\-]+"
    found = re.search(regex, chem_desc)
    if found:
        cas_num = re.sub("CoE.No.:\|\s", "", found.group(0))
        return cas_num
    else:
        pass

In [26]:
extract_cas_num(entry_content[2])

'105-57-7'

## Extract the complete data set

Using the functions above, extract the complete data set from the Fenaroli's Handbook. At this stage, we will generate a fairly crude data set:
* `titles`: A list of article titles (each corresponding to an individual chemical name) for each entry in the Handbook.
* `descriptions`: A list containing the complete full text of the Handbook, with one list entry per chemical.
* `cas_nums`: A list of unique CAS numbers, with one list entry per chemical.

### Parse the PDF

Using the functions above, crudely parse the PDF into lists of chemical `titles` and their `descriptions.

In [27]:
titles, descriptions = parse_pages(25, 2061)

2036it [06:59,  4.86it/s]


Double-check that an equal number of entry titles and entry descriptions were found:

Remember that the list of extracted extracted data entries, `descriptions`, has an unnecessary blank entry at the beginning (because of line 51 in `parse_pages`). Remove that before proceeding.

In [28]:
descriptions.pop(0)

''

In [29]:
len(titles), len(descriptions)

(2735, 2735)

### Extract CAS numbers

Generate a list of CAS numbers, `cas_nums`, to serve as unique identifiers for each chemical entry. The CAS numbers are currently buried in the list of `descriptions`.

In [30]:
cas_nums = []
for desc in descriptions:
    cas_nums.append(extract_cas_num(desc))

### Clean up the descriptions

The `descriptions` are pretty ugly right now, featuring a lot of unnecessary periods that are artifacts from the PDF reading, as well as a bunch of pipes (|) we inserted earlier as a backup plan for text separation. Go a head and remove both.

In [31]:
clean_descriptions = [desc.replace(".", " ").replace("|", "") for desc in descriptions]

### Fix the casing of entry `titles`

The list of `titles` are currently being yelled in all caps. Replace that with title case.

In [32]:
clean_titles = [title.replace("|", "").title() for title in titles] # Hey, I heard you like titles.

### Bundle the extracted data into a dict

Combine the lists of `titles`, `descriptions`, and `cas_nums` into a single list of dicts.

In [33]:
len(clean_titles), len(clean_descriptions), len(cas_nums)

(2735, 2735, 2735)

In [34]:
aromas = []
for i in range(len(clean_titles)):
    aromas.append(dict(CAS_num = cas_nums[i],
                       name = clean_titles[i],
                       full_description = clean_descriptions[i]))

In [35]:
aromas[0]

{'CAS_num': '9000-01-05',
 'name': 'Acacia Gum',
 'full_description': 'Botanical name: Acacia senegal (L ) Willd  Botanical family: Leguminosae Other names: Acacia [JAN]; Acacia senegal gum; Arabic gum; gum Arabic; Acacia delbata gum; Acacia solution; Acacia syrup;  Australian gum; Gum Arabic; Indian gum; Wattle gum CAS No : CoE No : 9000-01-05 n/a FL No : EINECS No :232-519-5 n/a FEMA No : JECFA No : 2001 n/a NAS No : E No : 2001 414 Description: Arabic or acacia gum is the dried exudate obtained from the stems and branches of Acacia senegal (L ) Willd  or of  related species of Acacia  Injured trees exude gum Arabic; heat, poor nutrition and drought stimulate its production  Most of the gum  Arabic production is from wild trees, but some from privately owned and cultivated gardens are tapped and collected on a systematic  basis  The gum called Hashab geneina (garden gum) is the cleanest and lightest grade and is most preferred for the U S  market  The wild  gum (called Hashab wady) i

## Write it to the database

We'll write the extracted data to the `fenaroli` table of the SQLite database, `aromas-flavors.db`.

In [36]:
import dataset

In [37]:
db = dataset.connect("sqlite:///aromas-flavors.db")

In [38]:
db["fenaroli"].insert_many(aromas)

Great -- all done. The entries from *Fenaroli's Handbook* have been extracted and written to `aromas-flavors.db`. They're searchable by chemical name and CAS number.