<a href="https://colab.research.google.com/github/KewBridge/CalamusTraits/blob/main/notebooks/treatment_extraction_from_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how we can extract data from a **PDF format** publication.

First we need to install a library that helps us process PDF format files

In [6]:
!pip install pypdf



Because the Calamus monograph is published closed access, we'll need to download the PDF from the publisher and then upload it into colab storage to access it from this notebook. Click on the folder icon on the left hand toolbar in colab and then on the file upload icon. Upload the PDF format article and make a variable (`pdf_filename`) that holds the filename

In [7]:
pdf_filename = "Calamus Monograph.pdf"

First we'll extract all the text and save in in a text file for examination

In [8]:
from pypdf import PdfReader
reader = PdfReader(pdf_filename)

with open('calamus_monograph.txt', 'w') as f:
  # Loop over each page in the PDF and write it to a text file
  # using a marker to indicate page boundaries
  for page in reader.pages:
      page_number = page.page_number + 1
      page_text = page.extract_text()
      f.write('-'*30 + str(page_number) + '-'*30 + '\n')
      f.write(page_text)
      f.write('\n'+'-'*80+'\n')

You should see a new file appear on the left in the file browser section of the colab interface. If the cell has finished executing and you don't see it, click the refresh icon. If you double-click the text file you'll be able to view the contents.

Now we'll extract the text from the PDF and store it in a pandas dataframe for easier processing.

In [9]:
from pypdf import PdfReader
import pandas as pd
import re

# Define a function that will help us clean boilerplate text from lines
def cleanLine(s, page_number):
  cleaned = s
  if s.startswith('HENDERSON {page_number}'.format(page_number=page_number)) and '© 2020 Magnolia Press' in s:
    cleaned = s.split('© 2020 Magnolia Press')[1]
  elif s.startswith('A REVISION OF CALAMUS'):
    try:
      cleaned = s.split('•   {page_number}'.format(page_number=page_number))[1]
    except:
      print(s, page_number)
  else:
    cleaned = s
  return cleaned

# Open the PDF file for reading
reader = PdfReader(pdf_filename)

# Extract pages and store with their page number in a dictionary
pages = dict()
for i, page in enumerate(reader.pages):
    page_number = page.page_number + 1
    page_text = page.extract_text()
    pages[page_number] = page_text

# Make a dataframe from the pages dictionary
df_pages = pd.DataFrame(data={'page_number':pages.keys(),'page_text':pages.values()})

print(df_pages.head(n=10))

   page_number                                          page_text
0            1  Phytotaxa  445 (1): 001–656\nhttps://www.mapre...
1            2  HENDERSON 2   •   Phytotaxa  445 (1) © 2020 Ma...
2            3  A REVISION OF CALAMUS Phytotaxa  445 (1) © 202...
3            4  HENDERSON 4   •   Phytotaxa  445 (1) © 2020 Ma...
4            5  A REVISION OF CALAMUS Phytotaxa  445 (1) © 202...
5            6  HENDERSON 6   •   Phytotaxa  445 (1) © 2020 Ma...
6            7  A REVISION OF CALAMUS Phytotaxa  445 (1) © 202...
7            8  HENDERSON 8   •   Phytotaxa  445 (1) © 2020 Ma...
8            9  A REVISION OF CALAMUS Phytotaxa  445 (1) © 202...
9           10  HENDERSON 10   •   Phytotaxa  445 (1) © 2020 M...


In [10]:
# Make a new column which holds the individual lines in each page
# First as a list of lines
df_pages['line'] = df_pages['page_text'].apply(lambda x: x.split('\n'))
# Now "explode" the list of lines so that each line is in its own row
# in the dataframe
df_lines = df_pages.explode('line')
# page text can be dropped as no longer needed
df_lines.drop(columns=['page_text'],inplace=True)
# Clean any boilerplate footer text from the lines
df_lines['line_cleaned'] = df_lines.apply(lambda row: cleanLine(row['line'], row['page_number']), axis=1)

print(df_lines)

     page_number                                               line  \
0              1                        Phytotaxa  445 (1): 001–656   
0              1                      https://www.mapress.com/j/pt/   
0              1  Copyright © 2020 Magnolia PressPHYTOTAXAISSN 1...   
0              1                    ISSN 1179-3163 (online edition)   
0              1  Accepted by William Baker: 11 Apr. 2020; publi...   
..           ...                                                ...   
654          655  leaf with regularly arranged, linear pinnae sp...   
655          656  HENDERSON 656   •   Phytotaxa  445 (1) © 2020 ...   
655          656  Plate 139. Calamus zollingeri  subsp. zollinge...   
655          656  brown, the other shorter, needle-like, black, ...   
655          656  Regularly arranged, lanceolate pinnae ( Hender...   

                                          line_cleaned  
0                          Phytotaxa  445 (1): 001–656  
0                        https://

Next, we'll see if we can identify the different sections of the article

In [11]:
# Here we map between the section identifer that we want to establish,
# and the line that represents the start of this section

section_mapper={'abstract':'Abstract',
'introduction':'Introduction',
'materials_and_methods':'Materials and Methods',
'distribution':'Distribution',
'morphology':'Morphology',
'taxonomic_treatment':'Taxonomic Treatment',
'acknowledgements':'Acknowledgements',
'references':'References',
'appendix_1':'Appendix I. Quantitative variables',
'appendix_2':'Appendix II. Qualitative Variables',
'appendix_3':'Appendix III. Excluded and Uncertain Names',
'appendix_4':'Appendix IV . Species by Region/Island/Country',
'appendix_6':'Appendix VI. Plates'}

# Add a column to df_lines so that we can categorise the line
# into its enclosing section
df_lines['section']=[None]*len(df_lines)

# Loop over the section mapper and update the df_lines dataframe with
# the appropriate section
for key, value in section_mapper.items():
  mask=(df_lines.line_cleaned.str.match(r'^{}\s*$'.format(value.replace('.','\\.')), case=False, flags=0, na=None))
  df_lines.loc[mask, 'section'] = key
print(df_lines[df_lines.section.notnull()][['line_cleaned','section']])

                                       line_cleaned                section
2                                          Abstract               abstract
2                                      Introduction           introduction
3                             Materials and Methods  materials_and_methods
6                                     Distribution            distribution
8                                       Morphology              morphology
38                              Taxonomic Treatment    taxonomic_treatment
468                               Acknowledgements        acknowledgements
469                                      References             references
477              Appendix I. Quantitative variables             appendix_1
478              Appendix II. Qualitative Variables             appendix_2
486      Appendix III. Excluded and Uncertain Names             appendix_3
491  Appendix IV . Species by Region/Island/Country             appendix_4
517                      

In [12]:

# Now we have the start lines for each section, we can use the
# pandas filldown function to categrise each line
df_lines.section.ffill(inplace=True)
print(df_lines.groupby(df_lines.section).size())

section
abstract                    10
acknowledgements            13
appendix_1                   5
appendix_2                  85
appendix_3                 249
appendix_4                1511
appendix_6                 524
distribution                42
introduction                45
materials_and_methods       88
morphology                 397
references                 454
taxonomic_treatment      10359
dtype: int64


Now we'll look for the lines which indicate the start of a taxon treatment, and we'll establish a column identifying the taxon treatement, using the same (filldown) principle as above.

In [13]:
# Define regular expression patterns to find the start lines of taxon treatments
species_treatment_patt = r'^[0-9]+\s*\.\s+Calamus\s+.*$'
infraspecies_treatment_patt = r'^[0-9]+[a-z]\s*\.\s+Calamus\s+[a-z]+\s+subsp\.\s+.*$'

# Utility functions to extract number and name from the line indicating the
# start of a taxonomic treatment
def extractSpeciesNumberAndName(s):
  speciesNumberAndName = None
  if s is not None:
    patt = r'^(?P<species_id>[0-9]+)\s*\.\s+(?P<species_name>Calamus\s+[a-z]+)\s+.*$'
    m = re.match(patt, s)
    if m is not None:
      speciesNumberAndName = m.groupdict()['species_id'] + ' ' + m.groupdict()['species_name']
  return speciesNumberAndName

def extractInfraSpeciesNumberAndName(s):
  infraspeciesNumberAndName = None
  if s is not None:
    patt = r'^(?P<species_id>[0-9]+[a-z])\s*\.\s+(?P<species_name>Calamus\s+[a-z]+\s+subsp\.\s+[a-z]+).*$'
    m = re.match(patt, s)
    if m is not None:
      infraspeciesNumberAndName = m.groupdict()['species_id'] + ' ' + m.groupdict()['species_name']
  return infraspeciesNumberAndName

# Make a new column to hold the taxon_id_and_name
df_lines['taxon_id_and_name']=[None]*len(df_lines)
#  As we're using filldown, we want to set the endpoint after the taxon
# treatments. This prevents the last taxon treatment being filled down right
# to the end of the article
mask=(df_lines.line_cleaned.str.match('^Acknowledgements\s*$'))
df_lines.loc[mask, 'taxon_id_and_name'] = 'na'

# Find the lines which indicate the start of an infraspecific treatment
mask=(df_lines.line_cleaned.str.match(infraspecies_treatment_patt, case=True, flags=0, na=None))
# Use the utility function to extract the taxon name and number fromthe text
# of the line
df_lines.loc[mask, 'taxon_id_and_name'] = df_lines[mask].line_cleaned.apply(extractInfraSpeciesNumberAndName)

# As above - find the lines which indicate the start of an species treatment
mask=(df_lines.line_cleaned.str.match(species_treatment_patt, case=True, flags=0, na=None))
# Use the utility function to extract the taxon name and number from the text
# of the line
mask = (df_lines.taxon_id_and_name.isnull() & mask)
df_lines.loc[mask, 'taxon_id_and_name'] = df_lines[mask].line_cleaned.apply(extractSpeciesNumberAndName)

# Fill down the taxon name and number values
df_lines.taxon_id_and_name.ffill(inplace=True)

# Group to show how many lines were allocated to each taxonomic treatment
pd.set_option('display.max_rows',500)
print(df_lines.groupby(df_lines.taxon_id_and_name).size())

taxon_id_and_name
1 Calamus acamptostachys                                    16
10 Calamus andamanicus                                      28
100 Calamus erectus                                         35
101 Calamus erinaceus                                       25
101a Calamus erinaceus subsp.  erinaceus                     9
101b Calamus erinaceus subsp.  daemonoropoides              11
102 Calamus erioacanthus                                    48
103 Calamus erythrocarpus                                   14
104 Calamus essigii                                         12
105 Calamus eugenei                                         13
106 Calamus  evansii                                        15
107 Calamus exiguus                                         25
108 Calamus exilis                                          41
109 Calamus fertilis                                        14
11 Calamus anomalus                                         23
110 Calamus filipendulus             

In [14]:
species_names = ['Calamus concolor',
                'Calamus disjunctus',
                'Calamus glaucescens',
                'Calamus hallierianus',
                'Calamus pseudoconcolor',
                'Calamus subangulatus']
for species_name in species_names:
  mask = (df_lines.taxon_id_and_name.str.contains(species_name,na=False))
  print(df_lines[mask][['line_cleaned']])

                                          line_cleaned
122  66. Calamus concolor  (Blume) Baker (2015a: 15...
122  concolo r Pritzel (1854: 245), orth. var. Lect...
122                               s.n. (lectotype L!).
122  Stems  clustered, climbing, 3.0 m long, 0.7(0....
122  a knee below the petiole; leaf sheaths with nu...
122  lateral veins diverging, the two distal margin...
122  terminating in a short, obscure stub; peduncle...
122  and color not recorded; fruiting perianths not...
122   Distribution and habitat:— Central-western Su...
122   Taxonomic notes:— Calamus concolor  belongs t...
122  C. disjunctus, C. glaucescens, C. hallierianus...
122  Ceratolobus . They are characterized by their ...
122  at the apex by two short, lateral splits (see ...
122  species in the group by its pinnae that are no...
123                                                   
123  Fig. 43. Distribution maps of Calamus concolor...
123   Calamus concolor  is a problematic species kn...
123  speci

Now we have a dataframe of lines, and we have identified the section for each line (see the dataframe column `section`) and for the lines which are in the taxonomic treatment section, we have identified the taxa for each line (see the dataframe column `taxon_id_and_name`).

The next task will be to subdivide the taxon treatment into its subsections. These are:

- Nomenclatural details
- Description - starts with "Stems" for species treatments, or "Pinnae" for infraspecific treatments
- Distribution
- Taxonomic notes
- Subspecific variation (only present for species which include subspecific taxa)
- Key to the subspecies (as above - only present for species which include subspecific taxa)

The next code cell defines a mapping between an identifier for the treatment subsection and a regular expression that finds the first line of the treatment subsection.

It shoudl be possible to use this in a similar way to how we identified the start of a section with a regular expression and populated a column with an identifier for each line.

In [15]:
treatment_subsection_mapper={'nomenclatural_details':r'^[0-9]+[a-z]?\s*\.\s+Calamus',
'description':r'^\s*(Stems|Pinnae)',
'distribution':r'^\s*Distribution',
'taxonomic_notes':r'^\s*Taxonomic notes',
'subspecific_variation':r'^\s*Subspecific variation',
'key_to_subspecies':r'^\s*Key to the subspecies of',
}

A new column (`treatment_subsection`) is added to df_lines

`Treatment subsection` is filled with the appropriate taxonomic treatment

In [16]:
# Add a column to df_lines so that we can categorise the taxonomic treatment line
# into its enclosing treatment_subsection
df_lines['treatment_subsection']=[None]*len(df_lines)

# Loop over the treatment_subsection_mapper mapper and update the df_lines dataframe with
# the appropriate treatment_subsection
for key, value in treatment_subsection_mapper.items():
  # make a mask with the regular expression in "value"
  mask=(df_lines.line_cleaned.str.match(value, case=True, flags=0, na=None) & (df_lines.section == "taxonomic_treatment"))
  # Use the mask to find the matching lines and write the key into treatment_subsection
  df_lines.loc[mask, 'treatment_subsection'] = key

print(df_lines[df_lines.treatment_subsection.notnull()][['line_cleaned','treatment_subsection']])

                                          line_cleaned   treatment_subsection
38   Stems  solitary or clustered, climbing or non-...            description
64   1. Calamus acamptostachys  (Beccari) Baker (20...  nomenclatural_details
64   Stems  solitary, non-climbing, erect or creepi...            description
64    Distribution and habitat:— Borneo (Sarawak) (...           distribution
64    Taxonomic notes:— Calamus acamptostachys  was...        taxonomic_notes
..                                                 ...                    ...
468   Taxonomic notes:— Calamus zonatus  subsp. zon...        taxonomic_notes
468  411b. Calamus zonatus subsp.  corneri  (Furtad...  nomenclatural_details
468  Pinnae 26(19–32) per side of rachis; middle pi...            description
468   Distribution and habitat :—Peninsular Malaysi...           distribution
468   Taxonomic notes:— In Peninsular Malaysia, Cal...        taxonomic_notes

[1879 rows x 2 columns]


Now do the filldown command and see if you can extract just the description lines for the species of interest (those listed in `species_names`)  

In [18]:
# Filldown the treatment_subsection column to fill missing values
df_lines.treatment_subsection.ffill(inplace=True)

# Iterate over each species name in species_names
# For each species, create a mask to select rows where the taxon_id_and_name column contains the species name, and the treatment_subsection column = 'description'
for species_name in species_names:
    mask = (df_lines.taxon_id_and_name.str.contains(species_name, na=False) & (df_lines.treatment_subsection == 'description'))
    print(df_lines[mask][['line_cleaned']])


                                          line_cleaned
122  Stems  clustered, climbing, 3.0 m long, 0.7(0....
122  a knee below the petiole; leaf sheaths with nu...
122  lateral veins diverging, the two distal margin...
122  terminating in a short, obscure stub; peduncle...
122  and color not recorded; fruiting perianths not...
                                          line_cleaned
149  Stems  clustered, climbing, 7.7(3.0–15.0) m lo...
149  petiole, with a knee below the petiole; leaf s...
149  rhombic, the lateral veins diverging, the two ...
149  of divergence, erect to arching, short, withou...
149  orangey-brown, or red; fruiting perianths expl...
                                          line_cleaned
183  Stems  clustered, climbing, 4.0 m long, 1.0(0....
183  with a knee below the petiole; leaf sheath spi...
183  without spinules on veins, grayish-white indum...
183  divergence, arching, short, without any recurv...
                                          line_cleaned
191  Stems