<a href="https://colab.research.google.com/github/KewBridge/CalamusTraits/blob/main/notebooks/treatment_extraction_from_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how we can extract data from a PDF format publication.

First we need to install a library that helps us process PDF format files

In [1]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.0.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-5.0.0-py3-none-any.whl (292 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.8/292.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.0.0


Because the Calamus monograph is published closed access, we'll need to download the PDF from the publisher and then upload it into colab storage to access it from this notebook. Click on the folder icon on the left hand toolbar in colab and then on the file upload icon. Upload the PDF format article and make a variable (`pdf_filename`) that holds the filename

In [2]:
pdf_filename = "38576-Article Text-39176-42143-10-20200525.pdf"

First we'll extract all the text and save in in a text file for examination

In [3]:
from pypdf import PdfReader
reader = PdfReader(pdf_filename)

with open('calamus_monograph.txt', 'w') as f:
  # Loop over each page in the PDF and write it to a text file
  # using a marker to indicate page boundaries
  for page in reader.pages:
      page_number = page.page_number + 1
      page_text = page.extract_text()
      f.write('-'*30 + str(page_number) + '-'*30 + '\n')
      f.write(page_text)
      f.write('\n'+'-'*80+'\n')

You should see a new file appear on the left in the file browser section of the colab interface. If the cell has finished executing and you don't see it, click the refresh icon. If you double-click the text file you'll be able to view the contents.

Now we'll extract the text from the PDF and store it in a pandas dataframe for easier processing.

In [4]:
from pypdf import PdfReader
import pandas as pd
import re

# Define a function that will help us clean boilerplate text from lines
def cleanLine(s, page_number):
  cleaned = s
  if s.startswith('HENDERSON {page_number}'.format(page_number=page_number)) and '© 2020 Magnolia Press' in s:
    cleaned = s.split('© 2020 Magnolia Press')[1]
  elif s.startswith('A REVISION OF CALAMUS'):
    try:
      cleaned = s.split('•   {page_number}'.format(page_number=page_number))[1]
    except:
      print(s, page_number)
  else:
    cleaned = s
  return cleaned

# Open the PDF file for reading
reader = PdfReader(pdf_filename)

# Extract pages and store with their page number in a dictionary
pages = dict()
for i, page in enumerate(reader.pages):
    page_number = page.page_number + 1
    page_text = page.extract_text()
    pages[page_number] = page_text

# Make a dataframe from the pages dictionary
df_pages = pd.DataFrame(data={'page_number':pages.keys(),'page_text':pages.values()})

# Make a new column which holds the individual lines in each page
# First as a list of lines
df_pages['line'] = df_pages['page_text'].apply(lambda x: x.split('\n'))
# Now "explode" the list of lines so that each line is in its own row
# in the dataframe
df_lines = df_pages.explode('line')
# page text can be dropped as no longer needed
df_lines.drop(columns=['page_text'],inplace=True)
# Clean any boilerplate footer text from the lines
df_lines['line_cleaned'] = df_lines.apply(lambda row: cleanLine(row['line'], row['page_number']), axis=1)

print(df_lines)

     page_number                                               line  \
0              1                        Phytotaxa  445 (1): 001–656   
0              1                      https://www.mapress.com/j/pt/   
0              1  Copyright © 2020 Magnolia PressPHYTOTAXAISSN 1...   
0              1                    ISSN 1179-3163 (online edition)   
0              1  Accepted by William Baker: 11 Apr. 2020; publi...   
..           ...                                                ...   
654          655  leaf with regularly arranged, linear pinnae sp...   
655          656  HENDERSON 656   •   Phytotaxa  445 (1) © 2020 ...   
655          656  Plate 139. Calamus zollingeri  subsp. zollinge...   
655          656  brown, the other shorter, needle-like, black, ...   
655          656  Regularly arranged, lanceolate pinnae ( Hender...   

                                          line_cleaned  
0                          Phytotaxa  445 (1): 001–656  
0                        https://

Next, we'll see if we can identify the different sections of the article

In [5]:
# Here we map between the section identifer that we want to establish,
# and the line that represents the start of this section

section_mapper={'abstract':'Abstract',
'introduction':'Introduction',
'materials_and_methods':'Materials and Methods',
'distribution':'Distribution',
'morphology':'Morphology',
'taxonomic_treatment':'Taxonomic Treatment',
'acknowledgements':'Acknowledgements',
'references':'References',
'appendix_1':'Appendix I. Quantitative variables',
'appendix_2':'Appendix II. Qualitative Variables',
'appendix_3':'Appendix III. Excluded and Uncertain Names',
'appendix_4':'Appendix IV . Species by Region/Island/Country',
'appendix_6':'Appendix VI. Plates'}

# Add a column to df_lines so that we can categorise the line
# into its enclosing section
df_lines['section']=[None]*len(df_lines)

# Loop over the section mapper and update the df_lines dataframe with
# the appropriate section
for key, value in section_mapper.items():
  mask=(df_lines.line_cleaned.str.match(r'^{}\s*$'.format(value.replace('.','\\.')), case=False, flags=0, na=None))
  df_lines.loc[mask, 'section'] = key

# Now we have the start lines for each section, we can use the
# pandas filldown function to categrise each line
df_lines.section.ffill(inplace=True)

print(df_lines.groupby(df_lines.section).size())

section
abstract                    10
acknowledgements            13
appendix_1                   5
appendix_2                  85
appendix_3                 249
appendix_4                1511
appendix_6                 524
distribution                42
introduction                45
materials_and_methods       88
morphology                 397
references                 454
taxonomic_treatment      10359
dtype: int64


Now we'll look for the lines which indicate the start of a taxon treatment, and we'll establish a column identifying the taxon treatement, using the same (filldown) principle as above.

In [6]:
# Define regular expression patterns to find the start lines of taxon treatments
species_treatment_patt = r'^[0-9]+\s*\.\s+Calamus\s+.*$'
infraspecies_treatment_patt = r'^[0-9]+[a-z]\s*\.\s+Calamus\s+[a-z]+\s+subsp\.\s+.*$'

# Utility functions to extract number and name from the line indicating the
# start of a taxonomic treatment
def extractSpeciesNumberAndName(s):
  speciesNumberAndName = None
  if s is not None:
    patt = r'^(?P<species_id>[0-9]+)\s*\.\s+(?P<species_name>Calamus\s+[a-z]+)\s+.*$'
    m = re.match(patt, s)
    if m is not None:
      speciesNumberAndName = m.groupdict()['species_id'] + ' ' + m.groupdict()['species_name']
  return speciesNumberAndName

def extractInfraSpeciesNumberAndName(s):
  infraspeciesNumberAndName = None
  if s is not None:
    patt = r'^(?P<species_id>[0-9]+[a-z])\s*\.\s+(?P<species_name>Calamus\s+[a-z]+\s+subsp\.\s+[a-z]+).*$'
    m = re.match(patt, s)
    if m is not None:
      infraspeciesNumberAndName = m.groupdict()['species_id'] + ' ' + m.groupdict()['species_name']
  return infraspeciesNumberAndName

# Make a new column to hold the taxon_id_and_name
df_lines['taxon_id_and_name']=[None]*len(df_lines)
#  As we're using filldown, we want to set the endpoint after the taxon
# treatments. This prevents the last taxon treatment being filled down right
# to the end of the article
mask=(df_lines.line_cleaned.str.match('^Acknowledgements\s*$'))
df_lines.loc[mask, 'taxon_id_and_name'] = 'na'

# Find the lines which indicate the start of an infraspecific treatment
mask=(df_lines.line_cleaned.str.match(infraspecies_treatment_patt, case=True, flags=0, na=None))
# Use the utility function to extract the taxon name and number fromthe text
# of the line
df_lines.loc[mask, 'taxon_id_and_name'] = df_lines[mask].line_cleaned.apply(extractInfraSpeciesNumberAndName)

# As above - find the lines which indicate the start of an species treatment
mask=(df_lines.line_cleaned.str.match(species_treatment_patt, case=True, flags=0, na=None))
# Use the utility function to extract the taxon name and number from the text
# of the line
mask = (df_lines.taxon_id_and_name.isnull() & mask)
df_lines.loc[mask, 'taxon_id_and_name'] = df_lines[mask].line_cleaned.apply(extractSpeciesNumberAndName)

# Fill down the taxon name and number values
df_lines.taxon_id_and_name.ffill(inplace=True)

# Group to show how many lines were allocated to each taxonomic treatment
pd.set_option('display.max_rows',500)
print(df_lines.groupby(df_lines.taxon_id_and_name).size())

taxon_id_and_name
1 Calamus acamptostachys                                    16
10 Calamus andamanicus                                      28
100 Calamus erectus                                         35
101 Calamus erinaceus                                       25
101a Calamus erinaceus subsp.  erinaceus                     9
101b Calamus erinaceus subsp.  daemonoropoides              11
102 Calamus erioacanthus                                    48
103 Calamus erythrocarpus                                   14
104 Calamus essigii                                         12
105 Calamus eugenei                                         13
106 Calamus  evansii                                        15
107 Calamus exiguus                                         25
108 Calamus exilis                                          41
109 Calamus fertilis                                        14
11 Calamus anomalus                                         23
110 Calamus filipendulus             

In [7]:
species_names = ['Calamus concolor',
                'Calamus disjunctus',
                'Calamus glaucescens',
                'Calamus hallierianus',
                'Calamus pseudoconcolor',
                'Calamus subangulatus']
for species_name in species_names:
  mask = (df_lines.taxon_id_and_name.str.contains(species_name,na=False))
  print(df_lines[mask][['line_cleaned']])

                                          line_cleaned
122  66. Calamus concolor  (Blume) Baker (2015a: 15...
122  concolo r Pritzel (1854: 245), orth. var. Lect...
122                               s.n. (lectotype L!).
122  Stems  clustered, climbing, 3.0 m long, 0.7(0....
122  a knee below the petiole; leaf sheaths with nu...
122  lateral veins diverging, the two distal margin...
122  terminating in a short, obscure stub; peduncle...
122  and color not recorded; fruiting perianths not...
122   Distribution and habitat:— Central-western Su...
122   Taxonomic notes:— Calamus concolor  belongs t...
122  C. disjunctus, C. glaucescens, C. hallierianus...
122  Ceratolobus . They are characterized by their ...
122  at the apex by two short, lateral splits (see ...
122  species in the group by its pinnae that are no...
123                                                   
123  Fig. 43. Distribution maps of Calamus concolor...
123   Calamus concolor  is a problematic species kn...
123  speci

In [8]:
for line in df_lines.line_cleaned.tolist():
  if re.match('^[0-9]+[a-z]*\. Calamus', line):
    print(line)

1. Calamus acamptostachys  (Beccari) Baker (2015a: 141).  Daemonorops acamptostachys  Beccari (1911a: 209). 
2. Calamus acanthochlamys  Dransfield (1990: 85). Type:—MALAYSIA. Sarawak, 4
3. Calamus acanthophyllus  Beccari (1910: 229).  Type:—THAILAND. Expédition du Me-Kong, Rivière d’Ubon, 
4. Calamus acanthospathus Griffith (1845: 39).  Palmijuncus acanthospathus (Griffith) Kuntze (1891: 733). Lectotype 
5. Calamus acaulis  Henderson, Ninh Khac Ban & Nguyen Quoc Dung (2008: 188).  Type:—VIETNAM. Phu Yen 
6. Calamus adspersus  (Blume) Blume (1847: 40). Daemonorops adspersa  Blume (1847: t. 142, 143). Palmijuncus 
7. Calamus aidae  Fernando (1989: 49) . Type:—PHILIPPINES. Samar, Basey, Guirang, Rawis Baswood concession 
8. Calamus albidus  Guo Lixiu & Henderson (2007: 352).  Calamus oxycarpus  var. angustifolius  Chen San-Yang 
9. Calamus altiscandens  Burret (1939: 196).  Lectotype (designated here):—PAPUA NEW GUINEA. Palmer River, 
10. Calamus andamanicus  Kurz (1874: 211).  Palmijuncu

You can try modifying this so that all the descriptive information for a treatment is saved in a data structure. You'll need to:
1. Extract the species name (or the infraspecific name - these are indicated by an alphabetic character after the number see eg "73a. Calamus crinitus subsp.  crinitus".
2. Find the lines which indicate the description. These will start with the word "Stems" and proceed down to the line which starts "Distribution and habitat"
3. Save these lines in a data structure - maybe a dictionary keyed by species name. Initially we will only want to work with the species that were previously regarded as *Ceratolobus* ie:
    - *Calamus concolor*
    - *Calamus disjunctus*
    - *Calamus glaucescens*
    - *Calamus hallierianus*
    - *Calamus pseudoconcolor*
    - *Calamus subangulatus*

In [9]:
species_names = ['Calamus concolor',
                'Calamus disjunctus',
                'Calamus glaucescens',
                'Calamus hallierianus',
                'Calamus pseudoconcolor',
                'Calamus subangulatus']
description_lines = []
in_species_treatment = False
in_species_description = False
species_id = None
species_name = None
species_descriptions = dict()
for line in df_lines.line_cleaned.tolist():
  # Define a pattern that indicates the start of a species treatment
  # This syntax uses "named groups" to extract the species_id and species name
  patt = '^(?P<species_id>[0-9]+[a-z]*)\. (?P<species_name>Calamus [a-z]+) .*$'
  m = re.match(patt, line)
  # This tests if we found a match
  if m is not None:
    # Now we test if the species name is one that we want to extract
    # ie is it one of the list defined above ("species_names")
    if m.groupdict()['species_name'] in species_names:
      species_id = m.groupdict()['species_id']
      species_name = m.groupdict()['species_name']
      # Set a flag to indicate that we're currently inside a species treatment
      in_species_treatment = True
      continue
  if in_species_treatment:
    # When we're in a species treatement, we don't want to save lines until
    # we hit the line starting with "stems"
     if line.startswith('Stems'):
        # Set a flag to indicate that we're currently inside a species description
        in_species_description = True
  if in_species_description:
    # This test is for the end of the species description
    if line.strip().startswith('Distribution') or line.strip().startswith('Fig'):
      in_species_description = False
      # Save the data into our dictionary
      species_descriptions[species_name] = '\n'.join(description_lines)
      # clear all variables for next loop
      description_lines = []
      in_species_treatment = False
      in_species_description = False
      species_id = None
      species_name = None
    else:
      # as we're not yet at the end, just keep accumulating lines
      description_lines.append(line)

# Output our saved description data
for species_name, species_description in species_descriptions.items():
  print(species_name)
  print(species_description)

Calamus concolor
Stems  clustered, climbing, 3.0 m long, 0.7(0.5–1.0) cm diameter. Leaf sheaths  tubular, closed opposite the petiole, with 
a knee below the petiole; leaf sheaths with numerous spicules borne on short, low, horizontal ridges, easily detached and leaving the sheaths with ridges only; ocreas short, membranous, not spiny, with external and internal abscission zone, splitting and falling early; flagella absent; petioles 4.3(1.0–6.0) cm long; rachises 36.2(28.5–45.0) cm long, the apices extended into an elongate cirrus, without reduced or vestigial pinnae, adaxially flat, abaxially with more or less regularly arranged (at least proximally), distantly spaced clusters of dark–tipped, recurved spines, terminating in a stub, without a shallow groove adaxially; pinnae  5(5–6) per side of rachis, regularly arranged, rhombic, the 
lateral veins diverging, the two distal margins praemorse, without spinules on veins, sometimes with few to numerous, small, brown scales abaxially; pro