<a href="https://colab.research.google.com/github/nickynicolson/2016-01-01-antarctica/blob/gh-pages/treatment_extraction_from_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how we can extract data from a PDF format publication.

First we need to install a library that helps us process PDF format files

In [1]:
!pip install pypdf



Because the Calamus monograph is published closed access, we'll need to download the PDF from the publisher and then upload it into colab storage to access it from this notebook. Click on the folder icon on the left hand toolbar in colab and then on the file upload icon. Upload the PDF format article and make a variable (`pdf_filename`) that holds the filename

In [2]:
pdf_filename = "38576-Article Text-39176-42143-10-20200525.pdf"

First we'll extract all the text and save in in a text file for examination

In [3]:
from pypdf import PdfReader
reader = PdfReader(pdf_filename)

with open('calamus_monograph.txt', 'w') as f:
  # Loop over each page in the PDF and write it to a text file
  # using a marker to indicate page boundaries
  for page in reader.pages:
      page_number = page.page_number + 1
      page_text = page.extract_text()
      f.write('-'*30 + str(page_number) + '-'*30 + '\n')
      f.write(page_text)
      f.write('\n'+'-'*80+'\n')

Now we'll extract the text from the PDF and store it in a pandas dataframe for easier processing.

In [4]:
from pypdf import PdfReader
import pandas as pd
import re

# Define a function that will help us clean boilerplate text from lines
def cleanLine(s, page_number):
  cleaned = s
  if s.startswith('HENDERSON {page_number}'.format(page_number=page_number)) and '© 2020 Magnolia Press' in s:
    cleaned = s.split('© 2020 Magnolia Press')[1]
  elif s.startswith('A REVISION OF CALAMUS'):
    try:
      cleaned = s.split('•   {page_number}'.format(page_number=page_number))[1]
    except:
      print(s, page_number)
  else:
    cleaned = s
  return cleaned

# Open the PDF file for reading
reader = PdfReader(pdf_filename)

# Extract pages and store with their page number in a dictionary
pages = dict()
for i, page in enumerate(reader.pages):
    page_number = page.page_number + 1
    page_text = page.extract_text()
    pages[page_number] = page_text

# Make a dataframe from the pages dictionary
df_pages = pd.DataFrame(data={'page_number':pages.keys(),'page_text':pages.values()})

# Make a new column which holds the individual lines in each page
# First as a list of lines
df_pages['line'] = df_pages['page_text'].apply(lambda x: x.split('\n'))
# Now "explode" the list of lines so that each line is in its own row
# in the dataframe
df_lines = df_pages.explode('line')
# page text can be dropped as no longer needed
df_lines.drop(columns=['page_text'],inplace=True)
# Clean any boilerplate footer text from the lines
df_lines['line_cleaned'] = df_lines.apply(lambda row: cleanLine(row['line'], row['page_number']), axis=1)

print(df_lines)

     page_number                                               line  \
0              1                        Phytotaxa  445 (1): 001–656   
0              1                      https://www.mapress.com/j/pt/   
0              1  Copyright © 2020 Magnolia PressPHYTOTAXAISSN 1...   
0              1                    ISSN 1179-3163 (online edition)   
0              1  Accepted by William Baker: 11 Apr. 2020; publi...   
..           ...                                                ...   
654          655  leaf with regularly arranged, linear pinnae sp...   
655          656  HENDERSON 656   •   Phytotaxa  445 (1) © 2020 ...   
655          656  Plate 139. Calamus zollingeri  subsp. zollinge...   
655          656  brown, the other shorter, needle-like, black, ...   
655          656  Regularly arranged, lanceolate pinnae ( Hender...   

                                          line_cleaned  
0                          Phytotaxa  445 (1): 001–656  
0                        https://

Now we'll look for the lines which indicate the start of a taxon treatement:

In [5]:
for line in df_lines.line_cleaned.tolist():
  if re.match('^[0-9]+[a-z]*\. Calamus', line):
    print(line)

1. Calamus acamptostachys  (Beccari) Baker (2015a: 141).  Daemonorops acamptostachys  Beccari (1911a: 209). 
2. Calamus acanthochlamys  Dransfield (1990: 85). Type:—MALAYSIA. Sarawak, 4
3. Calamus acanthophyllus  Beccari (1910: 229).  Type:—THAILAND. Expédition du Me-Kong, Rivière d’Ubon, 
4. Calamus acanthospathus Griffith (1845: 39).  Palmijuncus acanthospathus (Griffith) Kuntze (1891: 733). Lectotype 
5. Calamus acaulis  Henderson, Ninh Khac Ban & Nguyen Quoc Dung (2008: 188).  Type:—VIETNAM. Phu Yen 
6. Calamus adspersus  (Blume) Blume (1847: 40). Daemonorops adspersa  Blume (1847: t. 142, 143). Palmijuncus 
7. Calamus aidae  Fernando (1989: 49) . Type:—PHILIPPINES. Samar, Basey, Guirang, Rawis Baswood concession 
8. Calamus albidus  Guo Lixiu & Henderson (2007: 352).  Calamus oxycarpus  var. angustifolius  Chen San-Yang 
9. Calamus altiscandens  Burret (1939: 196).  Lectotype (designated here):—PAPUA NEW GUINEA. Palmer River, 
10. Calamus andamanicus  Kurz (1874: 211).  Palmijuncu

You can try modifying this so that all the descriptive information for a treatment is saved in a data structure. You'll need to:
1. Extract the species name (or the infraspecific name - these are indicated by an alphabetic character after the number see eg "73a. Calamus crinitus subsp.  crinitus".
2. Find the lines which indicate the description. These will start with the word "Stems" and proceed down to the line which starts "Distribution and habitat"
3. Save these lines in a data structure - maybe a dictionary keyed by species name. Initially we will only want to work with the species that were previously regarded as *Ceratolobus* ie:
    - *Calamus concolor*
    - *Calamus disjunctus*
    - *Calamus glaucescens*
    - *Calamus hallierianus*
    - *Calamus pseudoconcolor*
    - *Calamus subangulatus*