# PDF to Text with Page Number Detection

**Recommended to run notebook in browser using Jupyter Notebook since user input functionality works better there.**

This notebook:

- converts a PDF to a text file
- detects page numbers printed in each page's header or footer
- inserts both PDF page number and printed page number (where found) at each page break in the text file

If multiple potential page numbers are detected, the user will be asked to confirm which of the page numbers is correct. The printed page numbers and the indexed page numbers are added after each page's text in the text file in the following format: 

\~printed_page_number:[NUM HERE]\~ 

\~indexed_page_number:[NUM HERE]\~

The PDFs are converted to text using this package: https://github.com/jsvine/pdfplumber#extracting-text. Follow the installation instructions before running this notebook. If you're running jupyter notebook in your browser, just run the cell below to install packages.

In [4]:
# run this cell to install packages if you're running jupyter notebook in browser 
import sys
!{sys.executable} -m pip install re
!{sys.executable} -m pip install pdfplumber
import re 
import pdfplumber

[31mERROR: Could not find a version that satisfies the requirement re (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for re[0m[31m


In [8]:
# input the filename of the PDF you want to convert here (for now make sure it's in the same folder as this notebook)

filepath = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Said_1979_Orientalism"
PDFname = "Said_1979_Orientalism.pdf"

# produces project name for giving output correct filename

textname = PDFname.replace('.pdf', '')
print(textname)


Said_1979_Orientalism


In [9]:
full_pdf_text = ""

with pdfplumber.open(f"{filepath}/{PDFname}") as pdf:
    for i in range(len(pdf.pages)):
        full_page = pdf.pages[i].extract_text()

        first_line = full_page.split("\n")[0] # where page numbers at the top of the page are likely to be found
        last_line = full_page.split("\n")[-1] # where page numbers at the bottom of the page are likely to be found
        full_pdf_text += full_page

        if len(full_page) != 0:
            top_left_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})\s", first_line)
            top_right_page_matches = re.findall("\s([xXvViIlL]+|\d{1,3})$", first_line)
            bottom_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})$", last_line)
            
            all_matches = []
            if top_left_page_matches:
                all_matches.append(top_left_page_matches[0].strip())
            if top_right_page_matches:
                all_matches.append(top_right_page_matches[-1].strip())
            if bottom_page_matches:
                all_matches.append(bottom_page_matches[-1].strip())
                
            if len(all_matches) > 0:
                if len(all_matches) > 1: # more than one potential page number 
                    print(f'\n{len(all_matches)} potential page numbers were found for this page.')
                    print(all_matches)
                    print(f'Check PDF page {i+1} for the correct page number and enter it below. If the page has no printed page number, do not input anything.')
                    correct_page_number = input()
                    print(correct_page_number)
                    if len(correct_page_number.strip()) != 0:
                        printed_page_number = correct_page_number.strip()
                    else:
                        printed_page_number = None
                else:
                    printed_page_number = all_matches[0]

                if printed_page_number is not None: 
                    full_pdf_text += f'\n\n~printed_page_number:{printed_page_number}~'
                
        full_pdf_text += f'\n~indexed_page_number:{i+1}~\n\n'

# save output as a plain text file     
with open(f"{filepath}/{textname}_plaintext.txt", "w", encoding="utf-8") as text_file:
    text_file.write(full_pdf_text)



2 potential page numbers were found for this page.
['42', 'I']
Check PDF page 51 for the correct page number and enter it below. If the page has no printed page number, do not input anything.
42
42

2 potential page numbers were found for this page.
['44', 'l']
Check PDF page 53 for the correct page number and enter it below. If the page has no printed page number, do not input anything.
44
44

2 potential page numbers were found for this page.
['100', 'i']
Check PDF page 109 for the correct page number and enter it below. If the page has no printed page number, do not input anything.
100
100

2 potential page numbers were found for this page.
['152', 'l']
Check PDF page 161 for the correct page number and enter it below. If the page has no printed page number, do not input anything.
152
152

2 potential page numbers were found for this page.
['176', 'I']
Check PDF page 185 for the correct page number and enter it below. If the page has no printed page number, do not input anything.
1

# Note

The output currently saves to the preprocessing subfolder of the Github repo. Since this is copyrighted material, you should not keep either the PDF or text file in this folder, but move both back to our private shared Google Drive folder. Eventually someone should revise the code to do this automatically.