# Convert PDF to Markdown sidecar File

## Summary

Read the PDF provided, extract text from it. Turn meta data in to YAML header. Body text will become Markdown text file. Name the same as original except with `.md` extension.

## Outline

1. Get file name,
2. Does file exist, 
3. Open PDF file,
4. Output file next to original.
5. Read meta data, --> Write to YAML header.
6. Read OCR text, --> Write to Markdown export.
7. Close output file.

## Requirements
- If PDF does not contain OCR text, use a tool to process it in the background. Or provide an error message that can flag the files for future processing. The script could create the sidecar with an error message, or different extension.
- Extract meta data from file creation, modification, meta data, and other attributes. Will need a way to read file attributes backed into PDF, plus a to format YAML front matter. Build this table in a seperate function. (May have utility value.)
- Use a standard extraction like `pandoc` to get text out of the PDF file. Need to check the work to make sure it is only ASCII or UTF-8 text. Don't want a Markdown file with special characters or garbage. If extraction has binary, strip it but leave warning in header.
- Maintain the original file name in the front matter and sidecar naming. Make it a simple extension change to reference the original document or sidecar. Keep extensions simple, including `.yml` for YAML, and `.md` for Markdown.

## Modules

Using Python3 or better.

- PIL
- pdf2image
- pytesseract

## Functions

### PDF OCR Conversion

Read a PDF file, turn it into text file using Optical Character Recognition (OCR).

Requires:

- Python3 or better,
    - PIL
    - pdf2image
    - pytesseract
- tesseract-ocr


In [10]:
import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract

Establish filename for testing. Later function will receive file from function, or process a directory.

In [11]:
filePath = './test/202301191803-Easy-No-Yeast-Flatbread.pdf'

Extract file to process.

In [12]:
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

Extract the page as an image, then turn the image into a text. Print those page numbers into text format.

In [13]:
for page_number, page_data in enumerate(doc):
    txt = pytesseract.image_to_string(Image.fromarray(page_data)).encode("utf-8")
    print("Page # {} - {}".format(str(page_number),txt))

AttributeError: __array_interface__

#### Reference

- Geeks for Geeks. Python | Reading contents of PDF using OCR (Optical Character Recgonition). (2022 June 16). https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/
- Stack Overflow. Python - OCR - pytesseract for PDF. (Last accessed 14 February 2023). https://stackoverflow.com/questions/60754884/python-ocr-pytesseract-for-pdf#60755272 (Lambo)

In [17]:
import os
from PIL import Image
import pytesseract
import pdf2image

# Path to the PDF file
filePath = "./test/202301191803-Easy-No-Yeast-Flatbread.pdf"

# Convert PDF to images
pages = pdf2image.convert_from_path(filePath)

# Initialize an empty string for the final text output
text = ""

# Loop through each image and extract text using OCR
for page in pages:
    # Convert the image to grayscale
    img = page.convert('L')
    # Apply threshold to the image to remove noise
    threshold = 200
    img = img.point(lambda x: 0 if x < threshold else 255)
    # Extract text using OCR
    text += pytesseract.image_to_string(img)


Writes the text from PDF OCR to a text file. Very good job.

In [18]:
# Construct the output file path
outFilePath = os.path.splitext(filePath)[0] + '.txt'

# Write the extracted text to the output file
with open(outFilePath, 'w') as outFile:
    outFile.write(text)

print(f"Extracted text written to {outFilePath}")

Extracted text written to ./test/202301191803-Easy-No-Yeast-Flatbread.txt


Converts PDF OCR generated text into a crappy Markdown file.

In [19]:
# Convert the extracted text to Markdown syntax
markdownText = ""
for line in text.split("\n"):
    if line.strip() == "":
        # Skip empty lines
        continue
    elif line.isupper():
        # Treat lines that are all uppercase as section headers
        markdownText += f"\n# {line.strip()}\n\n"
    elif line[0].isdigit():
        # Treat lines that start with a digit as subsection headers
        markdownText += f"\n## {line.strip()}\n\n"
    else:
        # Treat all other lines as body text
        markdownText += line.strip() + " "

# Construct the output file path
outFilePath = os.path.splitext(filePath)[0] + '.md'

# Write the extracted text in Markdown syntax to the output file
with open(outFilePath, 'w') as outFile:
    outFile.write(markdownText)

print(f"Extracted text written to {outFilePath}")

Extracted text written to ./test/202301191803-Easy-No-Yeast-Flatbread.md


New try from scratch.

In [20]:
import os
import re
from PIL import Image
import pytesseract
import pdf2image

# Path to the PDF file
filePath = "./test/202301191803-Easy-No-Yeast-Flatbread.pdf"

# Convert PDF to images
pages = pdf2image.convert_from_path(filePath)

# Initialize a list to store the text blocks
textBlocks = []

# Loop through each image and extract text using OCR
for page in pages:
    # Convert the image to grayscale
    img = page.convert('L')
    # Apply threshold to the image to remove noise
    threshold = 200
    img = img.point(lambda x: 0 if x < threshold else 255)
    # Extract text using OCR
    text = pytesseract.image_to_string(img)
    # Add the text block to the list
    textBlocks.append(text)

# Combine the text blocks into a single string
text = "\n".join(textBlocks)

# Convert the extracted text to Markdown syntax
markdownText = ""
for line in text.split("\n"):
    if line.strip() == "":
        # Skip empty lines
        continue
    elif re.match(r"^[A-Z\s]+$", line):
        # Treat lines that are all uppercase as section headers
        markdownText += f"\n# {line.strip()}\n\n"
    elif re.match(r"^[A-Z][a-z\s]+$", line):
        # Treat lines that start with a capital letter as subsection headers
        markdownText += f"\n## {line.strip()}\n\n"
    else:
        # Treat all other lines as body text
        markdownText += line.strip() + " "

# Construct the output file path
outFilePath = os.path.splitext(filePath)[0] + '.md'

# Write the extracted text in Markdown syntax to the output file
with open(outFilePath, 'w') as outFile:
    outFile.write(markdownText)

print(f"Extracted text written to {outFilePath}")

Extracted text written to ./test/202301191803-Easy-No-Yeast-Flatbread.md


Last chance for ChatGPT Feb 13 Version. 

In [22]:
import os
import re
from PIL import Image
import pytesseract
import pdf2image

# Path to the PDF file
filePath = "./test/202301191803-Easy-No-Yeast-Flatbread.pdf"

# Convert PDF to images
pages = pdf2image.convert_from_path(filePath)

# Initialize a list to store the text blocks
textBlocks = []

# Loop through each image and extract text using OCR
for page in pages:
    # Convert the image to grayscale
    img = page.convert('L')
    # Apply threshold to the image to remove noise
    threshold = 200
    img = img.point(lambda x: 0 if x < threshold else 255)
    # Extract text using OCR
    text = pytesseract.image_to_string(img)
    # Add the text block to the list
    textBlocks.append(text)

# Combine the text blocks into a single string
text = "\n".join(textBlocks)

# Convert the extracted text to Markdown syntax
markdownText = ""
for line in text.split("\n"):
    if line.strip() == "":
        # Skip empty lines
        continue
    elif re.match(r"^[A-Z\s]+$", line):
        # Treat lines that are all uppercase as section headers
        markdownText += f"\n# {line.strip()}\n\n"
    elif re.match(r"^[A-Z][a-z\s]+$", line):
        # Treat lines that start with a capital letter as subsection headers
        markdownText += f"\n## {line.strip()}\n\n"
    else:
        # Treat all other lines as body text
        if markdownText.endswith("\n\n"):
            # Start a new paragraph
            markdownText += f"{line.strip()}\n\n"
        elif re.match(r"^\d+\.\s", line.strip()):
            # Treat lines that start with a number and period as numbered list items
            markdownText += f"1. {line.strip()}\n"
        elif re.match(r"^[*-]\s", line.strip()):
            # Treat lines that start with a bullet as bullet list items
            markdownText += f"* {line.strip()[2:].strip()}\n"
        else:
            # Continue the current paragraph
            markdownText += f"{line.strip()} "

# Construct the output file path
outFilePath = os.path.splitext(filePath)[0] + '.md'

# Write the extracted text in Markdown syntax to the output file
with open(outFilePath, 'w') as outFile:
    outFile.write(markdownText)

print(f"Extracted text written to {outFilePath}")


Extracted text written to ./test/202301191803-Easy-No-Yeast-Flatbread.md


NOTES: ChatGPT creates code similar to what was found online. It does a good job. Enough to get a novice programmer on the right track. 

- [ ] Break code up into single function. [created::2023-02-14] The last version is the best.

/EOF/