<a href="https://colab.research.google.com/github/jorisschellekens/borb-google-colab-examples/blob/main/using_borb_to_create_an_e_book_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ![borb logo](https://github.com/jorisschellekens/borb/raw/master/logo/borb_64.png) Using `borb` to create an e-book PDF

[`borb`](https://github.com/jorisschellekens/borb) is a library for reading, creating and manipulating PDF files in python. borb was created in 2020 by Joris Schellekens and is still in active development. Check out the [GitHub repository](https://github.com/jorisschellekens/borb), or the [borb website](https://borbpdf.com).

Let's start by installing `borb`

In [55]:
pip install borb



In [56]:
pip install unidecode



With that out of the way, you can now copy the imports needed to create a basic PDF document.

In [57]:
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.canvas.layout.text.heading import Heading

from borb.pdf.canvas.color.color import HexColor, X11Color

import typing
import re
from decimal import Decimal

This is the part where it gets fun. You're now going to set up everything to be able to add content to your PDF.

In [58]:
# create empty Document
pdf = Document()

# create empty Page
page = Page()

# add Page to Document
pdf.append_page(page)

# create PageLayout
layout: PageLayout = SingleColumnLayout(page)

Not all fonts can handle all characters. Instead of dealing with this in a more elaborate way, here you'll be using `unidecode` which finds the nearest matching ASCII character for a given non-ASCII character.

In [59]:
def to_ascii(s: str) -> str:
  s_out: str = ""
  for c in s:
    if c == '“' or c == '”' or c == 'â':
      s_out += '"'
    else:
      s_out += unidecode(c)  
  return s_out

Now you can move on to the bulk processing of the text.
You'll start by simply downloading the full text from the gutenberg website.

In [60]:
from unidecode import unidecode

# define which ebook to fetch
url = 'https://www.gutenberg.org/files/863/863-0.txt'

# download text
import requests
txt = requests.get(url).text
print("Downloaded %d bytes of text .." % len(txt))

# split to lines
lines_of_text: typing.List[str] = re.split('\r\n', txt)
lines_of_text = [to_ascii(x) for x in lines_of_text]

# debug
print("This ebook contains %d lines .. " % len(lines_of_text))

Downloaded 361353 bytes of text ..
This ebook contains 8892 lines .. 


You don't really need the first couple of lines of text. They're just the copyright headers Project Gutenberg puts on all these works.

In [61]:
# skip header
header_offset: int = 0
for i in range(0, len(lines_of_text)):
  if lines_of_text[i].startswith("*** START OF THE PROJECT GUTENBERG EBOOK"):
    header_offset = i + 1
    break
while lines_of_text[header_offset].isspace():
  header_offset += 1
lines_of_text = lines_of_text[header_offset:]
print("The first %d lines are the gutenberg header .." % header_offset)

The first 24 lines are the gutenberg header ..


Next, you'll ensure the final copyright/legal header is trimmed as well.

In [62]:
# skip footer
footer_offset: int = len(lines_of_text)
for i in range(0, len(lines_of_text)):
  if "*** END OF THE PROJECT GUTENBERG EBOOK" in lines_of_text[i]:
    footer_offset = i
    break
lines_of_text = lines_of_text[0:footer_offset]
print("The last %d lines are the gutenberg footer .." % (len(lines_of_text) - footer_offset))

The last 0 lines are the gutenberg footer ..


With that out of the way, you can move on to processing the main body of text.

In [None]:
# main processing loop
i: int = 0
while i < len(lines_of_text):
  
  # process lines
  paragraph_text: str = ""
  while i < len(lines_of_text) and not len(lines_of_text[i]) == 0:
    paragraph_text += lines_of_text[i]
    paragraph_text += " "
    i += 1

  # empty
  if len(paragraph_text) == 0:
    i += 1
    continue

  # space
  if paragraph_text.isspace():
    i += 1
    continue

  # contains the word 'CHAPTER' multiple times (likely to be table of contents)
  if sum([1 for x in paragraph_text.split(' ') if 'CHAPTER' in x]) > 2:
    i += 1
    continue

  # debug
  print("Processing line %d / %d" % (i, len(lines_of_text)))

  # outline
  if paragraph_text.startswith("CHAPTER"):
    print("Adding Header of %d bytes .." % len(paragraph_text))
    try:
      page = Page()
      pdf.append_page(page)
      layout = SingleColumnLayout(page)
      layout.add(Heading(paragraph_text, font_color=HexColor("13505B"), font_size=Decimal(20)))
    except:
      pass
    continue

  # default
  try:
      layout.add(Paragraph(paragraph_text))
  except:
    pass
  
  # default behaviour
  i += 1

Now, you can store the `Document` as a PDF using the `PDF.dumps` method.

In [65]:
with open("output.pdf", "wb") as pdf_file_handle:
  PDF.dumps(pdf_file_handle, pdf)

That's it! You now have a PDF e-book. That's how easy it is to create a PDF using `borb`.