# doc2text

> extract text from epub, pdf and docx

In [1]:
# 👇 for colab only
# !pip install -r https://raw.githubusercontent.com/hoishing/doc2txt/main/requirements.txt
# !wget https://github.com/hoishing/doc2txt/raw/main/dummy.docx
# !wget https://github.com/hoishing/doc2txt/raw/main/dummy.epub
# !wget https://github.com/hoishing/doc2txt/raw/main/dummy.pdf

In [2]:
# utils
import re

def remove_empty_rows(txt:str) -> str:
    return re.sub(r'(\n\s*){3,}', r'\n\n', txt)

def remove_bad_characters(txt: str) -> str:
    pattern = r'[ \n]*—[ \n]*'
    return re.sub(pattern, '', txt)

## docx

In [3]:
import docx2txt

txt = docx2txt.process("dummy.docx")
txt = remove_empty_rows(txt)
print(txt)

Lorem Ipsum

simply dummy text 

🤣 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent interdum ultrices nisi vel ullamcorper. Aliquam fermentum porttitor ligula in pharetra. Proin sed tristique sapien. Quisque tempor dictum nibh, in interdum dui gravida in. 

What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and 

Lorem Ipsum has been the industry's standard dummy 

光嗎開計，場熱使造不品不智定內她充日如生分音我，思那上感認主學道，此盡上的畫身起他，血選兩造不新從謝三母時！立因打大事細團，力記廣進算期的，了老吃水年不。🥳


## pdf

In [4]:
from pypdf import PdfReader

reader = PdfReader("dummy.pdf")

txt_pages = [page.extract_text() for page in reader.pages]
txt = '\n'.join(txt_pages)
print(txt)

Lorem Ipsum  
simply dummy text  
 
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent interdum ultrices nisi vel 
ullamcorper. Aliquam fermentum porttitor ligula in pharetra. Proin sed tristique sapien. Quisque 
tempor dictum nibh, in interdum dui gravida in.  
 
What is Lorem Ipsum?  
 
• Lorem Ipsum is simply dummy text of the printing and  
• Lorem Ipsum has been the industry's standard dummy  
 
光嗎開計，場熱使造不品不智定內她充日如生分音我，思那上感認主學道，此盡上的畫身
起他，血選兩造不新從謝三母時！立因打大事細團，力記廣進算期的，了老吃水年不。
            


## epub

In [5]:
import ebooklib
from ebooklib import epub
from ebooklib.epub import EpubBook
from bs4 import BeautifulSoup
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

In [6]:
bk = epub.read_epub('dummy.epub')

# extract HTML from ITEM_DOCUMENT
htmls = [item.get_body_content().decode('utf-8') for item in bk.get_items() if item.get_type() == ebooklib.ITEM_DOCUMENT]
html = ''.join(htmls)

# parse html to txt
txt = BeautifulSoup(html, 'html.parser').get_text()
txt = remove_empty_rows(txt)
print(txt)



Lorem Ipsum

simply dummy text 

🤣 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent interdum ultrices nisi vel ullamcorper. Aliquam fermentum porttitor ligula in pharetra. Proin sed tristique sapien. Quisque tempor dictum nibh, in interdum dui gravida in. 

What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and 

Lorem Ipsum has been the industry's standard dummy 

光嗎開計，場熱使造不品不智定內她充日如生分音我，思那上感認主學道，此盡上的畫身起他，血選兩造不新從謝三母時！立因打大事細團，力記廣進算期的，了老吃水年不。
🥳


