# RecDP LLM - Document Extract

standard input for LLM pretrain/finetune is a folder of files containing multiple samples. Each sample is a json format or tabular format row.

This function is used to convert text, images, pdfs, docs to jsonl files and then used for LLM data process.

Output format:

| text                | meta                              |
| ------------------- | --------------------------------- |
| This is a cool tool | {'source': 'dummy', 'lang': 'en'} |
| llm is fun          | {'source': 'dummy', 'lang': 'en'} |
| ...                 | {'source': 'dummy', 'lang': 'en'} |

input types supported:
* image (png, jpg)
* pdf
* docs

# Get started

## 1. Install pyrecdp and dependencies

In [22]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
%mkdir -p /content/doc_jsonl
file_names = ['english-and-korean.png', 'handbook-872p.docx', 'layout-parser-paper-10p.jpg', 'layout-parser-paper.pdf']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/{i}" for i in file_names]
!wget -P /content/test_data/document/ {" ".join(file_list)}

## 3. convert data

#### 3.1 convert pdf

In [15]:
from pyrecdp.primitives.llmutils.document_extractor import pdf_to_text

file_name = "layout-parser-paper.pdf"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
pdf_to_text(in_file, out_file)

! head {out_file}

{'text': 'LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines l

#### 3.2 convert docx

In [16]:
from pyrecdp.primitives.llmutils.document_extractor import docx_to_text

file_name = "handbook-872p.docx"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
docx_to_text(in_file, out_file)

! head {out_file}

{'text': 'U.S. Department of Justice', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'Executive Office for United States Trustees\t\t         \t\t\t\t   ', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'Handbook for', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'Chapter 13 ', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'Standing Trustees', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'October 1, 2012 ', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'Table of Contents', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'CHAPTER 1 - INTRODUCTION', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'A. PURPOSE1-1', 'metadata': {'source': 'handbook-872p.docx'}}
{'text': 'B. ROLE OF UNITED STATES TRUSTEE1-1', 'metadata': {'source': 'handbook-872p.docx'}}


#### 3.3 convert images

In [23]:
! pip install -q pytesseract
! apt-get -qq install tesseract-ocr

Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 121352 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2.1build1) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr (4.1.1-2.1build1) ...
Processing triggers for man-db (2.10.2-1) ...


In [24]:
from pyrecdp.primitives.llmutils.document_extractor import image_to_text

file_name = "layout-parser-paper-10p.jpg"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
image_to_text(in_file, out_file)

! head {out_file}

{'text': '2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\nLayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\n\nZejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain\nLee*, Jacob Carlson’, and Weining Li®\n\n1 Allen Institute for AI\nshannons@allenai.org\n? Brown University\nruochen_zhang@brown.edu\n° Harvard University\n{melissadell, jacob_carlson}@fas.harvard.edu\n* University of Washington\nbeg1@cs.washington. edu\n© University of Waterloo\nw4221i@uwaterloo.ca\n\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplif

#### 3.4 convert entire directory

In [27]:
from pyrecdp.primitives.llmutils.document_extractor import document_to_text

in_file = "/content/test_data/document/"
out_file = "/content/doc_jsonl/" + "document.json"
document_to_text(input_dir = in_file, output_file = out_file)

! head {out_file}

Document converter with args [input_dir= /content/test_data/document/, glob=**/*.*, input_files=None] started ...


100%|██████████| 4/4 [02:37<00:00, 39.39s/it]

Document converter with args [input_dir= /content/test_data/document/, glob=**/*.*, input_files=None] took 157.5721603879997 sec
{'text': 'RULES AND INSTRUCTIONS\n\n1. Template for day 1 (korean) , for day 2\n(English) for day 3 both English and korean.\n2. Use all your accounts. use different emails\nto send. Its better to have many email\naccounts.\n\nNote: Remember to write your own "OPENING\nMESSAGE" before you copy and paste the\ntemplate. please always include [TREASURE\nHARUTO] for example:\n\nStS ofAl2, AS|E YGEAS TB TREASUREWH\nHARUTOM| 2] PHEYLICH. BHO AY, HARUTO M BE\n= WSO Hot Wat SSBstS LRU, O| Gil\nBS SH AS2 ASS AZOpO} HAS] TAI St a\nBat FSAel SHS HS + U7/S LICH.\n\n3. CC Harutonations@gmail.com so we can\nkeep track of how many emails were\nsuccessfully sent\n\n4. Use the hashtag of Haruto on your tweet to\nshow that vou have sent your email]\n\x0c', 'metadata': {'source': '/content/test_data/document/english-and-korean.png'}}
{'text': 'U.S. Department of Justice', 'meta


