# RecDP LLM - Document Extract

standard input for LLM pretrain/finetune is a folder of files containing multiple samples. Each sample is a json format or tabular format row.

This function is used to convert text, images, pdfs, docs to jsonl files and then used for LLM data process.

Output format:

| text                | meta                              |
| ------------------- | --------------------------------- |
| This is a cool tool | {'source': 'dummy', 'lang': 'en'} |
| llm is fun          | {'source': 'dummy', 'lang': 'en'} |
| ...                 | {'source': 'dummy', 'lang': 'en'} |

input types supported:
* image (png, jpg)
* pdf
* docs

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [8]:
%mkdir -p /content/test_data
%cd /content/test_data
%mkdir -p /content/doc_jsonl
file_names = ['english-and-korean.png', 'handbook-872p.docx', 'layout-parser-paper-10p.jpg', 'layout-parser-paper.pdf']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/{i}" for i in file_names]
!wget -P /content/test_data/document/ {" ".join(file_list)}

/content/test_data
--2023-10-20 10:37:33--  https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/english-and-korean.png
Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.120.56
Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.120.56|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 305401 (298K) [image/png]
Saving to: ‘/content/test_data/document/english-and-korean.png.7’


2023-10-20 10:37:34 (361 KB/s) - ‘/content/test_data/document/english-and-korean.png.7’ saved [305401/305401]

--2023-10-20 10:37:34--  https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/handbook-872p.docx
Reusing existing connection to raw.githubusercontent.com:443.
Proxy request sent, awaiting response... 200 OK
Length: 482883 (472K) [application/octet-stream]
Saving to: ‘/content/test_data/document/handbook-872p.docx.7’


2023-10-20 10:37:35 (1.13 MB/s) - ‘/content/test_data/document/handbook

## 3. convert data

#### 3.1 convert pdf

In [9]:
from pyrecdp.primitives.llmutils.document_extractor import pdf_to_text

file_name = "layout-parser-paper.pdf"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
pdf_to_text(in_file, out_file)

! head {out_file}

Document extract for '/content/test_data/document/layout-parser-paper.pdf' with [glob=**/*.pdf, required_exts=None, recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:00<00:00,  3.06it/s]

Document extract for '/content/test_data/document/layout-parser-paper.pdf' with [glob=**/*.pdf, required_exts=None, recursive=False, multithread=False] took 0.33163048420101404 sec
{'text': 'LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reu




#### 3.2 convert docx

In [10]:
from pyrecdp.primitives.llmutils.document_extractor import docx_to_text

file_name = "handbook-872p.docx"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
docx_to_text(in_file, out_file)

! head {out_file}

Document extract for '/content/test_data/document/handbook-872p.docx' with [glob=**/*.docx, required_exts=None, recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:00<00:00,  1.12it/s]

Document extract for '/content/test_data/document/handbook-872p.docx' with [glob=**/*.docx, required_exts=None, recursive=False, multithread=False] took 0.896975084207952 sec
{'text': 'U.S. Department of Justice\nExecutive Office for United States Trustees\t\t         \t\t\t\t   \nHandbook for\nChapter 13 \nStanding Trustees\nOctober 1, 2012 \nTable of Contents\nCHAPTER 1 - INTRODUCTION\nA. PURPOSE1-1\nB. ROLE OF UNITED STATES TRUSTEE1-1\nC. STATUTORY DUTIES OF A STANDING TRUSTEE1-2\nD. STANDING TRUSTEE PLEDGE OF EXCELLENCE1-4\nCHAPTER 2 – APPOINTMENT, QUALIFICATIONS, PERCENTAGE FEE AND COMPENSATION OF THE STANDING TRUSTEE\nA. ELIBILILITY2-1\n    B. RECRUITMENT AND SELECTION2-1\n    C. STANDING TRUSTEE COMPENSATION AND BENEFITS2-2\n    D. CALCULATION AND COLLECTION OF PERCENTAGE FEE2-3\n    E. SURETY BONDS2-4\nCHAPTER 3 – ADMINISTRATION OF CHAPTER 13 CASES\n   A. INITIAL REVIEW OF CHAPTER 13 CASES3-1\n        1. CONFLICTS OF INTEREST3-1\n        2. INITIAL REVIEW OF SCHEDULES AND PETIT




tion of the standing trustee’s office, including charges for preparation of payroll, payroll taxes, annual reports, and reconciliation of bank accounts. \nd.\tComputer Services: This line item should include charges for software, data conversion, related consulting.  All computer related training, unless conducted in the standing trustee’s office as part of a conversion, should be itemized under non-UST training.\ne.\tAudit Services: This line item should include charges incurred for the services of any independent audit firms. Each standing trustee will have at least one audit per year.\nf.\tConsulting Services: This line item includes charges incurred under contract with individuals for services by attorneys, appraisers, and other professionals. Accountants should be itemized under Bookkeeping/Accounting Services.  Computer related consulting should be itemized under Computer Services.  Each consultant and area of expertise must be specifically identified in the budget.  \nChapter 3,

#### 3.3 convert images

In [11]:
from pyrecdp.primitives.llmutils.document_extractor import image_to_text

file_name = "layout-parser-paper-10p.jpg"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
image_to_text(in_file, out_file)

! head {out_file}

Document extract for '/content/test_data/document/layout-parser-paper-10p.jpg' with [glob=**/*.*, required_exts=['.jpeg', '.jpg', '.png'], recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:32<00:00, 32.06s/it]

Document extract for '/content/test_data/document/layout-parser-paper-10p.jpg' with [glob=**/*.*, required_exts=['.jpeg', '.jpg', '.png'], recursive=False, multithread=False] took 32.0684085926041 sec
{'text': '2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\nLayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\n\nZejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain\nLee*, Jacob Carlson’, and Weining Li®\n\n1 Allen Institute for AI\nshannons@allenai.org\n? Brown University\nruochen_zhang@brown.edu\n° Harvard University\n{melissadell, jacob_carlson}@fas.harvard.edu\n* University of Washington\nbeg1@cs.washington. edu\n© University of Waterloo\nw4221i@uwaterloo.ca\n\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organize




#### 3.4 convert entire directory

In [12]:
from pyrecdp.primitives.llmutils.document_extractor import document_to_text

in_file = "/content/test_data/document/"
out_file = "/content/doc_jsonl/" + "document.json"
document_to_text(in_file, out_file, use_multithreading=True)

! head {out_file}

Document extract for '/content/test_data/document/' with [glob=**/*.*, required_exts=None, recursive=False, multithread=True] started ...


100%|██████████| 32/32 [00:32<00:00,  1.02s/it]


Document extract for '/content/test_data/document/' with [glob=**/*.*, required_exts=None, recursive=False, multithread=True] took 32.658647892065346 sec
{'text': 'RULES AND INSTRUCTIONS\n\n1. Template for day 1 (korean) , for day 2\n(English) for day 3 both English and korean.\n2. Use all your accounts. use different emails\nto send. Its better to have many email\naccounts.\n\nNote: Remember to write your own "OPENING\nMESSAGE" before you copy and paste the\ntemplate. please always include [TREASURE\nHARUTO] for example:\n\nStS ofAl2, AS|E YGEAS TB TREASUREWH\nHARUTOM| 2] PHEYLICH. BHO AY, HARUTO M BE\n= WSO Hot Wat SSBstS LRU, O| Gil\nBS SH AS2 ASS AZOpO} HAS] TAI St a\nBat FSAel SHS HS + U7/S LICH.\n\n3. CC Harutonations@gmail.com so we can\nkeep track of how many emails were\nsuccessfully sent\n\n4. Use the hashtag of Haruto on your tweet to\nshow that vou have sent your email]\n\x0c', 'metadata': {'source': '/content/test_data/document/english-and-korean.png'}}
{'text': 'U.S. Depa