# RecDP LLM - Document Extract

standard input for LLM pretrain/finetune is a folder of files containing multiple samples. Each sample is a json format or tabular format row.

This function is used to convert text, images, pdfs, docs to jsonl files and then used for LLM data process.

Output format:

| text                | meta                              |
| ------------------- | --------------------------------- |
| This is a cool tool | {'source': 'dummy', 'lang': 'en'} |
| llm is fun          | {'source': 'dummy', 'lang': 'en'} |
| ...                 | {'source': 'dummy', 'lang': 'en'} |

input types supported:
* image (png, jpg)
* pdf
* docs

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
%mkdir -p /content/doc_jsonl
file_names = ['english-and-korean.png', 'handbook-872p.docx', 'layout-parser-paper-10p.jpg', 'layout-parser-paper.pdf']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/{i}" for i in file_names]
!wget -P /content/test_data/document/ {" ".join(file_list)}

## 3. convert data

#### 3.1 convert pdf

In [15]:
from pyrecdp.primitives.llmutils.document_extractor import pdf_to_text
import pandas as pd

file_name = "layout-parser-paper.pdf"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
pdf_to_text(in_file, out_file)
display(pd.read_json(out_file, lines=True))

 

Document extract for '/content/test_data/document/layout-parser-paper.pdf' with [glob=**/*.pdf, required_exts=None, recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:00<00:00,  3.16it/s]

Document extract for '/content/test_data/document/layout-parser-paper.pdf' with [glob=**/*.pdf, required_exts=None, recursive=False, multithread=False] took 0.32126119174063206 sec





Unnamed: 0,text,metadata
0,LayoutParser : A Uniﬁed Toolkit for Deep\nLear...,{'source': '/content/test_data/document/layout...


#### 3.2 convert docx

In [16]:
from pyrecdp.primitives.llmutils.document_extractor import docx_to_text
import pandas as pd

file_name = "handbook-872p.docx"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
docx_to_text(in_file, out_file)

display(pd.read_json(out_file, lines=True))

Document extract for '/content/test_data/document/handbook-872p.docx' with [glob=**/*.docx, required_exts=None, recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:00<00:00,  1.94it/s]

Document extract for '/content/test_data/document/handbook-872p.docx' with [glob=**/*.docx, required_exts=None, recursive=False, multithread=False] took 0.520426164381206 sec





Unnamed: 0,text,metadata
0,U.S. Department of Justice\nExecutive Office f...,{'source': '/content/test_data/document/handbo...


#### 3.3 convert images

In [17]:
from pyrecdp.primitives.llmutils.document_extractor import image_to_text
import pandas as pd

file_name = "layout-parser-paper-10p.jpg"
in_file = "/content/test_data/document/" + file_name
out_file = "/content/doc_jsonl/" + file_name + ".jsonl"
image_to_text(in_file, out_file)

display(pd.read_json(out_file, lines=True))

Document extract for '/content/test_data/document/layout-parser-paper-10p.jpg' with [glob=**/*.*, required_exts=['.jpeg', '.jpg', '.png'], recursive=False, multithread=False] started ...


100%|██████████| 1/1 [00:31<00:00, 31.67s/it]

Document extract for '/content/test_data/document/layout-parser-paper-10p.jpg' with [glob=**/*.*, required_exts=['.jpeg', '.jpg', '.png'], recursive=False, multithread=False] took 31.676521027460694 sec





Unnamed: 0,text,metadata
0,2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\nL...,{'source': '/content/test_data/document/layout...


#### 3.4 convert entire directory

In [14]:
from pyrecdp.primitives.llmutils.document_extractor import document_to_text
import pandas as pd
in_file = "/content/test_data/document/"
out_file = "/content/doc_jsonl/" + "document.json"
document_to_text(in_file, out_file, use_multithreading=True)
display(pd.read_json(out_file, lines=True))

Document extract for '/content/test_data/document/' with [glob=**/*.*, required_exts=None, recursive=False, multithread=True] started ...


100%|██████████| 40/40 [00:32<00:00,  1.22it/s]

Document extract for '/content/test_data/document/' with [glob=**/*.*, required_exts=None, recursive=False, multithread=True] took 32.91086395457387 sec





Unnamed: 0,text,metadata
0,RULES AND INSTRUCTIONS\n\n1. Template for day ...,{'source': '/content/test_data/document/englis...
1,U.S. Department of Justice\nExecutive Office f...,{'source': '/content/test_data/document/handbo...
2,2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\nL...,{'source': '/content/test_data/document/layout...
3,LayoutParser : A Uniﬁed Toolkit for Deep\nLear...,{'source': '/content/test_data/document/layout...
