# RecDP LLM - Document Extract

standard input for LLM pretrain/finetune is a folder of files containing multiple samples. Each sample is a json format or tabular format row.

This function is used to convert text, images, pdfs, docs to jsonl files and then used for LLM data process.

Output format:

| text                | meta                              |
| ------------------- | --------------------------------- |
| This is a cool tool | {'source': 'dummy', 'lang': 'en'} |
| llm is fun          | {'source': 'dummy', 'lang': 'en'} |
| ...                 | {'source': 'dummy', 'lang': 'en'} |

input types supported:
* image (png, jpg)
* pdf
* docs

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
%mkdir -p /content/doc_jsonl
file_names = ['english-and-korean.png', 'handbook-872p.docx', 'layout-parser-paper-10p.jpg', 'layout-parser-paper.pdf']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/{i}" for i in file_names]
!wget -P /content/test_data/document/ {" ".join(file_list)}

## 3. load  documents

#### 3.1 load pdf documents

In [1]:
from pyrecdp.primitives.operations import DirectoryLoader

from pyrecdp.LLM import TextPipeline
 
pipeline = TextPipeline()
ops = [
    DirectoryLoader(input_dir="/content/test_data/document", glob="**/*.pdf")
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())
 

_get_loader
[32m2023-12-20 14:52:16.382[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pypdf[0m


_get_loader
[32m2023-12-20 14:52:16.469[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pypdf[0m
init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 14:52:21,156	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


100%|██████████| 1/1 [00:00<00:00,  2.85it/s]
2023-12-20 14:52:22,927	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
2023-12-20 14:52:22,932	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 14:52:22,933	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 1.1216576620936394 sec


Unnamed: 0,text,metadata
0,\n\n LayoutParser : A Uniﬁed Toolkit for Deep\...,{'source': '/content/test_data/document/layout...


[2m[33m(raylet)[0m [2023-12-20 14:52:30,040 E 2010245 2010266] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-52-18_287180_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 14:52:40,054 E 2010245 2010266] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-52-18_287180_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 14:52:50,065 E 2010245 2010266] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-52-18_287180_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 14:53:00,079 E 2010245 2010266] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-52-18_287180_2002748 is over 95% full, available space: 0; cap

#### 3.2 load word documents

In [2]:
from pyrecdp.primitives.operations import DirectoryLoader

from pyrecdp.LLM import TextPipeline
 
pipeline = TextPipeline()
ops = [
    DirectoryLoader(input_dir="/content/test_data/document", glob="**/*.docx")
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

_get_loader
[32m2023-12-20 14:53:12.711[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install python-docx[0m
_get_loader
[32m2023-12-20 14:53:12.803[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install python-docx[0m
init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 14:53:17,517	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


100%|██████████| 1/1 [00:00<00:00,  1.64it/s]
2023-12-20 14:53:19,445	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
2023-12-20 14:53:19,448	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 14:53:19,450	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 1.2985519338399172 sec


Unnamed: 0,text,metadata
0,\n\n U.S. Department of Justice\n\n Executive ...,{'source': '/content/test_data/document/handbo...


[2m[33m(raylet)[0m [2023-12-20 14:53:26,399 E 2015878 2015899] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-53-14_626786_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.


#### 3.3 load images

In [3]:
from pyrecdp.primitives.operations import DirectoryLoader

from pyrecdp.LLM import TextPipeline
 
pipeline = TextPipeline()
ops = [
    DirectoryLoader(input_dir="/content/test_data/document", glob="**/*.jpg")
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

_get_loader
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 14:53:36.861[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 14:53:46.379[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
_get_loader
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 14:53:46.933[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 14:53:54.602[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 14:53:59,250	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


  0%|          | 0/1 [00:00<?, ?it/s][2m[33m(raylet)[0m [2023-12-20 14:54:08,227 E 2021288 2021308] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-53-56_430977_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 14:54:18,238 E 2021288 2021308] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-53-56_430977_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 14:54:28,251 E 2021288 2021308] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-53-56_430977_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
100%|██████████| 1/1 [00:35<00:00, 35.02s/it]
2023-12-20 14:54:35,490	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOpera

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 35.629308976233006 sec


Unnamed: 0,text,metadata
0,2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\n...,{'source': '/content/test_data/document/layout...


[2m[33m(raylet)[0m [2023-12-20 14:54:38,262 E 2021288 2021308] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_14-53-56_430977_2002748 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.


#### 3.4 load entire directory

In [1]:
from pyrecdp.primitives.operations import DirectoryLoader

from pyrecdp.LLM import TextPipeline
 
pipeline = TextPipeline()
ops = [
    DirectoryLoader(input_dir="/content/test_data/document")
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

[32m2023-12-20 15:20:13.169[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install emoji==2.2.0[0m


_get_loader
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 15:20:13.692[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 15:20:23.167[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
[32m2023-12-20 15:20:23.172[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pypdf[0m
[32m2023-12-20 15:20:23.382[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install python-docx[0m
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 15:20:23.902[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 15:20:31.290[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
_get_loader
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 15:20:31.823[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 15:20:38.180[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
[32m2023-12-20 15:20:38.183[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pypdf[0m
[32m2023-12-20 15:20:38.185[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install python-docx[0m
Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 134 not upgraded.
[32m2023-12-20 15:20:38.670[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pillow[0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[32m2023-12-20 15:20:52.243[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install pytesseract[0m
init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 15:20:56,945	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


 92%|█████████▎| 37/40 [00:00<00:00, 91.06it/s][2m[33m(raylet)[0m [2023-12-20 15:21:05,827 E 2140220 2140243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-20-54_049751_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:21:15,838 E 2140220 2140243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-20-54_049751_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
 98%|█████████▊| 39/40 [00:19<00:00, 91.06it/s][2m[33m(raylet)[0m [2023-12-20 15:21:25,850 E 2140220 2140243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-20-54_049751_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
100%|██████████| 40/40 [00:35<00:00,  1.13it/s]
2023-12-20 15:21:33,755	INFO streaming_executor.py:93 

Running 0:   0%|          | 0/4 [00:00<?, ?it/s]

execute with ray took 36.26759174466133 sec


Unnamed: 0,text,metadata
0,\n\n RULES AND INSTRUCTIONS\n\n1. Template for...,{'source': '/content/test_data/document/englis...
1,\n\n U.S. Department of Justice\n\n Executive ...,{'source': '/content/test_data/document/handbo...
2,2103.15348v2 [cs.CV] 21 Jun 2021\n\narXiv\n\n...,{'source': '/content/test_data/document/layout...
3,\n\n LayoutParser : A Uniﬁed Toolkit for Deep\...,{'source': '/content/test_data/document/layout...


[2m[33m(raylet)[0m [2023-12-20 15:21:35,862 E 2140220 2140243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-20-54_049751_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:21:45,876 E 2140220 2140243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-20-54_049751_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.


#### 3.5 Url Loaders with langchain RecusiverUrlLoader

In [4]:
url = 'https://app.cnvrg.io/docs/'

from pyrecdp.LLM import TextPipeline
from pyrecdp.primitives.operations import DocumentLoader

pipeline = TextPipeline()
ops = [
    DocumentLoader(loader='RecursiveUrlLoader', loader_args={"url": url}),
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())



init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 15:28:21,062	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


[2m[33m(raylet)[0m [2023-12-20 15:28:30,032 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:28:40,047 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
2023-12-20 15:28:41,766	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
2023-12-20 15:28:41,767	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 15:28:41,768	INFO streaming_executor.py:96 -- Tip: F

Running 0:   0%|          | 0/17 [00:00<?, ?it/s]

execute with ray took 20.301080273464322 sec


Unnamed: 0,text,metadata
0,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
1,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
2,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
3,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
4,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
5,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
6,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
7,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
8,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...
9,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...


[2m[33m(raylet)[0m [2023-12-20 15:28:50,060 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:29:00,074 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:29:10,089 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 15:29:20,105 E 2159771 2159790] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-28-18_227590_2138313 is over 95% full, available space: 0; cap

#### 3.5 Online Pdf loader

In [5]:
attention_is_all_you_need_pdf = 'https://arxiv.org/pdf/1706.03762.pdf'

from pyrecdp.LLM import TextPipeline
from pyrecdp.primitives.operations import DocumentLoader

pipeline = TextPipeline()
ops = [
    DocumentLoader(loader='OnlinePDFLoader', loader_args={"file_path": attention_is_all_you_need_pdf}),
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

init ray
init ray with total mem of 324413575987, total core of 48


2023-12-20 15:37:49,289	INFO worker.py:1642 -- Started a local Ray instance.


execute with ray started ...


[2m[33m(raylet)[0m [2023-12-20 15:37:58,263 E 2179392 2179417] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-37-46_470728_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
2023-12-20 15:37:58,360	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
2023-12-20 15:37:58,362	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 15:37:58,363	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 8.47647582553327 sec


Unnamed: 0,text,metadata
0,3 2 0 2\n\ng u A 2\n\n] L C . s c [\n\n7 v 2 6...,{'source': '/tmp/tmp7lpqus0e/tmp.pdf'}


[2m[33m(raylet)[0m [2023-12-20 15:38:08,277 E 2179392 2179417] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_15-37-46_470728_2138313 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
