# RecDP LLM - Document Ingestion

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
%mkdir -p /content/doc_jsonl
file_names = ['english-and-korean.png', 'handbook-872p.docx', 'layout-parser-paper-10p.jpg', 'layout-parser-paper.pdf']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/document/{i}" for i in file_names]
!wget -P /content/test_data/document/ {" ".join(file_list)}

## 3. DocumentIngestion

### 3.1 Load document

In [1]:
from pyrecdp.primitives.operations import DocumentSplit,DocumentLoader

loader = DocumentLoader(loader="RecursiveUrlLoader", loader_args={"url": "https://app.cnvrg.io/docs/core_concepts/python_sdk_v2.html"})

ds = loader.process_rayds()
display(ds.to_pandas())

[32m2023-12-20 16:05:50.470[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m52[0m - [1mcheck_availability_and_install emoji==2.2.0[0m
2023-12-20 16:05:55,014	INFO worker.py:1642 -- Started a local Ray instance.


Unnamed: 0,text,metadata
0,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>...",{'description': 'Documentation website for cnv...


[2m[33m(raylet)[0m [2023-12-20 16:06:04,911 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:06:14,923 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:06:24,935 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:06:34,946 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; cap

### 3.2 Embedding document

we support ingest document into following document store:
- [faiss](https://github.com/facebookresearch/faiss) vector store
- [chroma](https://github.com/chroma-core/chroma) vector store
- [elasticsearch](https://github.com/elastic/elasticsearch)

we provide the [DocumentIngestion](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/operations/text_ingestion.py#L239) operator for ingest documents into document store

#### 3.2.1 Ingest into FAISS vector store

In [4]:
from pyrecdp.LLM import TextPipeline
import os
from pyrecdp.primitives.operations import DocumentIngestion,DocumentLoader
from pyrecdp.core.cache_utils import RECDP_MODELS_CACHE
model_root_path = os.path.join(RECDP_MODELS_CACHE, "huggingface")
model_name = f"{model_root_path}/sentence-transformers/all-mpnet-base-v2"
faiss_output_dir = 'tests/data/faiss'
pipeline = TextPipeline()

url = "https://www.intc.com/news-events/press-releases/detail/1655/intel-reports-third-quarter-2023-financial-results"
ops = [
    DocumentLoader(loader="RecursiveUrlLoader" , loader_args={"url":url}),
    DocumentSplit(text_splitter='RecursiveCharacterTextSplitter'),
    DocumentIngestion(
        vector_store='FAISS',
        vector_store_args={
            "output_dir": faiss_output_dir,
            "index": "test_index"
        },
        embeddings='HuggingFaceEmbeddings',
        embeddings_args={'model_name': model_name}
    ),
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

[32m2023-12-20 16:17:21.188[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['langchain'][0m
[32m2023-12-20 16:17:25.209[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['langchain'][0m


[2m[33m(raylet)[0m [2023-12-20 16:17:25,757 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.


init ray
execute with ray started ...
[32m2023-12-20 16:17:31.848[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['faiss-cpu', 'faiss-gpu', 'langchain'][0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


[2m[33m(raylet)[0m [2023-12-20 16:17:35,769 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
[2m[33m(raylet)[0m [2023-12-20 16:17:45,782 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:17:55,795 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.


[2m[33m(raylet)[0m [2023-12-20 16:18:05,807 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
2023-12-20 16:18:07,819	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[FlatMap(<lambda>)->MapBatches(TextEmbedding)]
2023-12-20 16:18:07,820	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 16:18:07,822	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2023-12-20 16:18:09,643	INFO actor_pool_map_operator.py:106 -- FlatMap(<lambda>)->MapBatches(TextEmbedding): Wait

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[33m(raylet)[0m [2023-12-20 16:18:15,818 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
Batches:   0%|          | 0/4 [00:00<?, ?it/s]hes(TextEmbedding)) pid=2276648)[0m 
[2m[33m(raylet)[0m [2023-12-20 16:18:25,829 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:18:35,840 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:18:45,852 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]hes(TextEmbedding)) pid=2279978)[0m 
[2m[33m(raylet)[0m [2023-12-20 16:20:35,983 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:20:45,994 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:20:56,003 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
Batches:  25%|██▌       | 1/4 [00:34<01:43, 34.34s/it]Embedding)) pid=2279978)[0m 
[2m[33m(raylet)[0m [2023-

execute with ray took 307.69363911263645 sec


Batches: 100%|██████████| 4/4 [02:04<00:00, 31.15s/it]Embedding)) pid=2279978)[0m 


Unnamed: 0,text,embedding
0,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n <he...","[0.02279656007885933, -0.004381305538117886, -..."
1,"<script type=""text/javascript"">\r\n //c...","[-0.013796613551676273, -0.04453909769654274, ..."
2,"<a href=""/news-events/press-releases"" >Press R...","[0.009947280399501324, 0.010400775820016861, -..."
3,</li>\r\n <...,"[0.02152082696557045, 0.03438470885157585, -0...."
4,"<div class=""text"">\r\n ...","[0.0012505522463470697, -0.03194170445203781, ..."
...,...,...
113,"</td>\r\n<td width=""5"" valign=""bottom"" style=""...","[-0.009348627179861069, -0.031560588628053665,..."
114,</svg>\r\n <span>Tear Sheet</sp...,"[-0.018031703308224678, -0.08937390893697739, ..."
115,</svg>\r\n <span>RSS News Feed<...,"[0.029551001265645027, -0.027967212721705437, ..."
116,</svg>\r\n </a>\r\n ...,"[0.031037108972668648, -0.05818837508559227, -..."


[2m[33m(raylet)[0m [2023-12-20 16:22:36,121 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:22:46,135 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:22:56,149 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:23:06,162 E 2251227 2251243] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-05-53_056733_2250766 is over 95% full, available space: 0; cap

#### 3.2.2 Ingest into FAISS chroma store

In [2]:
from pyrecdp.LLM import TextPipeline
import os
from pyrecdp.primitives.operations import DocumentIngestion,DocumentLoader,DocumentSplit
from pyrecdp.core.cache_utils import RECDP_MODELS_CACHE
model_root_path = os.path.join(RECDP_MODELS_CACHE, "huggingface")
model_name = f"{model_root_path}/sentence-transformers/all-mpnet-base-v2"
chroma_output_dir = 'tests/data/chroma'
pipeline = TextPipeline()

url = "https://www.intc.com/news-events/press-releases/detail/1655/intel-reports-third-quarter-2023-financial-results"
ops = [
    DocumentLoader(loader="RecursiveUrlLoader" , loader_args={"url":url}),
    DocumentSplit(text_splitter='RecursiveCharacterTextSplitter'),
    DocumentIngestion(
        vector_store='chroma',
        vector_store_args={
            "persist_directory": chroma_output_dir,
            "collection_name": "test_index"
        },
        embeddings='HuggingFaceEmbeddings',
        embeddings_args={'model_name': model_name}
    ),
]
pipeline.add_operations(ops)
ds = pipeline.execute()
display(ds.to_pandas())

[32m2023-12-20 16:46:02.538[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['langchain'][0m
[32m2023-12-20 16:46:03.760[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['langchain'][0m
init ray
execute with ray started ...
[32m2023-12-20 16:46:09.008[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m47[0m - [1mcheck_availability_and_install ['chromadb==0.4.15', 'langchain'][0m


2023-12-20 16:46:09,014	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[FlatMap(<lambda>)]
2023-12-20 16:46:09,016	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-20 16:46:09,017	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(FlatMap(<lambda>) pid=2326233)[0m 2023-12-20 16:46:10.667 | INFO     | pyrecdp.core.import_utils:check_availability_and_install:52 - check_availability_and_install emoji==2.2.0
[2m[33m(raylet)[0m [2023-12-20 16:46:11,993 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:46:22,005 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
2023-12-20 16:46:25,523	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[FlatMap(<lambda>)]
2023-12-20 16:46:25,524	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

execute with ray took 20.456643318757415 sec


Unnamed: 0,text,metadata
0,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n <he...","{'language': 'en', 'source': 'https://www.intc..."
1,"<script type=""text/javascript"">\r\n //c...","{'language': 'en', 'source': 'https://www.intc..."
2,"<a href=""/news-events/press-releases"" >Press R...","{'language': 'en', 'source': 'https://www.intc..."
3,</li>\r\n <...,"{'language': 'en', 'source': 'https://www.intc..."
4,"<div class=""text"">\r\n ...","{'language': 'en', 'source': 'https://www.intc..."
...,...,...
113,"</td>\r\n<td width=""5"" valign=""bottom"" style=""...","{'language': 'en', 'source': 'https://www.intc..."
114,</svg>\r\n <span>Tear Sheet</sp...,"{'language': 'en', 'source': 'https://www.intc..."
115,</svg>\r\n <span>RSS News Feed<...,"{'language': 'en', 'source': 'https://www.intc..."
116,</svg>\r\n </a>\r\n ...,"{'language': 'en', 'source': 'https://www.intc..."


[2m[33m(raylet)[0m [2023-12-20 16:46:32,015 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:46:42,025 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:46:52,038 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; capacity: 422146228224. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-12-20 16:47:02,050 E 2326022 2326053] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-12-20_16-45-19_818714_2325066 is over 95% full, available space: 0; cap