# RecDP LLM - RAG Data Processing Pipeline

This component will show you how to  construct a RAG data processing pipeline  to build your own knowledge database for use by LLM-RAG.

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Set parameters according to your environment



### 2.1 Parametera about vector store.


In [None]:
# Where to store vectore store data
out_dir=/content/vs_store
vector_store_type="FAISS"
index_name="knowledge_db"

### 2.2 Parametera about TextSplitter

In [None]:
text_splitter = "RecursiveCharacterTextSplitter"
text_splitter_args = {"chunk_size": 500, "chunk_overlap": 0}

### 2.3 Parametera about Embedding

In [None]:
embeddings_type="HuggingFaceEmbeddings"
embeddings_args={'model_name': f"sentence-transformers/all-mpnet-base-v2"}

### 2.4 Specify the data you need to process

In [None]:
# web data
target_urls = ["https://www.intc.com/news-events/press-releases/detail/1655/intel-reports-third-quarter-2023-financial-results"]
# or some file data
# data_path = "/content/my_pdf_path"

## 3. Use recdp to extract data and build a knowledge database

### 3.1 Import related packages





In [None]:
from pyrecdp.LLM import TextPipeline
from pyrecdp.primitives.operations import Url_Loader, DocumentSplit, DocumentIngestion, RAGTextFix

### 3.2 Set the RAG data processing pipeline






In [None]:
pipeline = TextPipeline()
ops = [
    Url_Loader(urls=target_urls),
    # DirectoryLoader(data_path, glob="**/*.pdf"),
    # Use operators provided by Recdp to process the data
    RAGTextFix(),
    DocumentSplit(text_splitter=text_splitter, text_splitter_args=text_splitter_args),
    DocumentIngestion(
        vector_store=vector_store_type,
        vector_store_args={
            "output_dir": out_dir,
            "index": index_name
        },
        embeddings=embeddings_type,
        embeddings_args=embeddings_args
    ),
]
pipeline.add_operations(ops)


### 3.3 Run your pipeline



In [None]:
pipeline.execute()