# RecDP LLM - RAG Data Processing Pipeline

This component will show you how to  construct a RAG data processing pipeline  to build your own knowledge database for use by LLM-RAG.

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

### 2. Set parameters

In [None]:
out_dir=/content/vs_store
vector_store_type="FAISS"
index_name="knowledge_db"

text_splitter = "RecursiveCharacterTextSplitter"
text_splitter_args = {"chunk_size": 500, "chunk_overlap": 0}

target_urls = ["https://www.intc.com/news-events/press-releases/detail/1655/intel-reports-third-quarter-2023-financial-results"]

embeddings_type="HuggingFaceEmbeddings"
embeddings_args={'model_name': f"sentence-transformers/all-mpnet-base-v2"}

## 3. Extract data and build a knowledge database

### 3.1 Import related packages





In [None]:
from pyrecdp.LLM import TextPipeline
from pyrecdp.primitives.operations import Url_Loader, DocumentSplit, DocumentIngestion

### 3.2 Set the RAG data processing pipeline






In [None]:
pipeline = TextPipeline()
ops = [
    Url_Loader(urls=target_urls, target_tag='div', target_attrs={'class': 'main-content'}),
    DocumentSplit(text_splitter=text_splitter, text_splitter_args=text_splitter_args),
    DocumentIngestion(
        vector_store=vector_store_type,
        vector_store_args={
            "output_dir": out_dir,
            "index": index_name
        },
        embeddings=embeddings_type,
        embeddings_args=embeddings_args
    ),
]
pipeline.add_operations(ops)


### 3.3 Run your pipeline



In [None]:
pipeline.execute()