# RecDP LLM - RAG Indexer Pipeline (ElasticSearch)

This notebook will show you how to construct a RAG indexer pipeline with RecDP-LLM, it use Elasticsearch as backend document store

# Get started

## 1. Install pyrecdp and dependencies

In [2]:
! DEBIAN_FRONTEND=noninteractive apt-get install -qq -y openjdk-8-jre
! pip install -q pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Setup and run RAG indexer pipeline

### 2.1 Loader documents

In [1]:
urls = ['https://app.cnvrg.io/docs/', 
        'https://app.cnvrg.io/docs/core_concepts/python_sdk_v2.html',
        'https://app.cnvrg.io/docs/cli_v2/cnvrgv2_cli.html',
        'https://app.cnvrg.io/docs/collections/tutorials.html']

from pyrecdp.primitives.operations import DocumentLoader
url_loader = DocumentLoader(loader='UnstructuredURLLoader', loader_args={'urls': urls}, requirements=['unstructured'])

ds = url_loader.process_rayds()
display(ds.to_pandas())

[32m2023-12-13 11:22:36.363[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m50[0m - [1mcheck_availability_and_install emoji==2.2.0[0m


[32m2023-12-13 11:22:36.397[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m45[0m - [1mcheck_availability_and_install ['unstructured'][0m


2023-12-13 11:22:58,390	INFO worker.py:1642 -- Started a local Ray instance.


Unnamed: 0,text,metadata
0,# Welcome to cnvrg.io\n\ncnvrg.io is a machine...,{'source': 'https://app.cnvrg.io/docs/'}
1,# cnvrg SDK V2\n\n# Getting Started\n\nThe cnv...,{'source': 'https://app.cnvrg.io/docs/core_con...
2,"# cnvrgv2 CLI\n\nAs data scientists, we use th...",{'source': 'https://app.cnvrg.io/docs/cli_v2/c...
3,# Tutorials and Examples\n\nTo help you get st...,{'source': 'https://app.cnvrg.io/docs/collecti...


### 2.2 Preprocess documents

#### 2.2.1 Remove header separator and Extra white space

In [2]:
from pyrecdp.primitives.operations import RAGTextFix
text_fixer = RAGTextFix(str_to_replace={'\n###': '', '\n##': '', '\n#': ''}, remove_extra_whitespace=True)
ds = text_fixer.process_rayds(ds)
display(ds.to_pandas())


[32m2023-12-13 11:23:32.657[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m45[0m - [1mcheck_availability_and_install ['ftfy', 'selectolax'][0m
[32m2023-12-13 11:23:32.711[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m50[0m - [1mcheck_availability_and_install nltk[0m
[32m2023-12-13 11:23:32.712[0m | [1mINFO    [0m | [36mpyrecdp.core.model_utils[0m:[36mprepare_nltk_model[0m:[36m164[0m - [1mLoading nltk punkt split model...[0m


2023-12-13 11:23:32,723	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)]
2023-12-13 11:23:32,724	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-13 11:23:32,726	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/4 [00:00<?, ?it/s]

[2m[36m(Map(<lambda>) pid=1791194)[0m 2023-12-13 11:23:34.531 | INFO     | pyrecdp.core.import_utils:check_availability_and_install:50 - check_availability_and_install emoji==2.2.0


Unnamed: 0,text,metadata
0,# cnvrg SDK V2 Getting Started The cnvrg SDK w...,{'source': 'https://app.cnvrg.io/docs/core_con...
1,"# cnvrgv2 CLI As data scientists, we use the t...",{'source': 'https://app.cnvrg.io/docs/cli_v2/c...
2,# Tutorials and Examples To help you get start...,{'source': 'https://app.cnvrg.io/docs/collecti...
3,# Welcome to cnvrg.io cnvrg.io is a machine le...,{'source': 'https://app.cnvrg.io/docs/'}


#### 2.2.2 Split document

In [3]:
from pyrecdp.primitives.operations import CustomerDocumentSplit
split_doc_op = CustomerDocumentSplit(func=lambda text: text.split('# ')[1:])
ds = split_doc_op.process_rayds(ds)
display(ds.to_pandas())

2023-12-13 11:23:37,686	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)->FlatMap(<lambda>)]
2023-12-13 11:23:37,688	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-13 11:23:37,689	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,text,metadata
0,Welcome to cnvrg.io cnvrg.io is a machine lear...,{'source': 'https://app.cnvrg.io/docs/'}
1,"cnvrgv2 CLI As data scientists, we use the ter...",{'source': 'https://app.cnvrg.io/docs/cli_v2/c...
2,Tutorials and Examples To help you get started...,{'source': 'https://app.cnvrg.io/docs/collecti...
3,cnvrg SDK V2 Getting Started The cnvrg SDK was...,{'source': 'https://app.cnvrg.io/docs/core_con...
4,"An array of numerical values values: [0.1, 0.0...",{'source': 'https://app.cnvrg.io/docs/core_con...
5,"An array of string values values: [""linear"", ""...",{'source': 'https://app.cnvrg.io/docs/core_con...
6,inclusive max: 20,{'source': 'https://app.cnvrg.io/docs/core_con...
7,"not inclusive scale: ""linear"" steps: 2",{'source': 'https://app.cnvrg.io/docs/core_con...
8,"The number of linear steps to produce. """""" e =...",{'source': 'https://app.cnvrg.io/docs/core_con...


#### 2.2.3 Add customer filter

In [4]:
def custom_filter(text):
    from nltk.tokenize import word_tokenize
    ret_txt = None
    if len(word_tokenize(text)) >10:
        if text.split(' ')[0].lower()!='version':
            ret_txt = text
    return ret_txt != None
    
from pyrecdp.primitives.operations import TextCustomerFilter
custom_filter_op = TextCustomerFilter(custom_filter)

ds = custom_filter_op.process_rayds(ds)
display(ds.to_pandas())


2023-12-13 11:23:41,022	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)->FlatMap(<lambda>)->Filter(<lambda>)]
2023-12-13 11:23:41,023	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-13 11:23:41,025	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,text,metadata
0,Welcome to cnvrg.io cnvrg.io is a machine lear...,{'source': 'https://app.cnvrg.io/docs/'}
1,"cnvrgv2 CLI As data scientists, we use the ter...",{'source': 'https://app.cnvrg.io/docs/cli_v2/c...
2,Tutorials and Examples To help you get started...,{'source': 'https://app.cnvrg.io/docs/collecti...
3,cnvrg SDK V2 Getting Started The cnvrg SDK was...,{'source': 'https://app.cnvrg.io/docs/core_con...
4,"An array of numerical values values: [0.1, 0.0...",{'source': 'https://app.cnvrg.io/docs/core_con...
5,"An array of string values values: [""linear"", ""...",{'source': 'https://app.cnvrg.io/docs/core_con...
6,"The number of linear steps to produce. """""" e =...",{'source': 'https://app.cnvrg.io/docs/core_con...


#### 2.2.4 chunk document

In [5]:
def chunk_doc(text,max_num_of_words):
    from nltk.tokenize import word_tokenize,sent_tokenize
    text= text.strip()
    if len(word_tokenize(text)) <= max_num_of_words:
        return [text]
    else:
        chunks = []
        # split by sentence
        sentences = sent_tokenize(text)
        # print('number of sentences: ', len(sentences))
        words_count = 0
        temp_chunk = ""
        for s in sentences:
            temp_chunk+=(s+" ")
            words_count += len(word_tokenize(s))
            if len(word_tokenize(temp_chunk))> max_num_of_words:
                chunks.append(temp_chunk)
                words_count = 0
                temp_chunk = ""
                
        return chunks
    
from pyrecdp.primitives.operations import CustomerDocumentSplit
chunk_doc_op = CustomerDocumentSplit(chunk_doc, max_num_of_words=50)

ds = chunk_doc_op.process_rayds(ds)
display(ds.to_pandas())

2023-12-13 11:23:46,037	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Map(<lambda>)->FlatMap(<lambda>)->Filter(<lambda>)->FlatMap(<lambda>)]
2023-12-13 11:23:46,039	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-12-13 11:23:46,040	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,text,metadata
0,Welcome to cnvrg.io cnvrg.io is a machine lear...,{'source': 'https://app.cnvrg.io/docs/'}
1,Dataset Use Datasets to manage data with versi...,{'source': 'https://app.cnvrg.io/docs/'}
2,Use Papers to consolidate comparison across ex...,{'source': 'https://app.cnvrg.io/docs/'}
3,Easily update your running serving to keep it ...,{'source': 'https://app.cnvrg.io/docs/'}
4,Tutorials and Examples To help you get started...,{'source': 'https://app.cnvrg.io/docs/collecti...
...,...,...
92,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
93,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
94,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
95,"options are: cnvrg, dockerhub, gcr, acr, ecr, ...",{'source': 'https://app.cnvrg.io/docs/core_con...


### 2.3 Store documents 



#### 2.3.1 Setup ElasticSearch
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the [Docker Engine Server installation instructions](https://docs.docker.com/engine/install/#server) for details.

In [None]:
!docker run -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.15

Dec 08, 2023 1:47:14 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
{"type": "server", "timestamp": "2023-12-08T13:47:14,874Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "d36a0b817d49", "message": "version[7.17.15], pid[7], build[default/docker/0b8ecfb4378335f4689c4223d1f1115f16bef3ba/2023-11-10T22:03:46.987399016Z], OS[Linux/5.15.0-79-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/21.0.1/21.0.1+12-29]" }
{"type": "server", "timestamp": "2023-12-08T13:47:14,877Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "d36a0b817d49", "message": "JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]" }
{"type": "server", "timestamp": "2023-12-08T13:47:14,878Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "d36a0b817d49", "message": "JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cac

In [None]:
! curl -X GET "localhost:9200"

{
  "name" : "efc38f135e49",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "aevYlbdERfqS-jC8RXmeLw",
  "version" : {
    "number" : "7.17.15",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "0b8ecfb4378335f4689c4223d1f1115f16bef3ba",
    "build_date" : "2023-11-10T22:03:46.987399016Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


#### 2.3.2 Ingest documents into document store

In [6]:
from pyrecdp.primitives.operations import DocumentIngestion
indexer_op = DocumentIngestion(
    rag_framework='haystack',
    vector_store='elasticsearch',
    vector_store_args={'host': 'localhost', 'port': 9200},
)

ds = indexer_op.process_rayds(ds)
display(ds.to_pandas())

[32m2023-12-13 11:23:53.342[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m45[0m - [1mcheck_availability_and_install ['langchain'][0m
[32m2023-12-13 11:23:53.346[0m | [1mINFO    [0m | [36mpyrecdp.core.import_utils[0m:[36mcheck_availability_and_install[0m:[36m45[0m - [1mcheck_availability_and_install ['farm-haystack', 'farm-haystack[elasticsearch7]'][0m


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.




Unnamed: 0,text,metadata
0,Welcome to cnvrg.io cnvrg.io is a machine lear...,{'source': 'https://app.cnvrg.io/docs/'}
1,Dataset Use Datasets to manage data with versi...,{'source': 'https://app.cnvrg.io/docs/'}
2,Use Papers to consolidate comparison across ex...,{'source': 'https://app.cnvrg.io/docs/'}
3,Easily update your running serving to keep it ...,{'source': 'https://app.cnvrg.io/docs/'}
4,Tutorials and Examples To help you get started...,{'source': 'https://app.cnvrg.io/docs/collecti...
...,...,...
92,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
93,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
94,"For example, gputype=v100 .To specify several,...",{'source': 'https://app.cnvrg.io/docs/core_con...
95,"options are: cnvrg, dockerhub, gcr, acr, ecr, ...",{'source': 'https://app.cnvrg.io/docs/core_con...
