<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Data Processing for RAG with Data Prep Kit (Python)</h1>
    
</div>


## Before Running the notebook

Please complete [setting up python dev environment](./setup-python-dev-env.md)

## Overview

This notebook will process PDF documents as part of RAG pipeline

![](media/rag-overview-2.png)

This notebook will perform steps 1, 2, 3 and 4 in RAG pipeline.

Here are the processing steps:

- **pdf2parquet** : Extract text (in markdown format) from PDF and store them as parquet files
- **Exact Dedup**: Documents with exact content are filtered out
- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)
- **Text encoder**: Convert chunks into vectors using embedding models

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

## Step-2:  Data

We will use white papers  about LLMs.  

- [Granite Code Models](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/abs/1706.03762)

You can of course substite your own data below

### 2.1 - Download data

In [2]:
import os, sys
import shutil
from utils import download_file

shutil.rmtree(MY_CONFIG.INPUT_DATA_DIR, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)
print ("✅ Cleared input directory")
 
download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attention.pdf' ))
download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))
download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite2.pdf' )) # duplicate


✅ Cleared input directory

input/attention.pdf (2.22 MB) downloaded successfully.

input/granite.pdf (1.27 MB) downloaded successfully.

input/granite2.pdf (1.27 MB) downloaded successfully.


### 2.2 - Set input/output path variables for the pipeline

In [3]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_dedupe_out')
output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_chunk_out')
output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_embeddings_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Execute 

In [4]:
%%time 

from dpk_pdf2parquet.transform_python import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

STAGE = 1 
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{MY_CONFIG.INPUT_DATA_DIR}' --> output='{output_parquet_dir}'\n", flush=True)

result = Pdf2Parquet(input_folder= MY_CONFIG.INPUT_DATA_DIR,
                    output_folder= output_parquet_dir,
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    #    pdf2parquet_contents_type=pdf2parquet_contents_types.JSON   # JSON
                    ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'



12:21:44 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
12:21:44 INFO - pipeline id pipeline_id
12:21:44 INFO - code location None
12:21:44 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out
12:21:44 INFO - data factory data_ max_files -1, n_sample -1
12:21:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
12:21:44 INFO - orchestrator pdf2parquet started at 2025-02-19 12:21:44
12:21:44 INFO - Number of files is 3, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS


12:22:02 INFO - Completed 1 files (33.33%) in 0.237 min
12:22:31 INFO - Completed 2 files (66.67%) in 0.716 min
12:22:59 INFO - Completed 3 files (100.0%) in 1.19 min
12:22:59 INFO - Done processing 3 files, waiting for flush() completion.
12:22:59 INFO - done flushing in 0.0 sec
12:22:59 INFO - Completed execution in 1.254 min, execution result 0


✅ Stage:1 completed successfully
CPU times: user 3min 15s, sys: 11.4 s, total: 3min 26s
Wall time: 1min 19s


### 3.2 -  Inspect Generated output

Here we should see one entry per input file processed

In [5]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_parquet_dir)

# print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,attention.pdf,"Provided proper attribution is provided, Googl...",15,6,147,48c14faf-6f1d-4edc-8c29-2cf36a5bbdcc,2949302674760005271,pdf,f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...,46040,2025-02-19T12:22:02.413652,14.201284,attention.pdf
1,granite2.pdf,## Granite Code Models: A Family of Open Found...,28,19,295,b0668fa2-cabc-4b85-bbcd-407759af1e23,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:59.617806,28.445347,granite2.pdf
2,granite.pdf,## Granite Code Models: A Family of Open Found...,28,19,295,0dc95fd2-332c-4711-aa81-9fd529fcd3fc,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:31.132946,28.668649,granite.pdf


## Step-4: Eliminate Duplicate Documents

We have 2 duplicate documnets here : `granite.pdf` and `granite2.pdf`.

Note how the `hash` for these documents are same.

We are going to perform **de-dupe**

On the content of each document, a SHA256 hash is computed, followed by de-duplication of record having identical hashes.

[Dedupe transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)

### 4.1 - Execute 

In [6]:
%%time 

from dpk_ededup.transform_python import Ededup

STAGE = 2
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_parquet_dir}' --> output='{output_exact_dedupe_dir}'\n", flush=True)

result = Ededup(input_folder=output_parquet_dir,
    output_folder=output_exact_dedupe_dir,
    ededup_doc_column="contents",
    ededup_doc_id_column="document_id"
    ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_dedupe_out'



12:22:59 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}
12:22:59 INFO - pipeline id pipeline_id
12:22:59 INFO - code location None
12:22:59 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_dedupe_out
12:22:59 INFO - data factory data_ max_files -1, n_sample -1
12:22:59 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:22:59 INFO - orchestrator ededup started at 2025-02-19 12:22:59
12:22:59 INFO - Number of files is 3, source profile {'max_file_size': 0.04436779022216797, 'min_file_size': 0.02082538604736328, 'total_file_size': 0.10954761505126953}
12:22:59 INFO - Starting from the beginning
12:22:59 INFO - Completed 1 files (33.33%) in 0.0 min
12:22:59 INFO - Completed 2 files (66.67%) in 0.0 min
12:22:59 INFO - Com

✅ Stage:2 completed successfully
CPU times: user 36.2 ms, sys: 2.95 ms, total: 39.2 ms
Wall time: 37 ms


### 4.2 - Inspect Generated output

We would see 2 documents: `attention.pdf`  and `granite.pdf`.  The duplicate `granite.pdf` has been filtered out!

In [7]:
from utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(output_parquet_dir)
output_df = read_parquet_files_as_df(output_exact_dedupe_dir)

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.sample(min(3, output_df.shape[0]))

Input files before exact dedupe : 3
Output files after exact dedupe : 2
Duplicate files removed :   1


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,removed
0,attention.pdf,"Provided proper attribution is provided, Googl...",15,6,147,48c14faf-6f1d-4edc-8c29-2cf36a5bbdcc,2949302674760005271,pdf,f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...,46040,2025-02-19T12:22:02.413652,14.201284,attention.pdf,[]
1,granite.pdf,## Granite Code Models: A Family of Open Found...,28,19,295,0dc95fd2-332c-4711-aa81-9fd529fcd3fc,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:31.132946,28.668649,granite.pdf,[]


##  Step-5: Doc chunks

Split the documents in chunks.

[Chunking transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)

**Experiment with chunking size to find the setting that works best for your documents**

### 5.1 - Execute 

In [8]:
%%time

from dpk_doc_chunk.transform_python import DocChunk

STAGE = 3
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_chunk_dir}'\n", flush=True)

result = DocChunk(input_folder=output_exact_dedupe_dir,
        output_folder=output_chunk_dir,
        doc_chunk_chunking_type= "li_markdown",
        # doc_chunk_chunking_type= "dl_json",
        doc_chunk_chunk_size_tokens = 128,  # default 128
        doc_chunk_chunk_overlap_tokens=30   # default 30
        ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-3: Processing input='output/02_dedupe_out' --> output='output/03_chunk_out'



12:23:01 INFO - doc_chunk parameters are : {'chunking_type': 'li_markdown', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}
12:23:01 INFO - pipeline id pipeline_id
12:23:01 INFO - code location None
12:23:01 INFO - data factory data_ is using local data access: input_folder - output/02_dedupe_out output_folder - output/03_chunk_out
12:23:01 INFO - data factory data_ max_files -1, n_sample -1
12:23:01 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:23:01 INFO - orchestrator doc_chunk started at 2025-02-19 12:23:01
12:23:01 INFO - Number of files is 3,

✅ Stage:3 completed successfully
CPU times: user 980 ms, sys: 138 ms, total: 1.12 s
Wall time: 1.43 s


### 5.2 - Inspect Generated output

We would see documents are split into many chunks

In [9]:
from utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(output_exact_dedupe_dir)  ## for debug purposes
output_df = read_parquet_files_as_df(output_chunk_dir)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(min(3, output_df.shape[0]))

Files processed : 2
Chunks created : 60


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,removed,source_document_id,contents,document_id
16,attention.pdf,15,6,147,2949302674760005271,pdf,f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...,46040,2025-02-19T12:22:02.413652,14.201284,attention.pdf,[],48c14faf-6f1d-4edc-8c29-2cf36a5bbdcc,## 5.1 Training Data and Batching\n\nWe traine...,7b0b35235086ad981660e6b105f5a16542a6bee2c70b6d...
34,granite.pdf,28,19,295,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:31.132946,28.668649,granite.pdf,[],0dc95fd2-332c-4711-aa81-9fd529fcd3fc,## 2.4 Natural Language Datasets\n\nIn additio...,1289d564d558c095b4d4ba62c845f33bf6c666ea8d9a13...
42,granite.pdf,28,19,295,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:31.132946,28.668649,granite.pdf,[],0dc95fd2-332c-4711-aa81-9fd529fcd3fc,## 6 Evaluation\n\nWe evaluate Granite Code mo...,cc0b25f87c100d4e2009c7f4bc688030446590aff7c692...


## Step-6:   Calculate Embeddings for Chunks

we will calculate embeddings for each chunk using an open source embedding model

[Embeddings / Text Encoder documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/README.md)

### 6.1 - Execute

In [10]:
%%time 

from dpk_text_encoder.transform_python import TextEncoder

STAGE  = 4
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_chunk_dir}' --> output='{output_embeddings_dir}'\n", flush=True)


result = TextEncoder(input_folder= output_chunk_dir, 
               output_folder= output_embeddings_dir, 
               text_encoder_model_name = MY_CONFIG.EMBEDDING_MODEL
               ).transform()
if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-4: Processing input='output/03_chunk_out' --> output='output/04_embeddings_out'



12:23:02 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'ibm-granite/granite-embedding-30m-english'}
12:23:02 INFO - pipeline id pipeline_id
12:23:02 INFO - code location None
12:23:02 INFO - data factory data_ is using local data access: input_folder - output/03_chunk_out output_folder - output/04_embeddings_out
12:23:02 INFO - data factory data_ max_files -1, n_sample -1
12:23:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:23:02 INFO - orchestrator text_encoder started at 2025-02-19 12:23:02
12:23:02 INFO - Number of files is 2, source profile {'max_file_size': 0.04669189453125, 'min_file_size': 0.02893352508544922, 'total_file_size': 0.07562541961669922}


modules.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/467k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/60.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

12:23:15 INFO - Completed 1 files (50.0%) in 0.003 min
12:23:15 INFO - Completed 2 files (100.0%) in 0.008 min
12:23:15 INFO - Done processing 2 files, waiting for flush() completion.
12:23:15 INFO - done flushing in 0.0 sec
12:23:15 INFO - Completed execution in 0.218 min, execution result 0


✅ Stage:4 completed successfully
CPU times: user 2.14 s, sys: 784 ms, total: 2.93 s
Wall time: 14.2 s


### 6.2 - Inspect Generated output

In [11]:
from utils import read_parquet_files_as_df

input_df = read_parquet_files_as_df(output_chunk_dir)
output_df = read_parquet_files_as_df(output_embeddings_dir)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(min(3, output_df.shape[0]))

Input data dimensions (rows x columns)=  (60, 15)
Output data dimensions (rows x columns)=  (60, 16)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,removed,source_document_id,contents,document_id,embeddings
17,attention.pdf,15,6,147,2949302674760005271,pdf,f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...,46040,2025-02-19T12:22:02.413652,14.201284,attention.pdf,[],48c14faf-6f1d-4edc-8c29-2cf36a5bbdcc,## 5.2 Hardware and Schedule\n\nWe trained our...,1fbbe83ca7d88df5de6e0bd418dea4d4955675051315e9...,"[-0.041457232, 0.021141231, -0.04439012, -0.01..."
26,attention.pdf,15,6,147,2949302674760005271,pdf,f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...,46040,2025-02-19T12:22:02.413652,14.201284,attention.pdf,[],48c14faf-6f1d-4edc-8c29-2cf36a5bbdcc,## Attention Visualizations Input-Input Layer5...,277f4805febaea0efaaf3ffcc724f026e6e8b15782e4e6...,"[-0.03068907, 0.027044704, 0.013957209, 0.0387..."
56,granite.pdf,28,19,295,3127757990743433032,pdf,0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...,127678,2025-02-19T12:22:31.132946,28.668649,granite.pdf,[],0dc95fd2-332c-4711-aa81-9fd529fcd3fc,## 7 Conclusion\n\nWe presented a family of de...,9aa27c7a5c1d9f20caaea40601d491849a1a513fa0af83...,"[0.011081366, 0.0014233455, 0.10855648, 0.0181..."


## Step-7: Copy output to final output dir

In [12]:
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.copytree(src=output_embeddings_dir, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)

print (f"✅ Copied output from '{output_embeddings_dir}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'")

✅ Copied output from 'output/04_embeddings_out' --> 'output/output_final'
