# Data Prep Kit Demo 1

[Data prep kit](https://ibm.github.io/data-prep-kit)  is an open source framework that helps with data wrangling.

[github repo](https://github.com/IBM/data-prep-kit)

This notebook will introduce DPK and showcase some of it's capabilities.

## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-exmples/blob/main/dpk-intro/dpk-intro-1.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this guide: [setting up python dev environment](../setup-python-dev-env.md)

The notebook will work as in both environments

## Step-1: Inspect the Data

We will use simple PDFs about Solar system.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)

- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)
- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)

### How to generate PDFs?

If you like to play around with various inputs files, follow these steps to re-generate PDFs.

**Option 1: markdown -> pdf**

First edit the markdown files using any text editor.

Then use [pandoc](https://pandoc.org/) to convert them to pdfs.

```bash
pandoc  earth.md  -o earth.pdf
pandoc  mars.md  -o mars.pdf
```

**Option 2: Use a word editor or google docs editor**

Write your content and export as PDF

## Step-2: Figure out Runtime Environment

### 2.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 -Download Data if running on Google Colab

In [2]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input'
    !wget -O 'input/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O 'input/mars.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/mars.pdf'
    !wget -O 'input/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
    !wget -O 'utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/utils.py'

### 2.3 - Install dependencies if running on Google Colab

In [3]:
if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  data-prep-toolkit-transforms-ray==0.2.1.dev1

### 2.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration

### 2.1 - Basic Config

In [4]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


In [5]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

if RUNNING_IN_COLAB:
  MY_CONFIG.INPUT_DATA_DIR = 'input'
else:
  MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')
MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")

## Embedding model
MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'

## RAY CONFIGURATION
num_cpus_available =  os.cpu_count()
# print (num_cpus_available)
# MY_CONFIG.RAY_NUM_CPUS = num_cpus_available // 2  ## use half the available cores for processing
MY_CONFIG.RAY_NUM_CPUS =  1
# print (MY_CONFIG.RAY_NUM_CPUS)
MY_CONFIG.RAY_MEMORY_GB = 2  # GB
# MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3
MY_CONFIG.RAY_RUNTIME_WORKERS = 1

In [6]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 2.2 - Setup input/outpur directories

In [7]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

STAGE = 0

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Set Input/output Folder

In [8]:
STAGE  += 1
# STAGE = 1  ## DEBUG

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_parquet_out")

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'


### 3.2 - Execute

In [9]:
%%time

import ast
import os
import sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing_ray.runtime.ray import RayTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS, "memory": MY_CONFIG.RAY_MEMORY_GB * GB}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")


00:21:47 INFO - Running locally
00:21:47 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
00:21:47 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out
00:21:47 INFO - data factory data_ max_files -1, n_sample -1
00:21:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
00:21:47 INFO - pipeline id pipeline_id
00:21:47 INFO - code location None
00:21:47 INFO - number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
00:21:47 INFO - actor creation delay 0
00:21:47 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': '

✅ Stage:1 completed successfully
CPU times: user 4.01 s, sys: 1.58 s, total: 5.59 s
Wall time: 43.5 s


### 3.3 - Inspect Generated output

Here we should see one entry per input file processed.

In [10]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (2, 12)


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,mars.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf
1,earth.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf



### 3.4 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **hash** : hash of document
- **pdf_convert_time** : time to convert this pdf in seconds

Let's inspect the **contents** column.  See how the text is being divided up!

In [11]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))
# json.loads(output_df.iloc[0, ]['contents'])

{'_name': '',
 'description': {'logs': []},
 'equations': [],
 'figures': [],
 'file-info': {'#-pages': 1,
               'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',
               'filename': 'mars.pdf',
               'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',
                                'model': 'default',
                                'page': 1}]},
 'footnotes': [],
 'main-text': [{'name': 'Section-header',
                'prov': [{'bbox': [133.35, 657.11, 169.63, 667.98],
                          'page': 1,
                          'span': [0, 4]}],
                'text': 'Mars',
                'type': 'subtitle-level-1'},
               {'name': 'Section-header',
                'prov': [{'bbox': [133.1, 630.68, 210.45, 642.34],
                          'page': 1,
                          'span': [0, 12]}],
                'text': 'Solar System',
                'type': 'subtit

In [12]:
pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))

{'_name': '',
 'description': {'logs': []},
 'equations': [],
 'figures': [],
 'file-info': {'#-pages': 1,
               'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',
               'filename': 'earth.pdf',
               'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',
                                'model': 'default',
                                'page': 1}]},
 'footnotes': [],
 'main-text': [{'name': 'Section-header',
                'prov': [{'bbox': [133.31, 656.9, 173.65, 667.93],
                          'page': 1,
                          'span': [0, 5]}],
                'text': 'Earth',
                'type': 'subtitle-level-1'},
               {'name': 'Section-header',
                'prov': [{'bbox': [133.13, 630.69, 210.44, 642.28],
                          'page': 1,
                          'span': [0, 12]}],
                'text': 'Solar System',
                'type': 'subt

##  Step-4: Doc chunks

In the previous step, we have extracted text from oru PDFs.  But we have the content of entire file as 'one row' in our parquet output.

In this step, we are going to split the documents in chunks, according to their layout segmentation.

This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`
to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.
It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: "application/json"`,
which provides the required JSON structure.

### 4.1 - Set Input/output Folder

In [13]:
STAGE  += 1
# STAGE = 2  ## DEBUG

input_folder = output_folder # previous output folder is the input folder for the current stage
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_chunk_out")

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'


### 4.2 - Execute

In [14]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_chunk_transform_ray import DocChunkRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/sujee/apps/anaconda3/envs/data-prep-
[nltk_data]     kit-3-py311/lib/python3.11/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!
00:22:27 INFO - Running locally
00:22:27 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}
00:22:27 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out
00:22:27 INFO - data factory data_ max_files -1, n_sample -1
00:22:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:22:27 INFO - pipeline id pi

✅ Stage:2 completed successfully
CPU times: user 1.33 s, sys: 995 ms, total: 2.32 s
Wall time: 18.9 s


### 4.3 - Inspect Generated output

We would see documents are split into many chunks

In [15]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 2
Chunks created : 8
Input data dimensions (rows x columns)=  (2, 12)
Output data dimensions (rows x columns)=  (8, 15)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename,contents,doc_jsonpath,page_number,bbox
0,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.85, 590.24, 478.57, 623.03]"
1,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.19, 570.83, 374.24, 581.07]"
2,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87, 500.92, 477.3, 534.56]"
3,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Basic facts about Mars:\n· Distance from the S...,$.main-text[7],1,"[147.73, 452.7, 476.98, 474.97]"
4,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87, 590.26, 478.57, 623.05]"
5,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.21, 570.82, 374.81, 581.08]"
6,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91, 512.46, 477.41, 534.48]"
7,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\n· Distance from the Sun: Average of 149...,$.main-text[7],1,"[147.68, 464.66, 477.03, 487.03]"


### 4.4 - Understanding the Output

Here we see 2 PDF files are split into 6 chunks.  Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points

See how **document_id** is carried throughout.  This helps us identify original documents.

Also note **contents** is now plain text (not JSON as before)

In [16]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nOur solar system is a vast and f...
1,mars.pdf,Solar System\nFor more details about the Solar...
2,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
3,mars.pdf,Basic facts about Mars:\n· Distance from the S...
4,earth.pdf,Solar System\nOur solar system is a vast and f...
5,earth.pdf,Solar System\nFor more details about our Solar...
6,earth.pdf,Earth\nEarth is the third planet from the Sun....
7,earth.pdf,Earth\n· Distance from the Sun: Average of 149...


In [17]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 2------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 3------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celes

## Step-5: Exact Dedup



### 5.1 - Set Input/output Folder

In [18]:
STAGE  += 1
# STAGE  = 3  ## DEBUG

input_folder = output_folder # previous output folder is the input folder for the current stage
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_ededupe_out")

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_ededupe_out'


### 5.2 - Execute

In [19]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from ededup_transform_ray import EdedupRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # ededup parameters
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:22:45 INFO - Running locally
00:22:45 INFO - exact dedup params are {'doc_column': 'contents', 'hash_cpu': 0.5, 'num_hashes': 2}
00:22:45 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_ededupe_out
00:22:45 INFO - data factory data_ max_files -1, n_sample -1
00:22:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:22:45 INFO - pipeline id pipeline_id
00:22:45 INFO - code location None
00:22:45 INFO - number of workers 1 worker options {'num_cpus': 1, 'max_restarts': -1}
00:22:45 INFO - actor creation delay 0
00:22:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
2024-09-09 00:22:46,937	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=1545148)[0m 0

✅ Stage:3 completed successfully
CPU times: user 126 ms, sys: 151 ms, total: 278 ms
Wall time: 14.9 s


### 5.3 - Inspect Generated output

In [20]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input chunks before exact dedupe : {input_df.shape[0]:,}")
print (f"Output chunks after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate chunks removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 15)
Output data dimensions (rows x columns)=  (7, 15)
Input chunks before exact dedupe : 8
Output chunks after exact dedupe : 7
Duplicate chunks removed :   1


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename,contents,doc_jsonpath,page_number,bbox
0,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.19, 570.83, 374.24, 581.07]"
1,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87, 500.92, 477.3, 534.56]"
2,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Basic facts about Mars:\n· Distance from the S...,$.main-text[7],1,"[147.73, 452.7, 476.98, 474.97]"
3,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87, 590.26, 478.57, 623.05]"
4,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.21, 570.82, 374.81, 581.08]"
5,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91, 512.46, 477.41, 534.48]"
6,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\n· Distance from the Sun: Average of 149...,$.main-text[7],1,"[147.68, 464.66, 477.03, 487.03]"


In [21]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nFor more details about the Solar...
1,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
2,mars.pdf,Basic facts about Mars:\n· Distance from the S...
3,earth.pdf,Solar System\nOur solar system is a vast and f...
4,earth.pdf,Solar System\nFor more details about our Solar...
5,earth.pdf,Earth\nEarth is the third planet from the Sun....
6,earth.pdf,Earth\n· Distance from the Sun: Average of 149...


In [22]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 1------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 2------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 2------
Earth
Earth is the third planet from the Sun. It's our home plan

### 5.4 - Understanding the output

Remember we had 8 chunks initially.  Now we have 7!  One duplicate chunk is removed.

If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf`  is removed from one of the documents!  Pretty neat, eh!

```text
## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
```

## Step-6:  DOC ID generation

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This is a pre-requisite for fuzzy dedup** in the pipeline.

### 6.1 - Set Input/output Folder

In [23]:

# Input for this stage is the output of exact dedeup component
# output of this component makes it possible for fdedup component to run on data.

STAGE  += 1
# STAGE  = 4  ## DEBUG

input_folder = output_folder # previous output folder is the input folder for the current stage
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_doc_id_out")

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-4: Processing input='output/03_ededupe_out' --> output='output/04_doc_id_out'


### 6.2 - Execute

In [24]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_id_transform_ray import DocIDRayTransformConfiguration
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "hash_column",
    "doc_id_int_column": "int_id_column",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(DocIDRayTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:23:00 INFO - Running locally
00:23:00 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'hash_column', 'int_column': 'int_id_column'}
00:23:00 INFO - data factory data_ is using local data access: input_folder - output/03_ededupe_out output_folder - output/04_doc_id_out
00:23:00 INFO - data factory data_ max_files -1, n_sample -1
00:23:00 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:23:00 INFO - pipeline id pipeline_id
00:23:00 INFO - code location None
00:23:00 INFO - number of workers 1 worker options {'num_cpus': 1, 'max_restarts': -1}
00:23:00 INFO - actor creation delay 0
00:23:00 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
2024-09-09 00:23:01,907	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[36m(o

✅ Stage:4 completed successfully
CPU times: user 140 ms, sys: 145 ms, total: 285 ms
Wall time: 15.1 s


### 6.3 - Inspect Generated output

You will notice we have two extra columns

- **hash_column**
- **int_id_column**

But still the same number or rows as before

In [25]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Input data dimensions (rows x columns)=  (7, 15)
Output data dimensions (rows x columns)=  (7, 17)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename,contents,doc_jsonpath,page_number,bbox,hash_column,int_id_column
0,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.19, 570.83, 374.24, 581.07]",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,5
1,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87, 500.92, 477.3, 534.56]",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6
2,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Basic facts about Mars:\n· Distance from the S...,$.main-text[7],1,"[147.73, 452.7, 476.98, 474.97]",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7
3,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87, 590.26, 478.57, 623.05]",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,1
4,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.21, 570.82, 374.81, 581.08]",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,2
5,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91, 512.46, 477.41, 534.48]",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,3
6,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\n· Distance from the Sun: Average of 149...,$.main-text[7],1,"[147.68, 464.66, 477.03, 487.03]",0337be5c940b3df316e391ae3dc20379442257498812e4...,4


## Step-7: Fuzzy Dedup

Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing
the data further.

Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc.

### 7.1 - Set Input/output Folder

In [26]:
## Input to this component is the output of doc_id generator component.

STAGE  += 1
# STAGE  = 5  ## DEBUG

input_folder = output_folder # previous output folder is the input folder for the current stage
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_fdedupe_out")
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-5: Processing input='output/04_doc_id_out' --> output='output/05_fdedupe_out'


### 7.2 - Execute

In [27]:
%%time

import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration
from data_processing_ray.runtime.ray import RayTransformLauncher

# create parameters

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # Orchestration parameters
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "int_id_column",
    "fdedup_cluster_column": "hash_column",
    # infrastructure
    "fdedup_bucket_cpu": 0.5,
    "fdedup_doc_cpu": 0.5,
    "fdedup_mhash_cpu": 0.5,
    "fdedup_num_doc_actors": 2,
    "fdedup_num_bucket_actors": 1,
    "fdedup_num_minhash_actors": 1,
    "fdedup_num_preprocessors": 2,
    # fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, # (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
}

# Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(FdedupRayTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:23:15 INFO - Running locally
00:23:15 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'int_id_column', 'cluster_column': 'hash_column', 'bucket_cpu': 0.5, 'mhash_cpu': 0.5, 'doc_cpu': 0.5, 'num_doc_actors': 2, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 2, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}
00:23:15 INFO - data factory data_ is using local data access: input_folder - output/04_doc_id_out output_folder - output/05_fdedupe_out
00:23:15 INFO - data factory data_ max_files -1, n_sample -1
00:23:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:23:15 INFO - pipeline id pipeline_id
00:23:15 INFO - code location None
00:23:15 INFO - number of wor

✅ Stage:5 completed successfully
CPU times: user 188 ms, sys: 183 ms, total: 370 ms
Wall time: 35.8 s


### 7.3 - Inspect Generated output

In [28]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print ("Duplicate chunks removed  by fuzzy-dedupe:  ", (input_df.shape[0] - output_df.shape[0]))

output_df.head(10)

Input data dimensions (rows x columns)=  (7, 15)
Output data dimensions (rows x columns)=  (6, 17)
Duplicate chunks removed  by fuzzy-dedupe:   1


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename,contents,doc_jsonpath,page_number,bbox,int_id_column,hash_column
0,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87, 500.92, 477.3, 534.56]",6,-1
1,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Basic facts about Mars:\n· Distance from the S...,$.main-text[7],1,"[147.73, 452.7, 476.98, 474.97]",7,-1
2,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87, 590.26, 478.57, 623.05]",1,-1
3,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.21, 570.82, 374.81, 581.08]",2,5
4,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91, 512.46, 477.41, 534.48]",3,-1
5,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Earth\n· Distance from the Sun: Average of 149...,$.main-text[7],1,"[147.68, 464.66, 477.03, 487.03]",4,-1


In [29]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
1,mars.pdf,Basic facts about Mars:\n· Distance from the S...
2,earth.pdf,Solar System\nOur solar system is a vast and f...
3,earth.pdf,Solar System\nFor more details about our Solar...
4,earth.pdf,Earth\nEarth is the third planet from the Sun....
5,earth.pdf,Earth\n· Distance from the Sun: Average of 149...


In [30]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 1------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 2------
Earth
Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.
-------
-------Chunk 3------
Earth
· Distance f

### 7.4- Understanding the output

So we started with 7 rows and ended up with 6.  Fuzzy dedupe removed the following **very similar** chunk.

These are pretty similar chunks except for the words 'the' and 'our'

**earth.pdf**

`For more details about *our* Solar system see Chapter 1.`

**mars.pdf**

`For more details about *the* Solar system see Chapter 1.`

Pretty neat, eh? 👏

### Configuring Fuzzy de-dupe

You can tweak fuzzy dedupe by tweaking the following parameters

```python
# fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, #  (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
```

In our case, we set `fdedup_threshold` parameter to 0.7.  


## Step-8:   Text encoding

Encode text for the vector storage.

### 8.1 - Set Input/output Folder

In [31]:
STAGE  += 1
# STAGE  = 7 ## DEBUG

input_folder = output_folder # previous output folder is the input folder for the current stage
output_folder =  os.path.join(MY_CONFIG.OUTPUT_FOLDER, f"{STAGE:02}_encoder_out")

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-6: Processing input='output/05_fdedupe_out' --> output='output/06_encoder_out'


### 8.2 - Execute

In [32]:
%%time

from text_encoder_transform_ray import TextEncoderRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # text_encoder
    "text_encoder_model_name": MY_CONFIG.EMBEDDING_MODEL,
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())
# Launch the ray actor(s) to process the input

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:23:51 INFO - Running locally
00:23:51 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}
00:23:51 INFO - data factory data_ is using local data access: input_folder - output/05_fdedupe_out output_folder - output/06_encoder_out
00:23:51 INFO - data factory data_ max_files -1, n_sample -1
00:23:51 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:23:51 INFO - pipeline id pipeline_id
00:23:51 INFO - code location None
00:23:51 INFO - number of workers 1 worker options {'num_cpus': 1, 'max_restarts': -1}
00:23:51 INFO - actor creation delay 0
00:23:51 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}
2024-09-09 00:23:53,548	INFO worker.py:1744 -- Started a local Ray instance. View th

✅ Stage:6 completed successfully
CPU times: user 582 ms, sys: 257 ms, total: 839 ms
Wall time: 21.3 s


### 8.3 - Inspect Generated output

You will see a column called `embeddings` added at the end.  This the text content converted into vectors or embeddings.  We used the model `sentence-transformers/all-MiniLM-L6-v2`

In [33]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(3)

Input data dimensions (rows x columns)=  (6, 17)
Output data dimensions (rows x columns)=  (6, 18)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename,contents,doc_jsonpath,page_number,bbox,int_id_column,hash_column,embeddings
3,earth.pdf,1,0,11,347b47c6-7172-4e23-be0c-89581cf0d33d,pdf,0260c5d12de8d1ce27a6c096f7983336b1ded574ad3ab8...,2407,2024-09-09T00:22:12.874147,1.662447,earth.pdf,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.21, 570.82, 374.81, 581.08]",2,5,"[-0.062105577, -0.0053322953, 0.03127779, 0.04..."
1,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,Basic facts about Mars:\n· Distance from the S...,$.main-text[7],1,"[147.73, 452.7, 476.98, 474.97]",7,-1,"[0.1059802, 0.025460616, 0.02362733, 0.0390564..."
0,mars.pdf,1,0,11,26d247d6-60f7-42a5-9be4-3524391010dd,pdf,6dfaeed444efc80bf32e0c67308a57c38cc8d27a6405ec...,2531,2024-09-09T00:22:14.490039,1.602514,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87, 500.92, 477.3, 534.56]",6,-1,"[0.07728298, 0.024971062, -0.04318075, 0.05809..."


## Step-9: Copy output to final output dir

In [34]:
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)

print (f"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'")

✅ Copied output from 'output/06_encoder_out' --> 'output/output_final'
