# Processing PDFs using Data Prep Kit (Ray version)

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb)

This notebook will introduce DPK and showcase some of it's capabilities.

Here is the workflow:

- pdf2parquet: Extract text from PDF documents
- docid: compute hashes
- exact dedupe : filter out identical documents
- fuzzy dedupe : filter out 'near duplicates'
- document quality: scoring documents for quality

![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev//examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png)


## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this [guide](../../../README.md#-getting-started)

The notebook will work as in both environments

## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [2]:
%%capture

if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        data-prep-toolkit-transforms[ray,all]==1.0.0 \
        humanfriendly

### 1.3 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration  & Utils

In [3]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 - Setup input/outpur directories

In [4]:
import os, sys
import shutil

if RUNNING_IN_COLAB:
    input_dir = "input"
    shutil.os.makedirs(input_dir, exist_ok=True)
else:
    input_dir = "../../data-files/pdf-processing-1/"

output_dir = "output"

output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')
output_docid_dir = os.path.join (output_dir, '02_docid_out')
output_exact_dedupe_dir = os.path.join (output_dir, '03_exact_dedupe_out')
output_fuzzy_dedupe_dir = os.path.join (output_dir, '04_fuzzy_dedupe_out')
output_doc_quality_dir = os.path.join (output_dir, '05_doc_quality_out')
output_final_dir = os.path.join (output_dir, 'output_final')

## clear output folder
shutil.rmtree(output_dir, ignore_errors=True)
shutil.os.makedirs(output_dir, exist_ok=True)
print ("✅ Cleared output directory")

✅ Cleared output directory


### 2.3 - Runtime Configuration

In [5]:
from data_processing.utils import GB

CONFIG_RAY_NUM_CPUS = 1 # CPUs per worker
CONFIG_RAY_MEMORY = 2 * GB  # memory per worker
CONFIG_RAY_RUNTIME_WORKERS = 2

### 2.4 - Handy Utils

In [6]:
import os
import requests
from humanfriendly import format_size
import pandas as pd
import glob

## Reads parquet files in a folder into a pandas dataframe
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')
    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]
    dfs = [df for df in dfs if not df.empty]  # filter out empty dataframes
    # Concatenate all DataFrames into a single DataFrame
    if len(dfs) > 0:
        data_df = pd.concat(dfs, ignore_index=True)
        return data_df
    else:
        return pd.DataFrame() # return empty df
# ------------


def download_file(url, local_file, chunk_size=1024*1024):
    """
    Downloads a remote URL to a local file.

    Args:
        url (str): The remote URL.
        local_filename (str): The name of the local file to save the downloaded content.
        chunk_size (int): The size in bytes of each chunk. Defaults to 1024.

    Returns:
        None

    Example usage:
        download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024)  # Download in chunks of 1MB
    """
    # Check if the local file already exists
    if os.path.exists(local_file):
        file_size = format_size(os.path.getsize(local_file))
        print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.")
        return

    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(local_file), exist_ok=True)

    # Stream the file download
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_file, 'wb') as f:
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
        print()
        file_size = format_size(os.path.getsize(local_file))
        print(f"{local_file} ({file_size}) downloaded successfully.")
## --- end: download_file ------



## Step-3: Inspect the Data

We will use simple PDFs.  The files are [here](https://github.com/IBM/data-prep-kit/tree/dev/examples/data-files/pdf-processing-1/)

- [earth.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf) and exact duplicate [earth-copy.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf)
- [earth2.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf) almost similar to earth.pdf (ONE word difference!)
- [mars.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf)
- [spam.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf) - contains spammy contents
- [lorem-ipsum.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf) - contains 'lorem ipsum' placeholder


### 3.1 -Download Data

In [None]:
if RUNNING_IN_COLAB:

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf', os.path.join(input_dir, 'earth.pdf'))

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf', os.path.join(input_dir, 'earth-copy.pdf'))

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf', os.path.join(input_dir, 'earth2.pdf'))

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf', os.path.join(input_dir, 'mars.pdf'))

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf', os.path.join(input_dir, 'spam.pdf'))

    download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf', os.path.join(input_dir, 'lorem-ipsum.pdf'))
else:
    print ('Using input files from : ', input_dir)

Using input files from :  ../../data-files/pdf-processing-1/


## Step-4: Extract Data from PDF (pdf2parquet)

This step we will read PDF files and extract the text data.

[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)

We use the [Docling package](https://github.com/DS4SD/docling).


### 4.1 - Execute

In [8]:
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

STAGE = 1
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_dir}' --> output='{output_pdf2pq_dir}'\n", flush=True)


result =  Pdf2Parquet(input_folder= input_dir,
                    output_folder= output_pdf2pq_dir,
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    
                    # runtime config
                    run_locally= True,
                    num_cpus= CONFIG_RAY_NUM_CPUS,
                    memory= CONFIG_RAY_MEMORY,
                    runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,
                    ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-1: Processing input='../../data-files/pdf-processing-1/' --> output='output/01_pdf2pq_out'



14:19:07 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
14:19:07 INFO - pipeline id pipeline_id
14:19:07 INFO - code location None
14:19:07 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
14:19:07 INFO - actor creation delay 0
14:19:07 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
14:19:07 INFO - data factory data_ is using local data access: input_folder - ../../data-files/pdf-processing-1/ output_folder - output/01_pdf2pq_out
14:19:07 INFO - data factory data_ max_files -1, n_sample -1
14:19:07 INFO - data factory data_ Not using data set

✅ Stage:1 completed successfully


### 4.2 - Inspect Generated output

Here we should see one entry per input file processed.

In [9]:
print ("Displaying contents of : ", output_pdf2pq_dir)
output_df = read_parquet_files_as_df(output_pdf2pq_dir)
# print ("Output dimensions (rows x columns)= ", output_df.shape)
output_df.head(10)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Displaying contents of :  output/01_pdf2pq_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,8dc8970e-215a-44fe-a7bf-946c03f36c60,6571294142213095721,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-02-06T14:19:29.408910,1.912304,lorem-ipsum.pdf
1,spam.pdf,Free xxx,1,0,2,9ac78463-b325-406b-891e-c9e84722eb34,10026122586747302274,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-02-06T14:19:30.986464,1.573836,spam.pdf
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,2025-02-06T14:19:29.335271,1.850426,earth2.pdf
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,2025-02-06T14:19:30.950673,1.6122,mars.pdf
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.470409,2.071769,earth-copy.pdf
5,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,18d940f3-f4b4-46ac-9147-077675aead1d,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.492574,2.093768,earth.pdf



### 4.3 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **document_hash**: hash of documents
- **hash** : hash of `contents` column
- **pdf_convert_time** : time to convert this pdf in seconds

**Note: you should notice the hash values are identical for the duplicate documents**

Let's inspect the **contents** column.

In [10]:
print (output_df[output_df['filename'] == 'earth.pdf'].iloc[0,]['contents'])

## Earth

## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, ﬁve dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.

For more details about our Solar system see Chapter 1.

## Earth

Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.

Basic facts about Earth:

- · Distance from the Sun: Average of 149.6 million kilometers (93 million miles)
- · Moons: One moon, called Luna or simply "the Moon".
- · Rotation Period: 24 hours (one day)


In [11]:
print (output_df[output_df['filename'] == 'spam.pdf'].iloc[0,]['contents'])


Free xxx


In [12]:
print (output_df[output_df['filename'] == 'lorem-ipsum.pdf'].iloc[0,]['contents'])

Lorem ipsum Lorem ipsum Lorem ipsum


## Step-5:  Create DOC ID for Documents

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This step is a pre-requisite for fuzzy dedup** in the pipeline.

[DocID documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/doc_id)

### 5.1 - Execute

In [13]:
%%time

from dpk_doc_id.ray.transform import DocID

STAGE = 2
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_pdf2pq_dir}' --> output='{output_docid_dir}'\n", flush=True)

result = DocID(input_folder= output_pdf2pq_dir,
                output_folder= output_docid_dir,
                doc_id_doc_column= "contents",
                doc_id_hash_column= "doc_hash",
                # doc_id_int_column= "doc_id_int",
                doc_id_int_column= "int_id_column",
                
                # runtime config
                run_locally= True,
                num_cpus= CONFIG_RAY_NUM_CPUS,
                memory= CONFIG_RAY_MEMORY,
                runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,
                ).transform()
        
if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-2: Processing input='output/01_pdf2pq_out' --> output='output/02_docid_out'



14:19:42 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'doc_hash', 'int_column': 'int_id_column', 'start_id': 0}
14:19:42 INFO - pipeline id pipeline_id
14:19:42 INFO - code location None
14:19:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
14:19:42 INFO - actor creation delay 0
14:19:42 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
14:19:42 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_docid_out
14:19:42 INFO - data factory data_ max_files -1, n_sample -1
14:19:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
14:19:42 INFO - Running locally
2025-02-06 14:19:43,706	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://1

✅ Stage:2 completed successfully
CPU times: user 115 ms, sys: 137 ms, total: 251 ms
Wall time: 15.3 s


### 5.2 - Inspect Generated output

You would see a new columns **doc_hash** and **int_id_column**

In [14]:
print ("Displaying contents of : ", output_docid_dir)
output_df = read_parquet_files_as_df(output_docid_dir)
output_df.head(10)

Displaying contents of :  output/02_docid_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,doc_hash,int_id_column
0,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,8dc8970e-215a-44fe-a7bf-946c03f36c60,6571294142213095721,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-02-06T14:19:29.408910,1.912304,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,3
1,spam.pdf,Free xxx,1,0,2,9ac78463-b325-406b-891e-c9e84722eb34,10026122586747302274,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-02-06T14:19:30.986464,1.573836,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,5
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,2025-02-06T14:19:29.335271,1.850426,earth2.pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,2
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,2025-02-06T14:19:30.950673,1.6122,mars.pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,4
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.470409,2.071769,earth-copy.pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,1
5,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,18d940f3-f4b4-46ac-9147-077675aead1d,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.492574,2.093768,earth.pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,0


## Step-6: Eliminate Duplicate Documents

We have 2 exact duplicates: **earth.pdf** , **earth-copy.pdf**

Note how **doc_hash** for these documents are the same.

[Exact dedupe information](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/ededup)

### 6.1 - Execute

In [15]:
%%time

from dpk_ededup.ray.transform import Ededup

STAGE = 3
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_docid_dir}' --> output='{output_exact_dedupe_dir}'\n", flush=True)

result = Ededup(input_folder=output_docid_dir,
                output_folder=output_exact_dedupe_dir,
                ededup_doc_column="contents",
                ededup_doc_id_column="doc_hash",
                ededup_num_hashes= 2,
                
                # runtime config
                run_locally= True,
                num_cpus= CONFIG_RAY_NUM_CPUS,
                memory= CONFIG_RAY_MEMORY,
                runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,
                ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed")

🏃🏼 STAGE-3: Processing input='output/02_docid_out' --> output='output/03_exact_dedupe_out'



14:19:57 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'doc_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}
14:19:57 INFO - pipeline id pipeline_id
14:19:57 INFO - code location None
14:19:57 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
14:19:57 INFO - actor creation delay 0
14:19:57 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
14:19:57 INFO - data factory data_ is using local data access: input_folder - output/02_docid_out output_folder - output/03_exact_dedupe_out
14:19:57 INFO - data factory data_ max_files -1, n_sample -1
14:19:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
14:19:57 INFO - Running locally
2025-02-06 14:19:58,746	INFO worker.py:1777 -- Started a local Ray ins

✅ Stage:3 completed successfully
CPU times: user 98.9 ms, sys: 129 ms, total: 228 ms
Wall time: 15 s


### 6.2 - Inspect Generated output

You can see one of **earth.pdf** or **earth-copy.pdf** will be eliminated.

In [16]:
input_df = read_parquet_files_as_df(output_docid_dir)
output_df = read_parquet_files_as_df(output_exact_dedupe_dir)

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

print ("Displaying contents of : ", output_exact_dedupe_dir)
output_df.head(10)

Input files before exact dedupe : 6
Output files after exact dedupe : 5
Duplicate files removed :   1
Displaying contents of :  output/03_exact_dedupe_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,doc_hash,int_id_column,removed
0,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,8dc8970e-215a-44fe-a7bf-946c03f36c60,6571294142213095721,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-02-06T14:19:29.408910,1.912304,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,3,[]
1,spam.pdf,Free xxx,1,0,2,9ac78463-b325-406b-891e-c9e84722eb34,10026122586747302274,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-02-06T14:19:30.986464,1.573836,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,5,[]
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,2025-02-06T14:19:29.335271,1.850426,earth2.pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,2,[]
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,2025-02-06T14:19:30.950673,1.6122,mars.pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,4,[]
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.470409,2.071769,earth-copy.pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,1,[]


## Step-7: Fuzzy Dedupe

In previous step, we removed **exact duplicates (identical documents)**.

Fuzzy de-dupe can further filter out documents that are **not exactly identical, but nearly identical**

Here is a simple example:

`Our solar system is a vast and fascinating expanse`

`The solar system is a vast and fascinating expanse`

Only one word is different `Our` vs `The`.

Imagine two documents with one extra blank line.  For our purposes they are the same.

[Fuzzy dedupe documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/fdedup)

### Tweaking fuzzy matches

**`jaccard_similarity_threshold`** is the parameter used to tweak similarities between documents.  It's value is between 0 and 1.0.  Values close to 1.0 means more strict checking (fewer documents will qualify).  Lower threshold means more leniant matches (more documents will qualify)

Adjust this value to find what works for your documents

### 7.1 - Execute

In [17]:
%%time

from dpk_fdedup.ray.transform import Fdedup

STAGE = 4
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_fuzzy_dedupe_dir}'\n", flush=True)

result = Fdedup(input_folder=output_exact_dedupe_dir,
                output_folder=output_fuzzy_dedupe_dir,
                contents_column= "contents",
                # document_id_column= "doc_id",
                document_id_column= "int_id_column",
                num_permutations= 112,
                num_bands= 14,
                num_minhashes_per_band= 8,
                jaccard_similarity_threshold = 0.9, # between 0 - 1.  higher means more strict checking
                operation_mode="filter_duplicates",
                # operation_mode="annotate",
                
                # runtime config
                run_locally= True,
                ).transform()

# if result == 0:
#     print (f"✅ Stage:{STAGE} completed successfully")
# else:
#     raise Exception (f"❌ Stage:{STAGE}  failed (result={result})")

🏃🏼 STAGE-4: Processing input='output/03_exact_dedupe_out' --> output='output/04_fuzzy_dedupe_out'



14:20:12 INFO - Starting SignatureCalculation step
14:20:12 INFO - Got parameters for SignatureCalculation
14:20:12 INFO - minhash parameters are : {'document_id_column': 'int_id_column', 'contents_column': 'contents', 'seed': 42, 'num_permutations': 112, 'jaccard_similarity_threshold': 0.9, 'word_shingle_size': 5, 'num_bands': 14, 'num_minhashes_per_band': 8, 'num_segments': 1, 'shingle_option': 'word'}
14:20:12 INFO - data factory scdata_ is using local configuration without input/output path
14:20:12 INFO - data factory scdata_ max_files -1, n_sample -1
14:20:12 INFO - data factory scdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
14:20:12 INFO - pipeline id pipeline_id
14:20:12 INFO - code location None
14:20:12 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}
14:20:12 INFO - actor creation delay 0
14:20:12 INFO - job details {'job category': 'preprocessing', 

CPU times: user 558 ms, sys: 526 ms, total: 1.08 s
Wall time: 59.5 s


### 7.2 - Inspect Output

FuzzyDedupe will write documents that are filtered in **output/04_fuzzy_dedupe_out/cleaned** folder

You will notice only one **earth.pdf** made it!  So fuzzy dedupe did filter out the almost identical doc.

In [18]:
input_df = read_parquet_files_as_df(output_exact_dedupe_dir)
output_df = read_parquet_files_as_df(os.path.join(output_fuzzy_dedupe_dir, "cleaned"))

# print ("Input data dimensions (rows x columns)= ", input_df.shape)
# print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input files before exact dedupe : {input_df.shape[0]:,}")
print (f"Output files after exact dedupe : {output_df.shape[0]:,}")
print ("Near duplicate files removed :  ", (input_df.shape[0] - output_df.shape[0]))

print ("Displaying contents of : ", output_fuzzy_dedupe_dir)
output_df.head(10)

Input files before exact dedupe : 5
Output files after exact dedupe : 5
Near duplicate files removed :   0
Displaying contents of :  output/04_fuzzy_dedupe_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,doc_hash,int_id_column,removed
0,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,8dc8970e-215a-44fe-a7bf-946c03f36c60,6571294142213095721,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,2025-02-06T14:19:29.408910,1.912304,lorem-ipsum.pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,3,[]
1,spam.pdf,Free xxx,1,0,2,9ac78463-b325-406b-891e-c9e84722eb34,10026122586747302274,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,2025-02-06T14:19:30.986464,1.573836,spam.pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,5,[]
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,2025-02-06T14:19:29.335271,1.850426,earth2.pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,2,[]
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,2025-02-06T14:19:30.950673,1.6122,mars.pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,4,[]
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,2025-02-06T14:19:27.470409,2.071769,earth-copy.pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,1,[]


## Step-8: Document Quality

This handy plugin will score documents across many metrics.

Here we will look for 'bad words' metric.

[Document quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)

By default it uses [bad words collection](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/dpk_doc_quality/ldnoobw).  You can supply a custom file by passing an argument `bad_word_filepath=/path/to/badwords_file`

### 8.1 - Execute

In [19]:
%%time

from dpk_doc_quality.ray.transform import DocQuality

STAGE = 5
output_fuzzy_dedupe_cleaned_dir = os.path.join(output_fuzzy_dedupe_dir, "cleaned")
print (f"🏃🏼 STAGE-{STAGE}: Processing input='{output_fuzzy_dedupe_cleaned_dir}' --> output='{output_doc_quality_dir}'\n", flush=True)

result = DocQuality(input_folder=output_fuzzy_dedupe_cleaned_dir,
                    output_folder= output_doc_quality_dir,
                    docq_text_lang = "en",
                    docq_doc_content_column ="contents",
                    
                    # runtime config
                    run_locally= True,
                    num_cpus= CONFIG_RAY_NUM_CPUS,
                    memory= CONFIG_RAY_MEMORY,
                    runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,
                    ).transform()

if result == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception (f"❌ Stage:{STAGE}  failed (result={result})")

🏃🏼 STAGE-5: Processing input='output/04_fuzzy_dedupe_out/cleaned' --> output='output/05_doc_quality_out'



14:21:12 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x73d0a1aa2810>}
14:21:12 INFO - data factory docq_ is using local configuration without input/output path
14:21:12 INFO - data factory docq_ max_files -1, n_sample -1
14:21:12 INFO - data factory docq_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
14:21:12 INFO - pipeline id pipeline_id
14:21:12 INFO - code location None
14:21:12 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
14:21:12 INFO - actor creation delay 0
14:21:12 INFO - job details {'job category': 'preprocessing', 'job name': 

✅ Stage:5 completed successfully
CPU times: user 122 ms, sys: 128 ms, total: 250 ms
Wall time: 14.9 s


### 8.2 - Inspect the Output

We will see several new columns starting with the name **docq_**.

Look at the column **docq_contain_bad_word**; this will flag documents with 'bad words'.

Also inspect the column **docq_lorem_ipsum_ratio**; this will flag documents with 'lorem ipsum' text

For more information see : [Doc Quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)

In [20]:
output_df = read_parquet_files_as_df(output_doc_quality_dir)
print ("Displaying contents of : ", output_doc_quality_dir)
output_df.head()

Displaying contents of :  output/05_doc_quality_out


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
0,lorem-ipsum.pdf,Lorem ipsum Lorem ipsum Lorem ipsum,1,0,2,8dc8970e-215a-44fe-a7bf-946c03f36c60,6571294142213095721,pdf,bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...,35,...,5.0,0.0,1,0.085714,0.0,False,0.0,0.0,1.0,False
1,spam.pdf,Free xxx,1,0,2,9ac78463-b325-406b-891e-c9e84722eb34,10026122586747302274,pdf,543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...,8,...,3.5,0.0,1,0.0,0.0,True,0.0,0.0,1.0,False
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,...,4.541284,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,...,4.688,0.032,8,0.0,0.0,False,0.176471,0.0,0.88,True
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,...,4.541284,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True


### 8.3 - Filtering 'quality' documents

So from the output above we see **spam.pdf** is flagged for containing bad words (**docq_contain_bad_word=True**).

Also **lorem.pdf** is flagged for place holder content **lorem ipsum**  (**docq_lorem_ipsum_ratio > 0**)

We are going to filter them both out

In [21]:
all_docs_df = read_parquet_files_as_df(output_doc_quality_dir)

# remove documents with badwords
clean_docs_df = all_docs_df[all_docs_df['docq_contain_bad_word'] == False]

# also filter out 'lorem ipsum' text
clean_docs_df = clean_docs_df[clean_docs_df['docq_lorem_ipsum_ratio'] == 0]

clean_docs_df.head(10)

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
2,earth2.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,b3ed1942-54a6-49fc-bcbc-2d8c438adef3,10729312978404042321,pdf,f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...,610,...,4.541284,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True
3,mars.pdf,## Mars\n\n## Solar System\n\nOur solar system...,1,0,11,6d882651-2506-41cb-8704-85575c64b143,7758129997476962679,pdf,a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...,717,...,4.688,0.032,8,0.0,0.0,False,0.176471,0.0,0.88,True
4,earth-copy.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,f8ccec16-576c-4e3e-8bec-359dff01d6d2,14711865278795535908,pdf,6140cf695f269a3ddca6568536076756105ad3186086b2...,610,...,4.541284,0.027523,9,0.0,0.0,False,0.176471,0.0,0.880734,True


## Step-9: Copy output to final output dir

In [22]:
import shutil

shutil.rmtree(output_final_dir, ignore_errors=True)
shutil.os.makedirs(output_final_dir, exist_ok=True)

output_final_dir_parquet = os.path.join (output_final_dir, 'pq')
shutil.os.makedirs(output_final_dir_parquet, exist_ok=True)

output_final_dir_markdown = os.path.join (output_final_dir, 'markdown')
shutil.os.makedirs(output_final_dir_markdown, exist_ok=True)

In [23]:
## save parquet

clean_docs_df.to_parquet(os.path.join(output_final_dir_parquet, "clean_docs.parquet"))
print (f"✅ Saved CLEAN parquet output to '{output_final_dir_parquet}'")

✅ Saved CLEAN parquet output to 'output/output_final/pq'


In [24]:
## save markdown text

for index, row in clean_docs_df.iterrows():
    output_file_name = os.path.join (output_final_dir_markdown, row['filename'] + '.md')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"✅ Saved CLEAN markdown output to '{output_final_dir_markdown}'")


✅ Saved CLEAN markdown output to 'output/output_final/markdown'
