## **Demo on building data prep pipeline for fine tuning text data**

**Authors**: Pooja Holkar, Aisha Darga

**email**: poholkar@in.ibm.com,aisdarg1@in.ibm.com


<a href="https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/fine tuning/language/fine-tune-language.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


This demo notebook shows how to use IBM's Data Processing Toolkit to build a data preparation pipeline for fine-tuning or extended pre-training of legal contracts. We will discuss various data preparation steps to process raw legal documents (contracts), tokenize them, and prepare them for fine-tuning using any large language models.

Organizations working with legal contracts like Master Service Agreements (MSAs), often face challenges in managing, reviewing, and fine-tuning legal documents at scale. With a repository of contracts, there is a pressing need for a streamlined process to analyze, extract, and refine critical clauses and terms to ensure compliance, clarity, and adaptability.

The data preparation steps demonstrated in this notebook include:

- **Conversion of PDF to Parquet**
- **Identification of Hate, Abuse and Profanity (HAP)**
- **Identification of Personally Identifiable Information (PII)**
- **De-duplication of Data**
- **Document Chunking**
- **Document Quality Assessment**
- **Tokenization of the Data**



## Setup

Install data-prep-toolkit and datasets library. This notebook requires atleast 8 cpus.
To run on google colab, it is recommended to change the runtime to TPUs to get the required number of cpus.


In [1]:
%%capture
!pip install 'data-prep-toolkit-transforms[language]'
import pyarrow.parquet as pq
import pandas as pd

In [2]:
# Must enable nested asynchronous io in a notebook as the crawler uses coroutine to speed up acquisition and downloads
import nest_asyncio
nest_asyncio.apply()

import os

### Download Data

In [3]:
import urllib.request
import shutil
shutil.os.makedirs("input-data", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/refs/heads/dev/examples/data-files/fine-tuning/language/MSA-DPK-1.pdf", "input-data/MSA-DPK-1.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/refs/heads/dev/examples/data-files/fine-tuning/language/MSA-2.pdf", "input-data/MSA-2.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/refs/heads/dev/examples/data-files/fine-tuning/language/MSA-3.pdf", "input-data/MSA-3.pdf")

('input-data/MSA-3.pdf', <http.client.HTTPMessage at 0x16c92aed0>)

In [4]:
# create parameters
input_folder = os.path.join("input-data")


##### We will place the downloaded files in the `input-data` folder. For our use case, we have three MSA contracts that will undergo processing. The output for each transform run will be generated in separate folders, with folder names following the format `files-<transform_name>`, making it easy to verify the respective transform outputs. This concludes the setup section.

## Data Preparation Steps

We now discuss the various data preparation steps to transform the raw legal contracts data to a tokenized format post cleaning and transforming the data. We use the [parquet data format](https://parquet.apache.org/) for all our operations. This helps to efficiently scale the data for actual production runs, beyond the demo.

1. pdf2Parquet: Read the dataset from HF and convert into parquet format.  
2. PIIRedactor: Remove sensitive information
3. Ededup: Remove exact duplicates.  
4. DocChunk: Chunks large legal documents into smaller, coherent sections
5. DocQuality: Ensuring completeness, consistency, relevance.
6. Tokenization: Tokenize the data for model fine tuning.

The data processing pipeline is organized such that the output of the previous transform is used as input to the next one.

## 1. Pdf data to Parquet

This is the first component of this pipeline. It ingests all legal contract dataset from `input-data`  and converts it into
parquet files for consumption by the next steps in this data processing pipeline.


The output of this stage of the pipeline would be written to `files-pdf2parquet`.

In [5]:
%%capture
from dpk_pdf2parquet.transform_python import Pdf2Parquet
Pdf2Parquet(input_folder= input_folder,
               output_folder= "files-pdf2parquet",
               data_files_to_use=['.pdf'],
               pdf2parquet_contents_type='text/markdown').transform()

## 2. Identification of HAP content

The identification of HAP, ensures that the data used to train models is free from harmful or inappropriate content that could introduce bias into large language model (LLM) outputs.

Hate, Abuse, or Profanity (HAP) detection, while not typically relevant in legal contract analysis, is becoming crucial for training legal language models. For our proof of concept, we are incorporating HAP detection to showcase its value in managing user-generated data effectively.

In [6]:
from dpk_hap.transform_python import HAP
# create parameters
HAP(input_folder="files-pdf2parquet",
        output_folder="files-hapoutput",
        model_name_or_path= 'ibm-granite/granite-guardian-hap-38m',
        annotation_column= "hap_score",
        doc_text_column= "contents",
        inference_engine= "CPU",
        max_length= 512,
        batch_size= 128,
        ).transform()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/poojaholkar/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
09:45:56 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 


Processing batch: 0/0
    filename                                           contents  num_pages  \
0  MSA-2.pdf  ## MASTER SERVICE AGREEMENT\n\nThis Master Ser...          7   

   num_tables  num_doc_elements                           document_id  \
0           0               120  493846b6-70c4-4092-be5a-0384af88cec5   

          document_hash  ext  \
0  11474200003073967609  pdf   

                                                hash   size  \
0  f226dd1bcbfae3cff93fd977b566e307b06b3843575cc3...  12546   

                date_acquired  pdf_convert_time source_filename  hap_score  
0  2025-01-31T09:45:49.483514          3.385261       MSA-2.pdf   0.011435  
Processing batch: 0/0
    filename                                           contents  num_pages  \
0  MSA-3.pdf  ## MASTER SERVICE AGREEMENT\n\nThis Master Ser...          6   

   num_tables  num_doc_elements                           document_id  \
0           0               141  b3994352-ed1e-4d76-9ed9-94d5be04fd1a   

  

0

## 3. PII Redactor Transform
This transform redacts Personally Identifiable Information (PII) from the input data which is our parquet files generated at location `files-pdf2parquet` in step 1

The transform leverages the Microsoft Presidio SDK for PII detection and uses the Flair recognizer for entity recognition.

The transform detects the following PII entities in lgeal contracts:

#####**PERSON:** Names of individuals
#####**EMAIL_ADDRESS:** Email addresses
#####**ORGANIZATION:** Names of organizations
#####**PHONE_NUMBER:** Phone number
#####**LOCATION:** Address

All the redacted information is written in `files-piiredacted`. The Redaction Techniques used is `replace` which is default it Replaces detected PII with a placeholder. The other technique is to use redact which removes the detected PII from the text


In [7]:
%%capture
from dpk_pii_redactor.transform_python import PIIRedactor
PIIRedactor(input_folder='files-pdf2parquet',
            output_folder= 'files-piiredacted',
            pii_redactor_entities = ["PERSON", "EMAIL_ADDRESS","ORGANIZATION","PHONE_NUMBER", "LOCATION"],
            pii_redactor_operator = "replace",
            pii_redactor_transformed_contents = "title").transform()

The redacted output data, including the PII being redacted, can be viewed by inspecting one of the generated files within the `files-piiredacted` folder, such as `MSA-DPK-1.parquet`.

In [8]:
data_pii = pd.read_parquet('files-piiredacted/MSA-DPK-1.parquet')
print(data_pii["title"][0])
print(data_pii["detected_pii"][0])

## MASTER SERVICE AGREEMENT

This Master Service Agreement (the "Agreement") dated on this 15th day of January, 2025 (the "Effective Date")  is  made  between <ORGANIZATION>.  (the  "Client")  with  an  address  of  123  <LOCATION>,  Suite  100, <LOCATION>,  <LOCATION>  10001,  and  <ORGANIZATION>  (the  "Service  Provider"),  with  an  address  of  456 <LOCATION>, Suite 200, <LOCATION>, <LOCATION> 90210, for the particular purpose of setting forth the exclusive terms and conditions by which the Client desires to acquire the described services from the Service Provider.

In consideration of the mutual obligations specified in this Agreement, the parties, intending to be legally bound hereby, agree to the following:

Scope of Services. Client retains the above Service Provider, and the Service Provider agrees to perform for the Client, the services set forth in Exhibit A to this Agreement (the "Services"). Any Service outside of the scope as defined in Exhibit A to this Agreement will r

#### The function below reads all the parquet files in the folder at once.

In [9]:
import glob
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')
    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]
    dfs = [df for df in dfs if not df.empty]  # filter out empty dataframes
    # Concatenate all DataFrames into a single DataFrame
    if len(dfs) > 0:
        data_df = pd.concat(dfs, ignore_index=True)
        return data_df
    else:
        return pd.DataFrame() # return empty df

## 4. Exact deduplication

This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.

The transform specific params for exact deduplication are: <br/>
 
 _ededup_doc_column_ - Name of column which has to be checked for deduplication <br/>
 _ededup_doc_id_column_ - Name of the column containing document id <br/>

 The output of this stage of the pipeline would be written to `files-ededup`.

In [10]:
from dpk_ededup.transform_python import Ededup
Ededup(input_folder="files-piiredacted",
    output_folder="files-ededup",
    ededup_doc_column="contents",
    ededup_doc_id_column="document_id").transform()

09:46:41 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}
09:46:41 INFO - Starting from the beginning


0

##### The deduplicated output data can be verified for the three files generated in the `files-ededup folder`, as shown in the example below.

In [11]:
data_dedup = read_parquet_files_as_df('files-ededup')
print ("Displaying contents of : ", 'files-ededup')
data_dedup.head()

Displaying contents of :  files-ededup


Unnamed: 0,detected_pii,title,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,removed
0,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-DPK-1.pdf,## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,9,0,105,fbfc4f90-9bca-4a87-9ec5-ff69cb26722b,17485373327219494888,pdf,18ac515df9bc19ec6124a02c43518c6c8631f84a8b97bf...,22613,2025-01-31T09:45:55.562620,3.657493,MSA-DPK-1.pdf,[]
1,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-2.pdf,## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,7,0,120,493846b6-70c4-4092-be5a-0384af88cec5,11474200003073967609,pdf,f226dd1bcbfae3cff93fd977b566e307b06b3843575cc3...,12546,2025-01-31T09:45:49.483514,3.385261,MSA-2.pdf,[]
2,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-3.pdf,## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,6,0,141,b3994352-ed1e-4d76-9ed9-94d5be04fd1a,6447590240996797786,pdf,a2eea6404558a7f8e5f938004f7d3e88ff814f5399349b...,10071,2025-01-31T09:45:51.902796,2.413715,MSA-3.pdf,[]


## 5.Documents chunking

This transform is chunking documents. It supports multiple chunker modules (see the doc_chunk_chunking_type parameter).

The output of this stage of the pipeline would be written to `files-doc-chunk`.

In [12]:
%%capture
from dpk_doc_chunk.transform_python import DocChunk
DocChunk(input_folder='files-ededup',
        output_folder='files-doc-chunk',
        doc_chunk_chunking_type= "li_markdown").transform()

## 6. Doc Quality

This step evaluates the completeness, consistency, and relevance of documents to ensure high-quality input for downstream processing and model training.

docq_text_lang - specifies language used in the text content. By default, "en" is used.
doc_content_column - specifies column name that contains document text. By default, "contents" is used.

The output from this stage of the pipeline will be saved in the `files-doc-quality` folder.

In [13]:
%%capture
from dpk_doc_quality.transform_python import DocQuality
DocQuality(input_folder='files-doc-chunk',
            output_folder= 'files-doc-quality',
            docq_text_lang = "en",
            docq_doc_content_column ="contents").transform()

##### We will see several new columns starting with the name `docq_.`

The document quality output data can be verified of three files generated in the `files-doc-quality` folder , as shown in the example below.We will look at a metric docq_contain_bad_word and filter out any documents that have bad words. 

In [14]:
docq_df = read_parquet_files_as_df('files-doc-quality')
print ("Displaying contents of : ", 'files-doc-quality')
docq_df.head()

Displaying contents of :  files-doc-quality


Unnamed: 0,detected_pii,title,filename,num_pages,num_tables,num_doc_elements,document_hash,ext,hash,size,...,docq_mean_word_len,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words
0,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-DPK-1.pdf,9,0,105,17485373327219494888,pdf,18ac515df9bc19ec6124a02c43518c6c8631f84a8b97bf...,22613,...,5.132964,0.00277,18,0.0,0.0,False,0.0,0.0,0.947368,True
1,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-DPK-1.pdf,9,0,105,17485373327219494888,pdf,18ac515df9bc19ec6124a02c43518c6c8631f84a8b97bf...,22613,...,5.581663,0.000426,101,0.0,0.0,False,0.016,0.0,0.990618,True
2,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-DPK-1.pdf,9,0,105,17485373327219494888,pdf,18ac515df9bc19ec6124a02c43518c6c8631f84a8b97bf...,22613,...,7.6,0.2,1,0.0,0.0,False,0.0,0.0,0.8,False
3,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-DPK-1.pdf,9,0,105,17485373327219494888,pdf,18ac515df9bc19ec6124a02c43518c6c8631f84a8b97bf...,22613,...,25.079545,0.005682,6,0.0,0.0,False,0.017544,0.0,0.778409,True
4,"[ORGANIZATION, LOCATION, LOCATION, LOCATION, O...",## MASTER SERVICE AGREEMENT\n\nThis Master Ser...,MSA-2.pdf,7,0,120,11474200003073967609,pdf,f226dd1bcbfae3cff93fd977b566e307b06b3843575cc3...,12546,...,5.4,0.009091,3,0.0,0.0,False,0.0,0.0,0.927273,True


Based on the document quality results, since `docq_lorem_ipsum_ratio > 0` and `docq_contain_bad_word = True` are not flagged, it means our document quality is good. Therefore, there’s no need to filter out the best documents.

## 7. Tokenization

Next, we tokenize the data to be used for fine tuning.

In [15]:
%%capture
from dpk_tokenization.transform_python import Tokenization
Tokenization(input_folder= "files-doc-quality",
        output_folder= "files-tokenization",
        tkn_tokenizer=  "hf-internal-testing/llama-tokenizer",
        tkn_chunk_size= 20_000).transform()

This concludes the preprocessing steps necessary to prepare legal documents for fine-tuning. The processed data includes tokenized, high-quality text ready for legal domain tasks. The final output files can be found in the `files-tokenization` folder, ensuring they are optimized for downstream model training.

The process to fine tune model using the processed data can be followed from [here](https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/Fine_Tuning/Finetuning_Granite_Pirate_Style.ipynb)