# RecDP LLM - text normalization

Convert text to normalized text, the whole process includes using ftfy to fix text with bad character and using regular expression to fix punctuation and make every as lower_case.

This method is also used during global deduplication and fuzzy deduplication to increase hit ratio.

check 'text' and 'norm_text' as example

| id   | text                                              | meta                        | norm_text                                         |
| ---- | ------------------------------------------------- | --------------------------- | ------------------------------------------------- |
| 0    | The National Domestic Violence Hotline (NDVH) ... | {'APPLICATION_ID': 100065}  | the national domestic violence hotline ndvh an... |
| 1    | The Office of Planning, Research and Evaluatio... | {'APPLICATION_ID': 100066}  | the office of planning research and evaluation... |
| 2    | Improving outcomes for low-income fathers and ... | {'APPLICATION_ID': 100067}  | improving outcomes for lowincome fathers and t... |
| 3    | This project is implementing 36-month follow-u... | {'APPLICATION_ID': 100068}  | this project is implementing 36month followup ... |
| 4    | The CCDF Policies Database is a source of info... | {'APPLICATION_ID': 100069}  | the ccdf policies database is a source of info... |
| ...  | ...                                               | ...                         | ...                                               |
| 9995 | Project: Research and produce a videotape that... | {'APPLICATION_ID': 2120612} | project research and produce a videotape that ... |
| 9996 | While relapse prevention has been studied and ... | {'APPLICATION_ID': 2120613} | while relapse prevention has been studied and ... |
| 9997 | The proposed study on recruitment, adherence a... | {'APPLICATION_ID': 2120616} | the proposed study on recruitment adherence an... |
| 9998 | Recent studies suggest that HIV epidemics are ... | {'APPLICATION_ID': 2120620} | recent studies suggest that hiv epidemics are ... |
| 9999 | The overall goal of this study is to develop a... | {'APPLICATION_ID': 2120624} | the overall goal of this study is to develop a... |
10000 rows × 4 columns

input types can be 'jsonl', 'parquet'

# Get started

## 1. Install pyrecdp and dependencies

In [None]:
! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
! pip install pyrecdp --pre
# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'

## 2. Prepare your own data

In [None]:
%mkdir -p /content/test_data
%cd /content/test_data
file_names = ['NIH_sample.jsonl', 'NIH_sample.parquet']
file_list = [f"https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/PILE/{i}" for i in file_names]
!wget -P /content/test_data {" ".join(file_list)}

## 3. text normalization - input jsonl

In [6]:
from pyrecdp.primitives.llmutils import text_normalization
import pandas as pd

data_dir = "/content/test_data"
out_dir = "/content/text_norm"

text_normalization(data_dir, 'jsonl', out_dir)

# validate
print("Result after normalization")
pdf = pd.read_parquet(out_dir)
display(pdf)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
processing  started ...
processing  took 14.307068254 sec
data is written to /content/text_norm
  document count is 10000
Result after normalization


Unnamed: 0,filename_docid,text,meta,norm_text
0,NIH_sample.jsonl@0,The National Domestic Violence Hotline (NDVH) ...,"{""APPLICATION_ID"":100065}",the national domestic violence hotline ndvh an...
1,NIH_sample.jsonl@1,"The Office of Planning, Research and Evaluatio...","{""APPLICATION_ID"":100066}",the office of planning research and evaluation...
2,NIH_sample.jsonl@2,Improving outcomes for low-income fathers and ...,"{""APPLICATION_ID"":100067}",improving outcomes for lowincome fathers and t...
3,NIH_sample.jsonl@3,This project is implementing 36-month follow-u...,"{""APPLICATION_ID"":100068}",this project is implementing 36month followup ...
4,NIH_sample.jsonl@4,The CCDF Policies Database is a source of info...,"{""APPLICATION_ID"":100069}",the ccdf policies database is a source of info...
...,...,...,...,...
9995,NIH_sample.jsonl@9995,Project: Research and produce a videotape that...,"{""APPLICATION_ID"":2120612}",project research and produce a videotape that ...
9996,NIH_sample.jsonl@9996,While relapse prevention has been studied and ...,"{""APPLICATION_ID"":2120613}",while relapse prevention has been studied and ...
9997,NIH_sample.jsonl@9997,"The proposed study on recruitment, adherence a...","{""APPLICATION_ID"":2120616}",the proposed study on recruitment adherence an...
9998,NIH_sample.jsonl@9998,Recent studies suggest that HIV epidemics are ...,"{""APPLICATION_ID"":2120620}",recent studies suggest that hiv epidemics are ...


## 3. text normalization - input parquet

In [5]:
from pyrecdp.primitives.llmutils import text_normalization
import pandas as pd

data_dir = "/content/test_data"
out_dir = "/content/text_norm"

text_normalization(data_dir, 'parquet', out_dir)

# validate
print("Result after normalization")
pdf = pd.read_parquet(out_dir)
display(pdf)

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
processing  started ...
processing  took 1.8964475989999983 sec
data is written to /content/text_norm
  document count is 10000
Result after normalization


Unnamed: 0,filename_docid,text,meta,norm_text
0,NIH_sample.parquet@0,The National Domestic Violence Hotline (NDVH) ...,{'APPLICATION_ID': 100065},the national domestic violence hotline ndvh an...
1,NIH_sample.parquet@1,"The Office of Planning, Research and Evaluatio...",{'APPLICATION_ID': 100066},the office of planning research and evaluation...
2,NIH_sample.parquet@2,Improving outcomes for low-income fathers and ...,{'APPLICATION_ID': 100067},improving outcomes for lowincome fathers and t...
3,NIH_sample.parquet@3,This project is implementing 36-month follow-u...,{'APPLICATION_ID': 100068},this project is implementing 36month followup ...
4,NIH_sample.parquet@4,The CCDF Policies Database is a source of info...,{'APPLICATION_ID': 100069},the ccdf policies database is a source of info...
...,...,...,...,...
9995,NIH_sample.parquet@9995,Project: Research and produce a videotape that...,{'APPLICATION_ID': 2120612},project research and produce a videotape that ...
9996,NIH_sample.parquet@9996,While relapse prevention has been studied and ...,{'APPLICATION_ID': 2120613},while relapse prevention has been studied and ...
9997,NIH_sample.parquet@9997,"The proposed study on recruitment, adherence a...",{'APPLICATION_ID': 2120616},the proposed study on recruitment adherence an...
9998,NIH_sample.parquet@9998,Recent studies suggest that HIV epidemics are ...,{'APPLICATION_ID': 2120620},recent studies suggest that hiv epidemics are ...


## 3. text normalization - input sparkdf

In [7]:
from pyrecdp.primitives.llmutils import text_normalization_spk
from pyrecdp.core import SparkDataProcessor
from pyspark.sql.types import StructType,StructField, StringType
import pyspark.sql.functions as F
data_file = "/content/test_data/NIH_sample.jsonl"
rdp = SparkDataProcessor()
spark=rdp.spark
schema = StructType([
    StructField("text",StringType(),True),
    StructField("meta",StringType(),True)
])
spark_df = spark.read.text(data_file)
spark_df = spark_df.withColumn('jsonData', F.from_json(F.col('value'), schema)).select("jsonData.*")

print(f"input is")
spark_df.show()
print(f"input num_row is {spark_df.count()}")

ret = text_normalization_spk(spark_df)

print(f"output is")
ret.show()
print(f"output num_row is {ret.count()}")

del rdp

Will assign 1 cores and 10386 M memory for spark
per core memory size is 10.143 GB and shuffle_disk maximum capacity is 8589934592.000 GB
input is
+--------------------+--------------------+
|                text|                meta|
+--------------------+--------------------+
|The National Dome...|{"APPLICATION_ID"...|
|The Office of Pla...|{"APPLICATION_ID"...|
|Improving outcome...|{"APPLICATION_ID"...|
|This project is i...|{"APPLICATION_ID"...|
|The CCDF Policies...|{"APPLICATION_ID"...|
|The overall purpo...|{"APPLICATION_ID"...|
|This contract wil...|{"APPLICATION_ID"...|
|The purpose of th...|{"APPLICATION_ID"...|
|The purpose of th...|{"APPLICATION_ID"...|
|Intimate partner ...|{"APPLICATION_ID"...|
|ACF's Office of R...|{"APPLICATION_ID"...|
|The Temporary Ass...|{"APPLICATION_ID"...|
|Investing in Qual...|{"APPLICATION_ID"...|
|Current developme...|{"APPLICATION_ID"...|
|The proposed diss...|{"APPLICATION_ID"...|
|As the US populat...|{"APPLICATION_ID"...|
|Through employin