# 01_Preprocessing
Tips for data preprocessing to create search indexes for RAG

## Document format examples and extraction tools:

### Document Format Examples and Extraction Tools

TBD
- サンプルユースケース：マニュアルの検索
- データセットの用意：PDF, Word
- サンプルコードの実装

In [1]:
! pip install PyMuPDF python-docx beautifulsoup4

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.5-cp312-none-win_amd64.whl.metadata (3.4 kB)
Collecting python-docx
  Using cached python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Using cached PyMuPDFb-1.24.3-py3-none-win_amd64.whl.metadata (1.4 kB)
Collecting lxml>=3.1.0 (from python-docx)
  Downloading lxml-5.2.2-cp312-cp312-win_amd64.whl.metadata (3.5 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading PyMuPDF-1.24.5-cp312-none-win_amd64.whl (3.2 MB)
   ---------------------------------------- 0.0/3.2 MB ? eta -:--:--
    --------------------------------------- 0.1/3.2 MB 2.0 MB/s eta 0:00:02
   ----- ---------------------------------- 0.4/3.2 MB 5.3 MB/s eta 0:00:01
   ---------- ----------------------------- 0.8/3.2 MB 6.4 MB/s eta 0:00:01
   -------------- -------

In [2]:
import fitz  # PyMuPDF
from docx import Document
from bs4 import BeautifulSoup

# PDF to text
def extract_text_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    text = ""
    for page_num in range(pdf_document.page_count):
        page = pdf_document.load_page(page_num)
        text += page.get_text()
    return text


# Word to text
def extract_text_from_word(docx_path):
    doc = Document(docx_path)
    text = "\n".join([para.text for para in doc.paragraphs])
    return text


# HTML to text
def extract_text_from_html(html_path):
    with open(html_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'html.parser')
        text = soup.get_text()
    return text


In [5]:
# Example of usage
example_extracted_text = extract_text_from_pdf("../data/02_article/Retrieval-Augmented-Generation-for-LLM.pdf")
example_extracted_text

'1\nRetrieval-Augmented Generation for Large\nLanguage Models: A Survey\nYunfan Gaoa, Yun Xiongb, Xinyu Gaob, Kangxiang Jiab, Jinliu Panb, Yuxi Bic, Yi Daia, Jiawei Suna, Meng\nWangc, and Haofen Wang a,c\naShanghai Research Institute for Intelligent Autonomous Systems, Tongji University\nbShanghai Key Laboratory of Data Science, School of Computer Science, Fudan University\ncCollege of Design and Innovation, Tongji University\nAbstract—Large Language Models (LLMs) showcase impres-\nsive capabilities but encounter challenges like hallucination,\noutdated knowledge, and non-transparent, untraceable reasoning\nprocesses. Retrieval-Augmented Generation (RAG) has emerged\nas a promising solution by incorporating knowledge from external\ndatabases. This enhances the accuracy and credibility of the\ngeneration, particularly for knowledge-intensive tasks, and allows\nfor continuous knowledge updates and integration of domain-\nspecific information. RAG synergistically merges LLMs’ intrin-\nsic

## Document Intelligence sample
### Reference
- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/unlocking-advanced-document-insights-with-azure-ai-document/ba-p/4109675
- https://github.com/Azure-Samples/document-intelligence-code-samples/blob/main/Python(v4.0)/Retrieval_Augmented_Generation_(RAG)_samples/sample_figure_understanding.ipynb

In [6]:
! pip install python-dotenv openai azure-ai-documentintelligence azure-identity pillow PyMuPDF

Collecting pillow
  Using cached pillow-10.3.0-cp312-cp312-win_amd64.whl.metadata (9.4 kB)
Using cached pillow-10.3.0-cp312-cp312-win_amd64.whl (2.5 MB)
Installing collected packages: pillow
Successfully installed pillow-10.3.0


In [7]:
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import ContentFormat
from openai import AzureOpenAI

load_dotenv()

doc_intelligence_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
doc_intelligence_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")

aoai_api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
aoai_api_key= os.getenv("AZURE_OPENAI_API_KEY")
aoai_deployment_name = 'gpt-4o' # your model deployment name for GPT-4V
aoai_api_version = '2024-02-01' # this might change in the future

In [8]:
import re

# Unify the format of headings in markdown text
def convert_markdown_headings(markdown_text):
    # Convert "===" headers to "#"
    markdown_text = re.sub(r'^(.*?)\n={3,}$', r'# \1', markdown_text, flags=re.MULTILINE)

    # Convert "---" headers to "##"
    markdown_text = re.sub(r'^(.*?)\n-{3,}$', r'## \1', markdown_text, flags=re.MULTILINE)
    
    return markdown_text

In [9]:
def analyze_layout(input_file_path, output_folder):
    """
    Analyzes the layout of a document and extracts figures along with their descriptions, then update the markdown output with the new description.

    Args:
        input_file_path (str): The path to the input document file.
        output_folder (str): The path to the output folder where the cropped images will be saved.

    Returns:
        str: The updated Markdown content with figure descriptions.

    """
    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=doc_intelligence_endpoint, 
        credential=AzureKeyCredential(doc_intelligence_key),
        headers={"x-ms-useragent":"sample-code-figure-understanding/1.0.0"},
    )

    with open(input_file_path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream", output_content_format=ContentFormat.MARKDOWN 
        )

    result = poller.result()
    md_content = convert_markdown_headings(result.content)
            
    with open(f"{output_folder}/{os.path.splitext(os.path.basename(input_file_path))[0]}.md", 'w', encoding='utf-8') as f:
        f.write(md_content)
    
    return md_content

In [13]:
analyze_layout("../data/01_aisearch_docs/azure-search-concept.pdf", "../output")

Incomplete download: ('Connection broken: IncompleteRead(21061632 bytes read, 1430334 more expected)', IncompleteRead(21061632 bytes read, 1430334 more expected))


'# Azure Al Search の検索インデックス\n\n[アーティクル]·2024/02/16\n\nAzure Al Search の “検索インデックス"は検索可能なコンテンツであり、検索エンジン でインデックス作成、全文検索、ベクトル検索、ハイブリッド検索、フィルターされた クエリに使用できます。インデックスは、スキーマによって定義され、検索サービス に保存されます。2 番目のステップとしてデータのインポートが続きます。このコン テンツは検索サービス内に存在します。これは、最新の検索アプリケーションで想定さ れるミリ秒単位の応答時間に必要な、プライマリ データ ストアとは別のものです。イ ンデクサー主導のインデックス作成シナリオを除き、検索サービスがソース データに 接続したり、クエリを実行したりすることはありません。\n\n検索インデックスを作成して管理する場合、この記事は次の点を理解するのに役立ちま す。\n\n● コンテンツ(ドキュメントおよびスキーマ)\n\n● 物理データ構造\n\n● 基本操作\n\n今すぐに使いたいですか?代わりに、検索インデックスの作成に関する記事を参照して ください。\n\n\n## 検索インデックスのスキーマ\n\nAzure Al Search のインデックスには 検索ドキュメントが格納されます。概念的に、ド キュメントはインデックス内で検索可能なデータの1 つの単位です。たとえば、小売 業者に製品ごとのドキュメントがあり、ニュース組織に記事ごとのドキュメントがある 場合、旅行サイトにはホテルと目的地ごとのドキュメントがある場合があります。こ れらの概念をなじみのあるデータベースの同等のものに対応させるなら、検索インデッ クスはテーブルと同じで、ドキュメントはテーブルにおける行とほぼ同じです。\n\n次の例に示すように、ドキュメントの構造は“インデックス スキーマ"によって決まり ます。"フィールド"コレクションは通常、インデックスの最大の部分であり、各フィ ールドには、名前、データ型の割り当て、および使用方法を決定する許容される動作を 示す属性が設定されます。\n\nJSON\n\n{ "name": "name\\_of\\_index, unique across the service",\n\n"fiel

### 【TBD】With Image
https://github.com/Azure-Samples/document-intelligence-code-samples/blob/main/Python(v4.0)/Retrieval_Augmented_Generation_(RAG)_samples/sample_figure_understanding.ipynb

## [Option] Text standardization and normalization
- Utilizing LLMs for text standardization and normalization is a highly effective approach.
- It can extend the capabilities of traditional rule-based text transformation.
- However, since LLMs do not guarantee the same output every time, traditional rule-based transformations should be used when output consistency is required.

### Using LLM such as GPT


In [15]:
import re
import os
from openai import AzureOpenAI
import json

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version="2024-02-01"
)

system_message = """
# Your Role
You are an excellent AI assistant for proofreading text data. Your task is to ensure the provided text data is of high quality. You are only allowed to proofread. Adding or removing context from the original document is not allowed. Additionally, you cannot change the structure of the document.

# Examples of Corrections
- Grammar errors and typos
- OCR misrecognitions
- Inconsistencies in terminology and expressions

# Your input
text: 
"""

def correct_text_gpt(text):
    message_text = [
		{"role":"system","content": system_message},
		{"role":"user","content": text}
	]
    completion = client.chat.completions.create(
		model="gpt-4o", # model = "deployment_name"
		messages = message_text,
		temperature=0,
		)
    return completion.choices[0].message.content



### Using traditional rule-based text transformation

In [16]:
# This is example function to clean text data
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove duplicate lines
    lines = text.split("\n")
    unique_lines = list(dict.fromkeys(lines))
    return "\n".join(unique_lines)