# Teigaku Genzei (fixed-amount tax reduction) example

## Preparation
### Please read README.md and understand the licensing conditions and requirements on the source data distributed by National Tax Agency of Japan.
- [README.md](./README.md)

### Source data
URLs of the source PDF files are also available in [README.md](./README.md). Please download those by yourself, and store those files in `source/` directory.
- 0024001-021.pdf
- 0024004-072_01.pdf

### Prerequisites
Please prepare a python virtual environment with the following dependencies.
- Python >= 3.10
- sdg_hub >= 0.7.1
- notebook
- ipykernel

No GPU or external LLM inference service is required in this notebook.

## Overview
This notebook covers the following topics, each of which correspond to a section in [README.md](./README.md).
- Common preprocessing of source document
  - Extraction of text information from PDF source files by calling Docling.
  - QA pair extraction.
- SDG seed generation
  - Construction of seed contexts.
  - Construction of in-context learning examples.
  - SDG seed generation.

In README.md, the same things can be done with command-line python tools and shell scripts.
For the details of the input / output files and the options of each step, please refer to README.md.

Other topics such as synthetic training data generation, fine-tuning, and task-specific evaluation of models, please refer to README.md, as well as the sdg_hub and training_hub documents. 
You will need external LLM inference services or GPUs for the SDGs.
You will need GPUs for the training and evaluation of models.


## Configuration.


In [None]:
source_input_dir = "source"
docparser_dir = "docparser"
source_name_list = ["0024001-021", "0024004-072_01"]
qa_table_dir = "qa_table"
context_dir = "context"
icl_source_name = source_name_list[1]
icl_path = "icl/icl.jsonl"
seed_source_name = source_name_list[0]
seed_path = "seed/seed_ja.jsonl"

# Docling configurations.
# Constants and type definitions
EXPORT_FORMATS = {
    "json": ("json", "export_to_dict"),  # Deep Search JSON format
    "text": ("txt", "export_to_text"),  # Plain text
    "markdown": ("md", "export_to_markdown"),  # Markdown with structure
    "html": ("html", "export_to_html"),  # HTML with styling
    "doctags": ("doctags", "export_to_document_tokens"),  # Document tokens
}

DEFAULT_CONFIG = {
    "pipeline": {
        "ocr": {
            "enabled": True,  # Enable/disable OCR processing
            "languages": ["es"],  # List of language codes (e.g., eng, fra, deu)
        },
        "tables": {
            "enabled": True,  # Enable/disable table detection
            "cell_matching": True,  # Enable/disable cell matching in tables
        },
        "performance": {
            "threads": 4,  # Number of processing threads
            "device": "auto",  # Device selection (auto, cpu, gpu)
        },
    },
    "export": {
        "formats": {
            "json": True,  # Deep Search JSON format
            "text": True,  # Plain text
            "markdown": True,  # Markdown with structure
            "html": True,  # HTML with styling
            "doctags": True,  # Document tokens
        }
    },
}


# Common imports

In [None]:
import os
import pandas as pd

import jsonl_util

from sdg_hub.core.utils.logger_config import setup_logger

logger = setup_logger(__name__)


## Extraction of text information from PDF source files by calling Docling.

The code in this block is a simplified version of [docparser_v2.py](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/docparser_v2.py).

The text information in the source PDF files are extracted using Docling.

- Inputs: Source PDF files. These files contain FAQ and answers to the questions.
- Outputs: Two sets of (.doctags, .html, .json, .md, .txt) files. We only use .json files in the following sections.

In [None]:

# Standard
from pathlib import Path
import json
import time

# Third Party
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
)
from docling.datamodel.accelerator_options import (
    AcceleratorDevice,
    AcceleratorOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

def setup_pipeline_options(config: dict) -> PdfPipelineOptions:
    """Configure pipeline options from config dictionary."""
    pipeline_config = config["pipeline"]

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = pipeline_config["ocr"]["enabled"]
    pipeline_options.do_table_structure = pipeline_config["tables"]["enabled"]
    if isinstance(pipeline_options.table_structure_options, TableStructureOptions) :
        pipeline_options.table_structure_options.do_cell_matching = pipeline_config[
            "tables"
        ]["cell_matching"]
    pipeline_options.ocr_options.lang = pipeline_config["ocr"]["languages"]
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=pipeline_config["performance"]["threads"],
        device=getattr(
            AcceleratorDevice, pipeline_config["performance"]["device"].upper()
        ),
    )
    return pipeline_options


def export_document(
    conv_result, doc_filename: str, output_dir: Path, config: dict
) -> None:
    """Export document in configured formats."""
    enabled_formats = {
        k: v
        for k, v in EXPORT_FORMATS.items()
        if config["export"]["formats"].get(k, True)
    }

    for format_name, (extension, export_method) in enabled_formats.items():
        try:
            content = getattr(conv_result.document, export_method)()
            output_path = output_dir / f"{doc_filename}.{extension}"

            with output_path.open("w", encoding="utf-8") as fp:
                if isinstance(content, (dict, list)):
                    json.dump(content, fp, ensure_ascii=False, indent=2)
                else:
                    fp.write(content)

            logger.debug(f"Successfully exported {format_name} format to {output_path}")

        except Exception as e:
            logger.error(f"Failed to export {format_name} format: {str(e)}")
            raise

def export_document_new_docling(
    input_dir: Path,
    output_dir: Path,
    # config: Optional[Path],
):
    """Convert PDF documents and export them in multiple formats."""
    # config_data = load_config(config)
    config_data = DEFAULT_CONFIG

    file_paths = list(input_dir.glob("*.pdf"))
    if not file_paths:
        logger.warning(f"No PDF files found in {input_dir}")
        return

    logger.info(f"Found {len(file_paths)} PDF files to process")

    pipeline_options = setup_pipeline_options(config_data)
    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    output_dir.mkdir(parents=True, exist_ok=True)
    success_count = failure_count = 0
    start_time = time.time()

    for file_path in file_paths:
        logger.info(f"Processing {file_path}")
        try:
            conv_result = doc_converter.convert(file_path)
            doc_filename = conv_result.input.file.stem

            export_document(conv_result, doc_filename, output_dir, config_data)
            success_count += 1
            logger.info(f"Successfully processed {file_path}")

        except Exception as e:
            failure_count += 1
            logger.error(f"Failed to process {file_path}: {str(e)}")
            continue

    processing_time = time.time() - start_time

    logger.info(
        f"Processed {success_count + failure_count} docs in {processing_time:.2f} seconds"
        f"\n  Successful: {success_count}"
        f"\n  Failed: {failure_count}"
    )


In [None]:
export_document_new_docling(input_dir=Path(source_input_dir), output_dir=Path(docparser_dir))
logger.info(f"Done! Please check the outputs in {docparser_dir}.")

## QA pair extraction.

The QA pairs in the JSON files from the previous step are extracted using a python script.
The extracted QA pairs are used for test cases in the evaluation, as well as for In-context learning examples and seed contexts in the later sections.

The glossary sections (i.e., term - definition pairs) in the same files are also extracted similarly.
The glossary files are currently not used, but reserved for future extension.

- Inputs: JSON files that are outputs of Docling.
- Outputs: A set of two CSV files, one is for the QA pairs (\*.csv) and another one is for the glossary (\*_glossary.csv).


In [None]:

import json_qa
import json_glossary
import json_util

def make_common_fn(docparser_dir: str, source_name_list: list[str], qa_table_dir: str):
    for source_name in source_name_list:
        in_file = os.path.join(docparser_dir, source_name + ".json")
        out_file = os.path.join(qa_table_dir, source_name + ".csv")
        out_glossary_file = os.path.join(qa_table_dir, source_name + "_glossary.csv")

        data = json_util.read_json_file(in_file)

        qa_pairs = json_qa.extract_qa_pairs(data)
        json_qa.save_to_csv(qa_pairs, out_file)

        qa_glossary_pairs = json_glossary.extract_glossary(data)
        json_glossary.save_to_csv(qa_glossary_pairs, out_glossary_file)

make_common_fn(
    docparser_dir=docparser_dir,
    source_name_list=source_name_list,
    qa_table_dir=qa_table_dir
)
logger.info(f"Done! Please check the outputs in {qa_table_dir}.")


## Construction of seed contexts.

A context is a chunk of text which contains the knowledge to be taught to the student model.
Since the source files are FAQ documents, we compose contexts by concatenating the answers of the QA pairs.
The questions are not used in this step.

- Inputs: CSV files of the QA pairs from the previous step.
- Outputs: CSV files of the contexts.


In [None]:
import make_context

def make_context_fn(qa_table_dir: str, source_name_list: list[str], context_dir):
    for source_name in source_name_list:
        input_qa_path = os.path.join(qa_table_dir, source_name + ".csv")
        input_glossary_path = os.path.join(qa_table_dir, source_name + "_glossary.csv")
        output_context_path = os.path.join(context_dir, source_name + "_context.csv")

        qa_df = pd.read_csv(input_qa_path, encoding="utf8")
        glossary_df = pd.read_csv(input_glossary_path, encoding="utf8")

        context_df = make_context.make_context(qa_df, glossary_df, make_context.OPT_GLOSSARY_APPENDIX)

        context_df.to_csv(output_context_path, index=False, encoding="utf8")

make_context_fn(
    qa_table_dir=qa_table_dir,
    source_name_list=source_name_list,
    context_dir=context_dir
)
logger.info(f"Done! Please check the outputs in {context_dir}.")


## Construction of in-context learning examples.

In-context learning examples are used to exemplify how to synthesize QA pairs from a context to the teacher model.
In sdg_hub, an ICL example consists of three QA pairs and one context that is related to the QAs.
We compose ICL examples from one source, 0024004-072_01.pdf.
The other source is used for the seed contexts.

- Input: 
  - CSV file of QA pairs extracted from the source for ICL (see above).
  - CSV file of contexts extracted from the source for ICL (see above).
- Output:
  - ICL file in JSONL format.

In [None]:
import make_icl

def make_icl_fn(input_qa_path: str, input_context_path: str, output_icl_path: str, short_context: bool):
    qa_df = pd.read_csv(input_qa_path, encoding="utf8")
    context_df = pd.read_csv(input_context_path, encoding="utf8")

    icl_list = make_icl.make_icl(qa_df, context_df, short_context)

    jsonl_util.write_jsonl_file(output_icl_path, icl_list)

make_icl_fn(
    input_qa_path=os.path.join(qa_table_dir, icl_source_name + ".csv"),
    input_context_path=os.path.join(context_dir, icl_source_name + "_context.csv"),
    output_icl_path=icl_path,
    short_context=True
)
logger.info(f"Done! Please check the outputs in {icl_path}.")


## SDG seed generation.

The SDG seed data is the input to sdg_hub, where the seed contexts are converted into
a set of question-answer pairs synthesized by a teacher model.

Here, we use the other source, 0024001-021.pdf, for the seed context data. 
These are combined with the ICL examples in the output.

Note that `OPT_JOIN_METHOD_CARTESIAN` option specifies the function to
generate all the possible combinations of ICL examples and seed contexts to 
diversify the synthesis of the QA pairs.
For example, if we have 4 ICL examples and 7 seed contexts, there will be 4 x 7 = 28 patterns of SDG seeds.

- Inputs: 
  - CSV file of the seed contexts.
  - JSONL file of the ICL examples.
- Outputs: 
  - JSONL file of the SDG seed 

In [None]:
import make_seed

def make_seed_fn(input_context_path: str, input_icl_path: str, output_seed_path: str, opt_join_method: str):
    context_df = pd.read_csv(input_context_path, encoding="utf8")[["context", "qindex"]]
    icl_list = jsonl_util.read_jsonl_file(input_icl_path)
    out_df = make_seed.make_seed(context_df, icl_list, opt_join_method)
    
    out_df.to_json(output_seed_path, orient="records", lines=True, force_ascii=False)

make_seed_fn(
    input_context_path=os.path.join(context_dir, seed_source_name + "_context.csv"),
    input_icl_path=icl_path,
    output_seed_path=seed_path,
    opt_join_method=make_seed.OPT_JOIN_METHOD_CARTESIAN
)
logger.info(f"Done! Please check the outputs in {seed_path}.")
