<a href="https://colab.research.google.com/github/juhumkwon/Data/blob/main/%EB%B9%84%EC%A0%95%ED%98%95%ED%8C%8C%EC%9D%BC2XML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers sentencepiece



In [6]:

import xml.etree.ElementTree as ET
from xml.dom import minidom

# 구조화된 정보
extracted_info = {
    "vendor": "LG CNS",
    "date": "2025-07-18",
    "amount": "5000"
}

# XML Element 생성
invoice = ET.Element("invoice")
ET.SubElement(invoice, "vendor").text = extracted_info["vendor"]
ET.SubElement(invoice, "date").text = extracted_info["date"]
ET.SubElement(invoice, "amount").text = extracted_info["amount"]

# 문자열로 변환 (Pretty Print)
rough_string = ET.tostring(invoice, 'utf-8')
reparsed = minidom.parseString(rough_string)
pretty_xml = reparsed.toprettyxml(indent="  ")   # 들여쓰기 2칸

print("[XML Output]:")
print(pretty_xml)


[XML Output]:
<?xml version="1.0" ?>
<invoice>
  <vendor>LG CNS</vendor>
  <date>2025-07-18</date>
  <amount>5000</amount>
</invoice>



In [7]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import xml.etree.ElementTree as ET
from xml.dom import minidom

# 1. 모델 로딩 (T5-small)
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# 2. 입력 자연어 문장
input_text = "Invoice from LG CNS dated July 18, 2025, with a total amount of $5000."

# 3. 프롬프트를 활용한 구조적 출력 유도 (XML 생성 지시)
prompt = (
    "Extract information and return XML:\n"
    "Vendor: LG CNS\n"
    "Date: July 18, 2025\n"
    "Amount: $5000"
)

# 4. 모델 추론
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=128)
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

# 5. 추론 결과 출력 (그대로 출력)
print("[T5 모델 출력]:")
print(decoded_output)

# 6. XML Element 구조로 직접 만들기 (보완된 형태로)
extracted_info = {
    "vendor": "LG CNS",
    "date": "2025-07-18",
    "amount": "5000"
}

invoice = ET.Element("invoice")
ET.SubElement(invoice, "vendor").text = extracted_info["vendor"]
ET.SubElement(invoice, "date").text = extracted_info["date"]
ET.SubElement(invoice, "amount").text = extracted_info["amount"]

# 7. 예쁘게 출력
rough_string = ET.tostring(invoice, 'utf-8')
reparsed = minidom.parseString(rough_string)
pretty_xml = reparsed.toprettyxml(indent="  ")

print("\n[Final XML Output]:")
print(pretty_xml)


[T5 모델 출력]:
Extract information and return XML: Vendor: LG CNS Date: July 18, 2025 Mont

[Final XML Output]:
<?xml version="1.0" ?>
<invoice>
  <vendor>LG CNS</vendor>
  <date>2025-07-18</date>
  <amount>5000</amount>
</invoice>



In [8]:
# TensorFlow 기반 T5 모델 + XML 출력 예제 (한 셀)

from transformers import T5Tokenizer, TFT5ForConditionalGeneration
import xml.etree.ElementTree as ET
from xml.dom import minidom
import tensorflow as tf

# 1. 모델 및 토크나이저 로딩 (TensorFlow 버전)
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")

# 2. 입력 문장
input_text = "Invoice from LG CNS dated July 18, 2025, with a total amount of $5000."

# 3. 프롬프트 설정 (T5는 명시적 지시가 중요)
prompt = (
    "Extract information and return XML:\n"
    "Vendor: LG CNS\n"
    "Date: July 18, 2025\n"
    "Amount: $5000"
)

# 4. 토큰화 (TensorFlow tensor로 반환)
inputs = tokenizer(prompt, return_tensors="tf")

# 5. 모델 추론
output = model.generate(**inputs, max_length=128)
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

# 6. 결과 출력 (T5 모델 생성 결과)
print("[T5 모델 출력]:")
print(decoded_output)

# 7. 구조화된 XML 직접 생성 (보완된 형태)
extracted_info = {
    "vendor": "LG CNS",
    "date": "2025-07-18",
    "amount": "5000"
}

invoice = ET.Element("invoice")
ET.SubElement(invoice, "vendor").text = extracted_info["vendor"]
ET.SubElement(invoice, "date").text = extracted_info["date"]
ET.SubElement(invoice, "amount").text = extracted_info["amount"]

# 8. 예쁘게 XML 출력
rough_string = ET.tostring(invoice, 'utf-8')
reparsed = minidom.parseString(rough_string)
pretty_xml = reparsed.toprettyxml(indent="  ")

print("\n[Final XML Output]:")
print(pretty_xml)


TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


[T5 모델 출력]:
Extract information and return XML: Vendor: LG CNS Date: July 18, 2025 Mont

[Final XML Output]:
<?xml version="1.0" ?>
<invoice>
  <vendor>LG CNS</vendor>
  <date>2025-07-18</date>
  <amount>5000</amount>
</invoice>

