# 第一個版本

## 原始prompt

你的任務要寫出完整程式碼，從作為參考的XML檔中找出所有以<ref>標記的元素，然後以<ref>內去標記後的純文字作為搜索目標參考，再分別從兩個不同的txt來源找同檔名的txt檔，並從txt內文中找出與搜索目標相似度最高的字符串，在兩相比較。由於<ref>標記字符串代表論文內所引用的佛典出處，字符串內通常包含佛教經藏名稱/冊號/經號/頁碼/欄位/行號等出處資訊，比較時特別著重在冊號/經號與頁欄行等資訊是否在OCR處理過程中有漏失，所以按字符串長度較長的TXT為遺漏較少版本。

進行步驟如下：

從XML來源資料夾內建立所有XML檔的檔案清單("..\XML")
每個XML檔找出所有<ref>標記，建立去除所有Tag後的純文字串清單作為搜索目標參考
從兩個不同來源TXT資料夾內("..\TXT"與"..\NTXT")，找出與XML同檔名的txt檔，並以搜索目標參考為關鍵字分別找出相似度最高的字符串。字符串有可能比搜索目標參考長，因為漏字較少。
將兩個不同來源的相似字符串作比較，並作統計。如果完全相同則各加1，不予列出；如果字符串不等的話(包含找不到)，請列出兩個字符串，但僅較長的版本加1。
最後列出統計結果供參考

## LLM response

Okay, I will provide the Python code to perform the analysis as you've described. The code will:
1.  Identify all XML files in a specified `../XML` directory.
2.  For each XML file, extract text from all `<ref>` tags.
3.  For each extracted reference text, search for the most similar string in corresponding text files (same filename, different extension) located in `../TXT` and `../NTXT` directories.
4.  Compare these two found strings:
    *   If they are identical and found in both sources, scores for both sources are incremented. These are not listed.
    *   If they are different, or if one is missing, the original reference and both (attempted) matches will be listed. The source providing the longer string gets a score increment. If only one source finds a match, it gets the score.
5.  Finally, it will print the overall statistics, including scores for each text source.

Here's the plan:
*   Define paths to the XML, TXT, and NTXT directories relative to the notebook's parent directory.
*   Implement a helper function to extract clean text content from XML elements, specifically for `<ref>` tags.
*   Implement a helper function to find the best matching line in a given text file for a target string using `difflib`.
*   The main part of the script will iterate through XML files, process `<ref>` tags, perform comparisons, and tally scores.
*   Print detailed information for differing/missing matches and summary statistics at the end.

### [exploration.ipynb](file:///g%3A/Shared%20drives/CBETA%E5%8F%83%E7%85%A7%E5%81%B5%E6%B8%AC/Journal-Data/%E6%97%A5%E6%96%87%E6%9C%9F%E5%88%8A/JIBS%20%E5%8D%B0%E5%BA%A6%E5%AD%B8%E4%BD%9B%E6%95%99%E5%AD%B8%E7%A0%94%E7%A9%B6/work/xml-20250212-happypig/exploration/exploration.ipynb)
Update the Python cell with the complete script.


In [2]:
import os
import xml.etree.ElementTree as ET
import difflib

def get_clean_text_from_element(element):
    """Extracts and cleans all text from an XML element, joining child text."""
    if element is None:
        return ""
    # Concatenate all text pieces, then normalize whitespace
    text_parts = [text.strip() for text in element.itertext() if text and text.strip()]
    return " ".join(text_parts)

def find_best_match_in_lines(target_text, lines, cutoff=0.6):
    """Finds the best matching line from a list of lines."""
    if not lines or not target_text:
        return None
    matches = difflib.get_close_matches(target_text, lines, n=1, cutoff=cutoff)
    if matches:
        return matches[0]
    return None

def process_files():
    # Assuming the notebook is in 'exploration', so '..' goes to 'xml-20250212-happypig'
    current_notebook_dir = os.getcwd() # Should be .../exploration
    base_dir = os.path.abspath(os.path.join(current_notebook_dir, ".."))

    xml_dir_name = "XML"
    txt_dir1_name = "TXT"  # Source 1
    txt_dir2_name = "NTXT" # Source 2

    xml_dir = os.path.join(base_dir, xml_dir_name)
    txt_dir1 = os.path.join(base_dir, txt_dir1_name)
    txt_dir2 = os.path.join(base_dir, txt_dir2_name)

    print(f"Base directory: {base_dir}")
    print(f"Looking for XML files in: {xml_dir}")
    print(f"Looking for TXT files (Source 1) in: {txt_dir1}")
    print(f"Looking for TXT files (Source 2) in: {txt_dir2}")
    print("-" * 30)

    if not os.path.isdir(xml_dir):
        print(f"Error: XML directory not found at {xml_dir}")
        return
    if not os.path.isdir(txt_dir1):
        print(f"Error: {txt_dir1_name} directory not found at {txt_dir1}")
        return
    if not os.path.isdir(txt_dir2):
        print(f"Error: {txt_dir2_name} directory not found at {txt_dir2}")
        return

    txt_dir1_score = 0
    txt_dir2_score = 0
    identical_unlisted_count = 0
    total_refs_processed = 0
    total_xml_files_processed = 0
    mismatched_refs_count = 0

    xml_files = [f for f in os.listdir(xml_dir) if f.endswith(".xml")]
    
    if not xml_files:
        print(f"No XML files found in {xml_dir}.")
        return

    for xml_filename in xml_files:
        total_xml_files_processed += 1
        xml_filepath = os.path.join(xml_dir, xml_filename)
        base_filename = os.path.splitext(xml_filename)[0]

        txt1_filepath = os.path.join(txt_dir1, base_filename + ".txt")
        txt2_filepath = os.path.join(txt_dir2, base_filename + ".txt")

        try:
            tree = ET.parse(xml_filepath)
            root = tree.getroot()
        except ET.ParseError as e:
            print(f"Error parsing XML file {xml_filename}: {e}")
            continue

        refs_in_file = root.findall(".//ref")
        if not refs_in_file:
            # print(f"No <ref> tags found in {xml_filename}")
            continue
            
        # print(f"\nProcessing {xml_filename}...")

        txt1_lines = []
        if os.path.exists(txt1_filepath):
            with open(txt1_filepath, 'r', encoding='utf-8') as f:
                txt1_lines = [line.strip() for line in f.readlines()]
        else:
            print(f"Warning: {txt_dir1_name} file not found: {txt1_filepath}")

        txt2_lines = []
        if os.path.exists(txt2_filepath):
            with open(txt2_filepath, 'r', encoding='utf-8') as f:
                txt2_lines = [line.strip() for line in f.readlines()]
        else:
            print(f"Warning: {txt_dir2_name} file not found: {txt2_filepath}")

        for ref_element in refs_in_file:
            total_refs_processed += 1
            original_ref_text = get_clean_text_from_element(ref_element)

            if not original_ref_text: # Skip if ref tag is empty
                # print(f"Skipping empty <ref> tag in {xml_filename}")
                continue

            match1 = find_best_match_in_lines(original_ref_text, txt1_lines)
            match2 = find_best_match_in_lines(original_ref_text, txt2_lines)

            found1 = match1 is not None
            found2 = match2 is not None

            if found1 and found2 and match1 == match2:
                txt_dir1_score += 1
                txt_dir2_score += 1
                identical_unlisted_count += 1
            else:
                mismatched_refs_count +=1
                print(f"\n--- Difference found for ref in: {xml_filename} ---")
                print(f"Original <ref> text: \"{original_ref_text}\"")
                print(f"  Match from {txt_dir1_name}: \"{match1 if found1 else 'Not found'}\"")
                print(f"  Match from {txt_dir2_name}: \"{match2 if found2 else 'Not found'}\"")
                
                if found1 and found2: # Both found but different
                    if len(match1) > len(match2):
                        txt_dir1_score += 1
                        print(f"  Conclusion: {txt_dir1_name} version is longer.")
                    elif len(match2) > len(match1):
                        txt_dir2_score += 1
                        print(f"  Conclusion: {txt_dir2_name} version is longer.")
                    else:
                        print(f"  Conclusion: Versions differ but are same length.")
                elif found1 and not found2: # Only match1 found
                    txt_dir1_score += 1
                    print(f"  Conclusion: Only found in {txt_dir1_name}.")
                elif not found1 and found2: # Only match2 found
                    txt_dir2_score += 1
                    print(f"  Conclusion: Only found in {txt_dir2_name}.")
                else: # Neither found
                    print(f"  Conclusion: Not found in either source.")
                print("---")


    print("\n" + "=" * 30)
    print("Final Statistics:")
    print("=" * 30)
    print(f"Total XML files processed: {total_xml_files_processed}")
    print(f"Total <ref> tags processed: {total_refs_processed}")
    print(f"References identical in both sources (not listed): {identical_unlisted_count}")
    print(f"References with differences or missing (listed): {mismatched_refs_count}")
    print("-" * 30)
    print(f"Score for {txt_dir1_name} source: {txt_dir1_score}")
    print(f"Score for {txt_dir2_name} source: {txt_dir2_score}")
    print("=" * 30)

if __name__ == '__main__':
    # This block allows the script to be run if saved as .py, 
    # but in Jupyter, you'll just run the cell.
    # To run in Jupyter, call process_files() directly.
    process_files()


Base directory: g:\Shared drives\CBETA參照偵測\Journal-Data\日文期刊\JIBS 印度學佛教學研究\work\xml-20250212-happypig
Looking for XML files in: g:\Shared drives\CBETA參照偵測\Journal-Data\日文期刊\JIBS 印度學佛教學研究\work\xml-20250212-happypig\XML
Looking for TXT files (Source 1) in: g:\Shared drives\CBETA參照偵測\Journal-Data\日文期刊\JIBS 印度學佛教學研究\work\xml-20250212-happypig\TXT
Looking for TXT files (Source 2) in: g:\Shared drives\CBETA參照偵測\Journal-Data\日文期刊\JIBS 印度學佛教學研究\work\xml-20250212-happypig\NTXT
------------------------------
Error: XML directory not found at g:\Shared drives\CBETA參照偵測\Journal-Data\日文期刊\JIBS 印度學佛教學研究\work\xml-20250212-happypig\XML




Made changes.

# 第二版

# 相同prompt + 來源資料夾