你的任務要寫出完整程式碼，從作為參考的XML檔中找出所有以<ref>標記的元素，然後以<ref>內去標記後的純文字作為搜索目標參考，再分別從兩個不同的txt來源找同檔名的txt檔，並從txt內文中找出與搜索目標相似度最高的字符串，在兩相比較。由於<ref>標記字符串代表論文內所引用的佛典出處，字符串內通常包含佛教經藏名稱/冊號/經號/頁碼/欄位/行號等出處資訊，比較時特別著重在冊號/經號與頁欄行等資訊是否在OCR處理過程中有漏失，所以按字符串長度較長的TXT為遺漏較少版本。

進行步驟如下：

從XML來源資料夾內建立所有XML檔的檔案清單(".\XML")。每本期刊的XML檔，被歸在以期刊號為名的資料夾內
每個XML檔找出所有<ref>標記，建立去除所有Tag後的純文字串清單作為搜索目標參考
從兩個不同來源TXT檔資料夾內(".\TXT"與".\NTXT")，找出與XML同檔名的txt檔，並以搜索目標參考為關鍵字分別找出相似度最高的字符串。字符串有可能比搜索目標參考長，因為漏字較少。每本期刊的txt檔，被歸在以期刊號為名的資料夾內
將兩個不同來源的相似字符串作比較，並作統計。如果完全相同則各加1，不予列出；如果字符串不等的話(包含找不到)，請列出兩個字符串，但僅較長的版本加1。
最後列出統計結果供參考

Okay, I can help you with that. This script will perform the comparison as you've outlined.

It will:
1.  Iterate through all XML files in the specified XML directory and its subdirectories.
2.  For each XML file, it will find all `<ref>` tags and extract their text content (stripping inner tags).
3.  It will then locate corresponding `.txt` files in the TXT and NTXT directories.
4.  For each extracted reference string, it will search for the most similar string within the content of the corresponding TXT and NTXT files. The search for similar strings is done line by line, and within each line, it looks for candidate substrings of varying lengths around the length of the reference string.
5.  It compares the findings from TXT and NTXT:
    *   If identical, a counter is incremented.
    *   If different (including cases where one or both are not found), the script logs the details for printing and increments a counter for the version that has a longer string. If one is found and the other isn't, the found one is considered "longer".
6.  Finally, it prints the detailed differences and a summary of the statistics.

Please note that finding the "most similar string" in a large text can be computationally intensive. This script attempts a practical approach by checking candidates within lines, but its performance will depend on the size of your text files and the number of references.



In [1]:
import os
import xml.etree.ElementTree as ET
from difflib import SequenceMatcher
import collections
import re

# Helper: Get text from XML element, stripping tags and normalizing whitespace
def get_element_text(element):
    if element is None:
        return ""
    # Concatenate all text pieces within the element and its children
    text_pieces = [piece for piece in element.itertext()]
    text = ''.join(text_pieces)
    # Normalize whitespace: strip leading/trailing, replace multiple internal spaces/newlines with a single space
    text = ' '.join(text.split()).strip()
    return text

# Helper: Calculate similarity ratio between two strings
def get_similarity_ratio(s1, s2):
    if not s1 and not s2:  # Both empty
        return 1.0
    if not s1 or not s2:  # One empty
        return 0.0
    return SequenceMatcher(None, s1, s2).ratio()

# Helper: Find the best matching substring for a query within a larger text content
def find_best_match_in_txt_content(query, text_content, min_similarity_threshold=0.6):
    if not query or not text_content:
        return ""

    lines = text_content.splitlines()
    best_overall_match_str = ""
    max_overall_similarity = -1.0  # Initialize lower than any possible ratio
    query_len = len(query)

    if query_len == 0: # Cannot match an empty query meaningfully
        return ""

    for line in lines:
        if not line.strip(): # Skip empty or whitespace-only lines
            continue

        line_best_candidate_str = ""
        line_max_similarity_for_query = -1.0

        # Define candidate lengths: allow shorter (query might have noise) or longer (text is more complete)
        # min_cand_len: at least 80% of query length, but at least 1 char
        min_cand_len = max(1, int(query_len * 0.8))
        # max_cand_len: up to 20% longer + 10 chars, or query_len + 25 chars, whichever is more generous
        # This allows the found string to be substantially longer if it's a better match.
        max_cand_len = max(int(query_len * 1.2) + 10, query_len + 25)


        for cand_len in range(min_cand_len, max_cand_len + 1):
            if cand_len <= 0 : continue # Should not happen with max(1,...)
            for i in range(len(line) - cand_len + 1):
                candidate_substring = line[i : i + cand_len]
                similarity = get_similarity_ratio(query, candidate_substring)

                if similarity > line_max_similarity_for_query:
                    line_max_similarity_for_query = similarity
                    line_best_candidate_str = candidate_substring
                elif similarity == line_max_similarity_for_query:
                    # If similarity is tied, prefer the longer candidate string
                    if len(candidate_substring) > len(line_best_candidate_str):
                        line_best_candidate_str = candidate_substring
        
        # Update overall best match based on this line's findings
        if line_max_similarity_for_query > max_overall_similarity:
            max_overall_similarity = line_max_similarity_for_query
            best_overall_match_str = line_best_candidate_str
        elif line_max_similarity_for_query == max_overall_similarity:
            # If similarity is tied across lines, prefer the longer candidate string
            if len(line_best_candidate_str) > len(best_overall_match_str):
                best_overall_match_str = line_best_candidate_str

    if max_overall_similarity >= min_similarity_threshold:
        return best_overall_match_str
    return "" # Return empty if no match meets the threshold

# Main processing function
def process_files(xml_root_dir, txt_root_dir, ntxt_root_dir):
    stats = {
        "total_refs_processed": 0,
        "identical_matches": 0,
        "txt_longer": 0,
        "ntxt_longer": 0,
        "txt_found_ntxt_not": 0,
        "ntxt_found_txt_not": 0,
        "neither_found": 0,
        "different_content_same_length": 0,
        "problematic_files": collections.defaultdict(list) # To log errors for specific files
    }
    
    differences_log = [] # To store details of non-identical matches

    # Walk through the XML directory
    for subdir, _, files in os.walk(xml_root_dir):
        for file_name in files:
            if file_name.endswith(".xml"):
                xml_file_path = os.path.join(subdir, file_name)
                
                # Determine corresponding paths for TXT and NTXT files
                relative_path_to_xml_file = os.path.relpath(xml_file_path, xml_root_dir)
                base_name_no_ext, _ = os.path.splitext(relative_path_to_xml_file) # e.g., v55.3/148109
                
                txt_file_path = os.path.join(txt_root_dir, base_name_no_ext + ".txt")
                ntxt_file_path = os.path.join(ntxt_root_dir, base_name_no_ext + ".txt")

                # Read content of TXT file
                txt_content = ""
                if os.path.exists(txt_file_path):
                    try:
                        with open(txt_file_path, 'r', encoding='utf-8') as f:
                            txt_content = f.read()
                    except Exception as e:
                        stats["problematic_files"][txt_file_path].append(f"Error reading: {e}")
                else:
                    stats["problematic_files"][txt_file_path].append("File not found")

                # Read content of NTXT file
                ntxt_content = ""
                if os.path.exists(ntxt_file_path):
                    try:
                        with open(ntxt_file_path, 'r', encoding='utf-8') as f:
                            ntxt_content = f.read()
                    except Exception as e:
                        stats["problematic_files"][ntxt_file_path].append(f"Error reading: {e}")
                else:
                    stats["problematic_files"][ntxt_file_path].append("File not found")
                
                # Process the XML file
                try:
                    tree = ET.parse(xml_file_path)
                    root = tree.getroot()
                    
                    # Attempt to find <ref> elements, considering potential default namespaces
                    # Most XMLs provided don't seem to use a default namespace for <ref> that requires explicit handling here
                    ref_elements = root.findall('.//ref')

                    for ref_elem in ref_elements:
                        stats["total_refs_processed"] += 1
                        search_target_ref = get_element_text(ref_elem)

                        if not search_target_ref: # Skip if ref text is empty after cleaning
                            continue

                        # Find best matches in TXT and NTXT content
                        match_in_txt = find_best_match_in_txt_content(search_target_ref, txt_content)
                        match_in_ntxt = find_best_match_in_txt_content(search_target_ref, ntxt_content)

                        found_in_txt = bool(match_in_txt)
                        found_in_ntxt = bool(match_in_ntxt)

                        log_this_difference = False

                        if found_in_txt and found_in_ntxt:
                            if match_in_txt == match_in_ntxt:
                                stats["identical_matches"] += 1
                            else: # Different content
                                log_this_difference = True
                                if len(match_in_txt) > len(match_in_ntxt):
                                    stats["txt_longer"] += 1
                                elif len(match_in_ntxt) > len(match_in_txt):
                                    stats["ntxt_longer"] += 1
                                else: # Same length, but different content
                                    stats["different_content_same_length"] += 1
                        elif found_in_txt and not found_in_ntxt:
                            stats["txt_found_ntxt_not"] += 1
                            stats["txt_longer"] += 1 # TXT is 'longer' than not found
                            log_this_difference = True
                        elif not found_in_txt and found_in_ntxt:
                            stats["ntxt_found_txt_not"] += 1
                            stats["ntxt_longer"] += 1 # NTXT is 'longer' than not found
                            log_this_difference = True
                        else: # Neither found
                            stats["neither_found"] += 1
                            log_this_difference = True # Log if neither found as per prompt
                        
                        if log_this_difference:
                            differences_log.append({
                                "xml_file": relative_path_to_xml_file,
                                "ref_target": search_target_ref,
                                "txt_match": match_in_txt if found_in_txt else "--- NOT FOUND ---",
                                "ntxt_match": match_in_ntxt if found_in_ntxt else "--- NOT FOUND ---"
                            })

                except ET.ParseError as e:
                    stats["problematic_files"][xml_file_path].append(f"XML ParseError: {e}")
                except Exception as e: # Catch any other errors during processing of one XML
                    stats["problematic_files"][xml_file_path].append(f"Generic error processing XML: {e}")
    
    # Output the differences
    if differences_log:
        print("\n--- Differences Found (and cases where one or both not found) ---")
        for entry in differences_log:
            txt_match_len = len(entry["txt_match"]) if entry["txt_match"] != "--- NOT FOUND ---" else 0
            ntxt_match_len = len(entry["ntxt_match"]) if entry["ntxt_match"] != "--- NOT FOUND ---" else 0
            print(f"\nXML File: {entry['xml_file']}")
            print(f"  Ref Target: '{entry['ref_target']}'")
            print(f"  TXT Match : '{entry['txt_match']}' (len: {txt_match_len})")
            print(f"  NTXT Match: '{entry['ntxt_match']}' (len: {ntxt_match_len})")

    # Output the final statistics
    print("\n--- Statistics ---")
    print(f"Total <ref> elements processed: {stats['total_refs_processed']}")
    print(f"Identical matches in TXT and NTXT: {stats['identical_matches']}")
    print(f"TXT version longer (or found when NTXT not): {stats['txt_longer']}")
    print(f"NTXT version longer (or found when TXT not): {stats['ntxt_longer']}")
    print(f"Different content, but same length: {stats['different_content_same_length']}")
    print(f"Found only in TXT (NTXT not found): {stats['txt_found_ntxt_not']}")
    print(f"Found only in NTXT (TXT not found): {stats['ntxt_found_txt_not']}")
    print(f"Not found in either TXT or NTXT: {stats['neither_found']}")

    if stats["problematic_files"]:
        print("\n--- Problematic Files/Paths Encountered ---")
        for f_path, errors in stats["problematic_files"].items():
            print(f"File/Path: {f_path}")
            for err_msg in errors:
                print(f"  - {err_msg}")
    
    return stats


# --- Main execution block for the notebook cell ---
# Define root directories relative to the workspace (c:\Project\CBRD)
# The script assumes it's being run from a notebook located at c:\Project\CBRD
xml_directory = "XML"
txt_directory = "TXT"
ntxt_directory = "NTXT"

# Convert to absolute paths for clarity in output, though relative paths work too
abs_xml_dir = os.path.abspath(xml_directory)
abs_txt_dir = os.path.abspath(txt_directory)
abs_ntxt_dir = os.path.abspath(ntxt_directory)

print(f"Starting comparison process...")
print(f"Using XML directory: {abs_xml_dir}")
print(f"Using TXT directory: {abs_txt_dir}")
print(f"Using NTXT directory: {abs_ntxt_dir}")

# Check if directories exist before processing
all_dirs_exist = True
if not os.path.isdir(abs_xml_dir):
    print(f"Error: XML directory not found at {abs_xml_dir}")
    all_dirs_exist = False
if not os.path.isdir(abs_txt_dir):
    print(f"Error: TXT directory not found at {abs_txt_dir}")
    all_dirs_exist = False
if not os.path.isdir(abs_ntxt_dir):
    print(f"Error: NTXT directory not found at {abs_ntxt_dir}")
    all_dirs_exist = False

if all_dirs_exist:
    process_files(abs_xml_dir, abs_txt_dir, abs_ntxt_dir)
    print("\nProcessing complete.")
else:
    print("\nProcessing aborted due to missing directories.")


Starting comparison process...
Using XML directory: c:\Project\CBRD\XML
Using TXT directory: c:\Project\CBRD\TXT
Using NTXT directory: c:\Project\CBRD\NTXT

--- Differences Found (and cases where one or both not found) ---

XML File: v55.3\148109.xml
  Ref Target: 'T. taisho tripitaka'
  TXT Match : 'T. Taisho Tripitaka' (len: 19)
  NTXT Match: 'T. taisho Tripitaka' (len: 19)

XML File: v55.3\148112.xml
  Ref Target: 'T12,no.84'
  TXT Match : 'T12,no.384' (len: 10)
  NTXT Match: '12,no.384' (len: 9)

XML File: v55.3\148112.xml
  Ref Target: 'T85,no.276,p.34'
  TXT Match : 'T85,no.2769,p.34' (len: 16)
  NTXT Match: 'T85,no.2769,p. 34' (len: 17)

XML File: v55.3\148112.xml
  Ref Target: 'T85,no.2777,p.487b'
  TXT Match : 'T85, no. 2777, p. 487b' (len: 22)
  NTXT Match: 'T85,no.2777,p.487b' (len: 18)

XML File: v55.3\148112.xml
  Ref Target: 'T85,no.2778,p.532b'
  TXT Match : 'T85, no. 2778, p. 532b' (len: 22)
  NTXT Match: 'T85,no.2778,p.532b' (len: 18)

XML File: v55.3\174575.xml
  Ref 