I'll help you convert the XML-style completion format to the tuple format. Here's a Python function that parses the XML and extracts the elements in order:



please help to covert the `completion` of the following examples from
```"<ref><canon>T</canon>.<v>22</v>,<p>761</p><c>a</c><l>21</l>-<l>29</l></ref>"```
to
```
[("canon", "str", "v", "str", "p", "c", "l", "str", "l"),
("T", ".", "22", ",", "761", "a", "21", "-", "29")]
```


examples:
```
[
            {
                "prompt": "t.23,926b29-c12",
                "completion": "<ref><canon>T</canon>.<v>23</v>,<p>926</p><c>b</c><l>29</l>-<c>c</c><l>12</l></ref>"
            },
            {
                "prompt": "t.43,295c1-296a19",
                "completion": "<ref><canon>T</canon>.<v>43</v>,<p>295</p><c>c</c><l>1</l>-<p>296</p><<c>a</c><l>19</l></ref>"
            },
            {
                "prompt": "t.31,757a15-17;924c10-12",
                "completion": "<ref><canon>T</canon>.<v>31</v>,<p>757</p><c>a</c><l>15</l>-<l>17</l>;<p>924</p><c>c</c><l>10</l>-<l>12</l></ref>"
            },
            {
                "prompt": "cht.2,p.69c5-9",
                "completion": "<ref><canon>ChT</canon>.<v>2</v>,p.<p>69</p><c>c</c><l>5</l>-<l>9</l></ref>"
            },
            {
                "prompt": "大正18 no.865 p.216.b.4-12",
                "completion": "<ref><canon>大正</canon><v>18</v> No.<w>865</w> p.<p>216</p>.<c>b</c>.<l>4</l>-<l>12</l></ref>"
            },
            {
                "prompt": "大正:no.310, 373c 上段 4-5",
                "completion": "<ref><canon>大正</canon>:No.<w>310</w>, <p>373</p>, <c>上段</c> <l>4</l>-<l>5</l></ref>"
            },
            {
                "prompt": "大正二八、二七三上、七行",
                "completion": "<ref><canon>大正</canon><v>二八</v>、<p>二七三</p><c>上</c>、<l>七</l>行</ref>"
            },
            {
                "prompt": "卍続一-三四、五七a-b",
                "completion": "<ref><canon>卍続</canon>一-三四、<p>五七</p><c>a</c>-<c>b</c></ref>"
            },
            {
                "prompt": "『大正藏』第四卷359a1-b7",
                "completion": "<ref>『<canon>大正藏</canon>』第<v>四</v>卷<p>359</p><c>a </c><l>1</l>-<c>b </c><l>7</l></ref>"
            }
        ]

In [2]:
import re
from xml.etree import ElementTree as ET

def convert_completion_to_tuples(completion):
    """
    Convert XML completion format to tuple format.
    Returns (tags_tuple, values_tuple)
    """
    try:
        # Parse the XML
        root = ET.fromstring(completion)
        
        tags = []
        values = []
        
        # Function to recursively extract elements and text
        def extract_elements(element, parent_tail=""):
            nonlocal tags, values
            
            # Add the tag
            tags.append(element.tag)
            
            # Add the text content
            text = element.text if element.text else ""
            values.append(text.strip() if text.strip() else "")
            
            # Process children
            for child in element:
                # Add any text before the child element
                if child.tail and child.tail.strip():
                    # Check if it's punctuation or separator
                    tail_text = child.tail.strip()
                    if tail_text:
                        tags.append("str")
                        values.append(tail_text)
                
                extract_elements(child)
        
        # Special handling for the ref element structure
        ref_element = root
        
        # Process all elements in order they appear
        result_tags = []
        result_values = []
        
        # Parse the XML string to maintain order
        xml_content = completion.replace('<ref>', '').replace('</ref>', '')
        
        # Use regex to find all tags and content
        pattern = r'<(\w+)>([^<]*)</\w+>|([^<]+)'
        matches = re.findall(pattern, xml_content)
        
        for match in matches:
            tag, content, text = match
            if tag and content:
                result_tags.append(tag)
                result_values.append(content)
            elif text and text.strip():
                result_tags.append("str")
                result_values.append(text.strip())
        
        return tuple(result_tags), tuple(result_values)
        
    except Exception as e:
        print(f"Error parsing: {completion}")
        print(f"Error: {e}")
        return (), ()

# Test with your examples
examples = [
    {
        "prompt": "t.23,926b29-c12",
        "completion": "<ref><canon>T</canon>.<v>23</v>,<p>926</p><c>b</c><l>29</l>-<c>c</c><l>12</l></ref>"
    },
    {
        "prompt": "t.43,295c1-296a19", 
        "completion": "<ref><canon>T</canon>.<v>43</v>,<p>295</p><c>c</c><l>1</l>-<p>296</p><c>a</c><l>19</l></ref>"
    },
    {
        "prompt": "t.31,757a15-17;924c10-12",
        "completion": "<ref><canon>T</canon>.<v>31</v>,<p>757</p><c>a</c><l>15</l>-<l>17</l>;<p>924</p><c>c</c><l>10</l>-<l>12</l></ref>"
    },
    {
        "prompt": "cht.2,p.69c5-9",
        "completion": "<ref><canon>ChT</canon>.<v>2</v>,p.<p>69</p><c>c</c><l>5</l>-<l>9</l></ref>"
    },
    {
        "prompt": "大正18 no.865 p.216.b.4-12",
        "completion": "<ref><canon>大正</canon><v>18</v> No.<w>865</w> p.<p>216</p>.<c>b</c>.<l>4</l>-<l>12</l></ref>"
    }
]

# Convert each example and replace the completion value
for example in examples:
    tags, values = convert_completion_to_tuples(example['completion'])
    example['completion'] = {'tags': tags, 'values': values}

# Display the modified examples
for i, example in enumerate(examples):
    print(f"Example {i+1}:")
    print(f"Prompt: {example['prompt']}")
    print(f"Completion: {example['completion']}")
    print()

Example 1:
Prompt: t.23,926b29-c12
Completion: {'tags': ('canon', 'str', 'v', 'str', 'p', 'c', 'l', 'str', 'c', 'l'), 'values': ('T', '.', '23', ',', '926', 'b', '29', '-', 'c', '12')}

Example 2:
Prompt: t.43,295c1-296a19
Completion: {'tags': ('canon', 'str', 'v', 'str', 'p', 'c', 'l', 'str', 'p', 'c', 'l'), 'values': ('T', '.', '43', ',', '295', 'c', '1', '-', '296', 'a', '19')}

Example 3:
Prompt: t.31,757a15-17;924c10-12
Completion: {'tags': ('canon', 'str', 'v', 'str', 'p', 'c', 'l', 'str', 'l', 'str', 'p', 'c', 'l', 'str', 'l'), 'values': ('T', '.', '31', ',', '757', 'a', '15', '-', '17', ';', '924', 'c', '10', '-', '12')}

Example 4:
Prompt: cht.2,p.69c5-9
Completion: {'tags': ('canon', 'str', 'v', 'str', 'p', 'c', 'l', 'str', 'l'), 'values': ('ChT', '.', '2', ',p.', '69', 'c', '5', '-', '9')}

Example 5:
Prompt: 大正18 no.865 p.216.b.4-12
Completion: {'tags': ('canon', 'v', 'str', 'w', 'str', 'p', 'str', 'c', 'str', 'l', 'str', 'l'), 'values': ('大正', '18', 'No.', '865', 'p.', '21



This function will parse each XML completion and return two tuples:
1. The tag names (including "str" for text content)
2. The corresponding values

The output will be in the format you requested: `(tags_tuple, values_tuple)` where each position corresponds to the sequential elements found in the XML.

In [3]:
examples

[{'prompt': 't.23,926b29-c12',
  'completion': {'tags': ('canon',
    'str',
    'v',
    'str',
    'p',
    'c',
    'l',
    'str',
    'c',
    'l'),
   'values': ('T', '.', '23', ',', '926', 'b', '29', '-', 'c', '12')}},
 {'prompt': 't.43,295c1-296a19',
  'completion': {'tags': ('canon',
    'str',
    'v',
    'str',
    'p',
    'c',
    'l',
    'str',
    'p',
    'c',
    'l'),
   'values': ('T', '.', '43', ',', '295', 'c', '1', '-', '296', 'a', '19')}},
 {'prompt': 't.31,757a15-17;924c10-12',
  'completion': {'tags': ('canon',
    'str',
    'v',
    'str',
    'p',
    'c',
    'l',
    'str',
    'l',
    'str',
    'p',
    'c',
    'l',
    'str',
    'l'),
   'values': ('T',
    '.',
    '31',
    ',',
    '757',
    'a',
    '15',
    '-',
    '17',
    ';',
    '924',
    'c',
    '10',
    '-',
    '12')}},
 {'prompt': 'cht.2,p.69c5-9',
  'completion': {'tags': ('canon',
    'str',
    'v',
    'str',
    'p',
    'c',
    'l',
    'str',
    'l'),
   'values': ('ChT