# FIT5196 Assessment 1
#### Student Name:
#### Student ID: 

Date: 02/04/2017

Version: 2.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* xml.etree.ElementTree (for parsing XML doc)
* pandas 0.20.2 (for cut function) 
* re 2.2.1 (for regular expression) 
* os (for join path, split file name, check the file if exists)


## 1. Introduction
meeting transcripts are stored in three different types of XML files, which are ending with ".words.xml", ".topic.xml" and ".segments.xml". The task here is to reconstruct the original meeting transcripts with the corresponding topical and paragraph boundaries from these files.

## 2.  Import libraries 

In [1]:
import xml.etree.ElementTree as ET
import re
import pandas as pd
from os import listdir,makedirs
from os.path import isfile, join, split, exists, splitext

## 3. Defines constants

This program first defines some of the constants that will be used later in the process.   
TOPIC_TAG will be used to identify which element is the node that contains the topic.  
SEGMENTS_TAG will be used to identify that element is a node containing segments.     
WORD_XML_FILE_DIR is the directory where the .word.xml file is stored.  
TOPIC_XML_FILE_DIR is the directory where the .topic.xml file is stored.   
SEGMENTS_XML_FILE_DIR is the directory where the .segments.xml file is stored.  
TXT_OUTPUT_DIR is a directory for storing the data result of task1.  
BRAKETS_CONTENS_PATTERN is a regular expression used to extract the contents of parentheses.  
WORD_ID_PATTERN is a regular expression used to extract the word id from a string.  
WORD_ID_PATTERN_X is a regular expression used to extract the word id from a string, and the difference between WORD_ID_PATTERN is whether the prefix of id has an x.  

In [2]:
TOPIC_TAG = "{http://nite.sourceforge.net/}id"
SEGMENTS_TAG = "href"
WORD_XML_FILE_DIR = './words'
TOPIC_XML_FILE_DIR = './topics'
SEGMENTS_XML_FILE_DIR = './segments'
TXT_OUTPUT_DIR='./txt_files'
BRAKETS_CONTENS_PATTERN = '\((.*?)\)'
WORD_ID_PATTERN = r'words([0-9]+)'
WORD_ID_PATTERN_X = r'wordsx([0-9]+)'

## 4. Resolve a single .topic.xml file

First, the root node of the xml file is obtained, and each layer under the root node is processed for processing. The processing here includes:
1)Retrieve the value of the <topic nite:id> tag of the root node of this layer as a topic.
2)Iterate over each element below the layer, skip the element of the <nite:pointer> role tag, and from the inside of the element containing the <nite:child href> tag, extract the vocabulary beginning ID, vocabulary end ID, and corresponding one representing a segment. The word.xml file name, for elements containing the <topic nite:id> tag, repeats step 2) for recursive traversal processing until all data is retrieved.

In [3]:
# The input is a topic.xml file
def parse_topic_xml(topic_xml):
    tree = ET.parse(topic_xml) 
    #Get root node
    root = tree.getroot() 
    parse_list = [] #Create a list for storing the parsed topic, vocabulary file name, vocabulary start ID, vocabulary end ID
    for child in root: #Traversing the next level of the root node
        topic = child.attrib.get(TOPIC_TAG) # Get the topic value from the topic tag
        for sub in child: # Traverse every element below the layer
            # When the attribute of the element contains the <nite:pointer> role tag, 
            # the description does not include the segment to be extracted, skipping directly
            if "role" in sub.attrib: 
                pass
            # If the <topic nite:id> tag appears, there is a subtopic under this layer
            elif sub.attrib.get(TOPIC_TAG): 
                # Also need to skip the element that contains the <nite:pointer> role tag
                if not "role" in sub[0].attrib: 
                    # Recursive calls parse subtopics and add data to parse_list
                    parse_sub(sub, parse_list, topic) 
            else: 
                # When the <nite:child> tag appears, it means that below the element, the segment corresponding to the topic can be obtained.
                # Get tag nite:child href value
                info_str = sub.attrib.get(SEGMENTS_TAG) 
                # The calling function extracts the vocabulary file name from the info_str, the vocabulary start ID, the vocabulary end ID
                (word_xml_file, word_start_id, word_end_id) = extracte_word_file_and_id(info_str) 
                # Add the three fields extracted from info_str with topic to the parse_list
                parse_list.append((topic, word_xml_file, word_start_id, word_end_id)) 
    return parse_list

#In order to explore the tree structure of an XML node, a small recursive routine is defined to get the tag names of all descendants of any given 
#node. 
# There are two inputs, one is the sub topic node content under root topic, and the other is root topic
def parse_sub(sub, parse_list, belong_topic):
    for sub_sub in sub: 
        # Traversing each element in the sub
        sub_topic = sub.attrib.get(TOPIC_TAG) 
        # Does not handle elements containing <nite:pointer> role tags
        if not "role" in sub_sub.attrib: 
            # Handle the data that contains the segment
            if not TOPIC_TAG in sub_sub.attrib: 
                # Call function to extract vocabulary file name from info_str, vocabulary start ID, vocabulary end ID
                (word_xml_file, word_start_id, word_end_id) = extracte_word_file_and_id(sub_sub.attrib.get(SEGMENTS_TAG))
                # Save the parsed field data
                parse_list.append((belong_topic, word_xml_file, word_start_id, word_end_id)) 
            else:
                # When the sub topic still has children, call again for parsing
                parse_sub(sub_sub, parse_list, belong_topic) 

# An input, <nite:child href> tag value
def extracte_word_file_and_id(attrib_str):
    sub_attrib_split = attrib_str.split("#")
    word_xml_file = sub_attrib_split[0] # Extract the file name of .word.xml
    word_ids = re.findall(BRAKETS_CONTENS_PATTERN, sub_attrib_split[1])
    # Extract word_start_id and word_end_id, when there is only one word, word_start_id is equal to word_end_id
    condition = len(word_ids) > 1
    (word_start_str, word_end_str) = (word_ids[0], word_ids[1]) if condition else (word_ids[0], word_ids[0])
    word_start_id = re.findall(WORD_ID_PATTERN, word_start_str)[0]
    word_end_id = re.findall(WORD_ID_PATTERN, word_end_str)[0]
    # Return the extraction result
    return word_xml_file, word_start_id, word_end_id

In [4]:
# test case
topic_xml_file = "./topics/TS3009c.topic.xml"
parse_element_list = parse_topic_xml(topic_xml_file)
topic_xml_filename_prefix = split(topic_xml_file)[1].replace(".topic.xml", "")
sample = 10
for parse_element in parse_element_list[:sample]:
    print(parse_element)

('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', '0', '43')
('TS3009c.topic.vkaraisk.2', 'TS3009c.C.words.xml', '0', '0')
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', '0', '4')
('TS3009c.topic.vkaraisk.2', 'TS3009c.C.words.xml', '1', '2')
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', '5', '6')
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', '44', '64')
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', '7', '7')
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', '65', '107')
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', '8', '8')
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', '108', '165')


## 4. Reconstructed with the "*.segments.xml"

For each segment extracted from the tag href, read paragraph delimitation in ".segments.xml", Adjust the word_start_id and word_end_id in the segment. For example, for word0..word30, in the "segment.xml" file, the correct delimitation is word0..word10, word11..word30, and word0..word30 is divided into two segments.

In [5]:
#Read the .segments.xml file to correct paragraphs
def adjust_paragraph_boundary(parse_element_list):
    # Create a list to save the result of adjusting the paragraph delimitation
    adjust_parse_element_list = [] 
    for topic, word_xml_file, word_start_id, word_end_id in parse_element_list:
        #Get the corresponding "segments.xml" file
        segments_xml_file =  word_xml_file.replace("words", "segments") 
        # Passing segments_xml to get the last paragraph of the paragraph demarcation
        paragraph_up_boundaries = load_paragraph_boundary(segments_xml_file) 
        paragraph_boundaries = cut_segment_with_paragraph_boundary(int(word_start_id), int(word_end_id), paragraph_up_boundaries)
        adjust_parse_element_list.append((topic, word_xml_file, paragraph_boundaries))
    return adjust_parse_element_list

# Reading the upper bound of the paragraph delimitation from the segments.xml file
def load_paragraph_boundary(segments_xml_file):
    segments_xml_file_path = join(SEGMENTS_XML_FILE_DIR, segments_xml_file)
    tree = ET.parse(segments_xml_file_path)
    root = tree.getroot()
    # List for storing demarcation points
    paragraph_boundaries_bins = [] 
    # Parse each layer under the root node of the segments_xml file
    for child in root:
        for sub in child:
            #Parse the vocabulary file name from the content of the <nite:child href> tag, the vocabulary start ID, the vocabulary end ID
            (word_xml_file, word_start_id, word_end_id) = extracte_word_file_and_id(sub.attrib.get(SEGMENTS_TAG))
            #Save the upper bound of each paragraph
            paragraph_boundaries_bins.append(int(word_end_id)) 
    return paragraph_boundaries_bins 

def cut_segment_with_paragraph_boundary(word_start_id, word_end_id, paragraph_up_boundaries):
    temp_paragraph_boundaries = []
    # Used to store results
    adjust_paragraph_boundaries = [] 
    # Save the boundary in the range (word_start_id, word_end_id) to a new list
    [temp_paragraph_boundaries.append(up_bound) for up_bound in paragraph_up_boundaries if up_bound > word_start_id and up_bound < word_end_id]
    # Insert word_start_id at the head of the list, plus word_end_id at the end
    # Generate an interval that can be used to segment the word sequence word_start_id:word_end_id
    temp_paragraph_boundaries.insert(0, word_start_id)
    temp_paragraph_boundaries.append(word_end_id)
    interval_list = []
    # Dividing the sequence of words id by paragraph demarcation
    if len(set(temp_paragraph_boundaries)) == 1:
        #  If a word is done alone
        interval_list = [(temp_paragraph_boundaries[0], temp_paragraph_boundaries[0])]
    else:
        cut_result = pd.cut(range(word_start_id, word_end_id + 1), temp_paragraph_boundaries)
        # Handle the segmentation results, generate tuples, add to the result list
        for i in cut_result.categories:
            interval_str = i.replace("(", "").replace("]", "").replace(" ","").split(",")
            interval_list.append((int(interval_str[0]), int(interval_str[1])))
    for idx, interval in enumerate(interval_list):
        if idx == 0:
            adjust_paragraph_boundaries.append((interval[0], interval[1]))
        else:
            # The lower bound is +1 because the segmentation result interval is left open and right, but the first interval does not need to be processed.
            adjust_paragraph_boundaries.append((interval[0] + 1, interval[1]))
    return adjust_paragraph_boundaries

In [6]:
# test case
adjust_parse_element_list = adjust_paragraph_boundary(parse_element_list)
for adjust_parse_element in  adjust_parse_element_list[:sample]:
    print(adjust_parse_element)

('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', [(0, 43)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.C.words.xml', [(0, 0)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', [(0, 2), (3, 4)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.C.words.xml', [(1, 2)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', [(5, 6)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', [(44, 45), (46, 64)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', [(7, 7)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', [(65, 107)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.D.words.xml', [(8, 8)])
('TS3009c.topic.vkaraisk.2', 'TS3009c.A.words.xml', [(108, 165)])


## 5. Match the "*.word.xml"

Read the word.xml file and match the word based on the id. For example:  
Segments parsed from topic are:  
topic1 word0..word3  
topic2 word4..word5  
In the corresponding .word.xml file:  
word0: "this"
word1: "is"
word2: "the"
word3: "kick-off"
word4: "meeting"
Will be generated:  
topic1 "this is the"
topic2 "kick meeting"

In [7]:
def match_the_word_xml(adjust_parse_element_list, topic_xml_filename_prefix):
    #Used to save data after matching word
    paragraph_contents_data = [] 
    # The word content dictionary is read from the word.xml file with the prefix topic_xml_filename_prefix
    word_dict_with_id = load_word_dict_with_id(topic_xml_filename_prefix) 
    # Iterate through the adjust parsed results of paragraph boundaries
    for topic, word_xml_file, paragraph_boundaries in adjust_parse_element_list:  
        word_list_with_id = word_dict_with_id[word_xml_file]
        # According to the division interval, the words in the interval are taken out into the word dictionary, 
        # and the words in the interval are connected with spaces.
        for paragraph_start, paragraph_end in paragraph_boundaries:
            wordid_word_tuple_list = get_word(word_list_with_id, int(paragraph_start), int(paragraph_end))
            paragraph = " ".join([word for word_id,word in wordid_word_tuple_list])
            # Add the word join result and topic together to the result list, remove the empty
            if len(paragraph) > 0:
                paragraph_contents_data.append((topic, paragraph))
    return paragraph_contents_data

# Given a prefix, match the corresponding files from word.xml and read the files to build a word dictionary
# The format is {word.xml file name: [(word_id, word)]}
def load_word_dict_with_id(prefix):
    # Match files based on the prefix
    onlyfiles = [f for f in listdir(WORD_XML_FILE_DIR) if (f.startswith(prefix)) & (isfile(join(WORD_XML_FILE_DIR, f)))]
    # Create an empty dict 
    word_id_dict = {}
    #Traversing a file that satisfies the conditions
    for word_xml_file in onlyfiles:
        word_list_with_id = []
        word_tree = ET.parse(join(WORD_XML_FILE_DIR, word_xml_file))
        word_root = word_tree.getroot()
        # Traverse each layer under the root node of the "word.xml" file and extract the word corresponding to each word_id.
        for child in word_root:
            word_key = child.attrib.get(TOPIC_TAG)
            word_key_id_list = re.findall(WORD_ID_PATTERN, word_key)
            if len(word_key_id_list) > 0:
                word_key_id = word_key_id_list[0]
            else:
                word_key_id = re.findall(WORD_ID_PATTERN_X, word_key)[0]
            word = child.text
            word_list_with_id.append((int(word_key_id), word))
        # Put the word taken from the "word.xml" file in a list as the value of the dict
        word_id_dict[word_xml_file] = word_list_with_id
    return word_id_dict

# Determine if the id corresponding to a word is within the interval [word_start_id, word_end_id]
def filter_fun(tuple2, word_start_id, word_end_id):
    word_id, word = tuple2[0], tuple2[1]
    return word_id >= word_start_id and word_id <= word_end_id and word is not None
# Filter all words in the range [word_start_id, word_end_id]
def get_word(word_list_with_id, word_start_id, word_end_id):
    return filter(lambda x: filter_fun(x, word_start_id, word_end_id), word_list_with_id)

In [8]:
# test case
match_result_list = match_the_word_xml(adjust_parse_element_list, topic_xml_filename_prefix)
for match_result in match_result_list[:sample]:
    print( match_result)

('TS3009c.topic.vkaraisk.2', "Okay . Uh door is closed . Well , let's begin . Because if we have as much time as the last uh meeting , we'll have to hurry up . Um well I'll start with the presentation again , the agenda .")
('TS3009c.topic.vkaraisk.2', "I'm listening .")
('TS3009c.topic.vkaraisk.2', 'Right .')
('TS3009c.topic.vkaraisk.2', 'Great .')
('TS3009c.topic.vkaraisk.2', 'Yo .')
('TS3009c.topic.vkaraisk.2', 'So . Uh This one I think . Uh yeah . Well alright .')
('TS3009c.topic.vkaraisk.2', "Um well , I'll show you the notes . It's not as uh interesting as it should be because we just uh had the meeting , but I'll show them . We'll get your presentations again on the conceptual design . Um")
('TS3009c.topic.vkaraisk.2', "Then we'll have to dec decide about the control , the remote control concepts . I've put a f uh a file in the project management folder , which says exactly uh what kind of decisions we should take . So this time we exactly know what to decide about . And then we

## 6. Format output

Loop through each topic.xml file.  
1)Call a function that parses a single topic.xml file.  
2)For the parsed result of 1), read segment.xml to reconstruct the paragraph.   
3)For the reconstruction result of 2), read the words.xml file to match and extract the contents of Word.  

In [9]:
# For segments that have been parsed, corrected, and matched with good words, formatted and output to txt file
def format_output(match_result_list, output_file):
    f = open(output_file, 'w')
    prev_topic = match_result_list[0][0]
    for topic, paragraph in match_result_list:
        if topic != prev_topic:
            f.write("**********" + "\n")
        f.write(" %s\n" % paragraph)
        prev_topic = topic
    f.write("**********" + "\n")
    f.close()
    
# Parse files in batches, correct paragraphs, match texts, and save to txt files
def batch_parse_topic_xml(output_dir):
    onlyfiles = [join(TOPIC_XML_FILE_DIR, f) for f in listdir(TOPIC_XML_FILE_DIR) if (isfile(join(TOPIC_XML_FILE_DIR, f))) and (splitext(f)[1] == '.xml')]
    file_size = len(onlyfiles)
    for i in range(file_size):
        # The "topic.xml" file that needs to be parsed
        topic_xml_file = onlyfiles[i]
        # Generate output file path
        topic_prefix =  split(topic_xml_file)[1].replace(".topic.xml", "")
        output_file = join(output_dir, "%s.txt" % topic_prefix)
        # Parse the topic file
        parse_element_list = parse_topic_xml(topic_xml_file)
        # Correct paragraph demarcation
        adjust_parse_element_list = adjust_paragraph_boundary(parse_element_list)
        # Match vocabulary content
        match_result_list = match_the_word_xml(adjust_parse_element_list, topic_prefix)
        # Output to file
        format_output(match_result_list, output_file)    
        

In [10]:
batch_parse_topic_xml(TXT_OUTPUT_DIR)