## Task 1 Reconstruct the Original Meeting Transcripts

Libraries used:
* bs4 for using beautiful soup in order to parse files.
* re for regular expressions
* os to join file path


## 1. Introduction

This project comprises of 2 tasks; this jupyter notebook contains task 1. There are 139 xml files for topic, 589 xml files for words and segments. Our task here is to reconstruct the original meeting transcripts and add them to a text files for each topic by reading in the correct values from the xml files. The scenario chose for this project is as follows:

#### Scenario 1:
 
* A linebreak after each <nite: child> tag of the topic files.
* A linebreak after each segment.
* 10 '*' after each root topic (i.e. '\n**********\n').

## 2.  Import libraries 

In [1]:
#Importing libraries
from bs4 import BeautifulSoup as bsoup
import re
import os

## 3.  Defining file path 

In [2]:
#Defining file path
theTopics = "./topics/"
theSegments = "./segments/"
theWords= "./words/"

## 4.  Functions

### 4.1  Function to create word dictionary

In [3]:
#Creating word dictionary
def wordDict(file,fileType):
    #joing file to the path
    file = os.path.join(fileType, file)
    
    #opening the file and creating bsoup object
    theFile= open(file)
    xmlSoup = bsoup(theFile,"lxml")
    
    #defining empty dictionary and list
    words = []
    wordDict={}
    
    #finding all the words
    wordSoup = xmlSoup.find('nite:root')
    for word in wordSoup.findAll('w'):
        indexList = []
        
        #Using regex to find the words
        index = re.findall('words(\d+)',dict(word.attrs)['nite:id'])[0]
        indexList.append(index)
        text = word.text
        indexList.append(text)
        
        #adding words to the list
        words.append(indexList)
        
    #Creating dictionary from the list
    wordDict = {t[0]:t[1:] for t in words} 
    return wordDict

#Defining big empty dictionary
wordDicts={}

#Loop to create big dictionary
#looping over all the files in the segments folder
for file in os.listdir(theWords):
    
    #creating key for the segments
    file_1 = file[0:9]
    #creating big dictionary
    wordDicts[file_1]=wordDict(file,theWords)

### 4.2  Function to create segment dictionary

In [4]:
#function to create dictionary of segments
def parseSegments(file,fileType):
    
    #joing file to the path
    file = os.path.join(fileType, file)
    
    #opening the file and creating bsoup object
    theFile= open(file)
    xmlSoup = bsoup(theFile,"lxml")
    
    #finding all the segments
    segments = xmlSoup.findAll("segment")
    
    #creating final list which will contain all the segments
    finalList=[]
    for segment in segments:
        #Creating list of current iteration segment
        theList=[]
        
        #getting the href for the child tag
        x = segment.find('nite:child').get('href')
        
        #regular expression to find the start and the end position
        w=re.findall(r'.words(\d*)\)',x)
        
        #if lengthpostion was 1
        if len(w)==1:
            start = w[0]
            theList.append(start)
        
        #if length of position was 2
        else:
            start = w[0]
            end = w[1]
            theList.append(start)
            theList.append(end)
            
        #appending to make final list of list    
        finalList.append(theList)
    
    #creating the dictionary
    segDict = {t[0]:t[0:] for t in finalList} 
    return segDict

#big empty dictionary       
segDict={}

#Loop to create big dictionary
#looping over all the files in the segments folder
for file in os.listdir(theSegments):
    
    #creating key for the segments
    file_1 = file[0:9]
    #creating big dictionary
    segDict[file_1]=parseSegments(file,theSegments)

### 4.3  Function to give segment partitions

In [5]:
#Function that creates segments from the start and end
def giveSegs(file, start, end,theDict):
    
    #Condition that gives segments if topic do not cut the segment boundary
    if str(start) in theDict[file]:
        last = end 
        i= start
        theList=[]
        while i <= end:
            theList.append(theDict[file][str(i)])
            i = int(theDict[file][str(i)][-1]) +1 
        return theList
    #If topic cut segment boundaries
    else:
        t=[]
        finalList=[]
        for items in theDict[file]:
            if int(items)< int(start):
                t.append(int(items))
        t1= theDict[file][str(max(t))][1]
        theList1=[int(start),int(t1)]
        finalList.append(theList1)
        i= int(t1) + 1
        last = end
        
        theList=[]
        while i <= end:
            theList.append(theDict[file][str(i)])
            i = int(theDict[file][str(i)][-1]) +1 
        finalList.extend(theList)
        return finalList

### 4.4  Function to create the transcript

In [6]:
#Function to create topic that uses the segment and word functions
def parsingTopic(file,fileType):
    z=[]
    file = os.path.join(fileType, file)
    xmlSoup = bsoup(open(file),"lxml")
    topics = xmlSoup.find("nite:root").find_all('topic',recursive=False)
    for t in topics:
        z.append(t.find_all(re.compile(r'nite:child')))
    bigL=[]
    for i in z:
        finalWord=[]
        for i1 in i:
            x= i1.get('href')
            m=x[0:9]
            m1=x[0:19]
            w=re.findall(r'.words(\d*)\)',x)
            if (len(w)==1):
                start= int(w[0])
                end = int(w[0])
            else:
                start = int(w[0])
                end = int(w[1])
            s=giveSegs(m, start, end,segDict)
            z=parsingWord(m,wordDicts,s,end)
            finalWord.extend(z)  
        stars = ['**********']        
        finalWord.extend(stars) 
        bigL.extend(finalWord)
    bigL1="\n".join(bigL)
    return bigL1

### 4.5  Function to obtain the words from the dictionary

In [7]:
#Function to give the words according to the segments created
def parsingWord (file,wordDict,segs,end1):
    finalWord = []
    test = end1
    for items in segs:
        word1=''
        i = int(items[0])
        
        
        end = int(items[-1])
        if end > test:
            end = test
        while i<=end:
            if str(i) in wordDict[file]:
                x=wordDict[file][str(i)]
                x = ''.join(x)
                if len(x)>0:
                    word1+=' '+x
                else:
                    pass
            i=i+1
        if len(word1)>0:
            finalWord.append(word1)
    return finalWord

## 5. Main body to create text files

In [8]:
#Finally passing the function to the main body to create text files
for file in os.listdir(theTopics):
    finalFile = './txt_files/'+file[:-10]+'.txt'
    f = open(finalFile,'w')
    f.write(parsingTopic(file,theTopics))
    f.close()

## 6. Summary
1. In this project task, the input files were from three different folders named topic, segments and words.
2. Several different approaches were used to create the original meeting transcript.
3. The summary of the task is as follows:
    - Created a dictionary of word and segments.
    - Created a list of segments to parse words properly, where topic cuts off segment the topic boundary was given higher preference.
    - Created the meeting transcripts from the text files.
4. The final meeting transcript were exported to the relavant topic text files.