## .DOCX to .TXT file converter

#### docx package may need to be installed
#### To install enter command "pip install docx"
#### [Documentation can be found here](https://python-docx.readthedocs.io/en/latest/index.html#)

In [None]:
# !pip install docx
import docx

#### Global Variables to aid with filenames

In [None]:
path_to_data = '../data/Lexis Cases/'
file_prefix = 'P'
file_suffix = '.DOCX'
file_identifiers = range(1, 86) # Range from 1 to 85

out_path = '../data/Lexis Cases txt header/'
out_file_suffix = '.txt'

#### Text Extraction

The library `docx` doesn't have simple support for going over each element in the document sequentially. I found a solution online that finds the children of each document and constructs new `Paragraph` and `Table` elements for the user. Code and link to the source can be found below.

In [None]:
# This code is sourced from: https://github.com/python-openxml/python-docx/issues/40#issuecomment-90710401
# with modification made from comment: 

# Currently, there is no built-in method of iterating over all items sequentially
# We must manually iterate over children and call the constructors of Paragraph, Table
# Code below summary:
#      1. Grab body of document
#      2. Check if the child in "document.element.body.iterchildren()" is CT_P, or CT_Tbl
#      3. Call the Paragraph or Table constructor with the child and its parent (the document)
#      4. Yield the paragraph/table for user to manipulate

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

The code below streamlines opening each `.DOCX` file, reading the information, and writing it out to a `.txt` format. For tables our strategy is to concatenate each column for every row and treat it as a single sentence.

To get the case title we need to access the documents `header` via the documents `sections` and pull out text from row 1, cell 0. I believe the "page X of y" is store in row 0, cell 0 but this information isn't needed.

### Convert DOCX to TXT case files (without HEADER tags)

In [None]:
for file_number in file_identifiers:
    print('Processing ' + path_to_data + file_prefix + str(file_number) + file_suffix)
    
    # Open the document
    # (Note: docx.Document is different from docx.document.Document)
    document = docx.Document(path_to_data + file_prefix + str(file_number) + file_suffix)
    file_text = ''

    start_of_case = True
    doc_section = 0
    
    # Iterate over each item in the body
    for item in iter_block_items(document):

        # If at start of case, store the header (case name)
        # as the first entry. Must access 'document.sections[].header' to do so.
        if start_of_case:
            header = document.sections[doc_section].header
            case_title = header.tables[0].cell(1, 0).text
            file_text += case_title.strip() + '\n'
            doc_section += 2
            start_of_case = False
        
        # If item is a PARAGRAPH
        if isinstance(item, Paragraph):
            # Skip over any empty lines
            if len(item.text.strip()) > 0:
                file_text += item.text.strip() + '\n'

            # Check if we are at the last line of a case
            # to make sure we insert the header of the following case
            if item.text.strip() == 'End of Document':
                start_of_case = True 
          
        # If item is a TABLE
        elif isinstance(item, Table): 
            # We group everything across a row as one "sentence" in our .txt file
            # Currently have no better ideas. This seems to be the most common format of tables in our .docx files
            for row in item.rows:
                text = [cell.text.strip() for cell in row.cells]
                file_text += ' '.join(text) + '\n'
        
            
    # Save as .txt
    with open(out_path + file_prefix + str(file_number) + out_file_suffix, 'w') as out_file:
        out_file.write(file_text)
        

### Convert DOCX to TXT case files (with HEADER tags)

In [None]:
for file_number in file_identifiers:
    print('Processing ' + path_to_data + file_prefix + str(file_number) + file_suffix)
    
    # Open the document
    # (Note: docx.Document is different from docx.document.Document)
    document = docx.Document(path_to_data + file_prefix + str(file_number) + file_suffix)
    file_text = ''

    start_of_case = True
    doc_section = 0
    
    # Iterate over each item in the body
    for item in iter_block_items(document):

        # If at start of case, store the header (case name)
        # as the first entry. Must access 'document.sections[].header' to do so.
        if start_of_case:
            header = document.sections[doc_section].header
            case_title = header.tables[0].cell(1, 0).text
            file_text += case_title.strip() + '\n'
            doc_section += 2
            start_of_case = False
        
        # If item is a PARAGRAPH
        if isinstance(item, Paragraph):
            header = False
            for run in item.runs:
                if run.bold :
                    if run.text.strip() == item.text.strip():
                        if len(run.text.split()) < 12:
                            if item.text.strip() != 'End of Document':
                                header = True
            # Skip over any empty lines
            if len(item.text.strip()) > 0:
                if header:
                    file_text += '<header>' + item.text.strip() + '</header>' + '\n'
                else:
                    file_text += item.text.strip() + '\n'

            # Check if we are at the last line of a case
            # to make sure we insert the header of the following case
            if item.text.strip() == 'End of Document':
                start_of_case = True 

            
              
        # If item is a TABLE
        elif isinstance(item, Table): 
            # We group everything across a row as one "sentence" in our .txt file
            # Currently have no better ideas. This seems to be the most common format of tables in our .docx files
            for row in item.rows:
                text = [cell.text.strip() for cell in row.cells]
                file_text += ' '.join(text) + '\n'
            
    # Save as .txt
    with open(out_path + file_prefix + str(file_number) + out_file_suffix, 'w') as out_file:
        out_file.write(file_text)
        

### Converting current annotations to include headers

In [None]:
with open('../data/annotations/all_annotations_CN.txt') as f:
        annotation = f.read()
annotation = annotation.split('\n')

for file_number in file_identifiers:
    print('Processing ' + path_to_data + file_prefix + str(file_number) + file_suffix)
    
    # Open the document
    # (Note: docx.Document is different from docx.document.Document)
    document = docx.Document(path_to_data + file_prefix + str(file_number) + file_suffix)
    file_text = ''
    
    start_of_case = True
    doc_section = 0
    
    # Iterate over each item in the body
    for item in iter_block_items(document):
        if start_of_case:
            line_no = 0
            header = document.sections[doc_section].header
            case_title = header.tables[0].cell(1, 0).text
            
            # Check if this case is in our annotations.
            val_case = False
            
            annotation_line = 0 # Keeps case data and annotations txt parallel
            if case_title in annotation:
                for line in annotation:
                    if line.strip() == case_title.strip():
                        val_case = True
                        break

                    annotation_line += 1

            file_text += case_title.strip() + '\n'
            doc_section += 2
            start_of_case = False
        
        # If item is a PARAGRAPH
        if isinstance(item, Paragraph):
            if len(item.text.strip()) == 0:
                continue
            annotation_line += 1
            header = False
            for run in item.runs:
                if run.bold :
                    if run.text.strip() == item.text.strip():
                        if len(run.text.split()) < 12:
                            header = True
            # Skip over any empty lines
            if len(item.text.strip()) > 0:
                if header:
                    if val_case:
                        if item.text.strip() != 'End of Document':
                            # If we find a header
                            # We modify the specific line from the annotation file we read
                            annotation[annotation_line] = '<header>' + item.text.strip() + '</header>'


            # Check if we are at the last line of a case
            # to make sure we insert the header of the following case
            if item.text.strip() == 'End of Document':
                start_of_case = True 

            
              
        # If item is a TABLE
        elif isinstance(item, Table): 
            # We group everything across a row as one "sentence" in our .txt file
            # Currently have no better ideas. This seems to be the most common format of tables in our .docx files
            for row in item.rows:
                text = [cell.text.strip() for cell in row.cells]
                text = ' '.join(text) + '\n'
                annotation_line += text.count('\n')
                
# Save the new annotations
with open(out_path + 'all_annotations_CN_headers.txt', 'w') as out_file:
    out_file.write('\n'.join(annotation))
        