## .DOCX to .TXT file converter

#### docx package may need to be installed
#### To install enter command "pip install docx"
#### [Documentation can be found here](https://python-docx.readthedocs.io/en/latest/index.html#)

In [1]:
# !pip install docx
from docx import Document

#### Global Variables to aid with filenames

In [2]:
path_to_data = '../data/Lexis Cases/'
file_prefix = 'P'
file_suffix = '.DOCX'
file_identifiers = range(1, 86) # Range from 1 to 85

out_path = '../data/Lexis Cases txt/'
out_file_suffix = '.txt'

#### Text Extraction

The library `docx` will read just the main body of the text when iterating over the `paragraphs` of a `Document` object.

To get the case title we need to access the documents `header` via the documents `sections` and pull out text f

In [None]:
for file_number in file_identifiers:
    print('Processing ' + path_to_data + file_prefix + str(file_number) + file_suffix)
    
    # Open the document
    document = Document(path_to_data + file_prefix + str(file_number) + file_suffix)
    file_text = ''

    start_of_case = True
    doc_section = 0
    
    # Iterate over each paragraph in the body
    for para in document.paragraphs:

        # If at start of case, store the header (case name)
        # as the first entry
        if start_of_case:
            header = document.sections[doc_section].header
            case_title = header.tables[0].cell(1, 0).text
            file_text += case_title.strip() + '\n'
            doc_section += 2
            start_of_case = False
            
        # Skip over any empty lines
        if len(para.text.strip()) > 0:
            file_text += para.text.strip() + '\n'
            
        # Check if we are at the last line of a case
        # to make sure we insert the header of the following case
        if para.text.strip() == 'End of Document':
            start_of_case = True 
        
            
    # Save as .txt
    with open(out_path + file_prefix + str(file_number) + out_file_suffix, 'w') as out_file:
        out_file.write(file_text)

Processing ../data/Lexis Cases/P1.DOCX
Processing ../data/Lexis Cases/P2.DOCX
Processing ../data/Lexis Cases/P3.DOCX
Processing ../data/Lexis Cases/P4.DOCX
Processing ../data/Lexis Cases/P5.DOCX
Processing ../data/Lexis Cases/P6.DOCX
Processing ../data/Lexis Cases/P7.DOCX
Processing ../data/Lexis Cases/P8.DOCX
Processing ../data/Lexis Cases/P9.DOCX
Processing ../data/Lexis Cases/P10.DOCX
Processing ../data/Lexis Cases/P11.DOCX
Processing ../data/Lexis Cases/P12.DOCX
Processing ../data/Lexis Cases/P13.DOCX
Processing ../data/Lexis Cases/P14.DOCX
Processing ../data/Lexis Cases/P15.DOCX
Processing ../data/Lexis Cases/P16.DOCX
Processing ../data/Lexis Cases/P17.DOCX
Processing ../data/Lexis Cases/P18.DOCX
Processing ../data/Lexis Cases/P19.DOCX
Processing ../data/Lexis Cases/P20.DOCX
Processing ../data/Lexis Cases/P21.DOCX
Processing ../data/Lexis Cases/P22.DOCX
Processing ../data/Lexis Cases/P23.DOCX
Processing ../data/Lexis Cases/P24.DOCX
Processing ../data/Lexis Cases/P25.DOCX
Processin