# Working with MS Word documents
docx is MS Word open xml format. The file is essentially a zipped xml file, with text, objects, styles, etc. So in theory, you can open one like this:

`f = open('data/sample_doc.docx', 'r')`

The format started to be used in Word 2007.

## python-docx package
python-docx package seems to be a goto package for working with docx in python. I didn't do a thorough search, but there seems to be few alternatives. So I've installed it and experimeting with it here.

To install, either use pip or for conda:

`conda install -c conda-forge python-docx`

In [1]:
import docx

### Working with existing docx documents

In [5]:
doc = docx.Document('data/sample_doc.docx')

##### Sections
Are for section properties, such as margin, etc.
https://python-docx.readthedocs.io/en/latest/api/section.html

In [11]:
len(doc.sections)

2

In [13]:
doc.sections[0].start_type

2

#### Extracting the header text

In [15]:
for paragraph in doc.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

SUMMARY
BACKGROUND

Other heading


APPENDIX 1: Blah


#### Extracting the text after a specific header

In [17]:
start_index = [i for i, parag in enumerate(doc.paragraphs) if "Other heading" in parag.text]

if start_index:
    for i, parag in enumerate(doc.paragraphs):
        if i > start_index[0]:
            print(parag.text)


Bla-bla-bla Bla-bla-bla Bla-bla-bla Bla-bla-bla
Bla-bla-bla
Table 3: Summary of Stuff


APPENDIX 1: Blah

Different tables for your consideration 		




#### Extracting the text between two headers after a specific header

Note that the table text does not get extracted. Tables have to be processed separately, see below.

In [25]:
start_index = [i for i, parag in enumerate(doc.paragraphs) if "Other heading" in parag.text]

In [26]:
start_index

[50]

In [27]:
header_indexes = [i for i, parag in enumerate(doc.paragraphs) if parag.style.name == "Heading 1"]
header_indexes

[24, 29, 49, 50, 55, 56, 57]

In [31]:
end_index = header_indexes[header_indexes.index(start_index[0]) + 1]
end_index

55

In [32]:
for i, parag in enumerate(doc.paragraphs):
    if i > start_index[0] and i < end_index:
        print(parag.text)


Bla-bla-bla Bla-bla-bla Bla-bla-bla Bla-bla-bla
Bla-bla-bla
Table 3: Summary of Stuff


#### Tables

In [4]:
doc.tables

[<docx.table.Table at 0x10f548550>,
 <docx.table.Table at 0x111836160>,
 <docx.table.Table at 0x1118360f0>,
 <docx.table.Table at 0x111836128>,
 <docx.table.Table at 0x1118361d0>]

In [22]:
doc.tables[-1]

<docx.table.Table at 0x11e277470>

In [3]:
doc.core_properties.title

''

In [24]:
for i, row in enumerate(doc.tables[-1].rows):
    print()
    for cell in row.cells:
        print("{}".format(cell.text), end='|')


Group2|Group2|Group2|Group2|Group2|
Pest Name|Present |Interesting|Fluffy|Information|
Name 3|Y|N|N|bla: lalalalala lal alalala alala alamala|
Name 4|Y|N|N|bla: Some description text|
Name 5|N|N|N|bla: Some text that is not interesting and not intended to be read by a sensible being)
la: Some text that is not interesting and not intended to be read by a sensible being|
Name 6|Y|N|N|bla: Some text that is not interesting and not intended to be read by a sensible being|

In [25]:
# Read a table in the docx document into a dataframe
import pandas as pd

def load_table(docx_table):
    table_data = []
    
    for i, row in enumerate(docx_table.rows):
        row_data = []
        for cell in row.cells:
            row_data.append(cell.text)
        table_data.append(row_data)

    return pd.DataFrame(table_data)
    

In [26]:
df = load_table(doc.tables[-1])
df

Unnamed: 0,0,1,2,3,4
0,Group2,Group2,Group2,Group2,Group2
1,Pest Name,Present,Interesting,Fluffy,Information
2,Name 3,Y,N,N,bla: lalalalala lal alalala alala alamala
3,Name 4,Y,N,N,bla: Some description text
4,Name 5,N,N,N,bla: Some text that is not interesting and not...
5,Name 6,Y,N,N,bla: Some text that is not interesting and not...


In [28]:
df = load_table(doc.tables[3])
df

Unnamed: 0,0,1,2,3,4
0,Group1,Group1,Group1,Group1,Group1
1,Name,Present,Interesting,Fluffy,Information
2,Name 1,Y,N,N,Some info about name 1
3,Name 2,Y,N,N,bla: notinteresting
