<a href="https://colab.research.google.com/github/gnurock/nlp_extraccion_sentencias/blob/main/Copia_de_01_extract_from_MSWord_checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook will go into detail on extracting information from MSWord Documents locally. Since many companies and roles are inseparable from the Microsoft Office Suite, this is a useful reference for anyone faced with data transferred through .doc or .docx formats.

This is a sister-blog to my entry about Thomas Edison State University's (TESU) open source materials accessibility initiative. <a href = "https://medium.com/@NatalieOlivo/preserving-web-content-of-links-provided-in-a-word-doc-using-aws-services-ec2-and-s3-2c4f0cee0a26">Medium post</a> / <a href = "https://github.com/nmolivo/tesu_scraper">Github repository</a><br><br>
The initial blog I composed for this project was specific and borders on the line of unwieldy, so this blog is the first in a series I will be writing to take deep dives into separate aspects of the TESU project and make the material more accessible.

In [None]:
#specific to extracting information from word documents
import os
import zipfile

#useful tool for extracting information from XML
import re

#to pretty print our xml:
import xml.dom.minidom

A sample word document can be found in this github repository. Let's find it using the packages we've imported above.

In [None]:
#to check files in the current directory, use a single period
os.listdir('.')

['.config', 'sample_data']

In [None]:
#to check files in the directory above the current directory, use double periods
os.listdir('..')

['boot',
 'opt',
 'root',
 'proc',
 'sbin',
 'dev',
 'home',
 'sys',
 'bin',
 'lib',
 'etc',
 'srv',
 'var',
 'run',
 'mnt',
 'lib64',
 'usr',
 'tmp',
 'media',
 '.dockerenv',
 'tools',
 'datalab',
 'swift',
 'tensorflow-1.15.2',
 'content',
 'lib32']

In [None]:
#the sample word document is in the folder entitled "docs"
os.listdir('../home')

[]

Great! we can find where the word document is!<Br><Br>
We will now use the <a href = "https://docs.python.org/3/library/zipfile.html">zipfile</a> library to help us read our document. The defaults are listed below, and they're all good for our purposes of reading the word document.<br><br>
```python
class zipfile.ZipFile(file, mode='r', compression=ZIP_STORED, allowZip64=True, compresslevel=None)```

In [None]:
document = zipfile.ZipFile('/home/012 a Sent. Toca Penal 07-2016 GES MUJER.docx')

In [None]:
document

Ok, now to turn this into xml using 
```python 
ZipFile.read(name, pwd=None)```

As you can see, we need a name. This is different than the filename. This is how to direct ZipFile to read your file in the correct format. For our purposes, we'll want to see the xml that makes up the file.

In [None]:
document.namelist()

['[Content_Types].xml',
 '_rels/.rels',
 'word/_rels/document.xml.rels',
 'word/document.xml',
 'word/header1.xml',
 'word/_rels/header1.xml.rels',
 'word/endnotes.xml',
 'word/footnotes.xml',
 'word/footer1.xml',
 'word/media/image1.png',
 'word/theme/theme1.xml',
 'word/settings.xml',
 'customXml/itemProps2.xml',
 'customXml/itemProps1.xml',
 'customXml/item2.xml',
 'customXml/_rels/item2.xml.rels',
 'customXml/_rels/item1.xml.rels',
 'customXml/item1.xml',
 'docProps/core.xml',
 'word/numbering.xml',
 'word/styles.xml',
 'word/webSettings.xml',
 'word/fontTable.xml',
 'docProps/app.xml']

Just to see what kind of information is stored in our document, let's test a few.
I'm going to pretty print it, so it's a little more legible. This work is courtesy of answer from user <a href = "https://stackoverflow.com/users/47775/nick-bolton">Nick Bolton</a> on StackOverflow Question: <a href = "https://stackoverflow.com/questions/749796/pretty-printing-xml-in-python">Pretty Printing XML in Python</a>

In [None]:
#name = 'word/people.xml'
#we can see who the document author is: Stephen C. Phillips
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

<?xml version="1.0" ?>
<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessing

In [None]:
#name = 'word/fontTable.xml'
#This looks like the fonts used in the document style
uglyXml = xml.dom.minidom.parseString(document.read('word/fontTable.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

<?xml version="1.0" ?>
<w:fonts mc:Ignorable="w14 w15" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml">
  <w:font w:name="Times New Roman">
    <w:panose1 w:val="02020603050405020304"/>
    <w:charset w:val="00"/>
    <w:family w:val="roman"/>
    <w:pitch w:val="variable"/>
    <w:sig w:csb0="000001FF" w:csb1="00000000" w:usb0="E0002EFF" w:usb1="C0007843" w:usb2="00000009" w:usb3="00000000"/>
  </w:font>
  <w:font w:name="Calibri">
    <w:panose1 w:val="020F0502020204030204"/>
    <w:charset w:val="00"/>
    <w:family w:val="swiss"/>
    <w:pitch w:val="variable"/>
    <w:sig w:csb0="0000019F" w:csb1="00000000" w:usb0="E00002FF" w:usb1="4000ACFF" w:usb2="00000001" w:usb3="00000000"/>
  </

In [None]:
#ok cool let's get the xml that has the text contained in the document
#name = 'word/document.xml'
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

<?xml version="1.0" ?>
<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessing

Per the scope of this contract, we'll need to find all hyperlinks stored in this document and add them to a list. Let's take a look at what hyperlinks look like in the xml:
```
      <w:hyperlink r:id="rId27">
        <w:r w:rsidRPr="0005387D">
          <w:rPr>
            <w:rFonts w:ascii="Arial" w:cs="Arial" w:eastAsia="Arial" w:hAnsi="Arial"/>
            <w:color w:val="0000FF"/>
            <w:sz w:val="20"/>
            <w:szCs w:val="20"/>
            <w:u w:val="single"/>
          </w:rPr>
          <w:t>https://cyber.harvard.edu/getinvolved/jobs/communicationsmanager</w:t>
        </w:r>
      </w:hyperlink>
```

A simple pattern I notice is that they all start with characters `>http` and end with characters `</`<br> Now we can convert our xml to a string and use regex to collect all text between those characters.

To help with the regex I'll need to accomplish our goal of collecting all text between the aforementioned characters, I used the following StackOverflow question, which contains what I am looking for in the initial ask: <a href = "https://stackoverflow.com/questions/1454913/regular-expression-to-find-a-string-included-between-two-characters-while-exclud">Regular Expression to find a string included between two characters while EXCLUDING the delimiters.</a> 

While I do want to keep the `http`, I do not want to keep the `<` or `>`. I will make these modifications to my list items using list comprehension.

In [None]:
#first to turn the xml content into a string:
xml_content = document.read('word/document.xml')
document.close()
xml_str = str(xml_content)

ValueError: ignored

In [None]:
!pip install docx2python



In [None]:
from docx2python import docx2python
from docx2python.iterators import enum_paragraphs

In [None]:
doc_result = docx2python('/home/012 a Sent. Toca Penal 07-2016 GES MUJER.docx')

NameError: ignored

In [None]:
doc_result.body[0]

[[['TRIBUNAL SUPERIOR DE JUSTICIA DEL ESTADO. PRIMERA SALA PENAL Y ESPECIALIZADA EN JUSTICIA PARA ADOLESCENTES. MAGISTRADOS DOCTORA, MARIBEL MENDOZA FLORES, PRESIDENTA; LICENCIADOS ALEJANDRO ENRIQUE FIGUEROA Y MANUEL DE JESÚS LÓPEZ LÓPEZ.',
   'REYES MANTECÓN, SAN BARTOLO COYOTEPEC, OAXACA, SEIS DE JULIO DE DOS MIL DIECISÉIS.',
   '\tV I S T O S los autos del toca penal oral número 07/2016, formado con motivo del RECURSO DE CASACIÓN interpuesto por el sentenciado  GABRIEL VICHIDO RUÍZ en contra de la sentencia pronunciada el nueve de marzo de dos mil dieciséis, por el Tribunal de Juicio Oral de la región Istmo, con sede en Salina Cruz, Oaxaca, en la causa penal 365/2014, dictada, en vía de reposición de sentencia, al acreditarse la existencia del delito de EQUIPARADO A LA VIOLACIÓN en agravio de la víctima FRANCISCA MARCIAL VILLALANA y la plena responsabilidad de GABRIEL VICHIDO RUÍZ, en su comisión.',
   'R E S U L T A N D O:',
   'PRIMERO.  El día nueve de marzo de dos mil dieciséis,

In [None]:
doc_result.html_map

'<html><body><table border="1"><tr><td><pre>(0, 0, 0, 0) PRIMERA SALA PENAL Y ESPECIALIZADA EN JUSTICIA PARA ADOLESCENTES.</pre></td><td><pre>(0, 0, 1, 0) ----media/image1.png----</pre></td></tr></table><table border="1"><tr><td><pre>(1, 0, 0, 0) Recurso de casación                                                                       Toca penal oral   07/2016.</pre><pre>(1, 0, 0, 1) </pre><pre>(1, 0, 0, 2) </pre><pre>(1, 0, 0, 3) </pre></td></tr></table><table border="1"><tr><td><pre>(2, 0, 0, 0) TRIBUNAL SUPERIOR DE JUSTICIA DEL ESTADO. PRIMERA SALA PENAL Y ESPECIALIZADA EN JUSTICIA PARA ADOLESCENTES. MAGISTRADOS DOCTORA, MARIBEL MENDOZA FLORES, PRESIDENTA; LICENCIADOS ALEJANDRO ENRIQUE FIGUEROA Y MANUEL DE JESÚS LÓPEZ LÓPEZ.</pre><pre>(2, 0, 0, 1) REYES MANTECÓN, SAN BARTOLO COYOTEPEC, OAXACA, SEIS DE JULIO DE DOS MIL DIECISÉIS.</pre><pre>(2, 0, 0, 2) \tV I S T O S los autos del toca penal oral número 07/2016, formado con motivo del RECURSO DE CASACIÓN interpuesto por el sentenciado

In [None]:
doc_result.footnotes

[['<td><pre>(4, 0, 0, 0) </pre><pre>(4, 0, 0, 1) </pre></td>']]

In [None]:
content= doc_result.document

In [None]:
s=enum_paragraphs(content)

In [None]:
s

<generator object enum_at_depth.<locals>.enumerate_next_depth at 0x7f049d319e60>

In [None]:
!pip install docx2txt

Collecting docx2txt
  Downloading https://files.pythonhosted.org/packages/7d/7d/60ee3f2b16d9bfdfa72e8599470a2c1a5b759cb113c6fe1006be28359327/docx2txt-0.8.tar.gz
Building wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25l[?25hdone
  Created wheel for docx2txt: filename=docx2txt-0.8-cp36-none-any.whl size=3965 sha256=d00ac7f71ed29d3ba4d66b23c1509b862830ecff117af6fd3c3a45764d83aa82
  Stored in directory: /root/.cache/pip/wheels/b2/1f/26/a051209bbb77fc6bcfae2bb7e01fa0ff941b82292ab084d596
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8


In [None]:
import docx2txt
 
result = docx2txt.process("/home/012 a Sent. Toca Penal 07-2016 GES MUJER.docx")

FileNotFoundError: ignored

In [None]:
result

'PRIMERA SALA PENAL Y ESPECIALIZADA EN JUSTICIA PARA ADOLESCENTES.\n\n\n\nRecurso de casación                                                                       Toca penal oral   07/2016.\n\n\n\n\n\n\n\nTRIBUNAL SUPERIOR DE JUSTICIA DEL ESTADO. PRIMERA SALA PENAL Y ESPECIALIZADA EN JUSTICIA PARA ADOLESCENTES. MAGISTRADOS DOCTORA, MARIBEL MENDOZA FLORES, PRESIDENTA; LICENCIADOS ALEJANDRO ENRIQUE FIGUEROA Y MANUEL DE JESÚS LÓPEZ LÓPEZ.\n\nREYES MANTECÓN, SAN BARTOLO COYOTEPEC, OAXACA, SEIS DE JULIO DE DOS MIL DIECISÉIS.\n\n\tV I S T O S los autos del toca penal oral número 07/2016, formado con motivo del RECURSO DE CASACIÓN interpuesto por el sentenciado  GABRIEL VICHIDO RUÍZ en contra de la sentencia pronunciada el nueve de marzo de dos mil dieciséis, por el Tribunal de Juicio Oral de la región Istmo, con sede en Salina Cruz, Oaxaca, en la causa penal 365/2014, dictada, en vía de reposición de sentencia, al acreditarse la existencia del delito de EQUIPARADO A LA VIOLACIÓN en agravio 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os


In [None]:
ruta="/content/drive/My Drive/Colab Notebooks/saturday ia/justicia abierta/0 versiones finales/"

In [None]:
local_download_path = os.path.expanduser(ruta)

In [None]:
ruta_salida = "/content/drive/My Drive/Colab Notebooks/saturday ia/justicia abierta/salida_doc2text/"

In [None]:
for file in os.listdir(local_download_path):
    if file.endswith("docx"):
      print (file[0:3])
      filename= ruta+file
      result = docx2txt.process(filename)
      archivo_salida=ruta_salida+file[0:3]+".txt"
      f = open(archivo_salida,"a")
      f.write(result)
f.close()
      

004
004
006
001
007
009
011
003
012
005
006
007
014
015
016
017
012
018
014
020
015
022
019
026
021
027
022
028
023
024
024
030
025
031
026
032
027
028
034
031
036
032
033
038
034
040
035
042
037
043
044
044
040
041
042
043
048
049
050
046
051
052
047
053
049
055
056
056
057
052
053
059
054
058
062
060
065
066
061
067
069
070
070
072
064
073
075
068
077
072
079
074
081
084
085
078
079
087
080
088
089
090
091
086
092
088
089
094
095
090
091
096
097
093
098
099
100
095
101
102
097
103
104
099
100
101
107
108
103
111
106
107
114
108
109
116
110
119
114
121
123
117
118
124
120
126
122
127
128
123
129
125
127
132
128
130
131
138
139
133
140
134
141
135
136
137
143
139
140
142


In [None]:
f = open("salida.txt","")
f.write(res);

f.close()


In [None]:
link_list = re.findall('http.*?\<',xml_str)[1:]
link_list = [x[:-1] for x in link_list]

In [None]:
link_list

In [None]:
len(link_list)

Great! We've now collected all the URLS from this word document! Check out <a href= "https://github.com/nmolivo/tesu_scraper/blob/master/01_scraper.ipynb">the Notebook</a> updloaded to <a href = "https://github.com/nmolivo/tesu_scraper">the TESU Scraper Repo</a>, where we use the technique covered in this repository to collect a link list for each word document contained in a folder stored on an AWS S3 bucket. 

Created by Natalie Olivo
<a href = "https://www.linkedin.com/in/natalie-olivo-82548951/">LinkedIn</a> | <a href = "https://github.com/nmolivo">GitHub</a> | <a href = "https://medium.com/@NatalieOlivo">Blog</a>