# Extracting Full Text from .docx file using Python

Source : [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/chapter13/)

First, we'll start by installing the `python-docx` module.

In [1]:
!pip3 install python-docx



In the next step, we are going to import the module to put it to use

In [2]:
import docx

## Overall look

Next, we are going to open a .docx file in Python, call `docx.Document()`, and pass the filename IRS_Article.docx. This will return a `Document` object, which has a `paragraphs` attribute that is a list of `Paragraph` objects. When we call `len()` on `doc.paragraphs`, it returns 34, which tells us that there are thirty-four `Paragraph` objects in this document

In [3]:
doc = docx.Document('IRS_Article.docx')
len(doc.paragraphs)

34

Each of these `Paragraph` objects has a `text` attribute that contains a string of the text in that paragraph (without the style information). Here, the first `text` attribute contains *'II. THE IRS '*, and the second contains *'The IRS is the branch of the United States Department of Treasury that is responsible for administering the Internal Revenue Code and enforcing tax law [...]'*

In [4]:
doc.paragraphs[0].text

'II. THE IRS '

In [5]:
doc.paragraphs[1].text

'The IRS is the branch of the United States Department of Treasury that is responsible for administering the Internal Revenue Code and enforcing tax law. Income taxes were introduced to the United States in 1913 when the Sixteenth Amendment was enacted. While the Treasury Department collects the taxes, the IRS is responsible for examining the tax returns for accuracy and bringing criminal action against those who file incorrect returns. Each tax return is checked internally for mathematical accuracy and consistency, regardless of whether it is submitted via mail or electronically. The IRS also compares the submitted returns to third-party materials that are required to be filed with the IRS, such as W-2s and 1099s. Today the IRS is taking advantage of the large amount of data that can be purchased from data brokers as well as amassing its own data sets. '

If you care only about the text, not the styling information, in the Word document, you can use the `getText()` function. It accepts a filename of a *.docx* file and returns a single string value of its text.

In this next step, we are going to create a `getText()` method that opens the Word document, loops over all the `Paragraph` objects in the `paragraphs` list, and then appends their text to the list in `fullText`. After the loop, the strings in fullText are joined together with newline characters.

In [6]:
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

In [7]:
print(getText('IRS_Article.docx'))

II. THE IRS 
The IRS is the branch of the United States Department of Treasury that is responsible for administering the Internal Revenue Code and enforcing tax law. Income taxes were introduced to the United States in 1913 when the Sixteenth Amendment was enacted. While the Treasury Department collects the taxes, the IRS is responsible for examining the tax returns for accuracy and bringing criminal action against those who file incorrect returns. Each tax return is checked internally for mathematical accuracy and consistency, regardless of whether it is submitted via mail or electronically. The IRS also compares the submitted returns to third-party materials that are required to be filed with the IRS, such as W-2s and 1099s. Today the IRS is taking advantage of the large amount of data that can be purchased from data brokers as well as amassing its own data sets. 

A. IRS Data Collection 
Prior to discussing the potential issues with the IRS’s use of data analytics, it is important t

You can also adjust `getText()` to modify the string before returning it. For example, to indent each paragraph, replace the `append()` call with `fullText.append('\t' + para.text)`. To add a double space in between paragraphs, you can change the join() call code to `return '\n\n'.join(fullText)`. With that said, let's refine our `getText()` method and compare the results.

In [8]:
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append('\t' + para.text)
    return '\n\n'.join(fullText)

In [9]:
getText('IRS_Article.docx')

'\tII. THE IRS \n\n\tThe IRS is the branch of the United States Department of Treasury that is responsible for administering the Internal Revenue Code and enforcing tax law. Income taxes were introduced to the United States in 1913 when the Sixteenth Amendment was enacted. While the Treasury Department collects the taxes, the IRS is responsible for examining the tax returns for accuracy and bringing criminal action against those who file incorrect returns. Each tax return is checked internally for mathematical accuracy and consistency, regardless of whether it is submitted via mail or electronically. The IRS also compares the submitted returns to third-party materials that are required to be filed with the IRS, such as W-2s and 1099s. Today the IRS is taking advantage of the large amount of data that can be purchased from data brokers as well as amassing its own data sets. \n\n\t\n\n\tA. IRS Data Collection \n\n\tPrior to discussing the potential issues with the IRS’s use of data analy

In [10]:
print(getText('IRS_Article.docx'))

	II. THE IRS 

	The IRS is the branch of the United States Department of Treasury that is responsible for administering the Internal Revenue Code and enforcing tax law. Income taxes were introduced to the United States in 1913 when the Sixteenth Amendment was enacted. While the Treasury Department collects the taxes, the IRS is responsible for examining the tax returns for accuracy and bringing criminal action against those who file incorrect returns. Each tax return is checked internally for mathematical accuracy and consistency, regardless of whether it is submitted via mail or electronically. The IRS also compares the submitted returns to third-party materials that are required to be filed with the IRS, such as W-2s and 1099s. Today the IRS is taking advantage of the large amount of data that can be purchased from data brokers as well as amassing its own data sets. 

	

	A. IRS Data Collection 

	Prior to discussing the potential issues with the IRS’s use of data analytics, it is im