# Reading other kinds of files

## Word Documents

The best package for reading the contents of modern Word documents (i.e,. files with a  docx exctension) in `docx2txt`. It returns the full text, stripping out all formatting information. 

To install (from within a notebook):
~~~~
    %conda install -c conda-forge docx2txt
~~~~

In [None]:
import docx2txt

In [None]:
text = docx2txt.process('data/pandas_wiki.docx')

In [None]:
text

In [None]:
print(text)


<div class="alert alert-info">
<h3>Your Turn</h3>


<p>View the contents of one of your Word documents in Python.

</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
text = docx2txt.process('path/to/file.docx')
    
</code>
</details>


## PDF Documents

The best package for reading the contents of PDF documents in `PyPDF2`. It can be used to read and write PDF documents. It can extract the text, but only when the text is stored in the document, as is commonly the case when PDFs are created on a computer from the original document. It can not extract the text from scanned documents, as the text there is stored as an image.

To install (from within a notebook):
~~~~
    %conda install -c conda-forge PyPDF2
~~~~

In [None]:
import PyPDF2

Extracting the textual elements of a PDF is more complicated than a Word document. After the file is opened, it needs to be parsed, and then the text extracted from each page.

In [None]:
pdfFileObj = open('docs/19-1222.pdf', 'rb')

In [None]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [None]:
print(pdfReader.numPages)

In [None]:
first_page_text = pdfReader.getPage(0).extractText()

print(first_page_text[:100])

If you wanted all of the text from a particular PDF, you could iterate over each page. In this case, I add a line break (`\n`) in between each page. 

In [None]:
full_text = ''

for page_number in range(0, pdfReader.numPages):
    new_page_text = pdfReader.getPage(page_number).extractText()
    full_text = full_text +  '\n' + new_page_text

In [None]:
len(full_text)

In [None]:
print(full_text)


<div class="alert alert-info">
<h3>Your Turn</h3>


<p>View the text of the DOJ press released named 19-1223.pdf (in the docs folder).

</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
pdfFileObj = open('docs/19-1223.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
full_text = ''

for page_number in range(0, pdfReader.numPages):
    new_page_text = pdfReader.getPage(page_number).extractText()
    full_text = full_text +  '\n' + new_page_text
print(full_text)    
</code>
</details>

This could all be bundled in function:

In [None]:
def extract_text(pdf_file_name):
    '''Extract text contents from a PDF file'''
    
    pdfFileObj = open(pdf_file_name, 'rb')
        
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    full_text = ''

    for page_number in range(0, pdfReader.numPages):
        new_page_text = pdfReader.getPage(page_number).extractText()
        full_text = full_text +  '\n' + new_page_text
    return full_text
    

In [None]:
text = extract_text('docs/19-1222.pdf')

In [None]:
print(text)

Alternatively, if you wanted to preserve the pages as individual documents:

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 200)


def extract_page(page_number):
    text = pdfReader.getPage(page_number).extractText()
    return {'page' : page_number + 1,
            'text' : text}

def extract_pages(pdf_file_name):
    '''Extract text contents from a PDF file'''
    
    pdfFileObj = open(pdf_file_name, 'rb')
        
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    pages = []

    for page_number in range(0, pdfReader.numPages):
        new_page_dict = extract_page(page_number)
        pages.append(new_page_dict)
    df = pd.DataFrame(pages)
    df['source'] = pdf_file_name
    
    return df
    

In [None]:
extract_pages('docs/19-1222.pdf')

## Plain text files

Plain text files can be opened and contents stored in a string.

In [None]:
filename = 'docs/19-1223.txt'

In [None]:
text = open(filename,'r').read()

In [None]:
print(text[:250])

The above method does not scale well, as Python keeps the file open until you close it. Leaving it open creates memory problems when opening hundreds or thousands of files, so the preferred way to open files is using `with`:

In [None]:
with open(filename, 'r') as infile:
    text = infile.read()

In [None]:
print(text[:250])


<div class="alert alert-info">
<h3>Your Turn</h3>


<p>View the text of the DOJ press released named 19-1224.text (in the docs folder).

</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
fn = 'docs/19-1224.txt'
with open(fn, 'r') as infile:
    text = infile.read() 
</code>
</details>

## Loading multiple files

In [None]:
from glob import glob

In [None]:
glob('docs/*.txt')

In [None]:
filename_list = glob('docs/*.txt')

In [None]:
pr_list = []
for filename in filename_list:
    with open(filename, 'r') as infile:
        text = infile.read()
        
    info = {'filename' : filename,
            'text'     : text}
    
    pr_list.append(info)

In [None]:
pr_df = pd.DataFrame(pr_list)

pr_df.head()

In [None]:
print(pr_df['text'].values[0])