# Sidebar on dealing with online PDFs and other Formats

As I was fooling around with Ricardo's Principles of Economics and Smith's Wealth of Nations, I found that restricting myself to only plain text files was a bit of a handicap. So: how hard is it to wrangle text in other formats? Also, now that I am starting to get this up and running, it has been great that Jonathan (Conning) has offered a few comments here and there. 

Good question. I started by checking out [this web site](https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167). This put me on the scent of the PDFminer package. This package will not work, because it is written in Python 2. An active fork, however, can be found at [this site](https://pypi.python.org/pypi/pdfminer2). I ```pip install```ed this with no problem.

Like almost all Python packages, there are some hiccups when it comes to actually getting them to work. Two that I encountered were that pdfminer requires cStringIO and chardet packages. (```pip install```ed them both)

So, let's see how it works. First, I should mention that I tried using a wrapper for all of this called slate, but couldn't get this working. Instead, I fumbled around stackoverflow, and found some resources on how to work with the pdfminer. This question [is here](http://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python). The code there took a little modifying, mainly because differences in how StringIO works these days, but here it is:

In [1]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os # Merely so we can easily get to the local document repository

The next task is to write a little program (actually borrowed from stack exchange, and update it to work with the latest flavor of python. I'm deliberately avoiding putting comments in the code block so I can write it to a file. 

In [2]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os # Merely so we can easily get to the local document repository

def pdf2textConverter(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')                                                       #Had to change this line as it had deprecated "file" command
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Now that we have a reader set up, let's try and do some stuff with it. First, a nice html copy of Malthus's Principles of Political Economy can be found [at this site](http://lf-oll.s3.amazonaws.com/titles/2188/Malthus_1462_EBk_v6.0.pdf), and will serve as a bit of a test document. First, change the directory so that we have a dedicated location for whatever we are looking at. Note that we have included both ```.txt``` and ```.pdf``` files in the ```.gitignore``` file for this project, so that we don't use up too much storage. If anybody ever reads this, they will have to download them.  

In [3]:
os.chdir('.\\Texts')

In [4]:
os.getcwd()

'C:\\Users\\mjbaker\\OneDrive - CUNY\\Documents\\github\\NLTKEconExp\\Texts'

In [5]:
MalthusPrinciples=pdf2textConverter('MalthusPrinciples.pdf')

TypeError: TextConverter.__init__() got an unexpected keyword argument 'codec'

In [6]:
help(TextConverter)

Help on class TextConverter in module pdfminer.converter:

class TextConverter(PDFConverter)
 |  TextConverter(rsrcmgr, outfp, pageno=1, laparams=None, showpageno=False, imagewriter=None)
 |  
 |  ##  TextConverter
 |  ##
 |  
 |  Method resolution order:
 |      TextConverter
 |      PDFConverter
 |      PDFLayoutAnalyzer
 |      pdfminer.pdfdevice.PDFTextDevice
 |      pdfminer.pdfdevice.PDFDevice
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, rsrcmgr, outfp, pageno=1, laparams=None, showpageno=False, imagewriter=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  paint_path(self, gstate, stroke, fill, evenodd, path)
 |  
 |  receive_layout(self, ltpage)
 |  
 |  render_image(self, name, stream)
 |      # Some dummy functions to save memory/CPU when all that is wanted
 |      # is text.  This stops all the image and drawing output from being
 |      # recorded and taking up RAM.
 |  
 |  write_text(self, text)
 |  
 | 

In [24]:
print(MalthusPrinciples[:1000])

The Online Library of Liberty
A Project Of Liberty Fund, Inc.

Thomas Robert Malthus, Principles of Political
Economy [1836]

The Online Library Of Liberty

This E-Book (PDF format) is published by Liberty Fund, Inc., a private,
non-profit, educational foundation established in 1960 to encourage study of the ideal
of a society of free and responsible individuals. 2010 was the 50th anniversary year of
the founding of Liberty Fund.

It is part of the Online Library of Liberty web site http://oll.libertyfund.org, which
was established in 2004 in order to further the educational goals of Liberty Fund, Inc.
To find out more about the author or title, to use the site's powerful search engine, to
see other titles in other formats (HTML, facsimile PDF), or to make use of the
hundreds of essays, educational aids, and study guides, please visit the OLL web site.
This title is also part of the Portable Library of Liberty DVD which contains over
1,000 books and quotes about liberty and power, and 

Absolutely Gorgeous! Since this is accomplished, let's turn to the job of creating a corpus of a few classics and comparing them. We will tackle this job in [this notebook](EconCorpus.ipynb).