# Creating a sample Corpus and Analyzing it
In this notebook, the goal is to create a corpus from a group of classic economics texts. So, the first thing to do is to get some classic texts and download them to files in our on-site repository here, which we have called, simply, texts. 

I chose the following 10 economics books (based on interest and availability):
1. [*Wealth of Nations*](http://www.gutenberg.org/ebooks/3300), Adam Smith
2. [*Principles of Political Economy*](http://www.gutenberg.org/ebooks/33310), David Ricardo
3. [*Principles of Political Economy*](http://lf-oll.s3.amazonaws.com/titles/2188/Malthus_1462_EBk_v6.0.pdf), Thomas Malthus
4. [*An Essay on the Nature of Commerce in General*](http://lf-oll.s3.amazonaws.com/titles/285/Cantillon_0039_EBk_v6.0.pdf), Richard Cantillon
5. [*Reflections on the Formation and Distribution of Wealth*](http://oll.libertyfund.org/titles/turgot-reflections-on-the-formation-and-distribution-of-riches), M. Turgot
6. [*An Inquiry into the Principles of Political Economy*](http://livros01.livrosgratis.com.br/mc000259.pdf), James Stueart
7. [*England's Treasure by Forraign Trade*](https://archive.org/details/englandstre00muntuoft), Thomas Mun
8. [*A Treatise on Political Economy*](https://mises.org/library/treatise-political-economy), Jean-Baptiste Say
9. [*An Outline of the Science Of Political Economy*](), Nassau Senior
10. [*Principles of Political Economy*](http://www.gutenberg.org/files/30107/30107-pdf.pdf), John Stuart Mill

I downloaded each of the above books (in text format if possible) and placed them in the local directory - once again, if someone wants to follow my work, they should do the same, as I don't want to chew up all the ```github``` memory! First, let's import all the necessary stuff we need. If anything isn't working, it is a good idea to take a look at the [the previous notebook](NLTKTest.ipynb) to see what could be missing or going wrong. 

In [32]:
import re
import itertools
import collections
import nltk
import urllib3
import unicodedata
import numpy as np
from pylab import rcParams
from nltk.corpus import stopwords
from math import sqrt                 # Thanks to Jonathan for reminding me to include this!
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

# Import my purpose-built pdf reader
import pdf2textConverter

# Also import os package so we can check for files 
import os

In [33]:
if os.path.isdir('.\\Texts'): os.chdir('.\\Texts') #Set to Texts if we aren't already there
os.listdir()

['EETRC.pdf',
 'ETFTTM.txt',
 'IPPEJS.txt',
 'MalthusPrinciples.pdf',
 'OSPENS.pdf',
 'pdf2textConverter.py',
 'PPEDR.txt',
 'PPEDRedit.txt',
 'PPEJSM.pdf',
 'PPETM.pdf',
 'PPETMedit.txt',
 'RPWMT.txt',
 'TPEJBS.pdf',
 'WONAS.txt',
 'WONASedit.txt']

Abbreviations are about as expected - abbreviation for the work, and then the author's initials. Let's take them one by one and get them ready. Each book has to be opened, boiler plate at beginning and end eliminated, and then saved as a text file, just to keep things simple. I'll then save everything as a combined corpus (with no stopwords, I guess) that can be used for analysis.

In [34]:
# WEALTH OF NATIONS - Adam Smith

# Save edited version if not already there:
if os.path.isfile('WONASedit.txt'):
    print("File already present")
else:
    WONAS=open('WONAS.txt','r').read()
    start=WONAS.find("AN INQUIRY")
    end=WONAS.find("END OF THIS PROJECT GUTENBERG")
    WONAS=WONAS[start:end]    
    print(WONAS[:1000])
    target = open('WONASedit.txt', 'w')
    target.write(WONAS)
    target.close()

File already present


In [35]:
# PRINCIPLES OF POLITICAL ECONOMY - David Ricardo

# Save edited version if not already there:
if os.path.isfile('PPEDRedit.txt'):
    print("File already present")
else:
    PPEDR=open('PPEDR.txt','r').read()
    start=PPEDR.find("THE PRINCIPLES")
    end=PPEDR.find("END OF THIS PROJECT GUTENBERG")
    PPEDR=PPEDR[start:end]
    print(PPEDR[:1000])    
    target = open('PPEDRedit.txt', 'w')
    target.write(PPEDR)
    target.close()

File already present


For some reason, I couldn't get my packaged version going, so I'm going to have to just rewrite the function here:


In [36]:
def pdf2textConverter(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')                                                       #Had to change this line as it had deprecated "file" command
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

In [37]:
# PRINCIPLES OF POLITICAL ECONOMY - Thomas Malthus

if os.path.isfile('PPETMedit.txt'):
    print("File already present")
else:
    PPETM=pdf2textConverter('PPETM.pdf')
    start=PPETM.find("Memoir of Robert Malthus")
    end=PPETM.find("can only obtain it by charity.")+len("can only obtain it by charity")+1
    PPETM=PPETM[start:end]   
    PPETM=unicodedata.normalize('NFKD', PPETM).encode('ascii','ignore')
    target = open('PPETMedit.txt', 'w')   
    target.write(PPETM.decode())
    target.close()

File already present


Note that there are a few additional gyrations in the above because Malthus was an interpreted pdf. It turns out that the typical pdf needs to be stripped down a bit because of wingdings present in translation. So, we first translate it into bytes, ignoring crappy characters, and then re-encode while writing the file. 

Also note that we encountered a bit of a hiccup with the first download of Cantillon - it turned out that the pdf files from the [Von Mises Website Library](http://mises.org) don't play very nicely with my program!

In [63]:
# ESSAYS ON THE NATURE OF COMMERCE IN GENERAL - Richard Cantillon
if os.path.isfile('EETRCedit.txt'):
    print("File already present")
else:
    EETRC=pdf2textConverter('EETRC.pdf')
    start=EETRC.find("ESSAY ON THE NATURE OF TRADE IN GENERAL")
    end=EETRC.find("Richard Cantillon And The Nationality Of Political Economy")
    EETRC=EETRC[start:end]   
    EETRC=unicodedata.normalize('NFKD', EETRC).encode('ascii','ignore')
    target = open('EETRCedit.txt', 'w')   
    target.write(EETRC.decode())
    target.close()

# In passing note that some additional cleaning is needed:
#   1. Get rid of URL's throughout
#   2. Get rid of references to online library of liberty
#   3. Get rid of constant [RETURN TO TABLE OF CONTENTS] type stuff. 
#   4. There may be more!

File already present


In [70]:
# REFLECTIONS ON WEALTH - M. Turgot 
if os.path.isfile('RFWMTedit.txt'):
    print("File already present")
else:
    RFWMT=pdf2textConverter('RFWMT.pdf')
    start=RFWMT.find("Table of Contents")
    end=len(RFWMT)+1
    RFWMT=RFWMT[start:end]   
    RFWMT=unicodedata.normalize('NFKD', RFWMT).encode('ascii','ignore')
    target = open('RFWMTedit.txt', 'w')   
    target.write(RFWMT.decode())
    target.close()


In [80]:
# INQUIRY INTO THE PRINCIPLES OF POLITICAL ECONOMY - James Stueart
if os.path.isfile('IPPEJSedit.txt'):
    print("File already present")
else:
    IPPEJS=pdf2textConverter('IPPEJS.pdf')
    target = open('IPPEJSedit.txt', 'w')   
    target.write(IPPEJS)
    target.close()