# Creating a sample Corpus and Analyzing it
In this notebook, the goal is to create a corpus from a group of classic economics texts. So, the first thing to do is to get some classic texts and download them to files in our on-site repository here, which we have called, simply, texts. 

I chose the following 10 economics books (based on interest and availability):
1. [*Wealth of Nations*](http://www.gutenberg.org/ebooks/3300), Adam Smith
2. [*Principles of Political Economy*](http://www.gutenberg.org/ebooks/33310), David Ricardo
3. [*Principles of Political Economy*](http://lf-oll.s3.amazonaws.com/titles/2188/Malthus_1462_EBk_v6.0.pdf), Thomas Malthus
4. [*An Essay on Economic Theory*](https://mises.org/library/essay-economic-theory-0), Richard Cantillon
5. [*Reflections on the Formation and Distribution of Wealth*](https://archive.org/details/reflectionsonfor01turg), M. Turgot
6. [*An Inquiry into the Principles of Political Economy*](https://archive.org/details/inquiryintoprinc01steu), James Stueart
7. [*England's Treasure by Forraign Trade*](https://archive.org/details/englandstre00muntuoft), Thomas Mun
8. [*A Treatise on Political Economy*](https://mises.org/library/treatise-political-economy), Jean-Baptiste Say
9. [*An Outline of the Science Of Political Economy*](), Nassau Senior
10. [*Principles of Political Economy*](http://www.gutenberg.org/files/30107/30107-pdf.pdf), John Stuart Mill

I downloaded each of the above books (in text format if possible) and placed them in the local directory - once again, if someone wants to follow my work, they should do the same, as I don't want to chew up all the ```github``` memory! First, let's import all the necessary stuff we need. If anything isn't working, it is a good idea to take a look at the [the previous notebook](NLTKTest.ipynb) to see what could be missing or going wrong. 

In [58]:
import re
import itertools
import collections
import nltk
import urllib3
import numpy as np
from pylab import rcParams
from nltk.corpus import stopwords
from math import sqrt                 # Thanks to Jonathan for reminding me to include this!
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

# Import my purpose-built pdf reader
import pdf2textConverter

# Also import os package so we can check for files 
import os

In [35]:
if os.path.isdir('.\\Texts'): os.chdir('.\\Texts') #Set to Texts if we aren't already there
os.listdir()

['EETRC.pdf',
 'ETFTTM.txt',
 'IPPEJS.txt',
 'MalthusPrinciples.pdf',
 'OSPENS.pdf',
 'PPEDR.txt',
 'PPEJSM.pdf',
 'PPETM.pdf',
 'RPWMT.txt',
 'TPEJBS.pdf',
 'WONAS.txt']

Abbreviations are about as expected - abbreviation for the work, and then the author's initials. Let's take them one by one and get them ready. Each book has to be opened, boiler plate at beginning and end eliminated, and then saved as a text file, just to keep things simple. I'll then save everything as a combined corpus (with no stopwords, I guess) that can be used for analysis.

In [49]:
# WEALTH OF NATIONS - Adam Smith

# Save edited version if not already there:
if os.path.isfile('WONASedit.txt'):
    print("File already present")
else:
    WONAS=open('WONAS.txt','r').read()
    start=WONAS.find("AN INQUIRY")
    end=WONAS.find("END OF THIS PROJECT GUTENBERG")
    WONAS=WONAS[start:end]    
    print(WONAS[:1000])
    target = open('WONASedit.txt', 'w')
    target.write(WONAS)

AN INQUIRY INTO THE NATURE AND CAUSES OF THE WEALTH OF NATIONS.


By Adam Smith




INTRODUCTION AND PLAN OF THE WORK.


The annual labour of every nation is the fund which originally supplies
it with all the necessaries and conveniencies of life which it annually
consumes, and which consist always either in the immediate produce
of that labour, or in what is purchased with that produce from other
nations.

According, therefore, as this produce, or what is purchased with it,
bears a greater or smaller proportion to the number of those who are
to consume it, the nation will be better or worse supplied with all the
necessaries and conveniencies for which it has occasion.

But this proportion must in every nation be regulated by two different
circumstances: first, by the skill, dexterity, and judgment with which
its labour is generally applied; and, secondly, by the proportion
between the number of those who are employed in useful labour, and that
of those who are not so employed. Whateve

In [50]:
# PRINCIPLES OF POLITICAL ECONOMY - David Ricardo

# Save edited version if not already there:
if os.path.isfile('PPEDRedit.txt'):
    print("File already present")
else:
    PPEDR=open('PPEDR.txt','r').read()
    start=PPEDR.find("THE PRINCIPLES")
    end=PPEDR.find("END OF THIS PROJECT GUTENBERG")
    PPEDR=PPEDR[start:end]
    print(PPEDR[:1000])    
    target = open('PPEDRedit.txt', 'w')
    target.write(WONAS)

THE PRINCIPLES

  OF

  POLITICAL ECONOMY,

  AND

  TAXATION.

  BY DAVID RICARDO, Esq.

  LONDON:

  JOHN MURRAY, ALBEMARLE-STREET

  1817.

  J. M^{c}CREERY. Printer,

  Black Horse Court, London.




PREFACE.


The produce of the earth--all that is derived from its surface by the
united application of labour, machinery, and capital, is divided among
three classes of the community; namely, the proprietor of the land, the
owner of the stock or capital necessary for its cultivation, and the
labourers by whose industry it is cultivated.

But in different stages of society, the proportions of the whole produce
of the earth which will be allotted to each of these classes, under the
names of rent, profit, and wages, will be essentially different;
depending mainly on the actual fertility of the soil, on the
accumulation of capital and population, and on the skill, ingenuity, and
instruments employed in agriculture.

To determine the laws which regulate this distribution, is the principal
p

For some reason, I couldn't get my packaged version going, so I'm going to have to just rewrite the function here:


In [62]:
def pdf2textConverter(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')                                                       #Had to change this line as it had deprecated "file" command
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

In [99]:
# PRINCIPLES OF POLITICAL ECONOMY - Thomas Malthus

if os.path.isfile('PPETMedit.txt'):
    print("File already present")
else:
    PPETM=pdf2textConverter('PPETM.pdf')
    start=PPETM.find("Memoir of Robert Malthus")
    end=PPETM.find("can only obtain it by charity.")+len("can only obtain it by charity")+1
    PPETM=PPETM[start:end]   
    target = open('PPETMedit.txt', 'w')
    target.write(PPETM)

In [100]:
start,end

(2881, 1035882)

Memoir of Robert Malthus.
Principles of Political Economy.
Introduction.
Book I
Chapter I.: Of the Definitions of Wealth and of Productive Labour.
Section I.—: On the Definitions of Wealth.
Section II.—: On Productive Labour.
Chapter II.: On the Nature, Causes, and Measures of Value.
Section I.—: On the Different Sorts of Value.
Section II.—: Of Demand and Supply As They Affect Exchangeable Value.
Section III.—: Of the Cost of Production As Affected By the Demand and

Supply, and On the Mode of Representing Demand.

Section IV.—: Of the Labour Which Has Been Employed On a Commodity

Considered As a Measure of Its Exchangeable Value.

Section V.—: Of the Labour Which a Commodity Will Command, Considered

As a Measure of Value In Exchange.

Section VI.—: On the Practical Application of the Measure of Value, and Its

Section VII.—: On the Variations In the Value of Money In the Same, and

General Use and Advantages.

Different Countries.

Chapter III.: Of the Rent of Land.
Section I.—: Of