# Comparing Gibbon and Hume
Micah D. Saxton

## Getting started

### Using Google's Colaboratory Notebooks
The webpage you are looking at is an example of a Colab notebook. Notebooks are a convenient way to write and execute Python code. Google's Colab notebooks provide the extra benefit of installing packages and running code on the cloud, rather than on your own CPU.

### First steps:
Navigate to "File" and select "Save a copy in Drive."
Navigate to "Edit", select "Notebook settings" and make sure that "Runtime type" is set to "Python 3."

### Creating code and text cells
Colab notebooks are divided into cells which can contain either text or Python code. Although I have created all the cells we will be using for this workshop, it may be helpful to learn how to add cells of your own.

If you hover your mouse at the top or bottom of an already existing cell, you will have an option of adding a new code or text cell. Additionally, you can select the three dots on the right side of a cell for more options.

### Running code cells
There are two ways to run code cells:

Click the "play" button on the left side of the code cell
Press SHIFT+RETURN (or SHIFT+ENTER)

## Set up

In [1]:
# installations
!pip install scattertext



In [2]:
# imports
import nltk
from nltk import word_tokenize
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import spacy
import scattertext as st
from IPython.core.display import HTML

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\msaxto01\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\msaxto01\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
!python -m spacy download en_core_web_sm

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
nlp = spacy.load('en_core_web_sm')
nlp.max_length=2000000  # current longest: 2005761
nlp.disable_pipes('ner', 'tagger')

['ner', 'tagger']

## Pre-process

In [5]:
gibbon_corpus = PlaintextCorpusReader('./gibbon/gibbon_decline_and_fall/', '.*\.txt')
hume_corpus = PlaintextCorpusReader('./hume/hume-history-of-england/', '.*\.txt')

In [21]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [43]:
def preprocess(nltk_corpus):
    results = []
    for fileid in nltk_corpus.fileids()[:1]:  # testing first 6
        tokens_list = nltk_corpus.words(fileid)
        print(tokens_list[:10])  # TEST
        tokens_list_no_stops = [token for token in tokens_list if token.lower in stopwords.words('english')]
        print(tokens_list_no_stops[:10])  # TEST
        tokens_list_no_punct = [token for token in tokens_list if token.isalpha()]
        print(tokens_list_no_punct[:10])  # TEST
        tokens_string = ' '.join(tokens_list_no_punct)
        tokens_string = tokens_string.lower()
        results.append(tokens_string)
    return results
    

In [44]:
gibbon_strings = preprocess(gibbon_corpus)

['The', 'extent', 'and', 'military', 'force', 'of', 'the', 'Roman', 'empire', ',']
['The', 'extent', 'and', 'military', 'force', 'of', 'the', 'Roman', 'empire', ',']
['The', 'extent', 'and', 'military', 'force', 'of', 'the', 'Roman', 'empire', 'in']


In [36]:
gibbon_strings[0]

'the extent and military force of the roman empire in the age of the antonines introduction in the second century of the christian era the empire of rome comprehended the fairest part of the earth and the most civilised portion of mankind the frontiers of that extensive monarchy were guarded by ancient renown and disciplined valour the gentle but powerful influence of laws and manners had gradually cemented the union of the provinces their peaceful inhabitants enjoyed and abused the advantages of wealth and luxury the image of a free constitution was preserved with decent reverence the roman senate appeared to possess the sovereign authority and devolved on the emperors all the executive powers of government during a happy period a d of more than fourscore years the public administration was conducted by the virtue and abilities of nerva trajan hadrian and the two antonines it is the design of this and of the two succeeding chapters to describe the prosperous condition of their empire 

In [24]:
for s in gibbon_strings:
    print(len(s))

53774
60815
48406
40592
47521
82242
54229
32983
46975
88524
67300
64182
130790
97667
140787
247322
108827
84342
85310
76518
125401
69184
83674
99361
133513
129636
1791419
51003
47439
88322
2912040
68129
42067
61284
65669
102822
75565
108874
60688
124435
145575
92941
100384
151977
73447
108667
134763
167947
224030
159989
11062
109395
86391
34525
59798
1049462
61972
118300
82463
90890
80438
61966
53403
74217
75235
84197
47877
85675
73842
139867
20919


In [15]:
hume_strings = preprocess(hume_corpus)

## ScatterText
[ScatterText](https://github.com/JasonKessler/scattertext) is a python library used to visually compare texts according to two categories.

**Technical note**: Due to the large corpora we will be comparing, I have made adjustments to [spaCy](https://spacy.io/) to reduce processing time.

In [25]:
gibbon_df = pd.DataFrame(data={'author': 'Gibbon', 'text': gibbon_strings})
hume_df = pd.DataFrame(data={'author': 'Hume', 'text': hume_strings})
author_df = gibbon_df.append(hume_df)

In [26]:
author_df.head

<bound method NDFrame.head of     author                                               text
0   Gibbon  the extent and military force of the roman emp...
1   Gibbon  of the union and internal prosperity of the ro...
2   Gibbon  of the constitution of the roman empire in the...
3   Gibbon  the cruelty follies and murder of commodus ele...
4   Gibbon  public sale of the empire to didius julianus b...
..     ...                                                ...
68    Hume  the english nation ever since fatal league fra...
69    Hume  the king observing whole nation concurred firs...
70    Hume  when cabal entered mysterious alliance france ...
71    Hume  the first act james reign assemble privy counc...
72    Hume  while every motive civil religious concurred a...

[144 rows x 2 columns]>

In [18]:
author_corpus = st.CorpusFromPandas(author_df,
                                    category_col='author',
                                    text_col='text',
                                    nlp=nlp,
                                    ).build()





























































ValueError: [E088] Text of length 2005761 exceeds maximum of 2000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [None]:
html = st.produce_scattertext_explorer(author_corpus, category='Gibbon',
                                       category_name='Gibbon',
                                       not_category_name='Hume',
                                       width_in_pixels=900)
HTML(html)