# Pre-Processing Plato's Dialogues
## Iris Wu (iw5hte@virginia.edu) DS 5001 Spring 2023

## End goal of this notebook:
Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2) and Annotate these tables with statistical and linguistic features using NLP libraries such as NLTK (F3)

### Setting up necessary tools:

Importing useful packages -

In [205]:
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk
import plotly.express as px
from textparser import TextParser

Defining useful filepaths for reading and outputting data -

In [206]:
source_files = f'data'
out_path = f'data/output/plato'

Defining useful lists and patterns corresponding to each document for later preprocessing -

In [207]:
OHCO = ['book_id', 'chap_id', 'para_num', 'sent_num', 'token_num']

ohco_pat_list refers to the chapter pattern delineations. Some of Plato's dialogues remain undivided, or at least undivided in any sensible fashion, so they remain one big chunk of texts. Others, like The Republic, are divided into books. beginning_pat is a dictionary used to eliminate Benjamin Jowett's superfluous commentary and introductions to the dialogues in each document.

In [210]:
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"

ohco_pat_list = [
    
    (1676,   rf"PERSONS OF THE DIALOGUE:"),
    (1677,   rf"PERSONS OF THE DIALOGUE:"),
    (1600,   rf"PERSONS OF THE DIALOGUE:"),
    (1656, rf"APOLOGY"),
    (1580,   rf"PERSONS OF THE DIALOGUE:"),
    (1616,   rf"PERSONS OF THE DIALOGUE:"),
    (1571,   rf"PERSONS OF THE DIALOGUE:"),
    (1657,   rf"PERSONS OF THE DIALOGUE:"),
    (1681,   rf"PERSONS OF THE DIALOGUE:"),
    (1598,   rf"PERSONS OF THE DIALOGUE:"),
    (1642,   rf"PERSONS OF THE DIALOGUE:"),
    (1672,   rf"PERSONS OF THE DIALOGUE:"),
    (1635,   rf"PERSONS OF THE DIALOGUE:"),
    (1584,   rf"PERSONS OF THE DIALOGUE:"),
    (1673,   rf"PERSONS OF THE DIALOGUE:"),
    (1682,   rf"PERSONS OF THE DIALOGUE:"),
    (1579,   rf"PERSONS OF THE DIALOGUE:"),
    (1643,   rf"PERSONS OF THE DIALOGUE:"),
    (1687,   rf"PERSONS OF THE DIALOGUE:"),
    (1658,   rf"PERSONS OF THE DIALOGUE:"),
    (1636,   rf"PERSONS OF THE DIALOGUE:"),
    (1744,   rf"PERSONS OF THE DIALOGUE:"),
    (1591,   rf"PERSONS OF THE DIALOGUE:"),
    (1735,   rf"PERSONS OF THE DIALOGUE:"),
    (1738,   rf"PERSONS OF THE DIALOGUE:"),
    (1726,   rf"PERSONS OF THE DIALOGUE:"),
    (1572,   rf"^Section\s+\d+.$"),
    (1497,  rf"^\s*BOOK\s+{roman}\.\s*$"),
    (1750,  rf"^\s*BOOK\s+{roman}\.\s*$")
    
]

beginning_pat = {
    1571 : ["sense of the artistic difficulty of the design, cannot be determined.", r"\*\*\*\s*END OF"],
    1572 : ["or anticipates the discoveries of modern science.", r"\*\*\*\s*END OF"],
    1580 : ["rather to belong to a later stage of the philosophy of Plato.", r"\*\*\*\s*END OF"],
    1676 : ["(see Appendix I above)", r"\*\*\*\s*END OF"],
    1616 : ["to the latter work the author of this Essay is largely indebted", r"\*\*\*\s*END OF"],
    1677 : ["century before Christ.", r"\*\*\*\s*END OF"],
    1656 : ["the eyes of the Athenian public.", r"\*\*\*\s*END OF"],
    1681 : ["an imitator of Plato.", r"\*\*\*\s*END OF"],
    1598 : ["assigning to the Euthydemus any other position in the series.", r"\*\*\*\s*END OF"],
    1642 : ["trial or the reverse, can any evidence of the date be obtained.", r"\*\*\*\s*END OF"],
    1672 : ["daily life are not overlooked.", r"\*\*\*\s*END OF"],
    1635 : ["this truly Platonic little work is not a forgery of later times.", r"\*\*\*\s*END OF"],
    1584 : ["could not have been a young man at any time after the battle of Delium.", r"\*\*\*\s*END OF"],
    1673 : ["sufficient reasons for doubting the genuineness of the work.", r"\*\*\*\s*END OF"],
    1579 : ["Friendship; Cic. de Amicitia.", r"\*\*\*\s*END OF"],
    1682 : ["Platonic writings.", r"\*\*\*\s*END OF"],
    1643 : ["another.", r"\*\*\*\s*END OF"],
    1750 : ["THE PREAMBLE.", r"\*\*\*\s*END OF"],
    1591 : ["elements of human nature are reconciled.", r"\*\*\*\s*END OF"],
    1687 : ["but deeply rooted in history and in the human", r"\*\*\*\s*END OF"],
    1636 : ["the fear that literature will ever die out.", r"\*\*\*\s*END OF"],
    1658 : ["linger among critical uncertainties.", r"\*\*\*\s*END OF"],
    1744 : ["'spectator of all time and of all existence'?", r"\*\*\*\s*END OF"],
    1735 : ["'fragments of the great banquet' of Hegel.", r"\*\*\*\s*END OF"],
    1738 : ["be reunited with the great body of the Platonic writings.", r"\*\*\*\s*END OF"],
    1600 : ["together in a series the memorials of the life of Socrates.", r"\*\*\*\s*END OF"],
    1497 : ["introduced in the Timaeus.", r"\*\*\*\s*END OF"],
    1726 : ["opportunity of learning.", r"\*\*\*\s*END OF"],
    1657 : ["occur in Plato.", r"End of this Project Gutenberg Etext of Crito, by Plato"]
}

Functions to be used in processing - 

The below function is used to tokenize a document. This is drawn from Professor Alvarado's function in module 4 with a few modifications and his textparser.py class.

In [209]:
def tokenize_collection(LIB, clip_pats):

    books = []
    for book_id in LIB.index:

        # Announce
        print("Tokenizing", book_id, LIB.loc[book_id].raw_title)

        # Define vars
        chap_regex = LIB.loc[book_id].chap_regex
        ohco_pats = [('chap', chap_regex, 'm')]
        src_file_path = LIB.loc[book_id].source_file_path

        # Create object
        text = TextParser(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats[book_id], use_nltk=True)

        # Define parameters
        text.verbose = True
        text.strip_hyphens = True
        text.strip_whitespace = True

        # Parse
        text.import_source().parse_tokens()

        # Name things
        text.TOKENS['book_id'] = book_id
        text.TOKENS = text.TOKENS.reset_index().set_index(['book_id'] + text.OHCO)

        # Add to list
        books.append(text.TOKENS)
        
    # Combine into a single dataframe
    CORPUS = pd.concat(books).sort_index()

    # Clean up
    del(books)
    del(text)
        
    print("Done")
        
    return CORPUS

The below function is used to gather a table of tokens into a document table of varying OHCO levels (as defined in the OHCO list above). This function also draws on Professor Alvarado's Module 4 code.

In [212]:
def gather(TOKENS, ohco_level):
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.reset_index().groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

Importing the data and creating the LIB table -

In [213]:
source_file_list = sorted(glob(f"{source_files}/*.*"))
source_file_list

['data/ALCIBIADES_I-pg1676.txt',
 'data/ALCIBIADES_II-pg1677.txt',
 'data/APOLOGY-pg1656.txt',
 'data/CHARMIDES-pg1580.txt',
 'data/CRATYLUS-pg1616.txt',
 'data/CRITIAS-pg1571.txt',
 'data/CRITO-pg1657.txt',
 'data/ERYXIAS-pg1681.txt',
 'data/EUTHYDEMUS-pg1598.txt',
 'data/EUTHYPHRO-pg1642.txt',
 'data/GORGIAS-pg1672.txt',
 'data/ION-pg1635.txt',
 'data/LACHES-pg1584.txt',
 'data/LAWS-pg1750.txt',
 'data/LESSER_HIPPIAS-pg1673.txt',
 'data/LYSIS-pg1579.txt',
 'data/MENEXENUS-pg1682.txt',
 'data/MENO-pg1643.txt',
 'data/PARMENIDES-pg1687.txt',
 'data/PHAEDO-pg1658.txt',
 'data/PHAEDRUS-pg1636.txt',
 'data/PHILEBUS-pg1744.txt',
 'data/PROTAGORAS-pg1591.txt',
 'data/SOPHIST-pg1735.txt',
 'data/STATESMAN-pg1738.txt',
 'data/SYMPOSIUM-pg1600.txt',
 'data/THEAETETUS-pg1726.txt',
 'data/THE_REPUBLIC-pg1497.txt',
 'data/TIMAEUS-pg1572.txt']

In [214]:
book_data = []
for source_file_path in source_file_list:
    book_id = int(source_file_path.split('-')[-1].split('.')[0].replace('pg',''))
    book_title = source_file_path.split('/')[-1].split('-')[0].replace('_', ' ')
    book_data.append((book_id, source_file_path, book_title))
LIB = pd.DataFrame(book_data, columns=['book_id','source_file_path','raw_title'])\
    .set_index('book_id').sort_index()
LIB

Unnamed: 0_level_0,source_file_path,raw_title
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1497,data/THE_REPUBLIC-pg1497.txt,THE REPUBLIC
1571,data/CRITIAS-pg1571.txt,CRITIAS
1572,data/TIMAEUS-pg1572.txt,TIMAEUS
1579,data/LYSIS-pg1579.txt,LYSIS
1580,data/CHARMIDES-pg1580.txt,CHARMIDES
1584,data/LACHES-pg1584.txt,LACHES
1591,data/PROTAGORAS-pg1591.txt,PROTAGORAS
1598,data/EUTHYDEMUS-pg1598.txt,EUTHYDEMUS
1600,data/SYMPOSIUM-pg1600.txt,SYMPOSIUM
1616,data/CRATYLUS-pg1616.txt,CRATYLUS


In [215]:
LIB['chap_regex'] = LIB.index.map(pd.Series({x[0]:x[1] for x in ohco_pat_list}))
LIB

Unnamed: 0_level_0,source_file_path,raw_title,chap_regex
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1497,data/THE_REPUBLIC-pg1497.txt,THE REPUBLIC,^\s*BOOK\s+[IVXLCM]+\.\s*$
1571,data/CRITIAS-pg1571.txt,CRITIAS,PERSONS OF THE DIALOGUE:
1572,data/TIMAEUS-pg1572.txt,TIMAEUS,^Section\s+\d+.$
1579,data/LYSIS-pg1579.txt,LYSIS,PERSONS OF THE DIALOGUE:
1580,data/CHARMIDES-pg1580.txt,CHARMIDES,PERSONS OF THE DIALOGUE:
1584,data/LACHES-pg1584.txt,LACHES,PERSONS OF THE DIALOGUE:
1591,data/PROTAGORAS-pg1591.txt,PROTAGORAS,PERSONS OF THE DIALOGUE:
1598,data/EUTHYDEMUS-pg1598.txt,EUTHYDEMUS,PERSONS OF THE DIALOGUE:
1600,data/SYMPOSIUM-pg1600.txt,SYMPOSIUM,PERSONS OF THE DIALOGUE:
1616,data/CRATYLUS-pg1616.txt,CRATYLUS,PERSONS OF THE DIALOGUE:


Applying the tokenize_collection function to make the token table -

In [216]:
CORPUS = tokenize_collection(LIB, beginning_pat)
CORPUS

Tokenizing 1497 THE REPUBLIC
Importing  data/THE_REPUBLIC-pg1497.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*BOOK\s+[IVXLCM]+\.\s*$
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 1571 CRITIAS
Importing  data/CRITIAS-pg1571.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone PERSONS OF THE DIALOGUE:
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 1572 TIMAEUS
Importing  data/TIMAEUS-pg1572.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^Section\s+\d+.$
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model


  start = self.LINES.line_str.str.contains(start_pat, regex=True)


Parsing OHCO level 3 token_num by NLTK model
Tokenizing 1677 ALCIBIADES II
Importing  data/ALCIBIADES_II-pg1677.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone PERSONS OF THE DIALOGUE:
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 1681 ERYXIAS
Importing  data/ERYXIAS-pg1681.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone PERSONS OF THE DIALOGUE:
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 1682 MENEXENUS
Importing  data/MENEXENUS-pg1682.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone PERSONS OF THE DIALOGUE:
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1497,1,1,0,0,"(I, PRP)",PRP,I,i
1497,1,1,0,1,"(went, VBD)",VBD,went,went
1497,1,1,0,2,"(down, RB)",RB,down,down
1497,1,1,0,3,"(yesterday, NN)",NN,yesterday,yesterday
1497,1,1,0,4,"(to, TO)",TO,to,to
...,...,...,...,...,...,...,...,...
1750,12,127,0,5,"(EBook, NNP)",NNP,EBook,ebook
1750,12,127,0,6,"(of, IN)",IN,of,of
1750,12,127,0,7,"(Laws,, NNP)",NNP,"Laws,",laws
1750,12,127,0,8,"(by, IN)",IN,by,by


Adding more metadata to the LIB table, such as book length and the number of chapters -

In [189]:
LIB['book_len'] = CORPUS.groupby('book_id').term_str.count()
LIB['n_chaps'] = CORPUS.reset_index()[['book_id','chap_id']]\
    .drop_duplicates()\
    .groupby('book_id').chap_id.count()
LIB

Unnamed: 0_level_0,source_file_path,raw_title,chap_regex,book_len,n_chaps
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1497,data/THE_REPUBLIC-pg1497.txt,THE REPUBLIC,^\s*BOOK\s+[IVXLCM]+\.\s*$,118510,10
1571,data/CRITIAS-pg1571.txt,CRITIAS,PERSONS OF THE DIALOGUE:,6792,1
1572,data/TIMAEUS-pg1572.txt,TIMAEUS,^Section\s+\d+.$,69918,8
1579,data/LYSIS-pg1579.txt,LYSIS,PERSONS OF THE DIALOGUE:,9188,1
1580,data/CHARMIDES-pg1580.txt,CHARMIDES,PERSONS OF THE DIALOGUE:,10751,1
1584,data/LACHES-pg1584.txt,LACHES,PERSONS OF THE DIALOGUE:,10286,1
1591,data/PROTAGORAS-pg1591.txt,PROTAGORAS,PERSONS OF THE DIALOGUE:,22998,1
1598,data/EUTHYDEMUS-pg1598.txt,EUTHYDEMUS,PERSONS OF THE DIALOGUE:,15882,1
1600,data/SYMPOSIUM-pg1600.txt,SYMPOSIUM,PERSONS OF THE DIALOGUE:,22251,1
1616,data/CRATYLUS-pg1616.txt,CRATYLUS,PERSONS OF THE DIALOGUE:,23939,1


Extracting a vocab table with annotations for the stopwords and stemming - 

In [217]:
VOCAB = CORPUS.term_str.value_counts().to_frame('n').sort_index()
VOCAB.index.name = 'term_str'
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1
VOCAB['stop'] = VOCAB.index.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')
VOCAB

Unnamed: 0_level_0,n,stop
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
,95,0
1,38,0
10,1,0
100,13,0
10000,1,0
...,...,...
zones,9,0
zopyrus,1,0
zoroaster,1,0
zosin,1,0


In [218]:
from nltk.stem.porter import PorterStemmer
stemmer1 = PorterStemmer()
VOCAB['stem_porter'] = VOCAB.apply(lambda x: stemmer1.stem(x.name), 1)

from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer("english")
VOCAB['stem_snowball'] = VOCAB.apply(lambda x: stemmer2.stem(x.name), 1)

from nltk.stem.lancaster import LancasterStemmer
stemmer3 = LancasterStemmer()
VOCAB['stem_lancaster'] = VOCAB.apply(lambda x: stemmer3.stem(x.name), 1)
VOCAB

Unnamed: 0_level_0,n,stop,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,95,0,,,
1,38,0,1,1,1
10,1,0,10,10,10
100,13,0,100,100,100
10000,1,0,10000,10000,10000
...,...,...,...,...,...
zones,9,0,zone,zone,zon
zopyrus,1,0,zopyru,zopyrus,zopyr
zoroaster,1,0,zoroast,zoroast,zoroast
zosin,1,0,zosin,zosin,zosin


Applying the gather function to make a document table at the sentence level - 

In [203]:
DOC = gather(CORPUS, 4)
DOC

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,sent_str
book_id,chap_id,para_num,sent_num,Unnamed: 4_level_1
1497,1,1,0,I went down yesterday to the Piraeus with Glau...
1497,1,1,1,); and also because I wanted to see in what ma...
1497,1,1,2,I was delighted with the procession of the inh...
1497,1,1,3,When we had finished our prayers and viewed th...
1497,1,1,4,The servant took hold of me by the cloak behin...
...,...,...,...,...
1750,12,121,3,And the state will be perfected and become a w...
1750,12,122,0,"MEGILLUS: Dear Cleinias, after all that has be..."
1750,12,123,0,"CLEINIAS: Very true, Megillus; and you must jo..."
1750,12,124,0,MEGILLUS: I will.


Outputting all the tables as csvs - 

In [204]:
LIB.to_csv(f'{out_path}-LIB.csv')
VOCAB.to_csv(f'{out_path}-VOCAB.csv')
CORPUS.to_csv(f'{out_path}-CORPUS.csv')
DOC.to_csv(f'{out_path}-DOC.csv')