<a href="https://colab.research.google.com/github/kcalizadeh/phil_nlp/blob/main/data_load_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Introduction

A book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts thus enable us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis.

This project uses 51 texts spanning 10 schools of philosophical thought. Based on these, we develop classification models, word vectors, and general EDA. This can then be used to understand user's worldviews by comparing them to historical schools of thought. And once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy.

This notebook contains the first steps of that project, where we load the 51 texts in the corpus, clean them, and then produce and export a dataframe for use in modeling.

### Imports and Mounting Drive

In [2]:
from functions import *

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kcali\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
import spacy.cli
spacy.cli.download("en_core_web_lg")
import en_core_web_lg
nlp = en_core_web_lg.load()

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


### Load the Texts

With the functions loaded, we bring in the various texts. For access to them via Google Drive, use this [link](https://drive.google.com/drive/folders/1OdTQzRboTOozJqX1INJoYljuA4ctttx8?usp=sharing).

In [4]:
# load the texts

## plato
plato_complete = get_text(drive_path + '/phil_txts/plato_complete_works.txt')

# aristotle
aristotle_vol_1 = get_text(drive_path + '/phil_txts/aristotle_complete_works_v1.txt')
aristotle_vol_2 = get_text(drive_path + '/phil_txts/aristotle_complete_works_v2.txt')

## rationalists
spinoza_ethics = get_guten('http://www.gutenberg.org/cache/epub/3800/pg3800.txt')
spinoza_improve_understanding = get_guten('http://www.gutenberg.org/cache/epub/1016/pg1016.txt')
leibniz_theodicy = get_guten('http://www.gutenberg.org/cache/epub/17147/pg17147.txt')
descartes_discourse_method = get_guten('http://www.gutenberg.org/cache/epub/59/pg59.txt')
descartes_meditations = get_text(drive_path + '/phil_txts/descartes_meditations.txt')
malebranche_search_truth = get_text(drive_path + '/phil_txts/malebranche_search_truth.txt')

## empiricists
locke_understanding_1 = get_guten('http://www.gutenberg.org/cache/epub/10615/pg10615.txt')
locke_understanding_2 = get_guten('http://www.gutenberg.org/cache/epub/10616/pg10616.txt')
locke_treatise_gov = get_guten('http://www.gutenberg.org/cache/epub/7370/pg7370.txt')
hume_treatise = get_guten('http://www.gutenberg.org/cache/epub/4705/pg4705.txt')
hume_natural_religion = get_guten('http://www.gutenberg.org/cache/epub/4583/pg4583.txt')
berkeley_treatise = get_guten('http://www.gutenberg.org/cache/epub/4723/pg4723.txt')
berkeley_three_dialogues = get_guten('http://www.gutenberg.org/cache/epub/4724/pg4724.txt')

## german idealism
kant_practical_reason = get_text(drive_path + '/phil_txts/kant_critique_practical_reason.txt')
kant_judgement = get_text(drive_path + '/phil_txts/kant_critique_judgement.txt')
kant_pure_reason = get_text(drive_path + '/phil_txts/kant_pure_reason.txt')
fichte_ethics = get_text(drive_path + '/phil_txts/fichte_system_of_ethics.txt')
hegel_logic = get_text(drive_path + '/phil_txts/hegel_science_of_logic.txt')
hegel_phenomenology = get_text(drive_path + '/phil_txts/hegel_phenomenology_of_spirit.txt')
hegel_right = get_text(drive_path + '/phil_txts/hegel_elements_of_right.txt')

## analytic
russell_problems_of_phil = get_guten('http://www.gutenberg.org/cache/epub/5827/pg5827.txt')
russell_analylsis_of_mind = get_guten('http://www.gutenberg.org/cache/epub/2529/pg2529.txt')
moore_studies = get_guten('http://www.gutenberg.org/files/50141/50141-0.txt')
wittgenstein_tractatus = get_text(drive_path + '/phil_txts/wittgenstein_tractatus.txt')
wittgenstein_investigations = get_text(drive_path + '/phil_txts/wittgenstien_philosophical_investigations.txt')
lewis_papers1 = get_text(drive_path + '/phil_txts/lewis_papers_1.txt')
lewis_papers2 = get_text(drive_path + '/phil_txts/lewis_papers_2.txt')
quine_quintessence = get_text(drive_path + '/phil_txts/quine_quintessence.txt')
popper_science = get_text(drive_path + '/phil_txts/popper_logic_of_science.txt')
kripke_troubles = get_text(drive_path + '/phil_txts/kripke_philosophical_troubles.txt')
kripke_naming = get_text(drive_path + '/phil_txts/kripke_naming_necessity.txt')

## phenomenology
ponty_perception = get_text(drive_path + '/phil_txts/merleau-ponty_phenomenology_of_perception.txt')
husserl_idea_of = get_text(drive_path + '/phil_txts/husserl_idea_of_phenomenology.txt')
husserl_crisis = get_text(drive_path + '/phil_txts/husserl_crisis_of_euro_sciences.txt')
heidegger_being_time = get_text(drive_path + '/phil_txts/heidegger_being_and_time.txt')
heidegger_track = get_text(drive_path + '/phil_txts/heidegger_off_the_beaten_track.txt')

## continental
foucault_order = get_text(drive_path + '/phil_txts/foucault_order_of_things.txt')
foucault_madness = get_text(drive_path + '/phil_txts/foucault_history_of_madness.txt')
foucault_clinic = get_text(drive_path + '/phil_txts/foucault_birth_of_clinic.txt')
derrida_writing = get_text(drive_path + '/phil_txts/derrida_writing_difference.txt')
deleuze_oedipus = get_text(drive_path + '/phil_txts/deleuze_guattari_anti-oedipus.txt')
deleuze_difference = get_text(drive_path + '/phil_txts/deleuze_difference_repetition.txt')

## marxism
marx_kapital = get_text(drive_path + '/phil_txts/marx_kapital.txt')
marx_manifesto = get_text(drive_path + '/phil_txts/marx_manifesto.txt')
lenin_essential = get_text(drive_path + '/phil_txts/lenin_essential_works.txt')

## capitalist economics
smith_wealth = get_guten('http://www.gutenberg.org/files/3300/3300-0.txt')
ricardo_political_economy = get_guten('http://www.gutenberg.org/cache/epub/33310/pg33310.txt')
keynes_employment = get_text(drive_path + '/phil_txts/keynes_theory_of_employment.txt')


In [None]:
# new texts added after the original project's inception

## stoicism


Now we cut out front and end-matter. This needs to be done ad hoc, since there is no consistent marker for it.

In [5]:
# original texts
plato_complete = plato_complete.split('find that an enticing')[1][388:].split('Demeter, whose cult at')[0]

aristotle_vol_1 = aristotle_vol_1.split('1a20-1b9')[1].split('799a16')[0]
aristotle_vol_2 = aristotle_vol_2.split('830a5-830b4')[1].split('1462a5-1462a13')[0]

spinoza_ethics = spinoza_ethics.split('ranslated from the Latin by R.')[1][71:].split('End of the Ethics')[0]
spinoza_improve_understanding = spinoza_improve_understanding.split('Farewell.*')[1][20:].split('End of ')[0]
leibniz_theodicy = leibniz_theodicy.split('appeared in 1710 as the')[1][202:].split('SUMMARY OF THE CON')[0][:-140]
descartes_discourse_method = descartes_discourse_method.split('PREFATORY NOTE')[1][18:].split('End of the Pr')[0]
descartes_meditations = descartes_meditations.split('LETTER')[1][1:].split('AND REPLIES')[0]
malebranche_search_truth = malebranche_search_truth.split("n's Mind and the Use H")[1][64:].split('Beati qui')[0]

locke_understanding_1 = locke_understanding_1.split('2 Dorset Court, 24th of May, 1689')[1][50:].split('End of the Pro')[0][:-30]
locke_understanding_2 = locke_understanding_2.split('1. Man fitted to form articulated Sounds.')[1][4:].split('End of the Pro')[0][:-25]
locke_treatise_gov = locke_treatise_gov.split('now lodged in Christ College, Cambridge.')[1][21:].split('FINIS.')[0]
hume_treatise = hume_treatise.split('ADVERTISEMENT')[1][9:].split('End of Pro')[0][:-14]
hume_natural_religion = hume_natural_religion.split('PAMPHILUS TO HERMIPPUS')[1][6:].split('End of the Pro')[0][:-22]
berkeley_treatise = berkeley_treatise.split('are too apt to condemn an opinion before they rightly')[1][47:].split('End of the Pr')[0][:-22]
berkeley_three_dialogues = berkeley_three_dialogues.split('THE FIRST DIALOGUE')[1][17:].split('End of the Pro')[0][:-22]

kant_practical_reason = kant_practical_reason.split('erner Pluhar an')[1][329:].split('stone of the wi')[0][:-20]
kant_judgement = kant_judgement.split('TO THE FIRST EDITION,* 1790')[1][1:].split('EXPLANATORY NOTES')[0][:-39]
kant_pure_reason = kant_pure_reason.split('Bacon of Verulam')[1][33:].split('(Persius, Satires, iii, 78-9).')[0][:-1]
fichte_ethics = fichte_ethics.split('(“Krause Nachschrift,” 1798/99)')[1][111:].split('Page 345')[0][:-2]
hegel_logic = hegel_logic.split('complete transformati')[1][249:].split('It is a matter of speculation how Hegel would have rev')[0][:-80]
hegel_phenomenology = hegel_phenomenology.split('PREFACE: ON SCIENTIFIC')[1][1:].split('1I Adaptation')[0][:-62]
hegel_right = hegel_right.split('he immediate occasion f')[1][184:].split('I Hegel lectured on the topics in')[0][:-28]

russell_problems_of_phil = russell_problems_of_phil.split('n the following pages')[1].split('BIBLIOGRAPHICAL NOTE')[0]
russell_analylsis_of_mind = russell_analylsis_of_mind.split('H. D. Lewis')[2][21:].split('End of Pro')[0]
moore_studies = moore_studies.split('Aristotelian Society,_ 1919-20.')[1][23:].split('E Wes')[0][:-10]
wittgenstein_tractatus = wittgenstein_tractatus.split('TRACTATUS LOGICO-PHILOSOPHICUS')[1][70:].split('I NDEX')[0][:-8]
wittgenstein_investigations = wittgenstein_investigations.split('catty')[1][787:].split("above', 351")[0]
lewis_papers1 = lewis_papers1.split('The fifteen papers')[1][61:].split('Acquai')[0][:-10]
lewis_papers2 = lewis_papers2.split('Part Four Counterfactuals and Time')[1][17:].split('end p.342')[0]
quine_quintessence = quine_quintessence.split('T R UT H B Y C O N V E N T I O N')[1].split('CREDITS')[0][:-7]
popper_science = popper_science.split('F IRST E NGLISH E DITION, 1959')[1][2:].split('This is the end of the text of the original book.')[0]
kripke_troubles = kripke_troubles.split('apters 2, 3, 7, 10, 11, and 13 are previously unpublish')[1][103:].split('ans, Gareth. 198')[0][:-25]
kripke_naming = kripke_naming.split('xjvdsa')[1][10:].split('hese addenda represe')[0][:-35]

ponty_perception = ponty_perception.split('P REFACE')[1].split('B IBLIOGRAPHY')[0][:-65]
husserl_idea_of = husserl_idea_of.split('LECTUREl')[1][9:].split('Abstraction, ideating, 47, 50, 65')[0][:-10]
husserl_crisis = husserl_crisis.split('§ 1.')[1].split('Appendix X:')[0]
heidegger_being_time = heidegger_being_time.split("AUTHOR'S PREFACE TO THE")[1][25:].split('Not "the" sole way.')[0][:-8]
heidegger_track = heidegger_track.split('translated in several ')[1][15:].split('et-up [dar Gestellj as the uunost obli')[0][:-32]

foucault_order = foucault_order.split('P REFACE')[1]
foucault_madness = foucault_madness.split('ickering simulacra, an')[1][112:].split('Page 591')[0]
foucault_clinic = foucault_clinic.split('iagnostic (Paris, 1962, p.')[1][15:].split('de Sade.')[0][:-33]
derrida_writing = derrida_writing.split('(Flaubert, Preface d la d')[1][10:].split('Reb Derissa')[0]
deleuze_oedipus = deleuze_oedipus.split('xjdsde')[1].split('jajielaks')[0]
deleuze_difference = deleuze_difference.split('Introduction:')[1].split('Plateaus')[0][:-65]

marx_kapital = ((marx_kapital.split('E MAGNITUDE OF VALUE)')[1].split('expropriation of the laborer.')[0])+'expropriation of the laborer.')
marx_manifesto = marx_manifesto.split('page 29')[1].split('Mao')[0][:-15]
lenin_essential = lenin_essential.split('We will now sum up the theoretical')[1].split('SUGGESTIONS FOR FURTHER READING')[0]

smith_wealth = smith_wealth.split('INTRODUCTION AND PLAN OF THE WORK.')[2].split('End of the Project Gutenberg EBook of An Inquiry into the Nat')[0]
ricardo_political_economy = ricardo_political_economy.split('ON VALUE.')[1].split('  FOOTNOTES:')[0]
keynes_employment = keynes_employment.split('GENERAL INTRODUCTION')[1].split('PRINTING ERRORS IN THE FIRST EDITION CORRECTE')[0][:-145]


In [None]:
# new texts

Having isolated the relevant portions of each document, we can now unify all the texts in each school.

In [6]:
# a list of books for each school, then aggregated and entered into a dictionary

## original texts
plato_texts = [plato_complete]
aristotle_texts = [aristotle_vol_1, aristotle_vol_2]
rationalist_texts = [spinoza_ethics, spinoza_improve_understanding, 
                    leibniz_theodicy, descartes_discourse_method, 
                     descartes_meditations, malebranche_search_truth]
empiricist_texts = [locke_treatise_gov, locke_understanding_1, locke_understanding_2, 
                    hume_treatise, hume_natural_religion, berkeley_three_dialogues, 
                    berkeley_treatise]
german_idealist_texts = [kant_practical_reason, kant_judgement, kant_pure_reason, 
                         fichte_ethics, hegel_logic, hegel_phenomenology, hegel_right]
analytic_texts = [russell_analylsis_of_mind, russell_problems_of_phil, 
                  moore_studies, wittgenstein_investigations, wittgenstein_tractatus, 
                  lewis_papers1, lewis_papers2, quine_quintessence, popper_science, 
                  kripke_naming, kripke_troubles]
phenomenology_texts = [ponty_perception, husserl_crisis, 
                       husserl_idea_of, heidegger_being_time, heidegger_track]
continental_texts = [foucault_clinic, foucault_madness, foucault_order, 
                     derrida_writing, deleuze_difference, deleuze_oedipus]
marxist_texts = [marx_kapital, marx_manifesto, lenin_essential]
capitalist_texts = [smith_wealth, ricardo_political_economy, keynes_employment]

# new texts
stoiticism_texts = 


all_texts = plato_texts + aristotle_texts + empiricist_texts + rationalist_texts + analytic_texts + continental_texts + phenomenology_texts + german_idealist_texts + marxist_texts + capitalist_texts + stoiticism_texts
all_texts_string = ' . '.join(all_texts)

text_dict_list = {'plato': plato_texts, 'aristotle': aristotle_texts, 
             'empiricism': empiricist_texts, 'rationalism': rationalist_texts, 
            'german_idealism': german_idealist_texts, 
             'phenomenology': phenomenology_texts, 'analytic': analytic_texts, 
            'continental': continental_texts, 'marxism': marxist_texts,
             'capitalism': capitalist_texts, 'stoicism': stoicism_texts}

text_dict = {}
for school in text_dict_list.keys():
    text_dict[school] = ' . '.join(text_dict_list[school])

### More In-Depth Cleaning

All the previous explorations were done with some basic cleaning methods like removing stopwords. And while we brushed over it thus far, in the process of dealing with the data we encountered a lot of oddities. These include 
- words fused together (e.g., 'aconcept' for 'a concept')
- headers of pages occurring repeatedly in the text
- page numbers and citation numbers
- footnotes, roman numerals, titles of chapters

All these would need to be removed if we are to train a model on the actual content of these thinkers, and especially if we want a neural network to do any kind of predictive work where it will look at full sentences.

The process of dealing with these and getting the data ready for our models has a few steps:
1. develop a general cleaning function that can work for every text (removing roman numerals, for example)
2. examine each text itself and remove the specific headers that are relevant to it
  - look for features that could capture all the footnotes here as well
3. tokenize the text using spacy
4. examine the tokens for unusual patterns 
  - there should be virtually no duplicate sentences
  - we can remove sentences that are too short to mean anything
  - remove sentences that contain terms that must be from footnotes (the author's name should be very rare in the actual text, for example)

Unfortunately many of these steps can only be done ad hoc; there is no real way to know whether and what headers are in a text without examining the files individually. So the process is a bit tedious and time-consuming. Still, when we finish we will have data that is much cleaner and more useful for modeling. 


#### 1. Universal Cleaning Steps

In [23]:
def baseline_clean(to_correct, capitals=True, bracketed_fn=False, odd_words_dict={}):
  # remove utf8 encoding characters and some punctuations
  result = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff\xad\x0c6§\\\£\Â*_<>""⎫•{}Γ~]', ' ', to_correct)
  result = re.sub(r'[\u2014\u2013\u2012-]', ' ', result)

  # replace whitespace characters with actual whitespace
  result = re.sub(r'\s', ' ', result)

  # replace odd quotation marks with a standard
  result = re.sub(r'[‘’“”]', "'", result)

  # replace the ﬀ, ﬃ and ﬁ with the appropriate counterparts
  result = re.sub(r'ﬀ', 'ff', result)
  result = re.sub(r'ﬁ', 'fi', result)
  result = re.sub(r'ﬃ', 'ffi', result)

  # remove or standardize some recurring common and meaninless words/phrases
  result = re.sub(r'\s*This\s*page\s*intentionally\s*left\s*blank\s*', ' ', result)
  result = re.sub(r'(?i)Aufgabe\s+', ' ', result)
  result = re.sub(r',*\s+cf\.', ' ', result)

  # some texts have footnotes conveniently in brackets - this removes them all, 
  # with a safety measure for unpaired brackets, and deletes all brackets afterwards
  if bracketed_fn:
    result = re.sub(r'\[.{0,300}\]|{.{0,300}}|{.{0,300}\]|\[.{0,300}}', ' ', result)
  result = re.sub(r'[\[\]{}]', ' ', result)

  # unify some abbreviations
  result = re.sub(r'&', 'and', result)
  result = re.sub(r'\se\.g\.\s', ' eg ', result)
  result = re.sub(r'\si\.e\.\s', ' ie ', result)
  result = re.sub('coroll\.', 'coroll', result)
  result = re.sub('pt\.', 'pt', result)

  # remove roman numerals, first capitalized ones
  result = re.sub(r'\s((I{2,}V*X*\.*)|(IV\.*)|(IX\.*)|(V\.*)|(V+I*\.*)|(X+L*V*I*]\.*))\s', ' ', result)
  # then lowercase
  result = re.sub(r'\s((i{2,}v*x*\.*)|(iv\.*)|(ix\.*)|(v\.*)|(v+i*\.*)|(x+l*v*i*\.*))\s', ' ', result)

  # remove periods and commas flanked by numbers
  result = re.sub(r'\d\.\d', ' ', result)
  result = re.sub(r'\d,\d', ' ', result)

  # remove the number-letter-number pattern used for many citations
  result = re.sub(r'\d*\w{,2}\d', ' ', result)

  # remove numerical characters
  result = re.sub(r'\d+', ' ', result)

  # remove words of 2+ characters that are entirely capitalized 
  # (these are almost always titles, headings, or speakers in a dialogue)
  # remove capital I's that follow capital words - these almost always roman numerals
  # some texts do use these capitalizations meaningfully, so we make this optional
  if capitals:
    result = re.sub(r'[A-Z]{2,}\s+I', ' ', result)
    result = re.sub(r'[A-Z]{2,}', ' ', result)

  # remove isolated colons and semicolons that result from removal of titles
  result = re.sub(r'\s+:\s*', ' ', result)
  result = re.sub(r'\s+;\s*', ' ', result)

  # remove isolated letters (do it several times because strings of isolated letters do not get captured properly)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)
  result = re.sub(r'\s[^aAI\.]\s', ' ', result)

  # remove isolated letters at the end of sentences or before commas
  result = re.sub(r'\s[^aI]\.', '.', result)
  result = re.sub(r'\s[^aI],', ',', result)

  # deal with spaces around periods and commas
  result = re.sub(r'\s+,\s+', ', ', result)
  result = re.sub(r'\s+\.\s+', '. ', result)

  # remove empty parantheses
  result = re.sub(r'(\(\s*\.*\s*\))|(\(\s*,*\s*)\)', ' ', result)

  # reduce multiple periods, commas, or whitespaces into a single one
  result = re.sub(r'\.+', '.', result)
  result = re.sub(r',+', ',', result)
  result = re.sub(r'\s+', ' ', result)

  # deal with isolated problem cases discovered in the data:
  for key in odd_words_dict.keys():
    result = re.sub(r''+key+'', odd_words_dict[key], result)

  return result

This step is relatively easy - all we had to do was use regex to capture and remove the patterns that we needed to remove. 

The next step requires deeper examination of the individual texts, however. 

#### 2. Text-by-Text Cleaning

In this step we will remove headers and other offensive features of specific texts. 

The most common problem the presence of headings that appear at the top of each page in the original book. When converted to a string, these then get interpolated into the text, interrupting the normal flow of sentences (this happens for page numbers and citations as well, but in those cases is much easier to deal with). 

To deal with this, we build a list of the headers for each book and then delete them from the string that represents the book. In the process, we may create some issues if the header is common, so we are careful to only delete when the loss is worth it.

In some cases, of course, the texts are already clean and no extra steps are required.

In [24]:
plato_to_rm = ['Apology', 'Sophist', 'Statesman', 'Symposium', 
                 'Second Alcibiades', 'Rival Lovers', 'Greater Hippias', 
                 'Lesser Hippias', 'Republic', 'Laws', 'Letters', 'Definitions',
                 'On Virtue', 'On Justice', 'Epigrams', 'Translated b.+\.']

aristotle_to_rm = ['Aristotle', 'Book']

descartes_meditations_to_rm = ['Letter of Dedication', 'Preface to the Reader', 
                               'Synopsis', 'First Meditation', 'Second Meditation', 
                               'Third Meditation', 'Fourth Meditation', 'Fifth Meditation', 
                               'Sixth Meditation']

malebranche_search_to_rm = ['Nicolas Malebranche', 'Truth', 'Nicolas',]

locke_gov_to_rm = ['Sect.']

berkeley_dialogues_to_rm = ['PHIL.', 'PHILONOUS.', 'HYL.', 'HYLAS.']

kant_judgement_to_rm = ['Introduction', 'Preface to the First Edition', 
                        'Critique of Aesthetic Judgement', 'Critique of Teleological Judgement',
                        'Analytic of the Beautiful', 'Analytic of the Sublime',
                        'Anaytic of Teleological Judgement', 'Dialectic of Aeshetic Judgement',
                        'Critique of Teleological Judgement', 'Dialectic of Teleological Judgement',
                        'Theory of the Method of Teleological Judgement',
                        '‘First Introduction’ to the Critique of Judgement']

kant_pure_reason_to_rm = ['Introduction', '\s+Section\s.+', '\sDoctrine\s*of\s*Elements\.\s*.+',
                          '\sDoctrine\s*of\s*Method\.\s*.+']                        


fichte_ethics_to_rm = ['Page']

hegel_SoL_to_rm = ['Georg Wilhelm Friedrich Hegel', 'The Science of Logic']

hegel_right_to_rm = ['Preface', 'Philosophy of Right', 'Philosophy ofRight', 
                     'Philosophy ojRight', 'Philosophy oj Right', 
                     'Introduction', 'Abstract Right', 'Ethical Life', 'Ethical Lift',
                     'Morality']

wittgenstein_tract_to_rm = ['tractatus logico-philosophicus']

lewis_papers_1_to_rm = ['Introduction', 'Ontology', 'Holes', 'Anselm and Actuality', 
                        'Counterpart Theory and Quantified Modal Logic', 
                        'Counterparts of Persons and Their Bodies', 'Survival and Identity',
                        'How to Define Theoretical Terms', 'Philosophy of Mind', 
                        'An Argument for the Identity Theory', 'Radical Interpretation',
                        'Mad Pain and Partian Pain', 'Attitudes De Dicto and De Se',
                        'Philosophy of Language', 'Languages and Language',
                        'General Semantics', 'Scorekeeping in a Language Game',
                        'Tensions', 'Truth in Fiction', 'This page intentionally left blank']

lewis_papers_2_to_rm = ['end\sp\.']

quine_quintessence_to_rm = ['(?i)\s+Truth\s+by\s+Convention\s+', '(?i)\s+Two\s+Dogams\s+of\s+Empiricism',
                            '(?i)\s+Two\s+in\s+Retrospect\s+', '(?i)\s+Carnap\s+and\s+Logical\s+Truth',
                            '(?i)\s+Speaking\s+of\s+Objects\s+', '(?i)\s+Reference\s+',
                            '(?i)\s+Translation\s+and\s+Meaning', '(?i)\s+Progress\s+on\s+Two\s+Fronts\s+',
                            '(?i)\s+On\s+What\s+There\s+is\s+', '(?i)\s+the\s+scope\s+and\s+language\s+of\s+science\s+',
                            '(?i)\s+on\s+simple\s+theories\s+of\s+a\s+complex\s+world\s+',
                            '(?i)\s+ontic\s+decision\s+', '(?i)\s+things\s+and\s+their\s+place\s+in\s+theories\s+',
                            "(?i)\s+on\s+Carnap\s*'s\s+views\s+on\s+ontology\s+",
                            '(?i)\s+empistemology\s+naturalized\s+', 
                            "(?i)\s+naturalism\s;\s+or,\s+living\s+within\s+one\s*'s\s+means\s+",
                            '(?i)\s+the\s+nature\s+of\s+natural\s+knowledge\s+',
                            '(?i)\s+five\s+milestones\s+of\s+empiricism\s+',
                            '(?i)\s+on\s+mental\s+entities\s+', '(?i)\s+mind\s+and\s+verbal\s+dispositions\s+',
                            '(?i)\s+confessions\s+of\s+a\s+confirmed\s+extensionalist\s+',
                            '(?i)\s+quantifiers\s+and\s+propositional\s+attitudes\s+',
                            '(?i)\s+intensions\s+revisited\s+', '(?i)\s+reference\s+and\s+modality\s+',
                            '(?i)\s+three\s+grades\s+of\s+modal\s+involvement\s+']

popper_science_to_rm = ['the logic of science', 'a survey of some fundamental problems',
                        'preface', 'on the problem of a theory of scientific method',
                        'some structural components of a theory of experience',
                        'degrees of testability', 'some observations on quantum theory',
                        'corroboration, or how a theory stands up to tests']

kripke_troubles_to_rm = ['Identity and Necessity', 'On Two Paradoxes of Knowledge',
                          'Vacuous Names and Fictional Entities', 'Outline of a Theory of Truth',
                          "Speaker's Reference and Semantic Reference",
                          'A Puzzle about Belief', 'Nozick on Knowledge', 
                          "Russell's Notion of Scope", "Frege's Theory of Sense and Reference",
                          'The First Person', 'Unrestricted Exportation and Some Morals for the Philosophy',
                          'Presupposition and Anaphora', 'A Puzzle about Time and Thought']

ponty_perception_to_rm = ['phenomenology of perception', 'preface',
                          "the 'sensation' as a unit of experience", 
                          "'association' and the 'projection of memories'",
                          "'attention' and 'judgement'", 'the phenomenal field', 
                          'experience and objective thought', 'the body as object and mechanistic psychology',
                          'the experience of the body and classical psychology',
                          "the spatiality of one’s own body and motility",
                          "the synthesis of one's own body", 'the body in its sexual being',
                          'the body as expression, and speech',
                          'theory of the body is already a theory of perception',
                          'sense experience', 'the thing and the natural world',
                          'other selves and the human world']

husserl_crisis_to_rm = ['Part\s+', 'Idealization and the Science of Reality', 
                        'Denial of Scientific Philosophy', 'The Origin of Geometry',
                        'Natural Science and Humanistic Science',
                        'The Vienna Lecture']

heidegger_b_and_t_to_rm = ['\s+Being\s*and\s*Time\s+', 'Int.', 'I.m', 'I.n']                                                  

foucault_order_to_rm = ['the order of things', 'the prose of the world', '\s+classifying\s+',
                        'exchanging', 'the limits of representation',
                        'labour, life, language', 'man and his doubles']

foucault_madness_to_rm = ['\s+Page\s']

foucault_clinic_to_rm = ['\(\(.+\)\)']

deleuze_difference_to_rm = ['Difference and Repetition', 'Difference in Itself',
                            'Repetition for Itself', 'The Image of Thought',
                            'Ideas and the Synthesis of Difference',
                            'Asymmetrical Synthesis of the Sensible', 'Conclusion']

marx_kapital_to_rm = ['http.+', 'Capital\s+Vol\..+']

marx_manifesto_to_rm = ['\s+page\s+\d+']

keynes_employment_to_rm = ['(?i)\s+The\s+General\s+Theory\s+of\s+Employment,*\s+interest\s+and\s+money\s+by\s+john\s+maynard\s+keynes\s+',
                           'Table of Contents \| Previous Chapter \| Next Chapter', 
                           'Chapter\s+\d+']       

do_not_remove_capitals = ['essay concerning human understanding bk 2', 
                          'a treatise of human nature', 
                          'dialogues concerning natural religion', 'three dialogues', 
                          'a treatise concerning the principles of human knowledge']

bracketed_fn = ['critique of practical reason', 'the communist manifesto']                                          

#### 3. Tokenizing and Rendering the Texts as a Dataframe

We now are in a position to apply these methods to each text and return a dataframe for each of them. Although we are interested primarily in the schools of thought in general, it would be convenient and more useful for future projects if we also include the specific authors and titles. 

To prepare for this project, we build a dictionary for each school, so that we can then iterate over a list of dictionaries to create a dataframe for each.

In [25]:
# prepare lists that will be zipped into a dictionary

# texts 
all_texts

# titles
title_list = ["plato - complete works", "aristotle - complete works", "aristotle - complete works",
              'second treatise on government', 'essay concerning human understanding',
              'essay concerning human understanding', 'a treatise of human nature', 
              'dialogues concerning natural religion', 'three dialogues', 
              'a treatise concerning the principles of human knowledge',
              'ethics', 'on the improvement of understanding', 'theodicy',
              'discourse on method', 'meditations', 'the search after truth',
              'the analysis of mind', 'the problems of philosophy', 'philosophical studies',
              'philosophical investigations', 'tractatus logico-philosophicus',
              "lewis - papers", "lewis - papers", 'quintessence', 
              'the logic of scientific discovery', 'naming and necessity', 
              'philosophical troubles', 'the birth of the clinic', 'madness and civilization',
              'the order of things', 'writing and difference', 'difference and repetition',
              'anti-oedipus', 'the phenomenology of perception', 
              'the crisis of the european sciences and phenomenology', 
              'the idea of phenomenology',
              'being and time', 'off the beaten track', 'critique of practical reason', 
              'critique of judgement', 'critique of pure reason', 
              'the system of ethics', 'science of logic', 'the phenomenology of spirit',
              'elements of the philosophy of right', 'kapital', 'the communist manifesto', 
              'essential works of lenin', 'the wealth of nations', 
              'on the principles of political economy and taxation',
              'a general theory of employment, interest, and money']

# authors
author_list = ['plato', 'aristotle', 'aristotle', 'locke', 'locke', 'locke',
               'hume', 'hume', 'berkeley', 'berkeley', 'spinoza', 'spinoza',
               'leibniz', 'descartes', 'descartes', 'malebranche', 'russell', 
               'russell', 'moore', 'wittgenstein', 'wittgenstein', 'lewis', 'lewis',
               'quine', 'popper', 'kripke', 'kripke', 'foucault', 'foucault', 
               'foucault', 'derrida', 'deleuze', 'deleuze', 'merleau-ponty', 
               'husserl', 'husserl', 'heidegger', 'heidegger', 'kant', 'kant',
               'kant', 'fichte', 'hegel', 'hegel', 'hegel', 'marx', 'marx', 'lenin',
               'smith', 'ricardo', 'keynes']

school_list = ['plato', 'aristotle', 'aristotle', 'empiricism', 'empiricism', 
               'empiricism', 'empiricism', 'empiricism', 'empiricism', 'empiricism',
               'rationalism', 'rationalism', 'rationalism', 'rationalism', 
               'rationalism', 'rationalism', 'analytic', 'analytic', 'analytic', 
               'analytic', 'analytic', 'analytic', 'analytic', 'analytic', 'analytic', 
               'analytic', 'analytic', 'continental', 'continental', 'continental', 
               'continental', 'continental', 'continental', 'phenomenology', 
               'phenomenology', 'phenomenology', 'phenomenology', 'phenomenology', 
               'german_idealism', 'german_idealism', 'german_idealism', 'german_idealism',
               'german_idealism', 'german_idealism', 'german_idealism', 'communism', 
               'communism', 'communism', 'capitalism', 'capitalism', 'capitalism']               

# words to remove 
to_rm_list = [plato_to_rm, aristotle_to_rm, aristotle_to_rm, locke_gov_to_rm, 
              [], [], [], [], berkeley_dialogues_to_rm, [], [], [], [], [], 
              descartes_meditations_to_rm, malebranche_search_to_rm, [],
              [], [], [], wittgenstein_tract_to_rm, lewis_papers_1_to_rm, 
              lewis_papers_2_to_rm, quine_quintessence_to_rm, popper_science_to_rm,
              [], kripke_troubles_to_rm, foucault_clinic_to_rm, foucault_madness_to_rm,
              foucault_order_to_rm, [], deleuze_difference_to_rm, [],
              ponty_perception_to_rm, husserl_crisis_to_rm, [], heidegger_b_and_t_to_rm,
              [], [], kant_judgement_to_rm, kant_pure_reason_to_rm, fichte_ethics_to_rm, 
              hegel_SoL_to_rm, [], hegel_right_to_rm, marx_kapital_to_rm, 
              marx_manifesto_to_rm, [], [], [], keynes_employment_to_rm]

# check lengths to make sure all are present
len(to_rm_list), len(all_texts), len(author_list), len(title_list), len(school_list)

(51, 51, 51, 51, 51)

In [26]:
# combine all these into a single list of dictionaries
book_dicts = []
for i in range(0, 51):
  book_dict = {}
  book_dict['author'] = author_list[i].title()
  book_dict['title'] = title_list[i].title()
  book_dict['text'] = all_texts[i]
  book_dict['school'] = school_list[i]
  book_dict['words to remove'] = to_rm_list[i]
  book_dict['remove capitals'] = True
  book_dict['bracketed fn'] = False
  book_dicts.append(book_dict)

# mark the ones with bracketed footnotes 
for book in book_dicts:
  if book['title'] in bracketed_fn:
    book['bracketed fn'] = True

# mark the ones with capitals we want to keep
for book in book_dicts:
  if book['title'] in do_not_remove_capitals:
    book['remove capitals'] = False
  
# check length again to make sure
len(book_dicts)

51

With a dictionary for each text, we are prepared to clean them, build dataframes for each text, and combine them into a master dataframe for all our data.

In [29]:
#@title Oddities Dictionary for Cleaning
# a dictionary of oddities to clean up
odd_words_dict = {'\sderstanding': 'derstanding',
                  '\sforthe\s': ' for the ',
                  '\sject': 'ject',
                  '\sjects': 'jects', 
                  '\sness': 'ness',
                  '\sper\scent\s': ' percent ',
                  '\sper\scent\.': ' percent.',
                  '\sper\scent,': ' percent,',
                  '\wi\son': 'ion',
                  '\spri\sori': ' priori',
                  '\stences\s': 'tences ',
                  '\sprincipleb': ' principle',
                  '\ssciousness': 'sciousness',
                  '\stion': 'tion',
                  '\spri\s': ' pri',
                  '\scluding': 'cluding',
                  '\sdom': 'dom',
                  '\sers': 'ers',
                  '\scritiq\s': ' critique ',
                  '\ssensati\s': ' sensation ',
                  '(?i)\syou\sll': " you'll",
                  '\sI\sll': " I'll",
                  '(?i)\swe\sll': " we'll",
                  '(?i)he\sll': " he'll",
                  '(?i)who\sll': "who'll",
                  '(?i)\sthere\sll\s': " there'll ",
                  '\seduca\s': ' education ',
                  '\slity\s': 'lity ',
                  '\smultaneously\s': 'multaneously ',
                  '\stically\s': 'tically ',
                  '\sDa\ssein\s': ' Dasein ',
                  '(?i)\sthey\sll\s': " they'll ",
                  '(?i)\sin\tum\s': ' in turn ',
                  '\scon~\s': ' con',
                  '\sà\s': ' a ',
                  '\sjor\s': ' for ',
                  '\sluminating\s': 'luminating ',
                  '\sselj\s': ' self ',
                  '\stial\s': 'tial ',
                  '\sversal\s': 'versal ',
                  '\sexis\st': ' exist',
                  '\splauded\s': 'plauded ',
                  '\suiry\s': 'uiry ',
                  '\svithin\s': ' within ',
                  '\soj\s': ' of ',
                  '\sposi\st': ' posit',
                  '\sra\sther\s': ' rather ',
                  '(?i)\sthat\sll\s': " that'll ",
                  '(?i)\sa\sll\s': ' all ',
                  '\so\sther\s': ' other ',
                  '\sra\sther\s': ' rather ',
                  '\snei\sther\s': ' neither ',
                  '\sei\sther\s': ' either ',
                  '\sfur\sther\s': ' further ',
                  '\sano\sther': ' another ',
                  '\sneces\s': ' neces',
                  'u\slar\s': 'ular ',
                  '\sference\s': 'ference ',
                  '(?i)it\sll\s': "it'll ",
                  '\stoge\sther': ' together ',
                  '\sknowledgeb\s': ' knowledge ',
                  'r\stain\s': 'rtain ',
                  'on\stain\s': 'ontain',
                  '(?i)j\sect\s': 'ject',
                  '\sob\sect\s': ' object ',
                  '\sbtle\s': 'btle ',
                  '\snition\s': 'nition ',
                  '\sdering\s': 'dering ', 
                  '\sized\s': 'ized ',
                  '\sther\shand': ' other hand',
                  '\ture\s': 'ture ',
                  '\sabso\sl': ' absol',
                  '\stly\s': 'tly ',
                  '\serty\s': 'erty ',
                  '\sobj\se': ' obj',
                  '\sffiir\s': ' for ',
                  '\sndeed\s': ' indeed ',
                  '\sfonn\s': ' form ',
                  '\snally\s': 'nally ',
                  'ain\sty\s': 'ainty ',
                  'ici\sty\s': 'icity ',
                  '\scog\sni': ' cogni',
                  '\sacc\s': ' acc',
                  '\sindi\svid\sual': ' individual', 
                  '\sintu\sit': ' intuit',
                  'r\sance\s': 'rance ',
                  '\ssions\s': 'sions ',
                  '\sances\s': 'ances ',
                  '\sper\sception\s': ' perception ',
                  '\sse\sries\s': ' series ',
                  '\sque\sries\s': ' queries ',
                  '\sessary\s': 'essary ',
                  '\sofa\s': ' of a ',
                  '\scer\stainty\s': ' certainty ',
                  'ec\stivity\s': 'ectivity ',
                  '\stivity\s': 'tivity ',
                  '\slation\s': 'lation ',
                  '\sir\sr': ' irr',
                  '\ssub\sstance\s': ' substance ',
                  'sec\sond\s': 'second ',
                  '\s\.rv': '',
                  '\story\s': 'tory ',
                  '\sture\s': 'ture ',
                  '\sminate\s': 'minate ',
                  '\sing\s': 'ing ',
                  '\splicity\s': 'plicity ',
                  '\ssimi\slar\s': ' similar ',
                  '\scom\smunity\s': ' community ',
                  '\sitselfa\s': ' itself a ',
                  '\ssimp\s': ' simply ',
                  '\scon\stex': ' contex',
                  '\scon\sseq': ' conseq',
                  '\scon\stai': ' contai',
                  '\sofwhat\s': ' of what ',
                  '\sui\s': 'ui',
                  '\sofan\s': ' of an ',
                  '\saccor\sdance\s': ' accordance ',
                  '\stranscen\sdental\s': ' transcendental ',
                  '\sap\spearances\s': ' appearances ',
                  'e\squences\s': 'equences ',
                  '\sorits\s': ' or its ',
                  '\simma\sn': ' imman',
                  '\seq\sua': ' equa',
                  '\simpl\sied\s': ' implied ',
                  '\sbuta\s': ' but a ',
                  '\sa\snd\s': ' and ',
                  '\sence\s': 'ence ',
                  '\stain\s': 'tain ',
                  '\sunder\sstanding\s': ' understanding ',
                  'i\sence\s': 'ience ',
                  'r\sence\s': 'rence ',
                  '\stical\s': 'tical ',
                  '\sobjectsb\s': ' objects ',
                  '\stbe\s': ' the ',
                  '\smul\st': ' mult',
                  '\sgen\seral\s': ' general ',
                  '\suniver\ssal\s': ' universal ',
                  '\scon\stent\s': ' content ',
                  '\spar\sticular\s': ' particular ',
                  'ver\ssity\s': 'versity ',
                  '\sCritiq\s': ' Critique ',
                  '\sphilo\ssophy\s': ' philosophy ',
                  '\seq\s': ' eq'}

In [30]:
def from_raw_to_df(text_dict):
  nlp.max_length = 9000000
  text = text_dict['text']
  text = remove_words(text, text_dict['words to remove'])
  text = baseline_clean(text, capitals=text_dict['remove capitals'],
                        bracketed_fn=text_dict['bracketed fn'],
                        odd_words_dict=odd_words_dict)
  text_nlp = nlp(text, disable=['ner'])
  text_df = pd.DataFrame(columns=['title', 'author', 'school', 'sentence_spacy'])
  text_df['sentence_spacy'] = list(text_nlp.sents)
  text_df['author'] = text_dict['author']
  text_df['title'] = text_dict['title']
  text_df['school'] = text_dict['school']
  text_df['sentence_str'] = text_df['sentence_spacy'].apply(lambda x: ''.join(list(str(x))))
  return text_df

df = pd.DataFrame(columns=['title', 'author', 'school', 'sentence_spacy', 'sentence_str'])
for book in book_dicts:
  book_df = from_raw_to_df(book)
  df = df.append(book_df, ignore_index=True)

len(df)

361382

In [None]:
#@title 
# ## some code for checking oddities

# # unfortunately it has to be done ad hoc and as they are discovered, 
# # so we will not go too deep into it 
# #
# # the code is commented out to make running the notebook smoother


# word_list = [
# 'tent',
# 'per',
# 'cent'
# 'imma']

# word_checker = pd.DataFrame()
# for word in word_list:
#   word_check_slice = df[(df['sentence_str'].str.contains('\s'+word+'\s'.lower()))].copy()
#   word_check_slice['word'] = word
#   word_checker = word_checker.append(word_check_slice)

# print(len(word_checker))
# print(len(word_list))
# word_checker['word'].value_counts()

# pd.options.display.max_colwidth = 300
# word_checker[word_checker['word']=='con']

And *voila*! There we have it. We can do some quick exploring to make sure everything came out ok, but we are nearing the end of our task!

In [32]:
pd.options.display.max_colwidth = 200
df.sample(10)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str
84134,Aristotle - Complete Works,Aristotle,aristotle,"(For, if, the, points, form, a, series, ,, the, line, will, be, divided, not, at, either, of, the, points, ,, but, between, them, ;, whilst, if, they, are, in, contact, ,, a, line, will, be, the, ...","For if the points form a series, the line will be divided not at either of the points, but between them; whilst if they are in contact, a line will be the place of the single point."
201094,Philosophical Troubles,Kripke,analytic,"(Quine, seems, to, define, ', referentially, transparent, ', contexts, so, as, to, imply, that, coreferential, names, and, definite, descriptions, must, be, interchangeable, salva, veritate, .)",Quine seems to define 'referentially transparent' contexts so as to imply that coreferential names and definite descriptions must be interchangeable salva veritate.
2500,Plato - Complete Works,Plato,plato,"(In, any, way, you, like, ,, said, Socrates, ,, if, you, can, catch, me, and, I, do, not, escape, you, .)","In any way you like, said Socrates, if you can catch me and I do not escape you."
157355,Philosophical Studies,Moore,analytic,"(It, is, not, childishly, obvious, that, I, am, not, judging, it, to, be, part, of, the, surface, of, an, inkstand, ,, as, it, is, that, I, am, not, judging, it, to, be, an, inkstand, a, whole, on...","It is not childishly obvious that I am not judging it to be part of the surface of an inkstand, as it is that I am not judging it to be an inkstand a whole one."
211076,Madness And Civilization,Foucault,continental,"(Meaning, was, no, longer, read, in, an, immediate, perception, ,, and, accordingly, objects, ceased, to, speak, directly, :, between, the, knowledge, that, animated, the, figures, of, objects, an...","Meaning was no longer read in an immediate perception, and accordingly objects ceased to speak directly: between the knowledge that animated the figures of objects and the forms they were transfor..."
28588,Plato - Complete Works,Plato,plato,"(We, alone, could, not, bring, ourselves, to, betray, them, or, swear, the, oath, .)",We alone could not bring ourselves to betray them or swear the oath.
288141,Critique Of Pure Reason,Kant,german_idealism,"(The, human, being, is, one, of, the, appearances, in, the, world, of, sense, ,, and, to, that, extent, also, one, of, the, natural, causes)","The human being is one of the appearances in the world of sense, and to that extent also one of the natural causes"
52093,Aristotle - Complete Works,Aristotle,aristotle,"(And, if, anyone, disputes, whether, something, has, been, deduced, or, not, ,, we, meet, him, by, saying, that, ', that, is, what, a, deduction, is, ', ;, and, if, anyone, says, that, what, it, i...","And if anyone disputes whether something has been deduced or not, we meet him by saying that 'that is what a deduction is'; and if anyone says that what it is to be it has not been deduced we can ..."
205110,Philosophical Troubles,Kripke,analytic,"(The, switch, from, ', green, ', to, ', red, ', becomes, clear, when, the, omitted, material, ,, also, involving, color, blindness, ,, is, read, .)","The switch from 'green' to 'red' becomes clear when the omitted material, also involving color blindness, is read."
245994,The Phenomenology Of Perception,Merleau-Ponty,phenomenology,"(In, order, that, we, may, be, able, to, move, our, body, towards, an, object, ,, the, object, must, first, exist, for, it, ,, our, body, must, not, belong, to, the, realm, of, the, ', in, itself,...","In order that we may be able to move our body towards an object, the object must first exist for it, our body must not belong to the realm of the 'in itself'."


In [33]:
df['school'].value_counts(normalize=True)

analytic           0.162042
aristotle          0.149783
plato              0.133009
german_idealism    0.132193
continental        0.097943
phenomenology      0.084407
rationalism        0.072743
empiricism         0.057817
communism          0.057518
capitalism         0.052546
Name: school, dtype: float64

The texts look more or less ok, though there is still some cleaning to be done. And there is definitely some class imabalance (with 10 schools, ideally they'd each be at 10%). 

That said, the numbers look reasonable enough that we have something we can work with. We can even do some fun stuff like find the average length of a sentence for each school or run other little tests.

### Cleaning the Dataframe 

But before the fun EDA stuff, to ensure that we get good results, we need to clean the dataframe. This will take a few steps:
1. Determine a threshold length and cut the (so-called) sentences that are shorter than that length (this will already eliminate meaningless duplicates like punctuations)
2. Check for words that indicate footnotes (words like 'edition' or 'ibid'); we can then cut these sentences from the data
3. Check for duplicates; there should be few if any duplicates in the dataframe for each school
4. Check for words that indicate other languages so that we can eliminate quotations, citations, or other non-English sentences

#### 1. Deal with Short Sentences

In [34]:
df['sentence_length'] = df['sentence_str'].map(lambda x: len(x))
num_of_short_entries = len(df[df['sentence_length'] < 20])
print(f"there are {num_of_short_entries} so-called sentences with fewer than 20 characters")
df[df['sentence_length'] < 20].sample(5)

there are 29972 so-called sentences with fewer than 20 characters


Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length
174990,Lewis - Papers,Lewis,analytic,"(Which, facts, ?)",Which facts?,12
359564,"A General Theory Of Employment, Interest, And Money",Keynes,capitalism,"(,, .)",",.",2
286703,Critique Of Pure Reason,Kant,german_idealism,(C),C,1
23365,Plato - Complete Works,Plato,plato,"(He, 'd, say, yes, ., ')",He'd say yes. ',15
43433,Plato - Complete Works,Plato,plato,"((, a, ))",(a),3


Sentences with fewer than 20 characters tend to be more or less meaningless, so we will drop them.

In [35]:
df = df.drop(df[df['sentence_length'] < 20].index)
len(df)

331410

#### 2. Look at Words that Indicate Footnotes

Now let's look at footnote-indicator words.

In [36]:
fn_words = ['ch\.', 'bk', 'sect\.', 'div\.', 'cf', 'ibid', 'prop\.', 'Q\.E\.D\.',
            'pt\.', 'coroll\.', 'cf\.']

df['sentence_lowered'] = df['sentence_str'].map(lambda x: x.lower())

fn_df = pd.DataFrame()
for word in fn_words:
  found_word = df[df['sentence_lowered'].str.contains('\s' + word.lower())].copy()
  found_word['word'] = word
  fn_df = fn_df.append(found_word)

len(fn_df)

705

In [37]:
fn_df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered,word
191532,The Logic Of Scientific Discovery,Popper,analytic,"(Given, a, sequence, ,, we, may, construct, a, new, sequence, ,, of, segments, of, ,, in, such, a, way, that, we, Cf, .)","Given a sequence, we may construct a new sequence, of segments of, in such a way that we Cf.",92,"given a sequence, we may construct a new sequence, of segments of, in such a way that we cf.",cf\.
304764,Science Of Logic,Hegel,german_idealism,"(Inasmuch, as, judgment, is, the, concept, as, determinate, ,, the, only, Cf, .)","Inasmuch as judgment is the concept as determinate, the only Cf.",64,"inasmuch as judgment is the concept as determinate, the only cf.",cf\.
305255,Science Of Logic,Hegel,german_idealism,"(If, the, disjunction, of, a, genus, into, species, has, not, yet, attained, this, form, ,, this, is, proof, that, the, disjunction, has, Cf, .)","If the disjunction of a genus into species has not yet attained this form, this is proof that the disjunction has Cf.",117,"if the disjunction of a genus into species has not yet attained this form, this is proof that the disjunction has cf.",cf\.
123578,Ethics,Spinoza,rationalism,"(But, in, nature, (, by, Prop, ., xiv, ., ,, Coroll, ., ))","But in nature (by Prop. xiv., Coroll.)",38,"but in nature (by prop. xiv., coroll.)",coroll\.
114633,A Treatise Of Human Nature,Hume,empiricism,"(This, division, of, the, impressions, is, the, same, with, that, which, I, formerly, made, use, of, Book, I., Part, I., Sect, .)",This division of the impressions is the same with that which I formerly made use of Book I. Part I. Sect.,105,this division of the impressions is the same with that which i formerly made use of book i. part i. sect.,sect\.


Unfortunately, there was too much noise and too many differences in how the sentences were tokenized, so this kind of cleaning did not prove useful. As can be seen above, many of the relevant terms were used in meaningful sentences attributable to the correct authors. 

We were able to tell that 'bk.' was almost never used productively, so we cut that. For others, this was instructive in helping us clean the text and making revisions to the baseline cleaning function.

In [50]:
df = df.drop(df[df['sentence_lowered'].str.contains('\s+bk'.lower())].index)

len(df)

331388

#### 3. Look at cases of Self-Mention by Authors

Another approach is to check for sentences where an author mentions themselves - these are almost always not their own writing. 

In [51]:
self_mention_df = pd.DataFrame()
for author in df['author'].unique():
  self_mention_slice = df[(df['author'] == author) & 
                          (df['sentence_lowered'].str.contains('\s'+author.lower()))].copy()
  self_mention_df = self_mention_df.append(self_mention_slice)

len(self_mention_df)

853

In [52]:
self_mention_df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered
287870,Critique Of Pure Reason,Kant,german_idealism,"(In, the, first, edition, :, proofs, Notes, in, Kant, 's, copy, of, the, first, edition, :, In, the, first, class, of, antinomical, propositions, both, are, false, ,, because, they, say, more, tha...","In the first edition: proofs Notes in Kant's copy of the first edition: In the first class of antinomical propositions both are false, because they say more than is true, namely that there is an a...",229,"in the first edition: proofs notes in kant's copy of the first edition: in the first class of antinomical propositions both are false, because they say more than is true, namely that there is an a..."
37885,Plato - Complete Works,Plato,plato,"(The, classic, example, of, such, an, excursus, is, the, Atlantis, myth, in, Plato, 's, Timaeus, and, Critias, ,, and, there, are, other, examples, in, Alcibiades, ,, Hipparchus, ,, and, probably,...","The classic example of such an excursus is the Atlantis myth in Plato's Timaeus and Critias, and there are other examples in Alcibiades, Hipparchus, and probably in the (now mostly lost) Socratic ...",235,"the classic example of such an excursus is the atlantis myth in plato's timaeus and critias, and there are other examples in alcibiades, hipparchus, and probably in the (now mostly lost) socratic ..."
236466,Anti-Oedipus,Deleuze,continental,"(Where, Nietzsche, grew, progressively, more, isolated, to, the, point, of, madness, ,, Deleuze, and, Guattari, call, for, actions, and, passions, of, a, collective, nature, ,, here, and, now, .)","Where Nietzsche grew progressively more isolated to the point of madness, Deleuze and Guattari call for actions and passions of a collective nature, here and now.",162,"where nietzsche grew progressively more isolated to the point of madness, deleuze and guattari call for actions and passions of a collective nature, here and now."
14707,Plato - Complete Works,Plato,plato,"(Cephalus, is, prominent, in, the, opening, section, of, Plato, 's, which, is, set, in, his, home, in, Piraeus, ,, the, port, of, Athens, .)","Cephalus is prominent in the opening section of Plato's which is set in his home in Piraeus, the port of Athens.",112,"cephalus is prominent in the opening section of plato's which is set in his home in piraeus, the port of athens."
205461,Philosophical Troubles,Kripke,analytic,"((, At, the, end, of, Kripke)",(At the end of Kripke,21,(at the end of kripke


These seem pretty clearly to be notes. We drop them.

In [53]:
for author in df['author'].unique():
  df = df.drop(df[(df['author'] == author) & 
                  (df['sentence_lowered'].str.contains('\s'+author.lower()))].index)

len(df)

330535

#### 4. Deal with Duplicates

Now let's look at how many duplicates we have.

In [55]:
# find the total number of duplicates
len(df['sentence_str'])-len(df['sentence_str'].drop_duplicates())

4691

In [56]:
# get the number of duplicates in each school
for school in df['school'].unique():
  print(school)
  print(len(df.loc[df['school'] == school]['sentence_str']) - 
        len(df.loc[df['school'] == school]['sentence_str'].drop_duplicates()))

plato
291
aristotle
3351
empiricism
14
rationalism
39
analytic
289
continental
97
phenomenology
87
german_idealism
287
communism
84
capitalism
92


In [57]:
doubles_df = pd.concat(g for _, g in df.groupby("sentence_str") if len(g) > 1)
doubles_df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered
66166,Aristotle - Complete Works,Aristotle,aristotle,"(In, his, will, he, left, the, perpetual, copyright, on, his, writings, to, Balliol, College, ,, desiring, that, any, royalties, should, be, invested, and, that, the, income, from, the, investment...","In his will he left the perpetual copyright on his writings to Balliol College, desiring that any royalties should be invested and that the income from the investment should be applied 'in the fir...",335,"in his will he left the perpetual copyright on his writings to balliol college, desiring that any royalties should be invested and that the income from the investment should be applied 'in the fir..."
281373,Critique Of Judgement,Kant,german_idealism,"(First, to, the, Critique, of, Judgement)",First to the Critique of Judgement,34,first to the critique of judgement
100586,Aristotle - Complete Works,Aristotle,aristotle,"(Second, Printing, ,, Fourth, Printing, ,, Contents, Preface, ., ., ., ., ., ., ., .)","Second Printing, Fourth Printing, Contents Preface. . . . . . . .",65,"second printing, fourth printing, contents preface. . . . . . . ."
94221,Aristotle - Complete Works,Aristotle,aristotle,"(The, original, Translation, is, often, paraphrastic, :, some, of, the, translators, used, paraphrase, freely, and, deliberately, ,, attempting, not, so, much, to, English, 's, Greek, as, to, expl...","The original Translation is often paraphrastic: some of the translators used paraphrase freely and deliberately, attempting not so much to English 's Greek as to explain in their own words what he...",274,"the original translation is often paraphrastic: some of the translators used paraphrase freely and deliberately, attempting not so much to english 's greek as to explain in their own words what he..."
209254,The Birth Of The Clinic,Foucault,continental,"(But, what, is, this, order, ?)",But what is this order?,23,but what is this order?


From this it is clear that many of these duplicates are notes, meaninglessly short, or else headings that somehow escaped earlier efforts. Oddly, an enormous number of aristotle's sentences seem to be doubled. Looking at the doubled sentences, this appears to be because similar notes were made in both of the two volumes of the text. 

Let's eliminate the aristotle doubles first, then take another look to see what the others are like. 

In [58]:
len(doubles_df[doubles_df['author'] != 'Aristotle'])

2164

In [59]:
doubles_df[doubles_df['author'] != 'Aristole'].sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered
63748,Aristotle - Complete Works,Aristotle,aristotle,"(I., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., ., .,...",I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....,316,i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
274640,Critique Of Practical Reason,Kant,german_idealism,"(the, Grounding, for, the, Metaphysics, of, Morals, ,)","the Grounding for the Metaphysics of Morals,",44,"the grounding for the metaphysics of morals,"
52529,Aristotle - Complete Works,Aristotle,aristotle,"(The, fourth, class, of, alterations, accounts, for, the, majority, of, changes, made, by, the, reviser, .)",The fourth class of alterations accounts for the majority of changes made by the reviser.,89,the fourth class of alterations accounts for the majority of changes made by the reviser.
75781,Aristotle - Complete Works,Aristotle,aristotle,"(Drossaart, Lulofs, excises, the, parenthetical, sentence, .)",Drossaart Lulofs excises the parenthetical sentence.,52,drossaart lulofs excises the parenthetical sentence.
244871,The Phenomenology Of Perception,Merleau-Ponty,phenomenology,"(Lhermitte, ,, L'Image, de, notre, Corps, ,, .)","Lhermitte, L'Image de notre Corps,.",35,"lhermitte, l'image de notre corps,."


Deeper exploration of the duplicates reveals that Kant has a lot of doubles that seem to be authentically from his texts. Plato also has several duplicate sentences, but these are almost all short phrases from the dialogues ('of course, yes' and that kind of thing). 

To preserve the Kant, while also removing the irrelevant duplicates, we will remove both copies of all duplicates from texts other than the Kant's *Critique of Pure Reason*. For that text, we will remove the short duplicates and keep one copy of the longer ones, thus preserving the meaningful sentences. 

In [60]:
non_kant_indexes = df[(df['title'] != 'critique of pure reason') & 
                       (df['sentence_str'].duplicated(keep=False))].index
kant_short_indexes = df[(df['title'] == 'critique of pure reason') &
                        (df['sentence_str'].duplicated(keep=False)) &
                        (df['sentence_length'] < 40)].index
kant_long_indexes = df[(df['title'] == 'critique of pure reason') &
                        (df['sentence_str'].duplicated(keep='first')) &
                        (df['sentence_length'] >= 40)].index

indexes_to_drop = [non_kant_indexes, kant_short_indexes, kant_long_indexes]
for index in indexes_to_drop:
  df = df.drop(index)

len(df)

324817

#### Check for Foreign-Language Sentences

With this cleared up, let's do a couple quick tests to check for other languages in our texts. We use 'der' to check for German, since it is a common article and is not an English word. Similarly 'il' can be used to check for French. 

In [61]:
(df[df['sentence_str'].str.contains('\sder\s')]).sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered
203863,Philosophical Troubles,Kripke,analytic,"(Grundlagen, der, Mathematik, ,, Vol, .)","Grundlagen der Mathematik, Vol.",31,"grundlagen der mathematik, vol."
273983,Critique Of Practical Reason,Kant,german_idealism,"(Reading, ,, with, Erich, Adickes, and, with, Paul, Natorp, in, the, Akademie, edition, ,, des, letzteren, for, der, letzteren, ,, which, would, refer, (, implicitly, ), to, practical, reason, .)","Reading, with Erich Adickes and with Paul Natorp in the Akademie edition, des letzteren for der letzteren, which would refer (implicitly) to practical reason.",158,"reading, with erich adickes and with paul natorp in the akademie edition, des letzteren for der letzteren, which would refer (implicitly) to practical reason."
192536,The Logic Of Scientific Discovery,Popper,analytic,"(Thirring, ,, Die, Wandlung, des, Begriffssystems, der, Physik, (, essay, in, Krise, und, Neuaufbau, in, den, exakten, Wissenschaften, ,, nf, Wiener, Vortr, ge, ,, by, Mark, ,, Thirring, ,, Hahn, ...","Thirring, Die Wandlung des Begriffssystems der Physik (essay in Krise und Neuaufbau in den exakten Wissenschaften, nf Wiener Vortr ge, by Mark, Thirring, Hahn, Nobeling, Menger; Verlag Deuticke, W...",215,"thirring, die wandlung des begriffssystems der physik (essay in krise und neuaufbau in den exakten wissenschaften, nf wiener vortr ge, by mark, thirring, hahn, nobeling, menger; verlag deuticke, w..."
181708,Quintessence,Quine,analytic,"(A, lifhall, der, Welt, ,, in, his, derivations, from, similarities, of, global, experiences, .)","A lifhall der Welt, in his derivations from similarities of global experiences.",79,"a lifhall der welt, in his derivations from similarities of global experiences."
330545,Kapital,Marx,communism,"((, Wilhelm, Schulz, :, Die, Bewegung, der, Produktion, .)",(Wilhelm Schulz: Die Bewegung der Produktion.,45,(wilhelm schulz: die bewegung der produktion.


In [62]:
len((df[df['sentence_str'].str.contains('\sder\s')]))

314

These all seem questionable at best - the 'der' indicates a German phrase or book title where it doesn't just denote a fully German quote. Let's drop those and check 'il.'

In [63]:
df = df.drop(df[df['sentence_str'].str.contains('\sder\s')].index)

len(df)

324503

In [68]:
df[df['sentence_str'].str.contains('\sil\s')].sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered
336978,Kapital,Marx,communism,"(Si, les, Tartares, inondaient, I'Europe, aujourd'hui, ,, il, faudrait, bien, des, affaires, pour, leur, faire, entendre, ce, que, c'est, qu'un, financier, parmi, nous, .)","Si les Tartares inondaient I'Europe aujourd'hui, il faudrait bien des affaires pour leur faire entendre ce que c'est qu'un financier parmi nous.",144,"si les tartares inondaient i'europe aujourd'hui, il faudrait bien des affaires pour leur faire entendre ce que c'est qu'un financier parmi nous."
180463,Quintessence,Quine,analytic,"(oh, ,, et, de, conteml, ,, ier, eternellement, il, pro, /, Jre, l, (, Il, ', brti, .)","oh,et de conteml,ier eternellement il pro/Jre l( Il' brti.",58,"oh,et de conteml,ier eternellement il pro/jre l( il' brti."
200954,Philosophical Troubles,Kripke,analytic,"(If, God, did, not, exist, ,, Voltaire, said, ,, il, faudrait, l'inventer, ., ')","If God did not exist, Voltaire said, il faudrait l'inventer.'",61,"if god did not exist, voltaire said, il faudrait l'inventer.'"
185745,Quintessence,Quine,analytic,"(The, Ca, rtesian, quest, for, certainty, had, been, the, remote, motivation, of, epistemology, ,, both, il, its, conceptual, and, its, doctrinal, side, ;, but, that, quest, was, seen, as, a, lost...","The Ca rtesian quest for certainty had been the remote motivation of epistemology, both il its conceptual and its doctrinal side; but that quest was seen as a lost cause.",170,"the ca rtesian quest for certainty had been the remote motivation of epistemology, both il its conceptual and its doctrinal side; but that quest was seen as a lost cause."
323583,Kapital,Marx,communism,"(Pour, avoir, cet, argent, ,, il, faut, avoir, vendu, ,, ., ,, .)","Pour avoir cet argent, il faut avoir vendu, .,.",47,"pour avoir cet argent, il faut avoir vendu, .,."


It seems clear that those using 'il' are predominantly notes in French, especially from Marx. We drop them; even those with some meaning must have some errors - 'il' is not an English word.

In [69]:
df = df.drop(df[df['sentence_str'].str.contains('\sil\s')].index)

len(df)

324455

#### Some Ad Hoc Cleaning

These last cells show us cleaning up some things that we noticed as we read over the data and explored it in other ways. There is nothing systematic here, just us noticing bad data and deleting as we go.

In [79]:
# miscellaneous nonsense sentences
df = df.drop(df[df['sentence_str'].str.contains('\spp\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\stotam\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\srree\s')].index)
df = df.drop(df[df['sentence_str'].str.contains('\sflir\s')].index)
df = df.drop(df[(df['sentence_str'].str.contains('\smodis\s')) & (df['author'] != 'Kant')].index)

len(df)

324432

In [80]:
# markers of french and notes
df = df.drop(df[df['sentence_str'].str.contains('\schapitre')].index)
df = df.drop(df[df['sentence_str'].str.contains('\salisme')].index)
df = df.drop(df[df['sentence_str'].str.contains('\sHahn')].index)

len(df)

324432

In [81]:
# some notes in Kant
df = df.drop(df[df['sentence_str'].str.contains('\sVorl\s')].index)

len(df)

324414

In [82]:
# a common phrase in Plato / Aristotle footnotes
df = df.drop(df[(df['author']=='Plato') & (df['sentence_str'].str.contains('(?i)reading')) & (df['sentence_length'] < 40)].index)
df = df.drop(df[(df['author']=='Aristotle') & (df['sentence_str'].str.contains('(?i)reading')) & (df['sentence_length'] < 40)].index)

len(df)

324140

### Lemmatizing, Tokenizing, and Exporting

This brings us to the end of our pruning the data. Let's take a quick look at how the schools break down after the cleaning.

In [83]:
df['school'].value_counts(normalize=True)

analytic           0.164882
aristotle          0.150487
german_idealism    0.129996
plato              0.118440
continental        0.104217
phenomenology      0.088150
rationalism        0.070803
empiricism         0.061489
capitalism         0.056130
communism          0.055405
Name: school, dtype: float64

Things seem slightly more balanced, if only because there are a few thousand 
less sentences overall. At this point, we are ready to export the dataframe. Before doing so, we will make future work easier by adding adding two columns: one with lemmatized text and another with tokenized text.

In [84]:
df['tokenized_txt'] = df['sentence_str'].map(lambda x: simple_preprocess(x.lower(),deacc=True,
                                                        max_len=200))

def lemmatize_sentence(sentence):
  lemmatized_txt = ''
  for word in sentence:
    lemmatized_txt += ' ' + str(word.lemma_)
  return lemmatized_txt

In [85]:
df['lemmatized_str'] = df['sentence_spacy'].apply(lemmatize_sentence)

One last preview before we finalize the document:

In [88]:
df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered,tokenized_txt,lemmatized_str
112719,A Treatise Of Human Nature,Hume,empiricism,"(In, order, to, put, this, whole, affair, in, a, fuller, light, ,, let, us, consider, it, as, a, question, in, natural, philosophy, ,, which, we, must, determine, by, experience, and, observation, .)","In order to put this whole affair in a fuller light, let us consider it as a question in natural philosophy, which we must determine by experience and observation.",163,"in order to put this whole affair in a fuller light, let us consider it as a question in natural philosophy, which we must determine by experience and observation.","[in, order, to, put, this, whole, affair, in, fuller, light, let, us, consider, it, as, question, in, natural, philosophy, which, we, must, determine, by, experience, and, observation]","in order to put this whole affair in a full light , let -PRON- consider -PRON- as a question in natural philosophy , which -PRON- must determine by experience and observation ."
294919,The System Of Ethics,Fichte,german_idealism,"(And, this, is, supposed, to, be, something, special, and, grand, !)",And this is supposed to be something special and grand!,55,and this is supposed to be something special and grand!,"[and, this, is, supposed, to, be, something, special, and, grand]",and this be suppose to be something special and grand !
195249,Naming And Necessity,Kripke,analytic,"(One, does, n't, say, that, ', two, plus, two, equals, four, ', is, contingent, because, people, might, have, spoken, a, language, in, which, ', two, plus, two, equals, four, ', meant, that, seven...",One doesn't say that 'two plus two equals four' is contingent because people might have spoken a language in which 'two plus two equals four' meant that seven is even.,167,one doesn't say that 'two plus two equals four' is contingent because people might have spoken a language in which 'two plus two equals four' meant that seven is even.,"[one, doesn, say, that, two, plus, two, equals, four, is, contingent, because, people, might, have, spoken, language, in, which, two, plus, two, equals, four, meant, that, seven, is, even]",one do not say that ' two plus two equal four ' be contingent because people may have speak a language in which ' two plus two equal four ' mean that seven be even .
347010,The Wealth Of Nations,Smith,capitalism,"(When, an, artificer, has, acquired, a, little, more, stock, than, is, necessary, for, carrying, on, his, own, business, in, supplying, the, neighbouring, country, ,, he, does, not, ,, in, North, ...","When an artificer has acquired a little more stock than is necessary for carrying on his own business in supplying the neighbouring country, he does not, in North America, attempt to establish wit...",306,"when an artificer has acquired a little more stock than is necessary for carrying on his own business in supplying the neighbouring country, he does not, in north america, attempt to establish wit...","[when, an, artificer, has, acquired, little, more, stock, than, is, necessary, for, carrying, on, his, own, business, in, supplying, the, neighbouring, country, he, does, not, in, north, america, ...","when an artificer have acquire a little more stock than be necessary for carry on -PRON- own business in supply the neighbour country , -PRON- do not , in North America , attempt to establish wit..."
74797,Aristotle - Complete Works,Aristotle,aristotle,"(Animals, have, parts, of, a, similar, kind, ,, their, organs, ,, the, sinewy, tendons, to, wit, and, the, bones, ;, the, bones, are, like, the, pegs, and, the, iron, ;, the, tendons, are, like, t...","Animals have parts of a similar kind, their organs, the sinewy tendons to wit and the bones; the bones are like the pegs and the iron; the tendons are like the strings; for when these are slackene...",226,"animals have parts of a similar kind, their organs, the sinewy tendons to wit and the bones; the bones are like the pegs and the iron; the tendons are like the strings; for when these are slackene...","[animals, have, parts, of, similar, kind, their, organs, the, sinewy, tendons, to, wit, and, the, bones, the, bones, are, like, the, pegs, and, the, iron, the, tendons, are, like, the, strings, fo...","animal have part of a similar kind , -PRON- organ , the sinewy tendon to wit and the bone ; the bone be like the peg and the iron ; the tendon be like the string ; for when these be slacken or re..."


Looks good! Let's export.

In [89]:
from google.colab import files
df.to_csv('phil_nlp.csv', index=False) 
files.download('phil_nlp.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

And that's it! When it came to modeling the data, we first worked on some basic [Bayesian models](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/2_non-neural_models.ipynb), before then moving on to [w2v](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/3_w2v.ipynb) and [neural networks](https://github.com/kcalizadeh/phil_nlp/blob/master/Notebooks/4_neural_networks.ipynb).