In [1]:
import pandas as pd
import requests
import tweepy
import numpy as np
from nltk import sent_tokenize
import re
import json


def get_Guten(url):
    # retrieve the source text
    r = requests.get(url)
    r.encoding = 'utf-8'
    text = r.text
    return text

def get_text(path):
    f = open(path, 'r')
    text = f.read()
    f.close()
    return text

def clean_text(text):
    with open("data/replace_chars.json") as f:
        replace_these = json.load(f)
        for k in replace_these.keys():
            text = text.replace(k, replace_these[k])
    return text

def make_into_quotes_guten(text, source):
    # make a list of quotes and clean them up
    quotes = sent_tokenize(text)
    # remove unnecessary spaces
    quotes = [x.strip() for x in quotes]
    # remove empty quotes
    quotes = list(filter(None, quotes))
    # cut out very short ones as they often have no real meaning
    quotes = [x for x in quotes if len(x) > 15]
    # remove the titles of sections & citation-type stuff
    quotes = [x for x in quotes if not x.isupper()]
    quotes = [x for x in quotes if not x.replace('the', '').replace('of', '').replace('and', '').replace('II', '').istitle()]
    quotes = [x for x in quotes if not set('Werke').issubset(x)]
    # remove oddities
    quotes = [x for x in quotes if x[0].isupper()]
    quotes = [x.replace('.', '') for x in quotes]
    quotes = [x for x in quotes if not x[-1].isupper()]
    # add the source
    quotes = [x+'\n- '+ source for x in quotes]
    return quotes

def make_into_quotes_pdf(text, source):
    # make the text into a list
    quotes = sent_tokenize(text)
    # remove unnecessary spaces
    quotes = [x.strip() for x in quotes]
    # remove empty quotes
    quotes = list(filter(None, quotes))
    # cut out very short ones as they often have no real meaning
    quotes = [x for x in quotes if len(x) > 15]
    # remove the titles of sections & citation-type stuff
    quotes = [x for x in quotes if not x.isupper()]
    quotes = [x for x in quotes if not x.replace('the', '').replace('of', '').replace('and', '').replace('II', '').istitle()]
    quotes = [x for x in quotes if not set('Werke').issubset(x)]
    # this looks at all quotes and removes headers/footers/page numbers that are sometimes in the text accidentally
    holding = []
    for quote in quotes:
        for word in quote.split(' '):
            if word.isupper() and len(word) > 2 and word != 'A' and word != 'OF':
                quote = quote.replace(word, '')
        quote = re.sub('[1234567890]', '', quote).replace(' s ', ' ').replace(' S ', ' ').replace('OF', '').replace(' ) ', '').replace(' ( ', '').replace(' , ', '').replace('  ', ' ').replace('- ', '-').replace('  ', ' ').replace('  ', ' ')
        holding.append(quote)
    # remove oddities
    quotes = [x for x in holding if x[0].isupper()]
    quotes = [x.replace('.', '') for x in quotes]
    quotes = [x for x in quotes if not x[-1].isupper()]
    quotes = [x for x in quotes if not set('~').issubset(x)]
    quotes = [x for x in quotes if not set('=').issubset(x)]
    # add the source
    quotes = [x+'\n- '+ source for x in quotes]
    return quotes

In [2]:
# import different texts, cut out their front and end matter
hop1 = clean_text(get_Guten('http://www.gutenberg.org/files/51635/51635-0.txt'))[10545:-34150]
hop2 = clean_text(get_Guten('http://www.gutenberg.org/files/51636/51636-0.txt'))[4489:-42865]
hop3 = clean_text(get_Guten('http://www.gutenberg.org/files/58169/58169-0.txt'))[10068:-125524]
enc_logic = clean_text(get_Guten('http://www.gutenberg.org/files/55108/55108-0.txt'))[36755:-134712]
phil_of_nature = clean_text(get_text('.\data\Phil_of_Nature.txt'))[500403:-278448]

In [3]:
# turn these texts into quotes and assemble a list
hop1_quotes = make_into_quotes_guten(hop1, 'HoP 1')
hop2_quotes = make_into_quotes_guten(hop2, 'HoP 2')
hop3_quotes = make_into_quotes_guten(hop3, 'HoP 3')
enc_logic_quotes = make_into_quotes_guten(enc_logic, 'EnL')
pon_quotes = make_into_quotes_pdf(phil_of_nature, 'PoN')

master_q = hop1_quotes + hop2_quotes + hop3_quotes + enc_logic_quotes + pon_quotes

# preview the quote list to see if there are any abberations
random_range_start = np.random.randint(0, len(master_q))
master_q[random_range_start:random_range_start + 10], len(master_q)

(['These sciences progress through a process of juxtaposition\n- HoP 1',
  'It is true that in Botany, Mineralogy, and so on, much is dependent on what was previously known, but by far the greatest part remains stationary and by means of fresh matter is merely added to without itself being affected by the addition\n- HoP 1',
  'Thus to take an example, elementary geometry in so far as it was created by Euclid, may from his time on be regarded as having no further history\n- HoP 1',
  'The history of Philosophy, on the other hand, shows neither the motionlessness of a complete, simple content, nor altogether the onward movement of a peaceful addition of new treasures to those already acquired\n- HoP 1',
  'It seems merely to afford the spectacle of ever-recurring changes in the whole, such as finally are no longer even connected by a common aim\n- HoP 1',
  'At this point appear these ordinary superficial ideas regarding the history of Philosophy which have to be referred to and correct

In [4]:
# import the original set of quotes and prepare it for merging
old_quotes = pd.read_csv('.\data\Original_Quote_sheet.csv')
old_quotes = old_quotes.drop('Unnamed: 0', axis=1)
old_quotes = old_quotes.rename(columns={'Select one from each column':'quotes'})
old_quotes = old_quotes.iloc[3:]
old_quotes['quotes'] = old_quotes['quotes'].str.capitalize()

In [5]:
# turn the list into a dataframe and weed out untweetabley-long quotes
quote_df = pd.DataFrame(master_q, columns=['quotes'])
quote_df = old_quotes.append(quote_df)
quote_df['length'] = quote_df['quotes'].str.len()
quote_tweetable = quote_df.loc[quote_df['length'] <= 240].copy()

# preview again, see how many we have
quote_tweetable.iloc[random_range_start:random_range_start + 10], len(quote_tweetable), len(quote_df)

(                                                quotes  length
 148  Public opinion has common sense, but is infect...     106
 149  Necessity appears to itself in the shape of fr...      59
 150  It is the art of music which conducts us to th...      94
 151  The tool lasts, while the immediate enjoyments...      75
 152            No act of revenge is justified. - hegel      39
 153  By the act of reflection something is altered ...     139
 154  The man who wonders at nothing lives in a stat...      76
 155  The fact is, no man can think for another, any...      84
 156  Existence as determinate being is in essence b...      63
 157  Amid the pressure of great events, a general p...      77,
 14269,
 18726)

In [6]:
# export csv for use by tweeter program
quote_tweetable.to_csv('Quote List.csv')

In [7]:
258

258

In [8]:
phil_of_nature[-1000:-1]

'r way, the self-motivation of the solar system is the sublation of the merely ideal nature of being-for-self, of mere spatiality of determination, of not-being-for-selÂ£ In the Notion, the negation of place does not 35 merely give rise to its re-instatement; the negation of not-being-for-self 282  \x0cABSOLUTE MECHANICS  5  is a negation of the negation, ie an affirmation, so that what comes forth is real being-for-selÂ£ This is the abstractly logical determination of the transition. It is precisely the total development of being-for-self which is real being-for-self; this might be expressed as the freeing of the form of matter. The determinations of form which constitute the solar system are the determinations of matter itself, and these determinations constitute the being of matter, so that determination and being are essentially identical. This is of the nature of quality, for if the determination is removed here, being also disappears. This is the transition from mechanics to phys

In [9]:
enc_logic[-1000:-1]

"ending the notion of itself, as of the pure idea for which the idea is.  244. The Idea which is independent or for itself, when viewed on the point of this its unity with itself, is Perception or Intuition, and the percipient Idea is Nature. But as intuition the idea is, through an external 'reflection,' invested with the one-sided characteristic of immediacy, or of negation. Enjoying however an absolute liberty, the Idea does not merely pass over into life, or as finite cognition allow life to show in it: in its own absolute truth it resolves to let the 'moment' of its particularity, or of the first characterisation and other-being, the immediate idea, as its reflected image, go forth freely as Nature.         *       *       *       *       *  We have now returned to the notion of the Idea with which we began. This return to the beginning is also an advance. We began with Being, abstract Being: where we now are we also have the Idea as Being: but this Idea which has Being is Nature.