<a href="https://colab.research.google.com/github/kwaldenphd/poemBot/blob/master/gutenberg_explorations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup & Environment



## Install

In [None]:
!pip install pronouncing # https://pronouncing.readthedocs.io/en/latest/
!pip install markovify # https://pypi.org/project/markovify/
!pip install numpy # https://pypi.org/project/numpy/
! pip install scipy # https://pypi.org/project/scipy/

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import

In [None]:
# import stuff
import sys, pandas as pd, numpy as np, json, random, re, gzip, textwrap
from collections import Counter, defaultdict

# All The Allison Parrish Things

## Overview

### Project Gutenberg
- [Gutenberg, dammit](https://github.com/aparrish/gutenberg-dammit/) (full corpus)
- [Gutenberg corpus](https://github.com/aparrish/gutenberg-poetry-corpus) (poetry corpus)
  - ["Quick Experiments" Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)
  - ["Plot to Poem" 2017 NoPaGenMo Jupyter Notebook](https://github.com/aparrish/plot-to-poem/blob/master/plot-to-poem.ipynb)
- [Gutenberg Poetry Autocomplete](http://gutenberg-poetry.decontextualize.com/)

## Shallow Dives

### Project Gutenberg Poetry Corpus

- [GitHub](https://github.com/aparrish/gutenberg-poetry-corpus)
- [Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)

#### Build & Load

In [None]:
# build
!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52.2M  100 52.2M    0     0  67.1M      0 --:--:-- --:--:-- --:--:-- 67.0M


In [None]:
# load data
# import gzip, json
all_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    all_lines.append(json.loads(line.strip()))

In [None]:
# show random sample
# import random
random.sample(all_lines, 8)

[{'s': 'The lady, ever watchful, penetrant,', 'gid': '2490'},
 {'s': '"Some are labelled \'Knots to tie men--', 'gid': '8187'},
 {'s': "Milton! thou should'st be living at this hour:", 'gid': '41016'},
 {'s': 'The stranger smiled: "Since to your home', 'gid': '28287'},
 {'s': '"Gee up, my little horse!" he cried,', 'gid': '16686'},
 {'s': 'On an Easter-lily stalk.', 'gid': '1664'},
 {'s': 'Give half the world to sunshine, half to shade,', 'gid': '232'},
 {'s': 'When Roland saw that life had fled,', 'gid': '14019'}]

#### Markov Text Chains

In [None]:
# markov text chains
# import markovify
big_poem = "\n".join([line['s'] for line in random.sample(all_lines, 250000)])
model = markovify.NewlineText(big_poem)
for i in range(14):
    print(model.make_sentence())

In [None]:
# another sentence
model.make_short_sentence(60)

In [None]:
# randomly-generated poem
for i in range(6):
    print()
    for i in range(random.randrange(1, 5)):
        print(model.make_short_sentence(40))
    # ensure last line has a period at the end, for closure
    print(re.sub(r"(\w)[^\w.]?$", r"\1.", model.make_short_sentence(40)))
    print()
    print("～ ❀ ～")

# Katie's Explorations

## Clean poem text, merge with metadata

In [None]:
# show random sample
# import random
random.sample(all_lines, 8)

[{'s': 'With flaming torch, withstood the arms of France,', 'gid': '42422'},
 {'s': 'That searched the mysteries of leafy shade,', 'gid': '38135'},
 {'s': 'Here in this old neglected church,', 'gid': '1365'},
 {'s': "(The port once gain'd) uncabled ride secure.", 'gid': '24269'},
 {'s': '_Faust_. Thee, flame-born creature, shall I fear?', 'gid': '14460'},
 {'s': 'On her ensnared in Káma’s net', 'gid': '24869'},
 {'s': 'If no better feast is ready,', 'gid': '33089'},
 {'s': 'What were thy lips the worse for one poor kiss?', 'gid': '1045'}]

In [None]:
# all lines df
allLines = pd.DataFrame.from_dict(all_lines)
allLines.head()

Allison P published to Kaggle: https://www.kaggle.com/datasets/terminate9298/gutenberg-poetry-dataset

In [None]:
# group df by gid
allLines['poem'] = allLines.groupby(['gid'])['s'].transform(lambda x : ' \n '.join(x))
allLines = allLines.drop_duplicates(subset="poem", keep="first")
allLines['gid'] = allLines['gid'].astype(int)
allLines.to_csv("/content/drive/Shareddrives/Kaneb Center Course Design Academy/Notebooks/gutenberg_output.csv", index=False)
allLines.head()

In [None]:
# get metadata
metadata = pd.read_csv("https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv")
columns = ['gid', 'type', 'issued', 'title', 'language', 'authors', 'subjects', 'locc', 'bookshelves']
metadata.columns = columns
metadata.to_csv("/content/drive/Shareddrives/Kaneb Center Course Design Academy/Notebooks/gutenberg_metadata.csv", index=False)
metadata.head()

In [None]:
# merge dfs
combined = pd.merge(allLines, metadata, how="left", on="gid")
combined.to_csv("/content/drive/Shareddrives/Kaneb Center Course Design Academy/Notebooks/gutenberg_combined.csv", index=False)
combined

## Keyword Return

User enters a keyword and program returns single poem that includes that keyword.

In [None]:
keyword = input("Enter a search term: ")

result = combined[combined['poem'].str.contains(keyword)].sample()

poem = result.to_dict('records')

print(poem[0]['poem'])

## Old code

In [None]:
# testing on a subset
subset = allLines.iloc[:30000,:] # subset all lines
subset['poem'] = subset.groupby(['gid'])['s'].transform(lambda x : ' \n '.join(x)) # lambda function to group by id and combine individual lines in new columns
subset2 = subset.drop_duplicates(subset='poem', keep='first') # remove duplicates
subset2 # show updated df

In [None]:
# isolate single poem
poem = subset2.loc[subset['gid'] == "20"]
poemDict = poem.to_dict('records')
poemStr = str(poemDict[0]['poem'])
print(poemStr)

In [None]:
# get unique list of ids
ids = list(set([line['gid'] for line in all_lines]))
len(ids)

Not having any luck with the gutenberg libraries:
- `gutenberg`
- `gutenbergpy`

Trying the machine readable data files Gutenberg makes available https://www.gutenberg.org/ebooks/offline_catalogs.html#the-project-gutenberg-catalog-metadata-in-machine-readable-format

In [None]:
metadata = df.loc[df["Text#"] == 42422] 
metadata

In [None]:
metadata = df.loc[df["Text#"] == 42422] 

dtest = metadata.to_dict('records')

dtest

In [None]:
# fighting my way through dynamic page stuff
import bs4 as bs
import requests, re

page = requests.get(poemUrls[5])
soup = bs.BeautifulSoup(page.text, 'html.parser')

data = soup.find_all('script')[2].contents[0]
data2 = json.loads(data)
data3=data2['@graph'][0]
data3.keys()

soup.find(attrs={"itemprop" : "author"}).contents[0].strip()
attribution = re.sub('<.*?>', '', str(soup.find(class_="card--poem__attribution text-muted-dark font-sans p-3")))

string = str(data3['description'])
text = re.sub('<.*?>', '', string)

In [None]:
# random hunting for poetryDB
page = requests.get(urls[3])
info = page.json()
poem = info[0]
poem.keys()

lines = []

for line in poem['lines']:
  lines.append(line)
  lines.append("\n")

for line in lines:
  print(line)

dict_keys(['title', 'author', 'lines', 'linecount'])