<a href="https://colab.research.google.com/github/kwaldenphd/poemBot/blob/master/project_gutenberg_parrish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup & Environment



## Install

In [None]:
!pip install pronouncing # https://pronouncing.readthedocs.io/en/latest/
!pip install markovify # https://pypi.org/project/markovify/
!pip install numpy # https://pypi.org/project/numpy/
! pip install scipy # https://pypi.org/project/scipy/

Dependency issues with `tensorflow` & `keras` for `Pincelate` (https://pincelate.readthedocs.io/en/latest/)

In [None]:
# !pip install tensorflow==1.15.0 # https://pypi.org/project/tensorflow/
# !pip install keras==2.2.5 "h5py<3.0.0" # https://pypi.org/project/keras/
# !pip install pincelate # https://pypi.org/project/pincelate/

## Import

In [None]:
# import stuff
import sys, pandas as pd, numpy as np, json, random, re, gzip, textwrap
from collections import Counter, defaultdict
import markovify, codecs, random
# from pincelate import Pincelate

# All The Allison Parrish Things

## Overview

### Project Gutenberg
- [Gutenberg, dammit](https://github.com/aparrish/gutenberg-dammit/) (full corpus)
- [Gutenberg corpus](https://github.com/aparrish/gutenberg-poetry-corpus) (poetry corpus)
  - ["Quick Experiments" Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)
  - ["Plot to Poem" 2017 NoPaGenMo Jupyter Notebook](https://github.com/aparrish/plot-to-poem/blob/master/plot-to-poem.ipynb)
- [Gutenberg Poetry Autocomplete](http://gutenberg-poetry.decontextualize.com/)

## Shallow Dives

### Project Gutenberg Poetry Corpus

- [GitHub](https://github.com/aparrish/gutenberg-poetry-corpus)
- [Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)

#### Build & Load

In [None]:
# build
!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

In [None]:
# load data
# import gzip, json
all_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    all_lines.append(json.loads(line.strip()))

In [None]:
# show random sample
# import random
random.sample(all_lines, 8)

#### Concordances & Counts

In [None]:
# create concordance for "flower"
# import re
flower_lines = [line['s'] for line in all_lines if re.search(r'\bflower\b', line['s'], re.I)]
random.sample(flower_lines, 8)

In [None]:
# longest lines, align on "flower"
longest = max([len(x) for x in flower_lines]) # find the length of the longest line
center = longest - len("flower") # and use it to create a "center" offset that will work for all lines

sorted_flower_lines = sorted(
    [line for line in flower_lines if re.search(r"\bflower\b\s\w", line)], # only lines with word following
    key=lambda line: line[re.search(r"\bflower\b\s", line).end():]) # sort on the substring following the match

for line in sorted_flower_lines[350:400]: # change these numbers to see a different slice
    offset = center - re.search(r'\bflower\b', line, re.I).start()
    print((" "*offset)+line) # left-pad the string with spaces to align on "flower"

In [None]:
# adjective concordance 
found_adj = []
for line in flower_lines:
    matches = re.findall(r"(the|a)\s(\b\w+\b)\s(\bflower\b)", line, re.I)
    for match in matches: 
        found_adj.append(match[1])
random.sample(found_adj, 12)

In [None]:
# counting most common adjectives
# from collections import Counter
Counter(found_adj).most_common(12)

#### Rhymes & Phones

In [None]:
# rhymes
# import pronouncing as pr
source_word = "flowering"
source_word_rhymes = pr.rhymes(source_word)
for line in all_lines:
    text = line['s']
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        if last_word in source_word_rhymes:
            print(text)

In [None]:
# create list of phones
phones = pr.phones_for_word(source_word)[0] # words may have multiple pronunciations, so this returns a list
pr.rhyming_part(phones)

In [None]:
# random rhymes
# from collections import defaultdict
by_rhyming_part = defaultdict(lambda: defaultdict(list))
for line in all_lines:
    text = line['s']
    if not(32 < len(text) < 48): # only use lines of uniform lengths
        continue
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        pronunciations = pr.phones_for_word(last_word)
        if len(pronunciations) > 0:
            rhyming_part = pr.rhyming_part(pronunciations[0])
            # group by rhyming phones (for rhymes) and words (to avoid duplicate words)
            by_rhyming_part[rhyming_part][last_word.lower()].append(text)

random_rhyming_part = random.choice(list(by_rhyming_part.keys()))
random_rhyming_part, by_rhyming_part[random_rhyming_part]

In [None]:
# rhyming groups
rhyme_groups = [group for group in by_rhyming_part.values() if len(group) >= 2]
for i in range(7):
    group = random.choice(rhyme_groups)
    words = random.sample(list(group.keys()), 2)
    print(random.choice(group[words[0]]))
    print(random.choice(group[words[1]]))

#### Markov Text Chains

In [None]:
# markov text chains
# import markovify
big_poem = "\n".join([line['s'] for line in random.sample(all_lines, 250000)])
model = markovify.NewlineText(big_poem)
for i in range(14):
    print(model.make_sentence())

In [None]:
# another sentence
model.make_short_sentence(60)

In [None]:
# randomly-generated poem
for i in range(6):
    print()
    for i in range(random.randrange(1, 5)):
        print(model.make_short_sentence(40))
    # ensure last line has a period at the end, for closure
    print(re.sub(r"(\w)[^\w.]?$", r"\1.", model.make_short_sentence(40)))
    print()
    print("～ ❀ ～")