# Webscraping

## Using Selenium and BeautifulSoup

- This notebook details my process for scraping genre-labeled poetry from [PoetryFoundation.org](https://www.poetryfoundation.org/).

#### Important note
Due to the imperfection and idiosyncracies of scraping text from images, a lot of rescraping was necessary, sometimes in a manner that is best described, rather unfortunately, as nonprogrammatic. As a result, this notebook is extremely messy, which is not a reflection on the other notebooks for this project.

Thank you for understanding :)

## Table of contents

1. [Import necessary packages](#Import-necessary-packages)
2. [Initial scrape](#Initial-scrape)

    - [Text poems](#Text-poems)
    - [Scanned poems](#Scanned-poems)
    - [March](#March)
        - [First half](#First-half)
        - [Second half](#Second-half)
    - [April](#April)
    - [May](#May)
3. [Combine DataFrames](#Combine-DataFrames)

    - [Save](#Save)
    
## Import necessary packages

[[go back to the top](#Webscraping)]

In [170]:
# custom functions for webscraping
from functions_webscraping import *

# standard dataframe packages
import pandas as pd
import numpy as np

# # string manipulation libraries
# import re
# from unicodedata import normalize
# from ast import literal_eval

# # webscraping libraries
# import requests as rq
# from bs4 import BeautifulSoup as bs
# from selenium import webdriver

# timekeeping/progress packages
import time
from tqdm import tqdm

# saving packages
import gzip
import pickle

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# increase column width of dataframe
pd.set_option('max_colwidth', 150)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


- **Manually create dictionary with URL codes for each genre.**

## Initial scrape

[[go back to the top](#Webscraping)]

- Load URL codes for each genre.
- Create dictionary of URLs to poets' pages within each genre.
    - *NOTE: The function for this process uses Selenium, which will open dummy browser windows, so to speak.*
- Scrape URLs to poems' pages within each genre, separating into two groups:
    - Poems known to be in text format on the site.
    - Poem suspected to be within scanned images.
- Attempt to scrape each variety of poem.

In [536]:
# dictionary of genre codes found in poetryfoundation.org urls
genre_codes = load_genre_codes()
genre_codes

{'augustan': 149,
 'beat': 150,
 'black_arts_movement': 304,
 'black_mountain': 151,
 'confessional': 152,
 'fugitive': 153,
 'georgian': 154,
 'harlem_renaissance': 155,
 'imagist': 156,
 'language_poetry': 157,
 'middle_english': 158,
 'modern': 159,
 'new_york_school': 160,
 'new_york_school_2nd_generation': 161,
 'objectivist': 162,
 'renaissance': 163,
 'romantic': 164,
 'victorian': 165}

- Run function in a loop to create dictionary of poet urls.

In [None]:
# dictionary creation using custom function
poet_urls = {genre: poet_urls_by_genre(genre_code, 3) for genre, genre_code in genre_codes.items()}

# check a genre
poet_urls['augustan']

- Selenium can be finicky, so the loop only partially worked.
- I'll re-run sections in which some URLs are missing.

In [196]:
# re-run on genre
poet_urls['black_arts_movement'] = poet_urls_by_genre(genre_codes['black_arts_movement'])

In [198]:
# re-run on genre
poet_urls['modern'] = poet_urls_by_genre(genre_codes['modern'])

In [200]:
# re-run on genre
poet_urls['renaissance'] = poet_urls_by_genre(genre_codes['renaissance'])

In [203]:
# re-run on genre
poet_urls['romantic'] = poet_urls_by_genre(genre_codes['romantic'])

In [206]:
# re-run on genre
poet_urls['victorian'] = poet_urls_by_genre(genre_codes['victorian'])

In [207]:
# confirm all urls have been grabbed
url_lens = {k:len(v) for k,v in poet_urls.items()}
url_lens

{'augustan': 23,
 'beat': 13,
 'black_arts_movement': 23,
 'black_mountain': 10,
 'confessional': 7,
 'fugitive': 7,
 'georgian': 22,
 'harlem_renaissance': 17,
 'imagist': 6,
 'language_poetry': 18,
 'middle_english': 3,
 'modern': 54,
 'new_york_school': 9,
 'new_york_school_2nd_generation': 16,
 'objectivist': 5,
 'renaissance': 41,
 'romantic': 51,
 'victorian': 55}

- Ezra Pound and Richard Aldington both appear in two genres: Imagist and Modern.
- Since Modern has so many poets within it, and Imagist so few, I'll give them to the Imagists.

In [541]:
# remove urls that appear in two genres
poet_urls['modern'] = [url for url in poet_urls['modern'] \
                        if url not in \
                        ['https://www.poetryfoundation.org/poets/richard-aldington', 
                         'https://www.poetryfoundation.org/poets/ezra-pound']]

In [544]:
# confirm drop
url_lens = {k:len(v) for k,v in poet_urls.items()}
url_lens['modern']

52

#### üíæ Save/Load poet URLs dictionary

In [545]:
# # uncomment to save
# with gzip.open('data/poet_url_dict.pkl', 'wb') as goodbye:
#     pickle.dump(poet_urls, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# # uncomment to load
# with gzip.open('data/poet_url_dict.pkl', 'rb') as hello:
#     poet_url_dict = pickle.load(hello)

- Scrape poem URLs.

In [7]:
%%time

# loop over keys and values of dictionary
for genre, poet_urls in poet_url_dict.items():
    # scrape poem urls (text and scan poems) for each poet in each genre
    # now each poet's url with be a key and
    # the value will be a tuple of their text poems' urls and their scan poems' urls 
    poet_url_dict[genre] = [{poet_url: poem_url_scraper(poet_url)} for poet_url in poet_urls]

CPU times: user 41.1 s, sys: 699 ms, total: 41.8 s
Wall time: 9min 32s


- Simplify the structure of the dictionary.

In [8]:
# instantiate dictionaries of text and scan urls in each genre
poem_url_dict = {genre:{'text_urls':[],'scan_urls':[]} for genre in poet_url_dict}

# fill in empty lists with each type of url
for genre, poets in poet_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            poem_url_dict[genre]['text_urls'].extend(poems[0])
            poem_url_dict[genre]['scan_urls'].extend(poems[1])

In [12]:
#-------DATA STRUCTURE--------#
#
# genre ==> 'text_urls' ==> list of urls known to be text-based
#    \
#     ==> 'scan_urls' ==> list of urls thought to be scanned images

#### üíæ Save/Load poem URLs dictionary

In [9]:
# # uncomment to save
# with gzip.open('data/poem_url_dict.pkl', 'wb') as goodbye:
#     pickle.dump(poem_url_dict, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# # uncomment to load
# with gzip.open('data/poem_url_dict.pkl', 'rb') as hello:
#     poem_url_dict = pickle.load(hello)

In [10]:
# confirm everything's there
poem_url_dict.keys()

dict_keys(['augustan', 'beat', 'black_arts_movement', 'black_mountain', 'confessional', 'fugitive', 'georgian', 'harlem_renaissance', 'imagist', 'language_poetry', 'middle_english', 'modern', 'new_york_school', 'new_york_school_2nd_generation', 'objectivist', 'renaissance', 'romantic', 'victorian'])

In [13]:
%%time

poem_dicts = []
error_poems = []
for genre in tqdm(poem_url_dict.keys()):
    for text_url in poem_url_dict[genre]['text_urls']:
        poem = text_poem_scraper(text_url)
        poem['genre'] = genre
        poem['poem_url'] = text_url
        poem_dicts.append(poem)
        time.sleep(0.01)
        
    for scan_url in poem_url_dict[genre]['scan_urls']:
        try:
            poem = text_poem_scraper(scan_url)
            poem['genre'] = genre
            poem['poem_url'] = scan_url
            poem_dicts.append(poem)
            poem_url_dict[genre]['text_urls'].append(scan_url)
            poem_url_dict[genre]['scan_urls'].remove(scan_url)
            time.sleep(0.01)
        except:
            try:
                poem = scan_poem_scraper(scan_url)
                poem['genre'] = genre
                poem['poem_url'] = scan_url
                poem_dicts.append(poem)
                time.sleep(0.01)
            except:
                error_poems.append(scan_url)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [4:52:18<00:00, 974.35s/it]   

CPU times: user 15min 58s, sys: 1min 56s, total: 17min 54s
Wall time: 4h 52min 18s





In [14]:
len(poem_dicts), len(error_poems)

(4923, 80)

In [15]:
poems = pd.DataFrame(poem_dicts)
poems.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Mary Barber,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Advice to Her Son on Marriage,"[When you gain her Affection, take care to preserve it;, Lest others persuade her, you do not deserve it., Still study to heighten the Joys of her...","When you gain her Affection, take care to preserve it;\nLest others persuade her, you do not deserve it.\nStill study to heighten the Joys of her ...",augustan
1,Susanna Blamire,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Auld Robin Forbes,"[And auld Robin Forbes hes gien tem a dance,, I pat on my speckets to see them aw prance;, I thout o‚Äô the days when I was but fifteen,, And skipp‚Äô...","And auld Robin Forbes hes gien tem a dance,\nI pat on my speckets to see them aw prance;\nI thout o‚Äô the days when I was but fifteen,\nAnd skipp‚Äôd...",augustan
2,Susanna Blamire,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,O Donald! Ye Are Just the Man,"[O Donald! ye are just the man, Who, when he‚Äôs got a wife,, Begins to fratch‚Äî nae notice ta‚Äôen‚Äî, They‚Äôre strangers a‚Äô their life., The fan may dro...","O Donald! ye are just the man\nWho, when he‚Äôs got a wife,\nBegins to fratch‚Äî nae notice ta‚Äôen‚Äî\nThey‚Äôre strangers a‚Äô their life.\nThe fan may drop...",augustan
3,Susanna Blamire,https://www.poetryfoundation.org/poems/50532/the-siller-croun,The Siller Croun,"[And ye shall walk in silk attire,, And siller hae to spare,, Gin ye‚Äôll consent to be his bride,, Nor think o‚Äô Donald mair., O wha wad buy a silke...","And ye shall walk in silk attire,\nAnd siller hae to spare,\nGin ye‚Äôll consent to be his bride,\nNor think o‚Äô Donald mair.\nO wha wad buy a silken...",augustan
4,Henry Carey,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,The Ballad of Sally in our Alley,"[Of all the Girls that are so smart, There‚Äôs none like pretty SALLY,, She is the Darling of my Heart,, And she lives in our Alley., There is no La...","Of all the Girls that are so smart\nThere‚Äôs none like pretty SALLY,\nShe is the Darling of my Heart,\nAnd she lives in our Alley.\nThere is no Lad...",augustan


In [16]:
poems.to_csv('data/poems_v2.csv')

In [150]:
poems = pd.read_csv('data/poems_v2.csv', index_col=0)

In [17]:
%%time

text_poems = []
for genre in tqdm(poem_url_dict.keys()):
    for text_url in poem_url_dict[genre]['text_urls']:
        poem = text_poem_scraper(text_url)
        poem['genre'] = genre
        poem['poem_url'] = text_url
        text_poems.append(poem)
        time.sleep(0.01)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [1:26:33<00:00, 288.51s/it]

CPU times: user 5min 46s, sys: 38.3 s, total: 6min 25s
Wall time: 1h 26min 33s





In [18]:
text_poems_df = pd.DataFrame(text_poems)
text_poems_df.shape

(3261, 6)

In [19]:
text_poems_df[text_poems_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
154,Allen Ginsberg,https://www.poetryfoundation.org/poems/47660/a-supermarket-in-california,A Supermarket in California,[],,beat
166,Bob Kaufman,https://www.poetryfoundation.org/poems/55713/a-terror-is-more-certain-,A Terror is More Certain . . .,[],,beat
210,Lawrence Ferlinghetti,https://www.poetryfoundation.org/poetrymagazine/poems/58150/beatitudes-visuales-mexicanas,Beatitudes Visuales Mexicanas,[],,beat
268,Henry Dumas,https://www.poetryfoundation.org/poems/53477/kef-21,Kef 21,[],,black_arts_movement
288,Nikki Giovanni,https://www.poetryfoundation.org/poems/90181/no-complaints,No Complaints,[],,black_arts_movement
290,Nikki Giovanni,https://www.poetryfoundation.org/poems/90180/rosa-parks,Rosa Parks,[],,black_arts_movement
298,Etheridge Knight,https://www.poetryfoundation.org/poems/51371/a-fable-56d22f0fa5920,A Fable,[],,black_arts_movement
401,Robert Duncan,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,A Poem Beginning with a Line by Pindar,[],,black_mountain
505,Anne Sexton,https://www.poetryfoundation.org/poems/152252/o-ye-tongues,O Ye Tongues,[],,confessional
683,W. E. B. Du Bois,https://www.poetryfoundation.org/poems/43026/my-country-tis-of-thee,My Country ‚ÄôTis of Thee,[],,harlem_renaissance


In [20]:
for index in text_poems_df[text_poems_df.poem_string == ''].index:
    try:
        text_poems_df.loc[index,'poem_lines'] = PoemView_rescraper(text_poems_df.loc[index,'poem_url'])[0]
        text_poems_df.loc[index,'poem_string'] = PoemView_rescraper(text_poems_df.loc[index,'poem_url'])[1]
    except:
        print(index)

In [21]:
text_poems_df[text_poems_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1388,Dylan Thomas,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Poem on His Birthday [Facs. drafts],[],,modern
1540,Barbara Guest,https://www.poetryfoundation.org/poems/49367/imagined-room,Imagined Room,[],,new_york_school


In [22]:
text_df = text_poems_df[text_poems_df.poem_string != '']
text_df.shape

(3259, 6)

In [23]:
text_df.to_csv('data/text_poems_df.csv')

In [203]:
%%time

scan_poem_dicts = []
need_to_rescrape = []
for genre in tqdm(poem_url_dict.keys()):
    for scan_url in poem_url_dict[genre]['scan_urls']:
        try:
            poem = scan_poem_scraper(scan_url)
            poem['genre'] = genre
            poem['poem_url'] = scan_url
            scan_poem_dicts.append(poem)
        except:
            need_to_rescrape.append(scan_url)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [3:25:25<00:00, 684.77s/it]   

CPU times: user 7min 49s, sys: 5min 10s, total: 12min 59s
Wall time: 3h 25min 25s





In [204]:
len(scan_poem_dicts), len(need_to_rescrape)

(1775, 161)

In [205]:
scan_poem_df = pd.DataFrame(scan_poem_dicts)
scan_poem_df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/31338/wood,Wood,"[We age in darkness like wood, and watch our phantoms change, eir clothes, of shingles and boards, for a purpose that can only be, described as wo...",We age in darkness like wood\nand watch our phantoms change\neir clothes\nof shingles and boards\nfor a purpose that can only be\ndescribed as wood.,beat
1,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21676/dust-and-the-glory,Dust And The Glory,"[On a low Lorrainian knoll a leaning peasant sinking a pit, Meets rotted rock and a slab., The slab cracks and is split, the old grave opened,, Hi...","On a low Lorrainian knoll a leaning peasant sinking a pit\nMeets rotted rock and a slab.\nThe slab cracks and is split, the old grave opened,\nHis...",beat
2,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21675/we-in-the-fields,We In The Fields,"[Dawn and a high film, the sun burned it,, But noon had a thick sheet, and the clouds coming,, The low rain-bringers, trooping in from the north,,...","Dawn and a high film, the sun burned it,\nBut noon had a thick sheet, and the clouds coming,\nThe low rain-bringers, trooping in from the north,\n...",beat
3,Allen Ginsberg,https://www.poetryfoundation.org/poetrymagazine/poems/36505/written-in-my-dream-by-w-c-williams,Written In My Dream By W C Williams,"[‚ÄúAs Is, you're bearing, a common, Truth, Commonly known, as desire, No need, to dress, it up, as beauty, No need, to distort, what‚Äôs not, standar...",‚ÄúAs Is\nyou're bearing\na common\nTruth\nCommonly known\nas desire\nNo need\nto dress\nit up\nas beauty\nNo need\nto distort\nwhat‚Äôs not\nstandard...,beat
4,Jack Hirschman,https://www.poetryfoundation.org/poetrymagazine/poems/30162/the-baseball-poem,The Baseball Poem,"[A wrist (to repeat, with a shift, of ac-, cent, mood, of emphasis, attentive to) now, needed, The wrist I lost, hold of, of, what was most, loved...","A wrist (to repeat\nwith a shift\nof ac-\ncent, mood, of emphasis\nattentive to) now\nneeded\nThe wrist I lost\nhold of, of\nwhat was most\nloved ...",beat


In [522]:
scan_poem_df.to_csv('data/scan_poems_df.csv')

In [262]:
scan_poem_df.iloc[35:45]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
35,Kenneth Rexroth,https://www.poetryfoundation.org/poetrymagazine/poems/27134/a-dialogue-of-watching,A Dialogue Of Watching,"[Let me celebrate you. I, Have never known anyone, More beautiful than you. I,, Walking beside you, watching, You move beside me, watching, That s...","Let me celebrate you. I\nHave never known anyone\nMore beautiful than you. I,\nWalking beside you, watching\nYou move beside me, watching\nThat st...",beat
36,Kenneth Rexroth,https://www.poetryfoundation.org/poetrymagazine/poems/31055/the-spark-in-the-tinder-of-knowing,The Spark In The Tinder Of Knowing,"[Profound stillness in the greystone, Romanesque chapel, the rus, Of wheels beyond the door only, Underlines the silence. The wheels, Of life turn...","Profound stillness in the greystone\nRomanesque chapel, the rus\nOf wheels beyond the door only\nUnderlines the silence. The wheels\nOf life turn ...",beat
37,Kenneth Rexroth,https://www.poetryfoundation.org/poetrymagazine/poems/27131/marthe-away,Marthe Away,"[All night I lay awake beside you,, Leaning on my elbow, watching your, Sleeping face, that face whose purity, Never ceases to astonish me., I cou...","All night I lay awake beside you,\nLeaning on my elbow, watching your\nSleeping face, that face whose purity\nNever ceases to astonish me.\nI coul...",beat
38,Kenneth Rexroth,https://www.poetryfoundation.org/poetrymagazine/poems/27132/marthe-lonely,Marthe Lonely,"[To think of you surcharged with, Loneliness. To hear your voice, Over the recorder say,, ‚ÄúLoneliness.‚Äù The word, the voice,, So full of it, and I...","To think of you surcharged with\nLoneliness. To hear your voice\nOver the recorder say,\n‚ÄúLoneliness.‚Äù The word, the voice,\nSo full of it, and I,...",beat
39,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/32642/the-story-of-richard-maxfield,The Story Of Richard Maxfield,"[He jumped out of a window., Or did he shoot himself?, Was there a gun?, Or was it pills?, Did anyone see blood?, Was he holding water in his lung...",He jumped out of a window.\nOr did he shoot himself?\nWas there a gun?\nOr was it pills?\nDid anyone see blood?\nWas he holding water in his lungs...,beat
40,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/28674/apparitions-are-not-singular-occurrences,Apparitions Are Not Singular Occurrences,"[When I rode the zebra past your door,, wearing nothing but my diamonds, I expected to hear bells, and see your face behind the thin curtain:, But...","When I rode the zebra past your door,\nwearing nothing but my diamonds, I expected to hear bells\nand see your face behind the thin curtain:\nBut ...",beat
41,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/33706/the-ring-56d21729d7aa4,The Ring,"[I carry it on my keychain, which itself, is a big brass ring, large enough for my wrist,, holding keys for safe deposit box,, friends‚Äô apartments...","I carry it on my keychain, which itself\nis a big brass ring\nlarge enough for my wrist,\nholding keys for safe deposit box,\nfriends‚Äô apartments,...",beat
42,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/33707/tearing-up-my-mothers-letters,Tearing Up My Mothers Letters,"[The rain of summer thunders down past the sweet peas, trailing up the staves, of my balcony,, and I,, just returned from a journey,, am sitting a...","The rain of summer thunders down past the sweet peas\ntrailing up the staves\nof my balcony,\nand I,\njust returned from a journey,\nam sitting am...",beat
43,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/28675/picture-of-a-girl-drawn-in-black-and-white,Picture Of A Girl Drawn In Black And White,"[A girl sits in a black room., She is so, the plums have fallen off the trees outside., Icy winds blow geese, into her hair., The room is black, b...",A girl sits in a black room.\nShe is so\nthe plums have fallen off the trees outside.\nIcy winds blow geese\ninto her hair.\nThe room is black\nbu...,beat
44,Diane Wakoski,https://www.poetryfoundation.org/poetrymagazine/poems/31844/smudging,Smudging,"[Smudging is the term used for lighting small oil fires in the orange, groves at night when the temperatures are too low, to keep the leaves, and ...","Smudging is the term used for lighting small oil fires in the orange\ngroves at night when the temperatures are too low, to keep the leaves\nand f...",beat


## Individual rescrapes pt. I

In [263]:
scan_poem_df[scan_poem_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
6,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
23,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,[],,beat
606,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot,Some Simple Measures In The American Idiom And The Variable Foot,[],,imagist
723,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25655/toward-the-south-tr-by-harry-duncan,Toward The South Tr By Harry Duncan,[],,modern
775,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968,[],,modern
778,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19926/the-urn-enrich-my-resignation,The Urn Enrich My Resignation,[],,modern
779,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn Purgatorio,[],,modern
780,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,[],,modern
782,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn The Sad Indian,[],,modern
1170,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/22310/poem-after-the-wrestling,Poem After The Wrestling,[],,modern


In [71]:
rescrapes = []

In [72]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke'
rescrape = scan_poem_scraper(url, input_poet='Michael McClure', input_title='2 For Theodore Roethke: Premonition')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [73]:
url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=87&issue=4&page=28'
rescrape = scan_poem_scraper(url, input_poet='Michael McClure', input_title='2 For Theodore Roethke: 2')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [74]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968'
rescrape = scan_poem_scraper(url, input_poet='Malcolm Cowley', input_title='A Countryside 1918 1968: Boy in Sunlight')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [75]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot'
rescrape = scan_poem_scraper(url, 
                             input_poet='William Carlos Williams',
                             input_title='Some Simple Measures In The American Idiom And The Variable Foot',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!COMMENT).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
rescrapes.append(rescrape)

In [76]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes'
rescrape = scan_poem_scraper(url, 
                             input_poet='Kenneth Patchen',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!comment).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [79]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19915/the-urn-reliquary'
rescrape = scan_poem_scraper(url, input_poet='Hart Crane')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [81]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=2'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Purgatorio')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [83]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=6'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: The Sad Indian')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [85]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=7'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Reply')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [87]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=10'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Enrich My Resignation')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [94]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/31123/places-for-oscar-salvador'
rescrape = scan_poem_scraper(url, 
                             input_poet="Frank O'Hara",
                             input_title='Places for Oscar Salvador',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!SUDDEN SNOW).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school'
rescrapes.append(rescrape)

In [523]:
rescrapes_pt1 = pd.DataFrame(rescrapes)
rescrapes_pt1

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke: Premonition,"[My bones ascend by arsenics of sight., Where noise is all the sound there is to hear,, Beginning in the heart I work towards light., My toes are ...","My bones ascend by arsenics of sight.\nWhere noise is all the sound there is to hear,\nBeginning in the heart I work towards light.\nMy toes are c...",beat
1,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/browse?volume=87&issue=4&page=28,2 For Theodore Roethke: 2,"[This copse is earth‚Äôs cockade, this corpse my drum, To beat upon and play the mole a dance;, These hands are my defeat, these eyes my thumb., Opp...","This copse is earth‚Äôs cockade, this corpse my drum\nTo beat upon and play the mole a dance;\nThese hands are my defeat, these eyes my thumb.\nOppo...",beat
2,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968: Boy in Sunlight,"[The boy having fished alone, down Empfield Run from where it started on stony ground,, in oak and chestnut timber,, then crossed the Nicktown Roa...","The boy having fished alone\ndown Empfield Run from where it started on stony ground,\nin oak and chestnut timber,\nthen crossed the Nicktown Road...",modern
3,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot,Some Simple Measures In The American Idiom And The Variable Foot,"[EXERCISE IN TIMING, Oh, the sumac died, it‚Äôs, the first time, I, noticed it, HISTOLOGY, There is, the, microscopic, anatomy, of, the whale, this ...",EXERCISE IN TIMING\nOh\nthe sumac died\nit‚Äôs\nthe first time\nI\nnoticed it\nHISTOLOGY\nThere is\nthe\nmicroscopic\nanatomy\nof\nthe whale\nthis i...,imagist
4,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,"[XVI, No sooner had the clowns got a new house built,, a worse wind than the first blew it down. And it also, re-blew down the old house which the...","XVI\nNo sooner had the clowns got a new house built,\na worse wind than the first blew it down. And it also\nre-blew down the old house which they...",beat
5,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19915/the-urn-reliquary,The Urn Reliquary,"[ENDERNESS and resolution!, What is our life without a sudden pillow,, What is death without a ditch?, The harvest laugh of bright Apollo, And the...","ENDERNESS and resolution!\nWhat is our life without a sudden pillow,\nWhat is death without a ditch?\nThe harvest laugh of bright Apollo\nAnd the ...",modern
6,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn: Purgatorio,"[My country, O my land, my friends‚Äî, Am I apart‚Äîhere from you in a land, Where all your gas-lights, faces, sputum gleam, Like something left, fors...","My country, O my land, my friends‚Äî\nAm I apart‚Äîhere from you in a land\nWhere all your gas-lights, faces, sputum gleam\nLike something left, forsa...",modern
7,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn: The Sad Indian,"[Sad heart, the gymnast of inertia, does not count, Hours, days‚Äîand scarcely sun and moon., The warp is in his woof, and his keen vision, Spells w...","Sad heart, the gymnast of inertia, does not count\nHours, days‚Äîand scarcely sun and moon.\nThe warp is in his woof, and his keen vision\nSpells wh...",modern
8,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn: Reply,"[Thou canst read nothing except through appetite,, And here we join eyes in that sanctity, Where brother passes brother without sight,, But finall...","Thou canst read nothing except through appetite,\nAnd here we join eyes in that sanctity\nWhere brother passes brother without sight,\nBut finally...",modern
9,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn: Enrich My Resignation,"[Enrich my resignation as I usurp those far, Feints of control, hear rifles blown out on the stag, Below the aeroplane, and see the fox‚Äôs brush, W...","Enrich my resignation as I usurp those far\nFeints of control, hear rifles blown out on the stag\nBelow the aeroplane, and see the fox‚Äôs brush\nWh...",modern


In [265]:
rescrapes_pt1.to_csv('data/temp_rescrapes_pt1.csv')

## Individual rescrapes pt. II

In [264]:
need_to_rescrape

['https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29577/valery-as-dictator',
 'https://www.poetryfoundation.org/poetrymagazine/poems/146231/haiku-and-tanka-for-harriet-tubman',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30270/ritual-ix',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note',
 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law',
 'https://www.poetryfoundation.org/poetrym

In [266]:
error_rescrapes = []
still_errors = need_to_rescrape.copy()

In [299]:
%%time

for url in tqdm(still_errors):
    try:
        rescrape = text_poem_scraper(url)
        error_rescrapes.append(rescrape)
        still_errors.remove(url)
    except:
        continue

 72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 106/148 [07:49<03:05,  4.43s/it] 

CPU times: user 55.4 s, sys: 1min 40s, total: 2min 35s
Wall time: 7min 49s





In [270]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=18'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: We Shall Be Free')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [271]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: When Spirit Has No Edge')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [272]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=5&page=6'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Enough: Left After That')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [273]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=3&page=19'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Walking: In My Head')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [274]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=13&issue=6&page=13'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Epitaph')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [275]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28310/elaine'
rescrape = scan_poem_scraper(url, input_poet='William Carlos Williams', input_title='Elainb')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [276]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28312/emily'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=3'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Emily')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [277]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28311/erica'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=2'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Erica')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [278]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/18899/poem-as-the-cat'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=36&issue=4&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Poem: As the cat')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [279]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
rescrape = scan_poem_scraper(url, input_poet='Conrad Aiken', input_title='Discordants')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [280]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=6&issue=6&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='Conrad Aiken', input_title='Discordants IV')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)

In [283]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25512/jacks-white-horseup'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="jack's white")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [290]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25263/imc-a-tmo'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='E. E. Cummings')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrape['title'] = 'Untitled [5]'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [298]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29577/valery-as-dictator'
rescrape = text_poem_scraper(url)
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [308]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30270/ritual-ix'
rescrape = scan_poem_scraper(url, input_poet='Paul Blackburn', input_title='Ritual IX: Gathering Winter Fuel')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [305]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=108&issue=1&page=42',
                  input_poet='Paul Blackburn', input_title='the same barrels')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['the same barrels',
 '& cans & older men in long',
 'overcoats from the mission,',
 '& here, the scene unabated, 20-odd years later',
 'the fruit & vegetable market, First Ave. & Ninth, using',
 'wood from crates',
 'New Jersey, Delaware, Cali-',
 'for-ni-yay,',
 'Florida, New Mexico, Georgia, Louisiana, Texas, all',
 'e same fire, how',
 'reunite the South & North, the West & East',
 'IN SUNLIGHT YOU NEVER SEE Ir, ry just walking by &',
 'feel the warmth e.',
 'Fire in a barrel, burning',
 'the hands,',
 'the hands / the italian',
 'bakery next door is still discreet,',
 'but the kosher butcher shop next to',
 'that comes out for a word or two, the',
 'gesture',
 'palms stiff out at arms‚Äô length, passing',
 'the time of day, their magic hands',
 'reddened & liverspotted maybe,',
 'no peyis or beard, sti',
 'here at First Ave. & Ninth St. it‚Äôs',
 'the jews uniting the world, the country, the city,',
 'mankind down geological time perhaps,',
 'to keep their hands warm']

In [312]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks'
rescrape = scan_poem_scraper(url, 
                             input_poet='Edward Dorn', 
                             input_title='The Sundering U.P. Tracks: The End of the North Atlantic Turbine Poem')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [310]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=8',
                  input_poet='Edward Dorn', input_title='"Compared to the majestic legal thievery')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['"Compared to the majestic legal thievery',
 'of Commodore Vanderbilt men like Jay G Gould',
 'and Jim Fisk were second-story workers .',
 'Each side of the shining double knife',
 'from Chicago to Fri',
 'to Denver, the Cheyenne cutoff',
 'the Right of Way they called it',
 'and still it runs that way',
 'right through the heart',
 'the Union Pacific rails run also to Portland.',
 'Even through the heart of the blue beech',
 'hard as it is.',
 'each hamlet',
 'the winter sanctuar',
 'of the rare Jailbird',
 'and the Ishmaelite',
 'the esoteric summer firebombs',
 'of Chicago',
 'the same scar tissue',
 'I saw in Pocatello',
 'made',
 'by the rapacious geo-economic',
 'surgery of Harriman, the old isolator',
 'that ambassador-at-large',
 'You talk of color?',
 'Ob cosmological america, how well',
 'and with what geometry',
 'you teach your citizens']

In [315]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=9'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Edward Dorn', 
                             input_title='The First Note: From London')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [320]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=107&issue=5&page=46'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Creeley', 
                             input_title='Song')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [330]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
rescrape = scan_poem_scraper(url, 
                             input_poet='Robert Duncan', 
                             input_title='The Law: A Series in Variation')

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [331]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=33'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: Song's Fateful Crime")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [332]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=34'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: Cursed be he that")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [333]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=35'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: No! Took an Other way as its law")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [337]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=90&issue=6&page=20'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="Poem")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [340]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55733/what-next'))

In [342]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/41677/pacemaker'))

In [344]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/13075/the-dead'))

In [352]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55313/god-56d236c65624c'))

In [354]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55537/and'))

In [357]:
remove_list = [
    'https://www.poetryfoundation.org/poetrymagazine/poems/55733/what-next',
    'https://www.poetryfoundation.org/poetrymagazine/poems/41677/pacemaker',
    'https://www.poetryfoundation.org/poetrymagazine/poems/13075/the-dead',
    'https://www.poetryfoundation.org/poetrymagazine/poems/55313/god-56d236c65624c',
    'https://www.poetryfoundation.org/poetrymagazine/poems/55537/and'
]

for item in remove_list:
    still_errors.remove(item)

In [356]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/54278/accounts'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [359]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/150946/elsewhere-5d70274a8beed'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [363]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/53487/paragraph'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [364]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/89349/object-permanence'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [365]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/92669/natural-histories'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [370]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/144605/my-house-59df845fd8d77'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [373]:
%%time

for url in tqdm(still_errors[20:]):
    try:
        rescrape = text_poem_scraper(url)
        error_rescrapes.append(rescrape)
        still_errors.remove(url)
    except:
        continue

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 54/54 [00:52<00:00,  1.02it/s]

CPU times: user 4.2 s, sys: 248 ms, total: 4.45 s
Wall time: 52.8 s





In [383]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28835/pity-his-how-illimitable-plight'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=9'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title='pity his how illimitable plight')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_lines'].extend(temp_rescrape_lines2)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [379]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=10',
                  input_poet='E. E. Cummings', input_title='without the mercy of')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['without the mercy of',
 'your eyes your',
 'voice your',
 'ways (o very most my shining love)',
 'how more than dark i am,',
 'no song (no',
 'thing) no',
 'silence ever told; it has no name‚Äî',
 'but should this namelessness',
 '(completely',
 'fleetly',
 'vanish, at the infinite precise',
 'thrill of your beauty, then',
 'my lost my',
 'my',
 'whereful selves they put on here again',
 '‚Äîto livingest one star',
 'as small these',
 'all these',
 'thankful (hark) birds singing wholly are']

In [381]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=11',
                  input_poet='E. E. Cummings', input_title='annie died the other day')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])
temp_rescrape_lines2

['annie died the other day',
 'never was there such a lay‚Äî',
 'whom, among her dollies, dad',
 'first (‚Äúdon‚Äôt tell your mother‚Äù) had;',
 'making annie slightly mad',
 'but very wonderful in bed',
 '‚Äî-saints and satyrs, go your way',
 'youths and maidens: let us pray']

In [392]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/22223/six'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=53&issue=4&page=5'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title='six')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [396]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25960/springmay0151'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='spring! may')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [399]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25513/-oroundmoonhow'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='o(rounD)moon, how')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [403]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28566/why-dont-be'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="why don't be sil ly o no in deed; money")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)
# rescrape

In [405]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/24807/thislets-rememberday'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="this(let's remember)day died again and")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [409]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28834/for-any-ruffian-of-the-sky'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=8'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title="for any ruffian of the sky")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [410]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28833/if-seventy-were-young'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=7'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title="if seventy were young")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [412]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/26048/rosetreerosetree'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="rosetree, rosetree")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [415]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/15533/lenvoi'
rescrape = scan_poem_scraper(url, input_poet='Marion Strobel', input_title="envoi")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [421]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/17874/discus-thrower'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=31&issue=5&page=3'
rescrape = scan_poem_scraper(actual_url, input_poet='Marion Strobel', input_title="Discus-Thrower",
                            next_pattern='\n((?:\r?\n(?!SURF-BOARDING).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [423]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/33672/poem-green-things-are-flowers'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=130&issue=2&page=10'
rescrape = scan_poem_scraper(actual_url, input_poet="Frank O'Hara", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [426]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30246/poem-to-simply-talk'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=107&issue=6&page=10'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [428]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30549/poem-like-musical-instruments'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=14'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [429]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30775/sonnet-five-am-on-east-fourteenth'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=8'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Sonnet")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [434]:
# i've truly lost my mind
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30723/jungle-56d21432117a5'
rescrape = scan_poem_scraper(url, input_poet="Aram Saroyan", input_title="saroyan")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
rescrape['poem_lines'] = ['j;u;n;g;l;e']
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
rescrape['title'] = 'Untitled'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [439]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30140/spring-stood-there'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=29'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Spring")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [440]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30139/march-56d213a2b802b'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=28'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="March")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [443]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29475/now-in-one-year'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=5&page=27'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Now in one year")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [446]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30141/the-park-a-darling-walk'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=30'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="The park")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [449]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30138/consider-at-the-outset'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=28'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Consider at the outset")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [450]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30790/smile-to-see-the-lake'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Smile")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [455]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29745/giovannis-rape-of-the-sabine-women-at-wildensteins'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?contentId=29745'
rescrape = scan_poem_scraper(actual_url, input_poet='George Oppen', input_title="Giovanni's Rape of the Sabine Women at Wildenstein")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [457]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19125/1930s'
rescrape = scan_poem_scraper(url, input_poet='George Oppen', input_title="1930")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [461]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19940/along-the-flat-roofs'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Charles Reznikoff', input_title="Along the flat roofs beneath our window")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [467]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/31024/a-21-rudens-act-iii'
rescrape = scan_poem_scraper(url, input_poet='Louis Zukofsky', input_title='"A"21: Rudens, Dads')
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [468]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/32643/a-22-an-era-any-time'
rescrape = scan_poem_scraper(url, input_poet='Louis Zukofsky', input_title='"A"22: An Era Any Time Of Year')
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [469]:
rescrape

{'poet': 'Louis Zukofsky',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/32643/a-22-an-era-any-time',
 'title': '"A"22: An Era Any Time Of Year',
 'poem_lines': ['Others letters a sum owed',
  'ages account years each year',
  'out of old fields, permute',
  'blow blue up against yellow',
  '‚Äîscapes welcome young birds‚Äîinitial',
  'transmutes itself, swim near and',
  'read a weed‚Äôs reward‚Äîgrain',
  'an omen a good omen',
  'the chill mists greet woods',
  'ice, flowers‚Äîtheir soul‚Äôs return',
  'let me live here ever,',
  'sweet now, silence foison to',
  'on top of the weather',
  'it has said it before',
  'why that was you that',
  'is how you weather division',
  'a peacocks grammar perching‚Äîand',
  'perhaps think that they see',
  'or they fly thru a',
  'window not knowing it there'],
 'poem_string': 'Others letters a sum owed\nages account years each year\nout of old fields, permute\nblow blue up against yellow\n‚Äîscapes welcome young birds‚Äî

In [435]:
scan_poem_df[scan_poem_df.poet == 'Lorine Niedecker']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1694,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/30789/fall-56d21442f2ddc,Fall,"[are at their highest thoughts, of leaving, Middle life said nothing‚Äî, grounde, to a livelihood, Old age‚Äîa high gabbling gathering, before goodbye...",are at their highest thoughts\nof leaving\nMiddle life said nothing‚Äî\ngrounde\nto a livelihood\nOld age‚Äîa high gabbling gathering\nbefore goodbye\...,objectivist
1695,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/29472/three-poems-56d21304a5f2d,Three Poems,"[I, River-marsh-drowse, and in flood, of no land., They fish, a man, takes his wife to town, with his rowboat‚Äôs 1o-horse, ships his voice, to the ...","I\nRiver-marsh-drowse\nand in flood\nof no land.\nThey fish, a man\ntakes his wife to town\nwith his rowboat‚Äôs 1o-horse\nships his voice\nto the h...",objectivist
1696,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/30787/young-in-fall-i-said,Young In Fall I Said,"[are at their highest thoughts, of leaving, Middle life said nothing‚Äî, grounde, to a livelihood, Old age‚Äîa high gabbling gathering, before goodbye...",are at their highest thoughts\nof leaving\nMiddle life said nothing‚Äî\ngrounde\nto a livelihood\nOld age‚Äîa high gabbling gathering\nbefore goodbye\...,objectivist
1697,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/20187/promise-of-brilliant-funeral,Promise Of Brilliant Funeral,"[Travel, said he of the broken umbrella, enervates, the point of stop; once indoors, theology,, for want of a longer telescope, is made, of the mo...","Travel, said he of the broken umbrella, enervates\nthe point of stop; once indoors, theology,\nfor want of a longer telescope, is made\nof the moo...",objectivist
1698,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/30136/five-poems-56d213a21e011,Five Poems,"[I, To my pres-, sure pump, I‚Äôve been free, with less, and clean, I plumbed for principles, Now I'm jet-bound, by faucet shower, heater valve, rin...",I\nTo my pres-\nsure pump\nI‚Äôve been free\nwith less\nand clean\nI plumbed for principles\nNow I'm jet-bound\nby faucet shower\nheater valve\nring...,objectivist
1699,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/30788/we-are-what-the-seas,We Are What The Seas,"[have made us, longingly immense, the very veery, on the fence, Fall, We must pull, the curtains‚Äî, we haven‚Äôt any, leaves, Smile, to see the lake,...",have made us\nlongingly immense\nthe very veery\non the fence\nFall\nWe must pull\nthe curtains‚Äî\nwe haven‚Äôt any\nleaves\nSmile\nto see the lake\n...,objectivist
1700,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/20188/when-ecstasy-is-inconvenient,When Ecstasy Is Inconvenient,"[Feign a great calm;, all gay transport soon ends., Chant: who knows‚Äî, flight‚Äôs end or flight‚Äôs beginning, for the resting gull?, Heart, be still....","Feign a great calm;\nall gay transport soon ends.\nChant: who knows‚Äî\nflight‚Äôs end or flight‚Äôs beginning\nfor the resting gull?\nHeart, be still.\...",objectivist
1701,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/30137/to-my-pres-sure-pump,To My Pres Sure Pump,"[I‚Äôve been free, with less, and clean, I plumbed for principles, Now I'm jet-bound, by faucet shower, heater valve, ring seal service, cost to my ...",I‚Äôve been free\nwith less\nand clean\nI plumbed for principles\nNow I'm jet-bound\nby faucet shower\nheater valve\nring seal service\ncost to my l...,objectivist
1702,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/29473/river-marsh-drowse,River Marsh Drowse,"[and in flood, of no land., They fish, a man, takes his wife to town, with his rowboat‚Äôs 1o-horse, ships his voice, to the herons., Sure they drin...","and in flood\nof no land.\nThey fish, a man\ntakes his wife to town\nwith his rowboat‚Äôs 1o-horse\nships his voice\nto the herons.\nSure they drink...",objectivist
1703,Lorine Niedecker,https://www.poetryfoundation.org/poetrymagazine/poems/29474/prosperity-is-poverty,Prosperity Is Poverty,"[I‚Äôve foreclosed., I own again, these walls thin, as the back, of my writing tablet.]",I‚Äôve foreclosed.\nI own again\nthese walls thin\nas the back\nof my writing tablet.,objectivist


In [442]:
scan_poem_df.loc[1695, 'poem_string']

'I\nRiver-marsh-drowse\nand in flood\nof no land.\nThey fish, a man\ntakes his wife to town\nwith his rowboat‚Äôs 1o-horse\nships his voice\nto the herons.\nSure they drink\n‚Äîfull foamy folk‚Äî\ntill asleep.\nThe place is asleep\non one leg in the weeds.\nProsperity is poverty‚Äî\nI‚Äôve foreclosed.\nI own again\nthese walls thin\nas the back\nof my writing tablet.'

In [350]:
scan_poem_df[scan_poem_df.title == 'God']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre


In [424]:
still_errors

['https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0',
 'https://www.poetryfoundation.org/poetrymagazine/poems/19645/persons-seen',
 'https://www.poetryfoundation.org/poetrymagazine/poems/32401/sonnets-of-the-blood',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14312/calligraphy-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14296/on-seeing-the-portrait-of-a-beautiful-concubine-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14316/on-the-classic-of-the-hills-and-sea-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14321/the-inn-at-the-western-lake-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14313/one-goes-a-journey-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14315/the-palace-blossoms-tr-by-amy-lowel

In [None]:
scan_poem_scraper(url, 
                             input_poet="Frank O'Hara",
                             input_title='Places for Oscar Salvador',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!SUDDEN SNOW).*)*)')

In [401]:
text = pytesseract.image_to_string('data/temp.png')
text

"E. E. CUMMINGS\n\nmoney\ncan‚Äôt do(never\ndid &\n\nnever will)any\n\ncfar\n\nfrom it;you\nre wrong,my friend. But\nwhat does\n\n0,\nhas always done\n\nwill do alw\n\n-ays something\n\nis(guess)yes\n‚Äòou're\n\nright:my enemy\n\n. Love"

In [None]:
https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=9

In [470]:
rescrapes_pt2 = pd.DataFrame(error_rescrapes)
rescrapes_pt2

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free,Mad Sonnet: We Shall Be Free,"[IN THE WORLD OF DESTINY, when Heaven and Hell are a dream, left draped like blue silk upon a useless chair., But now we have come together by ato...",IN THE WORLD OF DESTINY\nwhen Heaven and Hell are a dream\nleft draped like blue silk upon a useless chair.\nBut now we have come together by atom...,beat
1,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge,Mad Sonnet: When Spirit Has No Edge,"[the Human frame. Men swell to blindness, without pain and are stupefied. Smooth fingertips, receive no pleasure. They become what they call Soul,...",the Human frame. Men swell to blindness\nwithout pain and are stupefied. Smooth fingertips\nreceive no pleasure. They become what they call Soul\n...,beat
2,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another,Enough: Left After That,"[not to my own mind,, but stayed, and stayed. Years, went by. What, were they. Days‚Äî, some happy,, but some bitter, and sad. If I walked, across t...","not to my own mind,\nbut stayed\nand stayed. Years\nwent by. What\nwere they. Days‚Äî\nsome happy,\nbut some bitter\nand sad. If I walked\nacross th...",black_mountain
3,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892,Walking: In My Head,"[is there to walk,, not thought of, is, the road itself more, than seen. I think, it might be, feel, as my feet do, and, continue, and, at last re...","is there to walk,\nnot thought of, is\nthe road itself more\nthan seen. I think\nit might be, feel\nas my feet do, and\ncontinue, and\nat last rea...",black_mountain
4,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow,Epitaph,"[An old willow with hollow branches, Slowly swayed his few high bright tendrils, And sang:, ‚ÄúLove is a young green willow, Shimmering at the bare ...",An old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n‚ÄúLove is a young green willow\nShimmering at the bare w...,imagist
...,...,...,...,...,...,...
122,George Oppen,https://www.poetryfoundation.org/poetrymagazine/poems/29745/giovannis-rape-of-the-sabine-women-at-wildensteins,Giovanni's Rape of the Sabine Women at Wildenstein,"[Showing the girl, On the shoulder of the warrior, calling, Behind her in the young body‚Äôs triumph, With its slight, despairing arms aloft, And th...","Showing the girl\nOn the shoulder of the warrior, calling\nBehind her in the young body‚Äôs triumph\nWith its slight, despairing arms aloft\nAnd the...",objectivist
123,George Oppen,https://www.poetryfoundation.org/poetrymagazine/poems/19125/1930s,1930,"[Thus, Hides the, Parts‚Äîthe prudery, Of Frigidaire, of, Soda-jerking‚Äî, Thus, Above the, Plane of lunch, of wives,, Removes itself, (As soda-jerkin...","Thus\nHides the\nParts‚Äîthe prudery\nOf Frigidaire, of\nSoda-jerking‚Äî\nThus\nAbove the\nPlane of lunch, of wives,\nRemoves itself\n(As soda-jerking...",objectivist
124,Charles Reznikoff,https://www.poetryfoundation.org/poetrymagazine/poems/19940/along-the-flat-roofs,Along the flat roofs beneath our window,"[in the morning sunshine, I read the signature of last night‚Äôs rain., v, The squads, platoons, and regiments, of lighted windows,, ephemeral under...","in the morning sunshine\nI read the signature of last night‚Äôs rain.\nv\nThe squads, platoons, and regiments\nof lighted windows,\nephemeral under ...",objectivist
125,Louis Zukofsky,https://www.poetryfoundation.org/poetrymagazine/poems/31024/a-21-rudens-act-iii,"""A""21: Rudens, Dads","[Miraculously gods playfellows dream in, men, don‚Äôt let us sleep, like me last night dreaming, this weird and silly dream:, a swallow‚Äôs nest, a mo...","Miraculously gods playfellows dream in\nmen, don‚Äôt let us sleep\nlike me last night dreaming\nthis weird and silly dream:\na swallow‚Äôs nest, a mon...",objectivist


In [521]:
rescrapes_pt2.to_csv('data/temp_rescrapes_pt2.csv')

In [524]:
text_poems_df.shape, scan_poem_df.shape, rescrapes_pt1.shape, rescrapes_pt2.shape

((3261, 6), (1774, 6), (11, 6), (124, 6))

In [526]:
df = pd.concat([text_poems_df, scan_poem_df, rescrapes_pt1, rescrapes_pt2], axis=0, ignore_index=True)
df.shape

(5170, 6)

In [529]:
df.iloc[3255:3270]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
3255,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45491/telling-the-bees,Telling the Bees,"[Here is the place; right over the hill, Runs the path I took;, You can see the gap in the old wall still,, And the stepping-stones in the shallow...","Here is the place; right over the hill\nRuns the path I took;\nYou can see the gap in the old wall still,\nAnd the stepping-stones in the shallow ...",victorian
3256,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/53105/the-pumpkin,The Pumpkin,"[Oh, greenly and fair in the lands of the sun,, The vines of the gourd and the rich melon run,, And the rock and the tree and the cottage enfold,,...","Oh, greenly and fair in the lands of the sun,\nThe vines of the gourd and the rich melon run,\nAnd the rock and the tree and the cottage enfold,\n...",victorian
3257,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45482/an-autograph,An Autograph,"[I write my name as one,, On sands by waves o‚Äôerrun, Or winter‚Äôs frosted pane,, Traces a record vain., Oblivion‚Äôs blankness claims, Wiser and bett...","I write my name as one,\nOn sands by waves o‚Äôerrun\nOr winter‚Äôs frosted pane,\nTraces a record vain.\nOblivion‚Äôs blankness claims\nWiser and bette...",victorian
3258,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45492/what-the-birds-said,What the Birds Said,"[The birds against the April wind, Flew northward, singing as they flew;, They sang, ‚ÄúThe land we leave behind, Has swords for corn-blades, blood ...","The birds against the April wind\nFlew northward, singing as they flew;\nThey sang, ‚ÄúThe land we leave behind\nHas swords for corn-blades, blood f...",victorian
3259,Oscar Wilde,https://www.poetryfoundation.org/poems/45495/the-ballad-of-reading-gaol,The Ballad of Reading Gaol,"[I, He did not wear his scarlet coat,, For blood and wine are red,, And blood and wine were on his hands, When they found him with the dead,, The ...","I\nHe did not wear his scarlet coat,\nFor blood and wine are red,\nAnd blood and wine were on his hands\nWhen they found him with the dead,\nThe p...",victorian
3260,A. E. Housman,https://www.poetryfoundation.org/poetrymagazine/poems/55409/to-my-comrade-moses-j-jackson-scoffer-at-this-scholarship,"To my Comrade, Moses J. Jackson, Scoffer at this Scholarship","[As we went walking far and wide, Through silent fields and countryside,, We watched together star signs brim, And rise above the ocean‚Äôs rim,, An...","As we went walking far and wide\nThrough silent fields and countryside,\nWe watched together star signs brim\nAnd rise above the ocean‚Äôs rim,\nAnd...",victorian
3261,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/31338/wood,Wood,"[We age in darkness like wood, and watch our phantoms change, eir clothes, of shingles and boards, for a purpose that can only be, described as wo...",We age in darkness like wood\nand watch our phantoms change\neir clothes\nof shingles and boards\nfor a purpose that can only be\ndescribed as wood.,beat
3262,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21676/dust-and-the-glory,Dust And The Glory,"[On a low Lorrainian knoll a leaning peasant sinking a pit, Meets rotted rock and a slab., The slab cracks and is split, the old grave opened,, Hi...","On a low Lorrainian knoll a leaning peasant sinking a pit\nMeets rotted rock and a slab.\nThe slab cracks and is split, the old grave opened,\nHis...",beat
3263,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21675/we-in-the-fields,We In The Fields,"[Dawn and a high film, the sun burned it,, But noon had a thick sheet, and the clouds coming,, The low rain-bringers, trooping in from the north,,...","Dawn and a high film, the sun burned it,\nBut noon had a thick sheet, and the clouds coming,\nThe low rain-bringers, trooping in from the north,\n...",beat
3264,Allen Ginsberg,https://www.poetryfoundation.org/poetrymagazine/poems/36505/written-in-my-dream-by-w-c-williams,Written In My Dream By W C Williams,"[‚ÄúAs Is, you're bearing, a common, Truth, Commonly known, as desire, No need, to dress, it up, as beauty, No need, to distort, what‚Äôs not, standar...",‚ÄúAs Is\nyou're bearing\na common\nTruth\nCommonly known\nas desire\nNo need\nto dress\nit up\nas beauty\nNo need\nto distort\nwhat‚Äôs not\nstandard...,beat


In [532]:
df.sort_values(by=['genre', 'poet', 'title'], inplace=True)
df.reset_index(drop=True, inplace=True)

In [533]:
df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Alexander Pope,https://www.poetryfoundation.org/poems/44896/an-essay-on-criticism-part-1,An Essay on Criticism: Part 1,"[PART 1, 'Tis hard to say, if greater want of skill, Appear in writing or in judging ill;, But, of the two, less dang'rous is th' offence, To tire...","PART 1\n'Tis hard to say, if greater want of skill\nAppear in writing or in judging ill;\nBut, of the two, less dang'rous is th' offence\nTo tire ...",augustan
1,Alexander Pope,https://www.poetryfoundation.org/poems/44897/an-essay-on-criticism-part-2,An Essay on Criticism: Part 2,"[Of all the causes which conspire to blind, Man's erring judgment, and misguide the mind,, What the weak head with strongest bias rules,, Is pride...","Of all the causes which conspire to blind\nMan's erring judgment, and misguide the mind,\nWhat the weak head with strongest bias rules,\nIs pride,...",augustan
2,Alexander Pope,https://www.poetryfoundation.org/poems/44898/an-essay-on-criticism-part-3,An Essay on Criticism: Part 3,"[Learn then what morals critics ought to show,, For 'tis but half a judge's task, to know., 'Tis not enough, taste, judgment, learning, join;, In ...","Learn then what morals critics ought to show,\nFor 'tis but half a judge's task, to know.\n'Tis not enough, taste, judgment, learning, join;\nIn a...",augustan
3,Alexander Pope,https://www.poetryfoundation.org/poems/44899/an-essay-on-man-epistle-i,An Essay on Man: Epistle I,"[Awake, my St. John! leave all meaner things, To low ambition, and the pride of kings., Let us (since life can little more supply, Than just to lo...","Awake, my St. John! leave all meaner things\nTo low ambition, and the pride of kings.\nLet us (since life can little more supply\nThan just to loo...",augustan
4,Alexander Pope,https://www.poetryfoundation.org/poems/44900/an-essay-on-man-epistle-ii,An Essay on Man: Epistle II,"[I., Know then thyself, presume not God to scan;, The proper study of mankind is man., Plac'd on this isthmus of a middle state,, A being darkly w...","I.\nKnow then thyself, presume not God to scan;\nThe proper study of mankind is man.\nPlac'd on this isthmus of a middle state,\nA being darkly wi...",augustan


In [534]:
df.tail()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
5165,William Barnes,https://www.poetryfoundation.org/poems/52365/tokens,Tokens,"[Green mwold on zummer bars do show, That they‚Äôve a-dripp‚Äôd in winter wet;, The hoof-worn ring o‚Äô groun‚Äô below, The tree, do tell o‚Äô storms or het...","Green mwold on zummer bars do show\nThat they‚Äôve a-dripp‚Äôd in winter wet;\nThe hoof-worn ring o‚Äô groun‚Äô below\nThe tree, do tell o‚Äô storms or het;...",victorian
5166,William Barnes,https://www.poetryfoundation.org/poems/52362/zun-zet,Zun-zet,"[Where the western zun, unclouded,, Up above the grey hill-tops,, Did sheen drough ashes, lofty sh‚Äôouded,, On the turf beside the copse,, In zumme...","Where the western zun, unclouded,\nUp above the grey hill-tops,\nDid sheen drough ashes, lofty sh‚Äôouded,\nOn the turf beside the copse,\nIn zummer...",victorian
5167,William Ernest Henley,https://www.poetryfoundation.org/poems/51642/invictus,Invictus,"[Out of the night that covers me,, Black as the pit from pole to pole,, I thank whatever gods may be, For my unconquerable soul., In the fell clut...","Out of the night that covers me,\nBlack as the pit from pole to pole,\nI thank whatever gods may be\nFor my unconquerable soul.\nIn the fell clutc...",victorian
5168,William Makepeace Thackeray,https://www.poetryfoundation.org/poems/52711/the-cane-bottomd-chair,The Cane-Bottom‚Äôd Chair,"[In tattered old slippers that toast at the bars,, And a ragged old jacket perfumed with cigars,, Away from the world and its toils and its cares,...","In tattered old slippers that toast at the bars,\nAnd a ragged old jacket perfumed with cigars,\nAway from the world and its toils and its cares,\...",victorian
5169,William Miller,https://www.poetryfoundation.org/poems/46949/willie-winkie-56d2271169ef7,Willie Winkie,"[Wee Willie Winkie, Rins through the toun,, Up stairs and doun stairs, In his nicht-gown,, Tirling at the window,, Crying at the lock,, ‚ÄúAre the w...","Wee Willie Winkie\nRins through the toun,\nUp stairs and doun stairs\nIn his nicht-gown,\nTirling at the window,\nCrying at the lock,\n‚ÄúAre the we...",victorian


In [535]:
df.to_csv('data/poems_df_pre_clean.csv')

In [255]:
scan_poem_df.loc[4, 'poem_string']

'A wrist (to repeat\nwith a shift\nof ac-\ncent, mood, of emphasis\nattentive to) now\nneeded\nThe wrist I lost\nhold of, of\nwhat was most\nloved as a kid\nin the swing of\nTed Williams,\nthe effortlessly\nbreaking as\nof the curved\nway true to the\nmark, of the stuff of Prince\nal,\nthe poem ought to be,\nas love is\na style she holds\nout to me\nto perch on the pulse of,\naristo-\ncratically, as I am not so but can\nat will flop for,\nsloppily,\nto please the crowd\nFor the game (the\njugglery,\nevasion\nthat‚Äôs the invasion of\nprivacy in, say\nyes,\nsay it, him, and by\nall means\nempathetically)\nCharlie Chaplin)\nis as it must be\ndrawing\nto the close of its\nnight now,\none is 30,\nhaving learned control,\nhow to pick the spots,\nand sits\nhigh up in the tiers\ninstead o\nthe lobsterbacked\nbleachers faraway close,\nand knows no wrist of a\nrare kid‚Äôs form\nwill bring down\nthat bird with the sun in its beak to break all records.'

In [259]:
scan_poem_scraper(scan_poem_df.loc[4, 'poem_url'])

{'poet': 'Jack Hirschman',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/30162/the-baseball-poem',
 'title': 'The Baseball Poem',
 'poem_lines': ['A wrist (to repeat',
  'with a shift',
  'of ac-',
  'cent, mood, of emphasis',
  'attentive to) now',
  'needed',
  'The wrist I lost',
  'hold of, of',
  'what was most',
  'loved as a kid',
  'in the swing of',
  'Ted Williams,',
  'the effortlessly',
  'breaking as',
  'of the curved',
  'way true to the',
  'mark, of the stuff of Prince',
  'al,',
  'the poem ought to be,',
  'as love is',
  'a style she holds',
  'out to me',
  'to perch on the pulse of,',
  'aristo-',
  'cratically, as I am not so but can',
  'at will flop for,',
  'sloppily,',
  'to please the crowd'],
 'poem_string': 'A wrist (to repeat\nwith a shift\nof ac-\ncent, mood, of emphasis\nattentive to) now\nneeded\nThe wrist I lost\nhold of, of\nwhat was most\nloved as a kid\nin the swing of\nTed Williams,\nthe effortlessly\nbreaking as\nof the curv

In [241]:
page = rq.get('https://www.poetryfoundation.org/poetrymagazine/browse?volume=50&issue=3&page=12')
soup = bs(page.content, 'html.parser')
img_link = soup.find('img', src=re.compile('.*/jstor/.*'))['src']
img_data = rq.get(img_link).content
with open('data/temp.png', 'wb') as handle:
    handle.write(img_data)

In [390]:
text = pytesseract.image_to_string('data/temp.png')
text

'POETRY\n\nA MAGAZINE OF VERSE\n\nVOL. LIIIL NO. IV\n\nJANUARY 1939\n\nSEVEN POEMS\n\nI\nmortals)\nclimbi\nng i\nnto eachness begi\nn\ndizzily\nswingthings\nof speeds of\ntrapeze gush somersaults\nopen ing\nhes shes\n[169]'

In [254]:
re.match('[\[\(\{]?\s?[\d]+\s?[\]\)\}]?', text)

In [243]:
re.match('^[A-HJ-Z][A-HJ-Z ][A-Z ]+\n', text)

In [236]:
scan_pattern = '\n\n((?:\r?\n?(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'
lines = re.search(scan_pattern, text, re.MULTILINE).group(1).splitlines()
lines

['Though words are littered to my hand',
 'nothing they build can house my need.',
 'Though words, a masked bedizened band,',
 'surround me, mock ‚Äî assail ‚Äî evade ‚Äî',
 '',
 'though words come flowing from afar',
 'having from ancient hills their red',
 '',
 'and from this sky their cloud, their star,',
 'still thirsty, mute, I bow my head.',
 '',
 'For I am caught here needing speech,',
 'sick with a lovely song unsung.',
 '‚ÄòWaves broken on a desolate beach,',
 'O not your strange confusing tongue',
 '',
 'but rather the enchanted beat,',
 '',
 'the deep eternal surge and sway ‚Äî',
 '',
 'silence, then running rapturous feet ‚Äî',
 '',
 'comes nearer what my heart would say.',
 'Grace Fallow Norton',
 '',
 '[133]']

In [151]:
poems.shape

(4923, 6)

In [156]:
poems.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Mary Barber,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Advice to Her Son on Marriage,"['When you gain her Affection, take care to preserve it;', 'Lest others persuade her, you do not deserve it.', 'Still study to heighten the Joys o...","When you gain her Affection, take care to preserve it;\nLest others persuade her, you do not deserve it.\nStill study to heighten the Joys of her ...",augustan
1,Susanna Blamire,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Auld Robin Forbes,"['And auld Robin Forbes hes gien tem a dance,', 'I pat on my speckets to see them aw prance;', 'I thout o‚Äô the days when I was but fifteen,', 'And...","And auld Robin Forbes hes gien tem a dance,\nI pat on my speckets to see them aw prance;\nI thout o‚Äô the days when I was but fifteen,\nAnd skipp‚Äôd...",augustan
2,Susanna Blamire,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,O Donald! Ye Are Just the Man,"['O Donald! ye are just the man', 'Who, when he‚Äôs got a wife,', 'Begins to fratch‚Äî nae notice ta‚Äôen‚Äî', 'They‚Äôre strangers a‚Äô their life.', 'The fa...","O Donald! ye are just the man\nWho, when he‚Äôs got a wife,\nBegins to fratch‚Äî nae notice ta‚Äôen‚Äî\nThey‚Äôre strangers a‚Äô their life.\nThe fan may drop...",augustan
3,Susanna Blamire,https://www.poetryfoundation.org/poems/50532/the-siller-croun,The Siller Croun,"['And ye shall walk in silk attire,', 'And siller hae to spare,', 'Gin ye‚Äôll consent to be his bride,', 'Nor think o‚Äô Donald mair.', 'O wha wad bu...","And ye shall walk in silk attire,\nAnd siller hae to spare,\nGin ye‚Äôll consent to be his bride,\nNor think o‚Äô Donald mair.\nO wha wad buy a silken...",augustan
4,Henry Carey,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,The Ballad of Sally in our Alley,"['Of all the Girls that are so smart', 'There‚Äôs none like pretty SALLY,', 'She is the Darling of my Heart,', 'And she lives in our Alley.', 'There...","Of all the Girls that are so smart\nThere‚Äôs none like pretty SALLY,\nShe is the Darling of my Heart,\nAnd she lives in our Alley.\nThere is no Lad...",augustan


In [158]:
poems[poems.poem_string.isna()]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
154,Allen Ginsberg,https://www.poetryfoundation.org/poems/47660/a-supermarket-in-california,A Supermarket in California,[],,beat
166,Bob Kaufman,https://www.poetryfoundation.org/poems/55713/a-terror-is-more-certain-,A Terror is More Certain . . .,[],,beat
212,Lawrence Ferlinghetti,https://www.poetryfoundation.org/poetrymagazine/poems/58150/beatitudes-visuales-mexicanas,Beatitudes Visuales Mexicanas,[],,beat
215,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
232,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,[],,beat
312,Henry Dumas,https://www.poetryfoundation.org/poems/53477/kef-21,Kef 21,[],,black_arts_movement
332,Nikki Giovanni,https://www.poetryfoundation.org/poems/90181/no-complaints,No Complaints,[],,black_arts_movement
334,Nikki Giovanni,https://www.poetryfoundation.org/poems/90180/rosa-parks,Rosa Parks,[],,black_arts_movement
342,Etheridge Knight,https://www.poetryfoundation.org/poems/51371/a-fable-56d22f0fa5920,A Fable,[],,black_arts_movement
453,Robert Duncan,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,A Poem Beginning with a Line by Pindar,[],,black_mountain


In [159]:
%%time

for index in poems[poems.poem_string.isna()].index:
    try:
        poems.loc[index,'poem_lines'] = PoemView_rescraper(poems.loc[index,'poem_url'])[0]
        poems.loc[index,'poem_string'] = PoemView_rescraper(poems.loc[index,'poem_url'])[1]
    except:
        print(index)

215
232
1461
2141
2196
2199
2200
2201
2203
2910
3099
CPU times: user 8.18 s, sys: 269 ms, total: 8.44 s
Wall time: 22.3 s


In [27]:
poems[poems.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
215,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
232,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,[],,beat
1461,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot,Some Simple Measures In The American Idiom And The Variable Foot,[],,imagist
2053,Dylan Thomas,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Poem on His Birthday [Facs. drafts],[],,modern
2141,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25655/toward-the-south-tr-by-harry-duncan,Toward The South Tr By Harry Duncan,[],,modern
2196,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968,[],,modern
2199,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19926/the-urn-enrich-my-resignation,The Urn Enrich My Resignation,[],,modern
2200,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn Purgatorio,[],,modern
2201,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,[],,modern
2203,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn The Sad Indian,[],,modern


In [28]:
poems = poems[poems.poem_string != '']
poems.shape

(4910, 6)

In [99]:
error_rescrapes = []
still_errors = error_poems.copy()

In [100]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=18'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: We Shall Be Free')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [102]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: When Spirit Has No Edge')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [107]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=5&page=6'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Enough: Left After That')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [109]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=3&page=19'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Walking: In My Head')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [112]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=13&issue=6&page=13'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Epitaph')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [119]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28310/elaine'
rescrape = scan_poem_scraper(url, input_poet='William Carlos Williams', input_title='Elainb')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [122]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28312/emily'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=3'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Emily')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [124]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28311/erica'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=2'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Erica')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [162]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/18899/poem-as-the-cat'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=36&issue=4&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Poem: As the cat')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [165]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
rescrape = scan_poem_scraper(url, input_poet='Conrad Aiken', input_title='Discordants')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [169]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=6&issue=6&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='Conrad Aiken', input_title='Discordants IV')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)

In [97]:
error_poems

['https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30270/ritual-ix',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note',
 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/19645/persons-seen',
 'https://www.poetryfoundation.org/poetrymagazine/poems

In [173]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25512/jacks-white-horseup'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="jack's white")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [182]:
rescrapes_v2 = pd.DataFrame(error_rescrapes)

In [184]:
rescrapes_v2.to_csv('data/temp_rescrapes_2.csv')

In [181]:
poems.loc[2247, 'poem_string']

'climbi\nng i\nnto eachness begi\ndizzily\nswingthings\nof speeds of\ntrapeze gush somersaults\nopen ing\nhes shes\n&meet &\nswoop\nfully is are ex\nquisite theys of re\nfall which now drop who all dreamlike\nim)\nJanuary 1939'

In [180]:
poems[poems.poet == 'E. E. Cummings']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1755,E. E. Cummings,https://www.poetryfoundation.org/poems/49744/chimneys-xii-kitty-sixteen51whiteprostitute,"[""kitty"". sixteen,5'1"",white,prostitute]","['""kitty"". sixteen,5\'1"",white,prostitute.', 'ducking always the touch of must and shall,', ""whose slippery body is Death's littlest pal,"", 'skill...","""kitty"". sixteen,5'1"",white,prostitute.\nducking always the touch of must and shall,\nwhose slippery body is Death's littlest pal,\nskilled in qui...",modern
1756,E. E. Cummings,https://www.poetryfoundation.org/poems/148504/the-bigness-of-cannon,[the bigness of cannon],"['the bigness of cannon', 'is skilful,', 'but i have seen', 'death‚Äôs clever enormous voice', 'which hides in a fragility', 'of poppies. . . .', 'i...","the bigness of cannon\nis skilful,\nbut i have seen\ndeath‚Äôs clever enormous voice\nwhich hides in a fragility\nof poppies. . . .\ni say that some...",modern
1757,E. E. Cummings,https://www.poetryfoundation.org/poems/148505/o-sweet-spontaneous-5bf31932ce110,[O sweet spontaneous],"['O sweet spontaneous', 'earth how often have', 'the', 'doting', 'fingers of', 'prurient philosophers pinched', 'and', 'poked', 'thee', ',has the ...","O sweet spontaneous\nearth how often have\nthe\ndoting\nfingers of\nprurient philosophers pinched\nand\npoked\nthee\n,has the naughty thumb\nof sc...",modern
1758,E. E. Cummings,https://www.poetryfoundation.org/poems/47247/in-just,[in Just-],"['in Just-', 'spring when the world is mud-', 'luscious the little', 'lame balloonman', 'whistles far and wee', 'and ed...",in Just-\nspring when the world is mud-\nluscious the little\nlame balloonman\nwhistles far and wee\nand eddieandbill c...,modern
1759,E. E. Cummings,https://www.poetryfoundation.org/poems/47304/little-tree,[little tree],"['little tree', 'little silent Christmas tree', 'you are so little', 'you are more like a flower', 'who found you in the green forest', 'and were ...",little tree\nlittle silent Christmas tree\nyou are so little\nyou are more like a flower\nwho found you in the green forest\nand were you very sor...,modern
1760,E. E. Cummings,https://www.poetryfoundation.org/poems/47245/the-cambridge-ladies-who-live-in-furnished-souls,the Cambridge ladies who live in furnished souls,"['the Cambridge ladies who live in furnished souls', 'are unbeautiful and have comfortable minds', ""(also, with the church's protestant blessings""...","the Cambridge ladies who live in furnished souls\nare unbeautiful and have comfortable minds\n(also, with the church's protestant blessings\ndaugh...",modern
1761,E. E. Cummings,https://www.poetryfoundation.org/poems/47244/buffalo-bill-s,[Buffalo Bill 's],"['Buffalo Bill ‚Äôs', 'defunct', 'who used to', 'ride a watersmooth-silver', 'stallion', 'and break onetwothreefourfive pigeonsjustlikethat', 'Jesus...",Buffalo Bill ‚Äôs\ndefunct\nwho used to\nride a watersmooth-silver\nstallion\nand break onetwothreefourfive pigeonsjustlikethat\nJesus\nhe was a han...,modern
1762,E. E. Cummings,https://www.poetryfoundation.org/poems/153876/what-if-a-much-of-a-which-of-a-wind,[what if a much of a which of a wind],"['what if a much of a which of a wind', ""gives truth to the summer's lie;"", 'bloodies with dizzying leaves the sun', 'and yanks immortal stars awr...",what if a much of a which of a wind\ngives truth to the summer's lie;\nbloodies with dizzying leaves the sun\nand yanks immortal stars awry?\nBlow...,modern
1763,E. E. Cummings,https://www.poetryfoundation.org/poems/148502/vi-into-the-strenuous-briefness,[into the strenuous briefness],"['into the strenuous briefness', 'Life:', 'handorgans and April', 'darkness,friends', 'i charge laughing.', 'Into the hair-thin tints', 'of yellow...","into the strenuous briefness\nLife:\nhandorgans and April\ndarkness,friends\ni charge laughing.\nInto the hair-thin tints\nof yellow dawn,\ninto t...",modern
1764,E. E. Cummings,https://www.poetryfoundation.org/poems/148503/all-in-green-went-my-love-riding,[All in green went my love riding],"['All in green went my love riding', 'on a great horse of gold', 'into the silver dawn.', 'four lean hounds crouched low and smiling', 'the merry ...",All in green went my love riding\non a great horse of gold\ninto the silver dawn.\nfour lean hounds crouched low and smiling\nthe merry deer ran b...,modern


In [174]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25263/imc-a-tmo'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="(im)c-a-t(mo)")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrape
# error_rescrapes.append(rescrape)
# still_errors.remove(url)

AttributeError: 'NoneType' object has no attribute 'group'

In [178]:
poems[poems.contains('cummings')]

AttributeError: 'DataFrame' object has no attribute 'contains'

In [190]:
title = 'HEY'

In [195]:
pattern=r'\b.*((?:\r?\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'
f'{title}' + pattern

'HEY\\b.*((?:\\r?\\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'

In [131]:
re.match(r'[\[\(]?\s?[\d]+\s?[\]\)]?', process_image(actual_url, poet='William Carlos Williams', title='Erica')[0][-1])

<_sre.SRE_Match object; span=(0, 3), match='326'>

In [128]:
rescrape['poem_lines']

['the melody line is',
 'everything',
 'in this composition',
 'when I first witnessed',
 'your hea',
 'and held it',
 'admiringly between',
 'in approval',
 'at the Scandinavian',
 'name they‚Äôd',
 'given you Erica after',
 'your father‚Äôs',
 'forebears',
 'the rest remains a',
 'mystery',
 'your snub nose spinning']

In [134]:
process_image('https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=3', first=False)

([],
 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=4')

In [114]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28309/3-stances'
rescrape = scan_poem_scraper(url, 
                             input_poet='William Carlos Williams',
                             input_title='3 Stances',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!THEODORE HOLMES).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
rescrape

{'poet': 'William Carlos Williams',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/28309/3-stances',
 'title': '3 Stances',
 'poem_lines': ['I',
  'ELAINB',
  'poised for the leap she',
  'is not yet ready for',
  '‚Äî-save in her eyes',
  'her bare toes',
  'starting over the clipt',
  'lawn where she may',
  'not go emphasize summer',
  'and the curl',
  'of her blond hair'],
 'poem_string': 'I\nELAINB\npoised for the leap she\nis not yet ready for\n‚Äî-save in her eyes\nher bare toes\nstarting over the clipt\nlawn where she may\nnot go emphasize summer\nand the curl\nof her blond hair',
 'genre': 'imagist'}

In [147]:
poems.loc[1505, 'poem_string']

'I\nWhen a man had gone\nin Russia from a small\ntown\nto the University\nhe\nreturned a hero‚Äî\npeople\nbowed down to him‚Äî\nhis\nego, nourished by this,\nmount-\ned to notable works.\nHere\nin the streets the kids\nsay\nHello Pete! to me\nWhat\ncan one be or\nimagine?\nNothing is reverenced\nnothing\nlooked up to. Nothing\ncan\ncome of that sort of\nis-\nrespect for the under-\nstanding'

In [146]:
poems[poems.title == 'The Unfrocked Priest']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1505,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/18898/the-unfrocked-priest,The Unfrocked Priest,"[I, When a man had gone, in Russia from a small, town, to the University, he, returned a hero‚Äî, people, bowed down to him‚Äî, his, ego, nourished by...","I\nWhen a man had gone\nin Russia from a small\ntown\nto the University\nhe\nreturned a hero‚Äî\npeople\nbowed down to him‚Äî\nhis\nego, nourished by ...",imagist


In [115]:
poems = poems[poems.title != '3 Stances']
poems.iloc[1489:1492]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1493,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/22755/the-world-narrowed-to-a-point,The World Narrowed To A Point,"[Liquor and love, when the mind is dull, focus the wit, on a world of form, The eye awakes, perfumes are defined, inflections, ride the quick ear,...",Liquor and love\nwhen the mind is dull\nfocus the wit\non a world of form\nThe eye awakes\nperfumes are defined\ninflections\nride the quick ear\n...,imagist
1494,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/13515/marriage,Marriage,"[So different, this man, And this woman:, A stream flowing, In a field.]","So different, this man\nAnd this woman:\nA stream flowing\nIn a field.",imagist
1495,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/15467/spouts,Spouts,"[In this world of, as fine a pair of breasts, as ever I saw,, the fountain in, Madison Square, spouts up of water, a white tree,, that dies and li...","In this world of\nas fine a pair of breasts\nas ever I saw,\nthe fountain in\nMadison Square\nspouts up of water\na white tree,\nthat dies and liv...",imagist


In [110]:
error_rescrapes[-1]

{'poet': 'Robert Creeley',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'title': 'Walking: In My Head',
 'poem_lines': ['is there to walk,',
  'not thought of, is',
  'the road itself more',
  'than seen. I think',
  'it might be, feel',
  'as my feet do, and',
  'continue, and',
  'at last reach, slowly,',
  'one end of my intention.'],
 'poem_string': 'is there to walk,\nnot thought of, is\nthe road itself more\nthan seen. I think\nit might be, feel\nas my feet do, and\ncontinue, and\nat last reach, slowly,\none end of my intention.',
 'genre': 'black_mountain'}

In [161]:
text = pytesseract.image_to_string('data/temp.png')
text

'POETRY: 4 Magazine of Verse\nTWO LADIES\n\nEPITAPH FOR TABITHA\nThief at her left and whore at her right,\nPause and pity Tabitha‚Äôs plight!\n\nPoor old kindly Tabitha who\n\nShunned the devil and all his crew;\n\nSqueezed the shillings and pinched the groats,\nKept the heathen in petticoats;\n\nTook her heart when she felt it prod,\nSalted it down like a frisky cod.\n\nGood old proper Tabitha Tubb,\nMeasuring bones with Beelzebub,\n\nCovered with daisies and chagrin,\nHere she lies where they heeled her in-\n\nThief at her left and whore at her right,\nPause and pity Tabitha‚Äôs plight.\nLOCAL COLOR\n\nMrs. Leander came to town,\nOgled us deftly up and down;\nPaused at Mrs. O‚ÄôReilly‚Äôs door,\nComplimented her sycamore;\n\n[196]'

In [144]:
lines = re.search('WILLIAM CARLOS WILLIAMS\n((?:\r?\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)', text, re.MULTILINE).group(1).splitlines()
lines

[]

In [699]:
text = 'William Carlos Williams\n\nEPITAPH\n\nAn old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n\n‚ÄúLove is a young green willow\nShimmering at the bare wood‚Äôs edge.‚Äù\n\nSPIRIT\n\nO my grey hairs!\nYou are truly white as plum blossoms.\n\nSTROLLER\n\nI have seen the hills blue,\n\nI have seen them purple;\n\nAnd it is as hard to know\n\nThe words of a woman\n\nAs to straighten the crumpled branch\nOf an old willow.\n\nMEMORY OF APRIL\n\nYou say love is this, love is that:\nPoplar tassels, willow tendrils\n\nThe wind and the rain comb,\nTinkle and drip, tinkle and drip‚Äî\nBranches drifting apart. Hagh!\nLove has not even visited this country.\n\n[303]'

In [700]:
title = 'Epitaph'
scan_pattern = fr'{title.split()[-1].upper()}\b.*((?:\r?\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'
lines = re.search(scan_pattern, text, re.MULTILINE).group(1).splitlines()
lines

['',
 '',
 'An old willow with hollow branches',
 'Slowly swayed his few high bright tendrils',
 'And sang:',
 '',
 '‚ÄúLove is a young green willow',
 'Shimmering at the bare wood‚Äôs edge.‚Äù']

In [413]:
poems.to_csv('data/poems_df.csv')

In [561]:
poems.loc[550:600, 'poem_string']

550    Don‚Äôt ste}\nso lightly. Break\nyour back, missed\nthe step. Don‚Äôt go\naway mad, lady in\nthe nightmare. You\nare central,\neven necessary.\nI will...
551    Remember the way you\nhunched up the first\ntimes in bed, all your\nbody as you walked\nseemed centered\nin your breasts. It\nwas watching the wor...
552    Thinking of you asleep on a\nbed on a pillow, on a\nbed‚Äîthe ground or space\nyou lie on. That‚Äôs enough to\ntalk to now I got space and\ntime like ...
553    For fear I want\nto make myself again\nunder the thumb\nof old love, old time\nsubservience\nand pain, bent\ninto a nail that will\nnot come out.\...
554    For Robert Duncan\nIt is hard going to the door\ncut so small in the wall where\nthe vision which echoes loneliness\nbrings a scent of wild flower...
555    For whatever, it could\nbe done, simply\nremove it, cut the\noffending member. Once\nin a photograph by\nFrederick Sommer a leg\nlay on what was a...
556    Say that you‚Äôre\nlonely‚Äîand want\nsomet

In [436]:
poems[poems.title == 'A Boat']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
116,Richard Brautigan,https://www.poetryfoundation.org/poems/48576/a-boat,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
209,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/56423/a-boat-56d238e754f45,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat


In [443]:
poems[poems.duplicated(subset=['poet', 'poem_string', 'genre'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
116,Richard Brautigan,https://www.poetryfoundation.org/poems/48576/a-boat,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
154,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"[I, I saw the best minds of my generation destroyed by madness, starving hysterical naked,, dragging themselves through the negro streets at dawn ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
183,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"[I, I saw the best minds of my generation destroyed by madness, starving hysterical naked,, dragging themselves through the negro streets at dawn ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
209,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/56423/a-boat-56d238e754f45,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
589,Edward Dorn,https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note,The First Note,"[THE END OF THE NORTH ATLANTIC TURBINE POEM, I never hear the Supremes, but what I remember Leroy., McLucas came, to Pocatello the summer of 1965,...",THE END OF THE NORTH ATLANTIC TURBINE POEM\nI never hear the Supremes\nbut what I remember Leroy.\nMcLucas came\nto Pocatello the summer of 1965\n...,black_mountain
590,Edward Dorn,https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks,The Sundering Up Tracks,"[THE END OF THE NORTH ATLANTIC TURBINE POEM, I never hear the Supremes, but what I remember Leroy., McLucas came, to Pocatello the summer of 1965,...",THE END OF THE NORTH ATLANTIC TURBINE POEM\nI never hear the Supremes\nbut what I remember Leroy.\nMcLucas came\nto Pocatello the summer of 1965\n...,black_mountain
2095,Wallace Stevens,https://www.poetryfoundation.org/poems/45235/the-snow-man-56d224a6d4e90,The Snow Man,"[One must have a mind of winter, To regard the frost and the boughs, Of the pine-trees crusted with snow;, And have been cold a long time, To beho...",One must have a mind of winter\nTo regard the frost and the boughs\nOf the pine-trees crusted with snow;\nAnd have been cold a long time\nTo behol...,modern
2105,Wallace Stevens,https://www.poetryfoundation.org/poems/51648/anecdote-of-the-jar-56d22f87dc64f,Anecdote of the Jar,"[I placed a jar in Tennessee,, And round it was, upon a hill., It made the slovenly wilderness, Surround that hill., The wilderness rose up to it,...","I placed a jar in Tennessee,\nAnd round it was, upon a hill.\nIt made the slovenly wilderness\nSurround that hill.\nThe wilderness rose up to it,\...",modern
2237,Wallace Stevens,https://www.poetryfoundation.org/poetrymagazine/poems/14575/anecdote-of-the-jar,Anecdote of the Jar,"[I placed a jar in Tennessee,, And round it was, upon a hill., It made the slovenly wilderness, Surround that hill., The wilderness rose up to it,...","I placed a jar in Tennessee,\nAnd round it was, upon a hill.\nIt made the slovenly wilderness\nSurround that hill.\nThe wilderness rose up to it,\...",modern
2350,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,"[RELIQUARY, ENDERNESS and resolution!, What is our life without a sudden pillow,, What is death without a ditch?, The harvest laugh of bright Apol...","RELIQUARY\nENDERNESS and resolution!\nWhat is our life without a sudden pillow,\nWhat is death without a ditch?\nThe harvest laugh of bright Apoll...",modern


In [452]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=9'

poems.loc[589,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[589,'poet'], 
                                                input_title=poems.loc[589,'title'])['poem_lines']

poems.loc[589,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[589,'poet'], 
                                                input_title=poems.loc[589,'title'])['poem_string']

In [459]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=7'

poems.loc[2350,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2350,'poet'], 
                                                input_title=poems.loc[2350,'title'])['poem_lines']

poems.loc[2350,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2350,'poet'], 
                                                input_title=poems.loc[2350,'title'])['poem_string']

In [462]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=6'

poems.loc[2351,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2351,'poet'], 
                                                input_title=poems.loc[2351,'title'])['poem_lines']

poems.loc[2351,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2351,'poet'], 
                                                input_title=poems.loc[2351,'title'])['poem_string']

In [498]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=2'

poems.loc[2354,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2354,'poet'], 
                                                input_title=poems.loc[2354,'title'])['poem_lines']

poems.loc[2354,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2354,'poet'], 
                                                input_title=poems.loc[2354,'title'])['poem_string']

In [500]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=10'

poems.loc[2359,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2359,'poet'], 
                                                input_title=poems.loc[2359,'title'])['poem_lines']

poems.loc[2359,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2359,'poet'], 
                                                input_title=poems.loc[2359,'title'])['poem_string']

In [511]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=2'

poems.loc[2621,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2621,'poet'], 
                                                input_title=poems.loc[2621,'title'])['poem_lines']

poems.loc[2621,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2621,'poet'], 
                                                input_title=poems.loc[2621,'title'])['poem_string']

In [515]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=3'

poems.loc[2629,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2629,'poet'], 
                                                input_title=poems.loc[2629,'title'])['poem_lines']

poems.loc[2629,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2629,'poet'], 
                                                input_title=poems.loc[2629,'title'])['poem_string']

In [519]:
poems.drop_duplicates(subset=['poet', 'poem_string', 'genre'], inplace=True)

In [524]:
poempara_rescraper('https://www.poetryfoundation.org/poems/57369/the-send-off')

(['Down the close, darkening lanes they sang their way',
  'To the siding-shed,',
  'And lined the train with faces grimly gay.',
  '',
  'Their breasts were stuck all white with wreath and spray',
  "As men's are, dead.",
  '',
  'Dull porters watched them, and a casual tramp',
  'Stood staring hard,',
  'Sorry to miss them from the upland camp.',
  'Then, unmoved, signals nodded, and a lamp',
  'Winked to the guard.',
  '',
  'So secretly, like wrongs hushed-up, they went.',
  'They were not ours:',
  'We never heard to which front these were sent.',
  '',
  'Nor there if they yet mock what women meant',
  'Who gave them flowers.',
  '',
  'Shall they return to beatings of great bells',
  'In wild trainloads?',
  'A few, a few, too few for drums and yells,',
  'May creep back, silent, to still village wells',
  'Up half-known roads.'],
 "Down the close, darkening lanes they sang their way\nTo the siding-shed,\nAnd lined the train with faces grimly gay.\n\nTheir breasts were stuck all

In [526]:
poems[poems.duplicated(subset=['poet', 'poem_string'], keep=False)].head(20)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
743,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1231,Richard Aldington,https://www.poetryfoundation.org/poems/53969/le-maudit,Le Maudit,"[Women‚Äôs tears are but water;, The tears of men are blood., He sits alone in the firelight, And on either side drifts by, Sleep, like a torrent wh...","Women‚Äôs tears are but water;\nThe tears of men are blood.\nHe sits alone in the firelight\nAnd on either side drifts by\nSleep, like a torrent whi...",imagist
1270,Ezra Pound,https://www.poetryfoundation.org/poems/54314/canto-i,Canto I,"[And then went down to the ship,, Set keel to breakers, forth on the godly sea, and, We set up mast and sail on that swart ship,, Bore sheep aboar...","And then went down to the ship,\nSet keel to breakers, forth on the godly sea, and\nWe set up mast and sail on that swart ship,\nBore sheep aboard...",imagist
1271,Ezra Pound,https://www.poetryfoundation.org/poems/52318/cantico-del-sole,Cantico del Sole,"[The thought of what America would be like, If the Classics had a wide circulation, Troubles my sleep,, The thought of what America,, The thought ...","The thought of what America would be like\nIf the Classics had a wide circulation\nTroubles my sleep,\nThe thought of what America,\nThe thought o...",imagist
1272,Ezra Pound,https://www.poetryfoundation.org/poems/54317/canto-xvi-56d234860e2a1,Canto XVI,"[And before hell mouth; dry plain, and two mountains;, On the one mountain, a running form,, and another, In the turn of the hill; in hard steel, ...","And before hell mouth; dry plain\nand two mountains;\nOn the one mountain, a running form,\nand another\nIn the turn of the hill; in hard steel\nT...",imagist
1273,Ezra Pound,https://www.poetryfoundation.org/poems/54321/from-canto-cxv,Canto CXV,"[The scientists are in terror, and the European mind stops, Wyndham Lewis chose blindness, rather than have his mind stop., Night under wind mid g...",The scientists are in terror\nand the European mind stops\nWyndham Lewis chose blindness\nrather than have his mind stop.\nNight under wind mid ga...,imagist
1274,Ezra Pound,https://www.poetryfoundation.org/poems/44915/hugh-selwyn-mauberley-part-i,Hugh Selwyn Mauberley [Part I],"[E. P. ODE POUR L‚ÄôEÃÅLECTION DE SON SEÃÅPULCHRE, , For three years, out of key with his time,, He strove to resuscitate the dead art, Of poetry; to...","E. P. ODE POUR L‚ÄôEÃÅLECTION DE SON SEÃÅPULCHRE\n \nFor three years, out of key with his time,\nHe strove to resuscitate the dead art\nOf poetry; to ...",imagist
1275,Ezra Pound,https://www.poetryfoundation.org/poems/54315/canto-iii-56d234851afde,Canto III,"[I sat on the Dogana‚Äôs steps, For the gondolas cost too much, that year,, And there were not ‚Äúthose girls‚Äù, there was one face,, And the Buccentor...","I sat on the Dogana‚Äôs steps\nFor the gondolas cost too much, that year,\nAnd there were not ‚Äúthose girls‚Äù, there was one face,\nAnd the Buccentoro...",imagist
1276,Ezra Pound,https://www.poetryfoundation.org/poems/57353/hugh-selwyn-mauberley-part-ii,Hugh Selwyn Mauberley [Part II],"[Par Jaquemart‚Äù, To the strait head, Of Messalina:, ‚ÄúHis True Penelope, Was Flaubert,‚Äù, And his tool, The engraver's., Firmness,, Not the full smi...","Par Jaquemart‚Äù\nTo the strait head\nOf Messalina:\n‚ÄúHis True Penelope\nWas Flaubert,‚Äù\nAnd his tool\nThe engraver's.\nFirmness,\nNot the full smil...",imagist
1277,Ezra Pound,https://www.poetryfoundation.org/poems/54320/canto-lxxxi,Canto LXXXI,"[Zeus lies in Ceres‚Äô bosom, Taishan is attended of loves, under Cythera, before sunrise, And he said: ‚ÄúHay aquiÃÅ mucho catolicismo‚Äî(sounded, catol...","Zeus lies in Ceres‚Äô bosom\nTaishan is attended of loves\nunder Cythera, before sunrise\nAnd he said: ‚ÄúHay aquiÃÅ mucho catolicismo‚Äî(sounded\ncatoli...",imagist


In [531]:
poems_double_backup = poems.copy()

In [536]:
poems.shape

(5151, 6)

In [549]:
rial_mod_ind = list(poems[(poems.poet == 'Richard Aldington') & (poems.genre == 'modern')].index)

poems.drop(rial_mod_ind, inplace=True)

poems.shape

(5048, 6)

In [551]:
poems[poems.poet == 'Li Bai']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1285,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs‚Äô Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",imagist
2003,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs‚Äô Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",modern


In [550]:
poems[poems.duplicated(subset=['poet', 'poem_string'], keep=False)].head(20)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
743,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1285,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs‚Äô Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",imagist
2003,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs‚Äô Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",modern
2133,Henry Wadsworth Longfellow,https://www.poetryfoundation.org/poems/44637/the-landlords-tale-paul-reveres-ride,The Landlord's Tale. Paul Revere's Ride,"[Listen, my children, and you shall hear, Of the midnight ride of Paul Revere,, On the eighteenth of April, in Seventy-five;, Hardly a man is now ...","Listen, my children, and you shall hear\nOf the midnight ride of Paul Revere,\nOn the eighteenth of April, in Seventy-five;\nHardly a man is now a...",modern
2196,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58343/ocean-of-earth,Ocean of Earth,"[I have built a house in the middle of the Ocean, Its windows are the rivers flowing from my eyes, Octopi are crawling all over where the walls ar...",I have built a house in the middle of the Ocean\nIts windows are the rivers flowing from my eyes\nOctopi are crawling all over where the walls are...,modern
2197,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58342/the-lady,The Lady,"[Knock knock He has closed his door, The garden‚Äôs lilies have started to rot, So who is the corpse being carried from the house, You just knocked ...",Knock knock He has closed his door\nThe garden‚Äôs lilies have started to rot\nSo who is the corpse being carried from the house\nYou just knocked o...,modern
2292,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58341/the-seasons-56d23ca091a25,The Seasons,"[It was a blesseÃÄd time we were at the beach, Go out early in the morning no shoes no hats no ties, And quick as a toad‚Äôs tongue can reach, Love w...",It was a blesseÃÄd time we were at the beach\nGo out early in the morning no shoes no hats no ties\nAnd quick as a toad‚Äôs tongue can reach\nLove wo...,modern
3478,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58343/ocean-of-earth,Ocean of Earth,"[I have built a house in the middle of the Ocean, Its windows are the rivers flowing from my eyes, Octopi are crawling all over where the walls ar...",I have built a house in the middle of the Ocean\nIts windows are the rivers flowing from my eyes\nOctopi are crawling all over where the walls are...,new_york_school_2nd_generation
3480,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58342/the-lady,The Lady,"[Knock knock He has closed his door, The garden‚Äôs lilies have started to rot, So who is the corpse being carried from the house, You just knocked ...",Knock knock He has closed his door\nThe garden‚Äôs lilies have started to rot\nSo who is the corpse being carried from the house\nYou just knocked o...,new_york_school_2nd_generation
3529,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58341/the-seasons-56d23ca091a25,The Seasons,"[It was a blesseÃÄd time we were at the beach, Go out early in the morning no shoes no hats no ties, And quick as a toad‚Äôs tongue can reach, Love w...",It was a blesseÃÄd time we were at the beach\nGo out early in the morning no shoes no hats no ties\nAnd quick as a toad‚Äôs tongue can reach\nLove wo...,new_york_school_2nd_generation


In [552]:
poems.drop([2003, 2133, 3478, 3480, 3529, 4976], inplace=True)

poems.shape

(5042, 6)

In [554]:
poems[poems.duplicated(subset=['poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
4754,A. E. Housman,https://www.poetryfoundation.org/poems/58269/a-shropshire-lad-52-far-in-a-western-brookland-,A Shropshire Lad 52:¬†Far in a western brookland,[ ],,victorian
5142,Katharine Tynan,https://www.poetryfoundation.org/poems/57349/a-lament-56d23ac7ae84a,A Lament,[ ],,victorian


In [555]:
poems.loc[988,'poem_lines'] = poempara_rescraper(poems.loc[988,'poem_url'])[0]
poems.loc[988,'poem_string'] = poempara_rescraper(poems.loc[988,'poem_url'])[1]

poems.loc[4754,'poem_lines'] = poempara_rescraper(poems.loc[4754,'poem_url'])[0]
poems.loc[4754,'poem_string'] = poempara_rescraper(poems.loc[4754,'poem_url'])[1]

poems.loc[5142,'poem_lines'] = poempara_rescraper(poems.loc[5142,'poem_url'])[0]
poems.loc[5142,'poem_string'] = poempara_rescraper(poems.loc[5142,'poem_url'])[1]

In [560]:
poems.to_csv('data/poems_df.csv')

In [404]:
poems_backup = poems.copy()

In [559]:
# uncomment to save
with gzip.open('data/poems_df.pkl', 'wb') as goodbye:
    pickle.dump(poems, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# uncomment to load
with gzip.open('data/poems_df.pkl', 'rb') as hello:
    poems_df = pickle.load(hello)

RecursionError: maximum recursion depth exceeded while getting the str of an object

In [389]:
error_poems = error_poems_orig.copy()
len(error_poems)

223

In [563]:
poems[poems.poet == 'Michael McClure']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
169,Michael McClure,https://www.poetryfoundation.org/poems/54613/dream-the-night-of-december-23rd-,Dream: The Night of December 23rd Ôªø,"[‚ÄîALL HUGE LIKE GIANT FLIGHTLESS KIWIS TWICE THE, SIZE OF OSTRICHES,, they turned and walked away from us, and you were there Jane and you were t...","‚ÄîALL HUGE LIKE GIANT FLIGHTLESS KIWIS TWICE THE\nSIZE OF OSTRICHES,\nthey turned and walked away from us\nand you were there Jane and you were tw...",beat
170,Michael McClure,https://www.poetryfoundation.org/poems/54612/the-chamber,The Chamber,"[IN LIGHT ROOM IN DARK HELL IN UMBER IN CHROME,, I sit feeling the swell of the cloud made about by movement, of arm leg and tongue. In reflection...","IN LIGHT ROOM IN DARK HELL IN UMBER IN CHROME,\nI sit feeling the swell of the cloud made about by movement\nof arm leg and tongue. In reflections...",beat
171,Michael McClure,https://www.poetryfoundation.org/poems/54614/mexico-seen-from-the-moving-car-,Mexico Seen from the Moving Car Ôªø,"[THERE ARE HILLS LIKE SHARKFINS, and clods of mud., The mind drifts through, in the shape of a museum,, in the guise of a museum, dreaming dead fr...","THERE ARE HILLS LIKE SHARKFINS\nand clods of mud.\nThe mind drifts through\nin the shape of a museum,\nin the guise of a museum\ndreaming dead fri...",beat
172,Michael McClure,https://www.poetryfoundation.org/poems/54611/the-mystery-of-the-hunt,The Mystery of the Hunt,"[It‚Äôs the mystery of the hunt that intrigues me,, That drives us like lemmings, but cautiously‚Äî, The search for a bright square cloud‚Äîthe scent of...","It‚Äôs the mystery of the hunt that intrigues me,\nThat drives us like lemmings, but cautiously‚Äî\nThe search for a bright square cloud‚Äîthe scent of ...",beat
218,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,"[PREMONITION, My bones ascend by arsenics of sight., Where noise is all the sound there is to hear,, Beginning in the heart I work towards light.,...","PREMONITION\nMy bones ascend by arsenics of sight.\nWhere noise is all the sound there is to hear,\nBeginning in the heart I work towards light.\n...",beat
219,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29414/the-child,The Child,"[Who were the Lion Men who walked in my dreams, when I was a fat and sleeping babe, in a room whose walls were miracles?, Who were the lion men wi...",Who were the Lion Men who walked in my dreams\nwhen I was a fat and sleeping babe\nin a room whose walls were miracles?\nWho were the lion men wit...,beat


In [573]:
poems.loc[574, 'poem_string']

'I\nIt is possible, in words, to speak\nof what has happened‚Äîa sense\nof there and here, now\nand then. It is some other\nway of being, prized enough,\nthat it makes a common\nground. Once\nyou were\nalone and I\nmet you. It was late\nat night.\nT never'

In [576]:
scan_poem_scraper(poems.loc[574, 'poem_url'])

{'poet': 'Robert Creeley',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/30524/enough-56d214013c576',
 'title': 'Enough',
 'poem_lines': ['I',
  'It is possible, in words, to speak',
  'of what has happened‚Äîa sense',
  'of there and here, now',
  'and then. It is some other',
  'way of being, prized enough,',
  'that it makes a common',
  'ground. Once',
  'you were',
  'alone and I',
  'met you. It was late',
  'at night.',
  'T never'],
 'poem_string': 'I\nIt is possible, in words, to speak\nof what has happened‚Äîa sense\nof there and here, now\nand then. It is some other\nway of being, prized enough,\nthat it makes a common\nground. Once\nyou were\nalone and I\nmet you. It was late\nat night.\nT never'}

In [582]:
text = pytesseract.image_to_string('data/temp.png')
text

'POETRY\n\nleft after that,\nnot to my own mind,\n\nbut stayed\nand stayed. Years\n\nwent by. What\nwere they. Days‚Äî\n\nsome happy,\nbut some bitter\n\nand sad. If I walked\nacross the room, then,\n\nand saw you un-\nexpected, saw the particular\n\nwhiteness of\nyour body, a little\n\nolder, more\ntired‚Äîin words\n\nI possessed it, in\nmy mind I thought, and\n\nyou never knew\nit, there I danced\n\nfor you, stumbling, in\nthe corner of my eye.'

In [591]:
poems[poems.title == 'Four Dream Songs']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
758,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29165/four-dream-songs,Four Dream Songs,"[I, To Ralph Ross, The greens of the Ganges delta foliate., Of heartless youth made late aware he pled:, Brownies, please come., To Henry in his s...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional


In [592]:
poems.loc[758, 'poem_string']

"I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sparest times sometimes\nthe little people spread, & did friendly things;\nthen he was glad.\nPleased, at the worst, except with man, he shook\nthe brightest winter sun.\nAll the green lives\nof the great delta, hours, hurt his migrant heart\nin a safety of the steady plane. Please, please\ncome.\nMy friends,‚Äîhe has been known to mourn,‚ÄîI'll die;\nlive you, in the most wild, kindly, green\npartly forgiving wood,\nsort of forever and all those human sings\nclose not your better ears to, while good Spring\nreturns with a dance and a sigh.\ni\nHenry‚Äôs pelt was put on sundry walls\nwhere it did much resemble Henry and\nthem persons was delighted."

In [645]:
error_poems

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29525/viii-he-yelled-at-me-in-greek',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29166/the-greens-of-the-ganges',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29552/you-search-in-rome-for-rome',
 'https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d2296992

In [637]:
%%time

simple_rescrapes = []
still_errors = []
for url in tqdm(error_poems):
    try:
        info = scan_poem_scraper(url)
        info['poem_url'] = url
        info['genre'] = ''
        simple_rescrapes.append(info)
    except:
        still_errors.append(url)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 102/102 [05:06<00:00,  3.00s/it]

CPU times: user 14.6 s, sys: 45.9 s, total: 1min
Wall time: 5min 6s





In [638]:
simple_rescrapes

[{'poet': 'John Berryman',
  'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
  'title': 'Spellbound Held Subtle Henry',
  'poem_lines': ['the little people spread, & did friendly things;',
   'then he was glad.',
   'Pleased, at the worst, except with man, he shook',
   'the brightest winter sun.',
   'All the green lives',
   'of the great delta, hours, hurt his migrant heart',
   'in a safety of the steady plane. Please, please',
   'come.',
   "My friends,‚Äîhe has been known to mourn,‚ÄîI'll die;",
   'live you, in the most wild, kindly, green',
   'partly forgiving wood,',
   'sort of forever and all those human sings',
   'close not your better ears to, while good Spring',
   'returns with a dance and a sigh.',
   'i',
   'Henry‚Äôs pelt was put on sundry walls',
   'where it did much resemble Henry and',
   'them persons was delighted.'],
  'poem_string': "the little people spread, & did friendly things;\nthen he was glad.\

In [639]:
still_errors

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0',
 'https://www.poetryfoundation.org/poetrymagazine/poems/19645/persons-seen',
 'https://www.poetryfoundation.org/poetrymagazine/poems/32401/sonnets-of-the-blood',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14310/an-evening-meeting-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14321/the-inn-at-the-w

In [648]:
poems[poems.poet == 'William Carlos Williams']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1289,William Carlos Williams,https://www.poetryfoundation.org/poems/148460/spring-and-all-chapter-xiii-thus-weary-of-life,"Spring and All: Chapter XIII [Thus, weary of life]","[Thus, weary of life, in view of the great consummation which awaits us ‚Äî tomorrow, we rush among our friends congratulating ourselves upon the jo...","Thus, weary of life, in view of the great consummation which awaits us ‚Äî tomorrow, we rush among our friends congratulating ourselves upon the joy...",imagist
1290,William Carlos Williams,https://www.poetryfoundation.org/poems/56159/this-is-just-to-say,This Is Just To Say,"[I have eaten, the plums, that were in, the icebox, and which, you were probably, saving, for breakfast, Forgive me, they were delicious, so sweet...",I have eaten\nthe plums\nthat were in\nthe icebox\nand which\nyou were probably\nsaving\nfor breakfast\nForgive me\nthey were delicious\nso sweet\...,imagist
1291,William Carlos Williams,https://www.poetryfoundation.org/poems/148462/spring-and-all-xi-in-passing-with-my-mind,Spring and All: XI [In passing with my mind],"[In passing with my mind, on nothing in the world, but the right of way, I enjoy on the road by, virtue of the law ‚Äî, I saw, an elderly man...",In passing with my mind\non nothing in the world\nbut the right of way\nI enjoy on the road by\nvirtue of the law ‚Äî\nI saw\nan elderly man ...,imagist
1292,William Carlos Williams,https://www.poetryfoundation.org/poems/53078/flowers-by-the-sea-56d23210587cf,Flowers by the Sea,"[When over the flowery, sharp pasture‚Äôs, edge, unseen, the salt ocean, lifts its form‚Äîchicory and daisies, tied, released, seem hardly flowers alo...","When over the flowery, sharp pasture‚Äôs\nedge, unseen, the salt ocean\nlifts its form‚Äîchicory and daisies\ntied, released, seem hardly flowers alon...",imagist
1293,William Carlos Williams,https://www.poetryfoundation.org/poems/49849/between-walls,Between Walls,"[the back wings, of the, hospital where, nothing, will grow lie, cinders, in which shine, the broken, pieces of a green, bottle]",the back wings\nof the\nhospital where\nnothing\nwill grow lie\ncinders\nin which shine\nthe broken\npieces of a green\nbottle,imagist
...,...,...,...,...,...,...
1571,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/14364/the-dark-day,The Dark Day,"[A three-day-long rain from the east‚Äî, An interminable talking, talking, Of no consequence‚Äîpatter, patter, patter., Hand in hand little winds, Blo...","A three-day-long rain from the east‚Äî\nAn interminable talking, talking\nOf no consequence‚Äîpatter, patter, patter.\nHand in hand little winds\nBlow...",imagist
1572,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/13517/summer-song,Summer Song,"[Wanderer moon,, Smiling, A faintly ironical smile, At this brilliant,, Dew-moistened, Summer morning‚Äî, A detached,, Sleepily indifferent, Smile,,...","Wanderer moon,\nSmiling\nA faintly ironical smile\nAt this brilliant,\nDew-moistened\nSummer morning‚Äî\nA detached,\nSleepily indifferent\nSmile,\n...",imagist
1573,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/22430/the-forgotten-city,The Forgotten City,"[When I was coming down from the country, with my mother, the day of the storm,, trees were across the road and small branches, kept rattling on t...","When I was coming down from the country\nwith my mother, the day of the storm,\ntrees were across the road and small branches\nkept rattling on th...",imagist
1574,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/18898/the-unfrocked-priest,The Unfrocked Priest,"[I, When a man had gone, in Russia from a small, town, to the University, he, returned a hero‚Äî, people, bowed down to him‚Äî, his, ego, nourished by...","I\nWhen a man had gone\nin Russia from a small\ntown\nto the University\nhe\nreturned a hero‚Äî\npeople\nbowed down to him‚Äî\nhis\nego, nourished by ...",imagist


In [640]:
error_rescrapes = []

In [641]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=18'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: We Shall Be Free')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [642]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: When Spirit Has No Edge')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [643]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=5&page=6'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Enough: Left After That')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [644]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=3&page=19'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Walking: In My Head')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [654]:
img_data = rq.get('https://static.poetryfoundation.org/jstor/i20572016/pages/13.png').content
with open('data/temp.png', 'wb') as handle:
    handle.write(img_data)
text = pytesseract.image_to_string('data/temp.png')
text

'William Carlos Williams\n\nEPITAPH\n\nAn old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n\n‚ÄúLove is a young green willow\nShimmering at the bare wood‚Äôs edge.‚Äù\n\nSPRING\n\nO my grey hairs!\nYou are truly white as plum blossoms.\n\nSTROLLER\n\nI have seen the hills blue,\n\nI have seen them purple;\n\nAnd it is as hard to know\n\nThe words of a woman\n\nAs to straighten the crumpled branch\nOf an old willow.\n\nMEMORY OF APRIL\n\nYou say love is this, love is that:\nPoplar tassels, willow tendrils\n\nThe wind and the rain comb,\nTinkle and drip, tinkle and drip‚Äî\nBranches drifting apart. Hagh!\nLove has not even visited this country.\n\n[303]'

In [699]:
text = 'William Carlos Williams\n\nEPITAPH\n\nAn old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n\n‚ÄúLove is a young green willow\nShimmering at the bare wood‚Äôs edge.‚Äù\n\nSPIRIT\n\nO my grey hairs!\nYou are truly white as plum blossoms.\n\nSTROLLER\n\nI have seen the hills blue,\n\nI have seen them purple;\n\nAnd it is as hard to know\n\nThe words of a woman\n\nAs to straighten the crumpled branch\nOf an old willow.\n\nMEMORY OF APRIL\n\nYou say love is this, love is that:\nPoplar tassels, willow tendrils\n\nThe wind and the rain comb,\nTinkle and drip, tinkle and drip‚Äî\nBranches drifting apart. Hagh!\nLove has not even visited this country.\n\n[303]'

In [700]:
title = 'Epitaph'
scan_pattern = fr'{title.split()[-1].upper()}\b.*((?:\r?\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'
lines = re.search(scan_pattern, text, re.MULTILINE).group(1).splitlines()
lines

['',
 '',
 'An old willow with hollow branches',
 'Slowly swayed his few high bright tendrils',
 'And sang:',
 '',
 '‚ÄúLove is a young green willow',
 'Shimmering at the bare wood‚Äôs edge.‚Äù']

In [701]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=13&issue=6&page=13'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Epitaph')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
rescrape
# error_rescrapes.append(rescrape)

{'poet': 'William Carlos Williams',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow',
 'title': 'Epitaph',
 'poem_lines': ['An old willow with hollow branches',
  'Slowly swayed his few high bright tendrils',
  'And sang:',
  '‚ÄúLove is a young green willow',
  'Shimmering at the bare wood‚Äôs edge.‚Äù'],
 'poem_string': 'An old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n‚ÄúLove is a young green willow\nShimmering at the bare wood‚Äôs edge.‚Äù',
 'genre': 'imagist'}

In [374]:
from tqdm import tqdm

In [391]:
%%time

rescraped = []
for url in tqdm(error_poems):
    try:
        poem = scan_poem_rescrape(url)
        rescraped.append(poem)
        error_poems.remove(url)
    except:
        continue

 65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 145/223 [06:41<03:36,  2.77s/it]

CPU times: user 28 s, sys: 52.3 s, total: 1min 20s
Wall time: 6min 41s





In [392]:
len(error_poems)

145

In [393]:
pd.DataFrame(rescraped)[pd.DataFrame(rescraped).poet == 'Kenneth Rexroth']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string


In [384]:
error_poems

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27130/in-my-childhood-when-i-first',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29522/v-tell-it-to-the-forest-fire',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29525/viii-he-yelled-at-me-in-greek',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29166/the-greens

In [400]:
poem_url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'

jumble_pattern = r'-[0-9]+[a-z][0-9a-z]*$'
clean_url = re.sub(jumble_pattern, '', poem_url)
# try:
#     title = soup.find('h1').contents[-1].strip()
# except:
title_pattern = r'[a-z0-9\-]*$'
title = re.search(
    title_pattern,
    clean_url,
    re.I).group().replace(
    '-',
    ' ').title()
    
title

'Walking'

In [403]:
title.split()[-1].upper()

'WALKING'

In [402]:
scan_poem_rescrape('https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892')

AttributeError: 'NoneType' object has no attribute 'group'

In [346]:
type(poems.loc[157,'poem_lines'])

list

In [345]:
# rescrape poem based on index from above 
poems.loc[157,'poem_lines'] = PoemView_rescraper(poems.loc[157,'poem_url'])[0]
poems.loc[157,'poem_string'] = PoemView_rescraper(poems.loc[157,'poem_url'])[1]

In [None]:
# rescrape poem based on index from above 
poems.loc[157,'poem_lines'] = PoemView_rescraper(poems.loc[157,'poem_url'])[0]
poems.loc[157,'poem_string'] = PoemView_rescraper(poems.loc[157,'poem_url'])[1]

poems.loc[165,'poem_lines'] = PoemView_rescraper(poems.loc[165,'poem_url'])[0]
poems.loc[165,'poem_string'] = PoemView_rescraper(poems.loc[165,'poem_url'])[1]

poems.loc[210,'poem_lines'] = PoemView_rescraper(poems.loc[210,'poem_url'])[0]
poems.loc[210,'poem_string'] = PoemView_rescraper(poems.loc[210,'poem_url'])[1]

df_trim.loc[165,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[703,'poem_url'])[0])
df_trim.loc[703,'poem_string'] = PoemView_rescraper(df_trim.loc[703,'poem_url'])[1]

df_trim.loc[952,'poem_lines'] = str(poempara_rescraper(df_trim.loc[952,'poem_url'])[0])
df_trim.loc[952,'poem_string'] = poempara_rescraper(df_trim.loc[952,'poem_url'])[1]

df_trim.loc[953,'poem_lines'] = str(modified_regular_rescraper(df_trim.loc[953,'poem_url'])[0])
df_trim.loc[953,'poem_string'] = modified_regular_rescraper(df_trim.loc[953,'poem_url'])[1]

df_trim.loc[1231,'poem_lines'] = str(justify_rescraper(df_trim.loc[1231,'poem_url'])[0])
df_trim.loc[1231,'poem_string'] = justify_rescraper(df_trim.loc[1231,'poem_url'])[1]

df_trim.loc[1234,'poem_lines'] = str(justify_rescraper(df_trim.loc[1234,'poem_url'])[0])
df_trim.loc[1234,'poem_string'] = justify_rescraper(df_trim.loc[1234,'poem_url'])[1]

df_trim.loc[1389,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1389,'poem_url'])[0])
df_trim.loc[1389,'poem_string'] = PoemView_rescraper(df_trim.loc[1389,'poem_url'])[1]

df_trim.loc[1603,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1603,'poem_url'])[0])
df_trim.loc[1603,'poem_string'] = PoemView_rescraper(df_trim.loc[1603,'poem_url'])[1]

df_trim.loc[2514,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2514,'poem_url'])[0])
df_trim.loc[2514,'poem_string'] = PoemView_rescraper(df_trim.loc[2514,'poem_url'])[1]

df_trim.loc[2517,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2517,'poem_url'])[0])
df_trim.loc[2517,'poem_string'] = PoemView_rescraper(df_trim.loc[2517,'poem_url'])[1]

df_trim.loc[3335,'poem_lines'] = str(ranged_rescraper(df_trim.loc[3335,'poem_url'])[0])
df_trim.loc[3335,'poem_string'] = ranged_rescraper(df_trim.loc[3335,'poem_url'])[1]

df_trim.loc[3418,'poem_lines'] = str(center_rescraper(df_trim.loc[3418,'poem_url'])[0])
df_trim.loc[3418,'poem_string'] = center_rescraper(df_trim.loc[3418,'poem_url'])[1]

df_trim.loc[3421,'poem_lines'] = str(justify_rescraper(df_trim.loc[3421,'poem_url'])[0])
df_trim.loc[3421,'poem_string'] = justify_rescraper(df_trim.loc[3421,'poem_url'])[1]

df_trim.loc[4217,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4217,'poem_url'])[0])
df_trim.loc[4217,'poem_string'] = poempara_rescraper(df_trim.loc[4217,'poem_url'])[1]

df_trim.loc[4611,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4611,'poem_url'])[0])
df_trim.loc[4611,'poem_string'] = poempara_rescraper(df_trim.loc[4611,'poem_url'])[1]

In [338]:
PoemView_rescraper('https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv')

(['XIV1',
  'The brutal Lord of All will rip us from each other‚Äîleave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching‚Äîeyes looking, clouds rising‚Äîlips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream‚Äî: Of Ymir‚Äôs flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya!  ________________',
  'Out of bitterness itself the clear wine of the imagination will be pressed and the dance prosper thereby.  2',
  'To you! whoever you are, wherever you are! (But I know where you are!) There‚Äôs DuÃàrer‚Äôs ‚ÄúNemesis‚Äù naked on her sphere over the little town by the river‚Äîexcept she‚Äôs too old. There‚Äôs a dancing burgess by Tenier and Villon‚Äôs maitresse‚Äîafter he‚Äôd gone bald and was skin pocked and toothless: she that had him ducked in the sewage drain. Then there‚Äôs that miller‚Äôs daughter of ‚Äúbuttocks broad and breastes high.‚Äù So

In [333]:
from unicodedata import normalize

In [337]:
normalize('NFKD', lines_raw[1]).replace('\ufeff', '')

'\n       The brutal Lord of All will rip us from each other‚Äîleave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching‚Äîeyes looking, clouds rising‚Äîlips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream‚Äî: Of Ymir‚Äôs flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya!  ________________'

In [None]:
[line.normalize('NFKD', )]

In [331]:
poem_url = 'https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv'

page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')
lines_raw = soup.find(
                    'div', {
                        'data-view': 'PoemView'}).get_text().split('\r')

lines_raw

['\nXIV1',
 '\n\xa0 \xa0\xa0\xa0\xa0 The brutal Lord of All will rip us from each other‚Äîleave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching‚Äîeyes looking, clouds rising‚Äîlips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream‚Äî: Of Ymir‚Äô\ufeff\ufeffs flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya! \xa0________________',
 '\n\xa0\xa0\xa0\xa0\xa0 Out of bitterness itself the clear wine of the imagination will be pressed and the dance prosper thereby. \xa02',
 '\n\xa0 \xa0\xa0\xa0\xa0 To you! whoever you are, wherever you are! (But I know where you are!) There‚Äô\ufeff\ufeffs D√º\ufeffrer‚Äô\ufeff\ufeffs ‚ÄúNemesis‚Äù naked on her sphere over the little town by the river‚Äîexcept she‚Äô\ufeff\ufeffs too old. There‚Äô\ufeff\ufeff\ufeffs a dancing burgess by Tenier and Villon‚Äô\ufeff\ufeff\ufeffs maitresse‚Äîafter he‚Äô\ufeff\ufeff\

In [None]:
page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')

In [135]:
text_poems = text_poems[text_poems.poem_string != ''].reset_index(drop=True)
text_poems.shape

(3082, 6)

In [136]:
text_poems.to_csv('data/text_poems.csv')

In [137]:
text_poems[text_poems.poet == 'William Carlos Williams']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
810,William Carlos Williams,https://www.poetryfoundation.org/poems/148460/spring-and-all-chapter-xiii-thus-weary-of-life,"Spring and All: Chapter XIII [Thus, weary of life]","[Thus, weary of life, in view of the great consummation which awaits us ‚Äî tomorrow, we rush among our friends congratulating ourselves upon the jo...","Thus, weary of life, in view of the great consummation which awaits us ‚Äî tomorrow, we rush among our friends congratulating ourselves upon the joy...",imagist
811,William Carlos Williams,https://www.poetryfoundation.org/poems/56159/this-is-just-to-say,This Is Just To Say,"[I have eaten, the plums, that were in, the icebox, and which, you were probably, saving, for breakfast, Forgive me, they were delicious, so sweet...",I have eaten\nthe plums\nthat were in\nthe icebox\nand which\nyou were probably\nsaving\nfor breakfast\nForgive me\nthey were delicious\nso sweet\...,imagist
812,William Carlos Williams,https://www.poetryfoundation.org/poems/148462/spring-and-all-xi-in-passing-with-my-mind,Spring and All: XI [In passing with my mind],"[In passing with my mind, on nothing in the world, but the right of way, I enjoy on the road by, virtue of the law¬† ¬†¬†¬†¬†¬†¬†‚Äî, I saw, an elderly man...",In passing with my mind\non nothing in the world\nbut the right of way\nI enjoy on the road by\nvirtue of the law¬† ¬†¬†¬†¬†¬†¬†‚Äî\nI saw\nan elderly man ...,imagist
813,William Carlos Williams,https://www.poetryfoundation.org/poems/53078/flowers-by-the-sea-56d23210587cf,Flowers by the Sea,"[When over the flowery, sharp pasture‚Äôs, edge, unseen, the salt ocean, lifts its form‚Äîchicory and daisies, tied, released, seem hardly flowers alo...","When over the flowery, sharp pasture‚Äôs\nedge, unseen, the salt ocean\nlifts its form‚Äîchicory and daisies\ntied, released, seem hardly flowers alon...",imagist
814,William Carlos Williams,https://www.poetryfoundation.org/poems/49849/between-walls,Between Walls,"[the back wings, of the, hospital where, nothing, will grow lie, cinders, in which shine, the broken, pieces of a green, bottle]",the back wings\nof the\nhospital where\nnothing\nwill grow lie\ncinders\nin which shine\nthe broken\npieces of a green\nbottle,imagist
815,William Carlos Williams,https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv,Kora in Hell: Improvisations XIÔªøV,"[XIV1, The brutal Lord of All will rip us from each other‚Äîleave the one to suffer here alone. No need belief in god or hell to postulate that much...",XIV1\nThe brutal Lord of All will rip us from each other‚Äîleave the one to suffer here alone. No need belief in god or hell to postulate that much....,imagist
816,William Carlos Williams,https://www.poetryfoundation.org/poems/46485/to-elsie,To Elsie,"[The pure products of America, go crazy‚Äî, mountain folk from Kentucky, or the ribbed north end of, Jersey, with its isolate lakes and, valleys, it...","The pure products of America\ngo crazy‚Äî\nmountain folk from Kentucky\nor the ribbed north end of\nJersey\nwith its isolate lakes and\nvalleys, its...",imagist
817,William Carlos Williams,https://www.poetryfoundation.org/poems/54564/kora-in-hell-improvisations-xxvii,Kora in Hell: Improvisations XXVII,"[XXVII ¬† 1, This particular thing, whether it be four pinches of four divers white powders cleverly compounded to cure surely, safely, pleasantly ...","XXVII ¬† 1\nThis particular thing, whether it be four pinches of four divers white powders cleverly compounded to cure surely, safely, pleasantly a...",imagist
818,William Carlos Williams,https://www.poetryfoundation.org/poems/54326/love-song-56d2348bab385,Love Song,"[I lie here thinking of you:‚Äî ¬† the stain of love is upon the world! Yellow, yellow, yellow it eats into the leaves, smears with saffron the horn...","I lie here thinking of you:‚Äî ¬† the stain of love is upon the world! Yellow, yellow, yellow it eats into the leaves, smears with saffron the horne...",imagist
819,William Carlos Williams,https://www.poetryfoundation.org/poems/46484/queen-annes-lace,Queen-Anne‚Äôs Lace,"[Her body is not so white as, anemony petals nor so smooth‚Äînor, so remote a thing. It is a field, of the wild carrot taking, the field by force; t...",Her body is not so white as\nanemony petals nor so smooth‚Äînor\nso remote a thing. It is a field\nof the wild carrot taking\nthe field by force; th...,imagist


In [202]:
poet_poems_url_dict

{'augustan': [{'https://www.poetryfoundation.org/poets/mary-barber': (['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage'],
    [])},
  {'https://www.poetryfoundation.org/poets/susanna-blamire': (['https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
     'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
     'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man'],
    [])},
  {'https://www.poetryfoundation.org/poets/henry-carey': (['https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley'],
    [])},
  {'https://www.poetryfoundation.org/poets/thomas-chatterton': (['https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
     'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude'],
    [])},
  {'https://www.poetryfoundation.org/poets/william-collins': (['https://www.poetryfoundation.org/poems/44003/ode-to-evening',
     'https://www.poetryfoundation.

In [204]:
poet_poems_url_dict['augustan']

[{'https://www.poetryfoundation.org/poets/mary-barber': (['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage'],
   [])},
 {'https://www.poetryfoundation.org/poets/susanna-blamire': (['https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
    'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
    'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man'],
   [])},
 {'https://www.poetryfoundation.org/poets/henry-carey': (['https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley'],
   [])},
 {'https://www.poetryfoundation.org/poets/thomas-chatterton': (['https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
    'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude'],
   [])},
 {'https://www.poetryfoundation.org/poets/william-collins': (['https://www.poetryfoundation.org/poems/44003/ode-to-evening',
    'https://www.poetryfoundation.org/poems/44002/an-ode-on

In [207]:
test = {genre:{'text_urls':[],'scan_urls':[]} for genre in poet_poems_url_dict}
for genre,poets in poet_poems_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            test[genre]['text_urls'].extend(poems[0])
            test[genre]['scan_urls'].extend(poems[1])
            
test

{'augustan': {'text_urls': ['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage',
   'https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
   'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
   'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man',
   'https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley',
   'https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
   'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude',
   'https://www.poetryfoundation.org/poems/44003/ode-to-evening',
   'https://www.poetryfoundation.org/poems/44002/an-ode-on-the-popular-superstitions-of-the-highlands-of-scotland-considered-as-the-subject-of-poetry',
   'https://www.poetryfoundation.org/poems/52293/eclogue-the-second-hassan-or-the-camel-driver',
   'https://www.poetryfoundation.org/poems/44001/ode-on-the-poetical-character',
   'https://www.poetryfoundation.org

In [201]:
%%time

poem_dicts = []
error_poems = []
for genre,poets in poet_poems_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            for text_url in poems[0]:
                poem = text_poem_scraper(text_url)
                poem['genre'] = genre
                poem['poem_url'] = text_url
                poem_dicts.append(poem)
            
            if poems[1]:
                for scan_url in poems[1]:
                    try:
                        poem = text_poem_scraper(scan_url)
                        poem['genre'] = genre
                        poem['poem_url'] = scan_url
                        poem_dicts.append(poem)
                        poems[0].append(scan_url)
                        poems[1].remove(scan_url)
                    except:
                        try:
                            poem = scan_poem_scraper(scan_url)
                            poem['genre'] = genre
                            poem['poet_url'] = scan_url
                            poem_dicts.append(poem)
                        except:
                            error_poems.append(scan_url)

KeyboardInterrupt: 

In [197]:
'https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool' in image_urls

True

In [196]:
text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool')

{'poet': 'H. D.',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool',
 'title': 'The Pool',
 'poem_lines': ['Are you alive?',
  'I touch you.',
  'You quiver like a sea-fish.',
  'I cover you with my net.',
  'What are you‚Äîbanded one?'],
 'poem_string': 'Are you alive?\nI touch you.\nYou quiver like a sea-fish.\nI cover you with my net.\nWhat are you‚Äîbanded one?'}

In [195]:
poet_poems_url_dict['imagist'][1]

{'https://www.poetryfoundation.org/poets/h-d': (['https://www.poetryfoundation.org/poems/47927/leda-56d228c3a5948',
   'https://www.poetryfoundation.org/poems/51856/evening-56d22fe15dc07',
   'https://www.poetryfoundation.org/poems/44133/cassandra-56d2231be6015',
   'https://www.poetryfoundation.org/poems/48186/oread',
   'https://www.poetryfoundation.org/poems/48187/sea-poppies',
   'https://www.poetryfoundation.org/poems/46541/helen-56d22674d6e41',
   'https://www.poetryfoundation.org/poems/48189/sheltered-garden',
   'https://www.poetryfoundation.org/poems/44134/cities',
   'https://www.poetryfoundation.org/poems/48188/sea-rose',
   'https://www.poetryfoundation.org/poems/53970/sea-heroes',
   'https://www.poetryfoundation.org/poems/51870/sea-iris',
   'https://www.poetryfoundation.org/poems/51869/eurydice-56d22fe6d049d',
   'https://www.poetryfoundation.org/poems/48190/wash-of-cold-river'],
  ['https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool',
   'https://www.p

In [194]:
len(image_urls)

2221

In [193]:
text_poems = pd.DataFrame(poem_dicts)
text_poems.shape

(3084, 7)

In [190]:
text_poems = text_poems[text_poems.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1321,Dylan Thomas,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Poem on His Birthday [Facs. drafts],[],,modern
1433,Barbara Guest,https://www.poetryfoundation.org/poems/49367/imagined-room,Imagined Room,[],,new_york_school


In [184]:
poem_url = 'https://www.poetryfoundation.org/poems/51653/to-a-poor-old-woman'

# load a page and soupify it
page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')

# most frequent formatting
lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
# normalize text (from unicode)
lines = [normalize('NFKD', str(line.contents[0])) for line in lines_raw if line.contents]
# remove some hanging html
lines = [line.replace('<br/>', '') for line in lines]
line_pattern = '>(.*?)<'
lines = [re.search(line_pattern, line, re.I).group(1) if '<' in line else line for line in lines]
# scrape poem
# lines_raw = soup.find('div', {'data-view': 'PoemView'}).strings
# lines = [line.strip() for line in lines_raw if line.strip()]

# if not lines:
#     lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
#     lines = [line.get_text().strip() for line in lines_raw if line.get_text().strip()]

# # create string version of poem
# poem_string = '\n'.join(lines)

# info = {'poet': poet,
#         'poem_url': poem_url,
#         'title': title,
#         'poem_lines': lines,
#         'poem_string': poem_string}

lines

['munching a plum on   ',
 '\r the street a paper bag',
 '\r of them in her hand',
 '',
 '\r They taste good to her',
 '\r They taste good   ',
 '\r to her. They taste',
 '\r good to her',
 '',
 '\r You can see it by',
 '\r the way she gives herself',
 '\r to the one half',
 '\r sucked out in her hand',
 '',
 'Comforted',
 '\r a solace of ripe plums',
 '\r seeming to fill the air',
 '\r They taste good to her',
 '']

In [181]:
lines_raw

[<div style="text-indent: -1em; padding-left: 1em;">munching a plum on¬†¬†¬†<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  the street a paper bag<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  of them in her hand<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;"><br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  They taste good to her<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  They taste good¬†¬†¬†<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  to her. They taste<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  good to her<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;"><br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  You can see it by<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  the way she gives herself<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  to the one half<br/></div>,

In [177]:
lines2 = [str(line) for line in lines_raw]
lines2

['\n',
 '\n',
 '\n',
 '\n',
 'Highlight Actions',
 '\n',
 'Enable or disable annotations',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'munching a plum on\xa0\xa0\xa0',
 '\r the street a paper bag',
 '\r of them in her hand',
 '\r They taste good to her',
 '\r They taste good\xa0\xa0\xa0',
 '\r to her. They taste',
 '\r good to her',
 '\r You can see it by',
 '\r the way she gives herself',
 '\r to the one half',
 '\r sucked out in her hand',
 'Comforted',
 'Comforted',
 ' When originally published in the journal ',
 'Smoke',
 ' (Autumn 1934), the line read: ‚ÄúComforted, Relieved‚Äî‚Äù',
 '\r a solace of ripe plums',
 '\r seeming to fill the air',
 '\r They taste good to her',
 '\n']

In [178]:
[line.strip() for line in lines2 if line.strip()]

['Highlight Actions',
 'Enable or disable annotations',
 'munching a plum on',
 'the street a paper bag',
 'of them in her hand',
 'They taste good to her',
 'They taste good',
 'to her. They taste',
 'good to her',
 'You can see it by',
 'the way she gives herself',
 'to the one half',
 'sucked out in her hand',
 'Comforted',
 'Comforted',
 'When originally published in the journal',
 'Smoke',
 '(Autumn 1934), the line read: ‚ÄúComforted, Relieved‚Äî‚Äù',
 'a solace of ripe plums',
 'seeming to fill the air',
 'They taste good to her']

In [113]:
pd.read_csv('data/text_poems.csv', index_col=0)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Mary Barber,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Advice to Her Son on Marriage,[],,augustan
1,Susanna Blamire,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Auld Robin Forbes,"['And auld Robin Forbes hes gien tem a dance,', 'I pat on my speckets to see them aw prance;', 'I thout o‚Äô the days when I was but fifteen,', 'And...","And auld Robin Forbes hes gien tem a dance,\nI pat on my speckets to see them aw prance;\nI thout o‚Äô the days when I was but fifteen,\nAnd skipp‚Äôd...",augustan
2,Susanna Blamire,https://www.poetryfoundation.org/poems/50532/the-siller-croun,The Siller Croun,"['And ye shall walk in silk attire,', 'And siller hae to spare,', 'Gin ye‚Äôll consent to be his bride,', 'Nor think o‚Äô Donald mair.', 'O wha wad bu...","And ye shall walk in silk attire,\nAnd siller hae to spare,\nGin ye‚Äôll consent to be his bride,\nNor think o‚Äô Donald mair.\nO wha wad buy a silken...",augustan
3,Susanna Blamire,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,O Donald! Ye Are Just the Man,"['O Donald! ye are just the man', 'Who, when he‚Äôs got a wife,', 'Begins to fratch‚Äî nae notice ta‚Äôen‚Äî', 'They‚Äôre strangers a‚Äô their life.', 'The fa...","O Donald! ye are just the man\nWho, when he‚Äôs got a wife,\nBegins to fratch‚Äî nae notice ta‚Äôen‚Äî\nThey‚Äôre strangers a‚Äô their life.\nThe fan may drop...",augustan
4,Henry Carey,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,The Ballad of Sally in our Alley,"['The ARGUMENT. A Vulgar Error having long prevailed among many Persons, who imagine Sally Salisbury the Subject of this Ballad, the Author begs ...","The ARGUMENT. A Vulgar Error having long prevailed among many Persons, who imagine Sally Salisbury the Subject of this Ballad, the Author begs le...",augustan
...,...,...,...,...,...,...
3079,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45487/in-school-days,In School-days,"['Still sits the school-house by the road, \xa0\xa0\xa0A ragged beggar sleeping; Around it still the sumachs grow, \xa0\xa0\xa0And blackberry-vine...","Still sits the school-house by the road, ¬†¬†¬†A ragged beggar sleeping; Around it still the sumachs grow, ¬†¬†¬†And blackberry-vines are creeping. With...",victorian
3080,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45483/barbara-frietchie,Barbara Frietchie,"['Up from the meadows rich with corn,', 'Clear in the cool September morn,', 'The clustered spires of Frederick stand', 'Green-walled by the hills...","Up from the meadows rich with corn,\nClear in the cool September morn,\nThe clustered spires of Frederick stand\nGreen-walled by the hills of Mary...",victorian
3081,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45489/skipper-iresons-ride,Skipper Ireson‚Äôs Ride,"['Of all the rides since the birth of time,', 'Told in story or sung in rhyme, ‚Äî', 'On Apuleius‚Äôs Golden Ass,', 'Or one-eyed Calender‚Äôs horse of b...","Of all the rides since the birth of time,\nTold in story or sung in rhyme, ‚Äî\nOn Apuleius‚Äôs Golden Ass,\nOr one-eyed Calender‚Äôs horse of brass,\nW...",victorian
3082,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45493/the-worship-of-nature,The Worship of Nature,['The harp at Nature‚Äôs advent strung \xa0\xa0\xa0\xa0\xa0\xa0Has never ceased to play; The song the stars of morning sung \xa0\xa0\xa0\xa0\xa0\xa0...,The harp at Nature‚Äôs advent strung ¬†¬†¬†¬†¬†¬†Has never ceased to play; The song the stars of morning sung ¬†¬†¬†¬†¬†¬†Has never died away. And prayer is mad...,victorian


In [110]:
# uncomment to save
with gzip.open('data/text_poems.pkl', 'wb') as goodbye:
    pickle.dump(text_poems, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# uncomment to load
with gzip.open('data/text_poems.pkl', 'rb') as hello:
    df = pickle.load(hello)

RecursionError: maximum recursion depth exceeded while getting the str of an object

In [37]:
page = rq.get(list(poet_urls_dict['augustan'][10].values())[0][0][0])
soup = bs(page.content, 'html.parser')

In [38]:
poet = soup.find('a', href=re.compile('.*/poets/.*')).contents[0]
title = soup.find('h1').contents[-1].strip()
poet,title

('Thomas Gray', 'On the Death of Richard West')

In [80]:
pd.DataFrame(text_poem_scraper(list(poet_poems_url_dict['black_mountain'][1].values())[0][0][0]))

Unnamed: 0,0
0,Robert Creeley
1,After Frost
2,"[He comes here, by whatever way he can,, not too late,, not too soon., He sits, waiting., He doesn‚Äôt know, why he should, have such a patience., H..."
3,"He comes here\nby whatever way he can,\nnot too late,\nnot too soon.\nHe sits, waiting.\nHe doesn‚Äôt know\nwhy he should\nhave such a patience.\nHe..."


In [39]:
# most frequent formatting
lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
lines_raw

[<div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And reddening Ph≈ìbus lifts his golden fire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race to cheer,
 <br/></div

In [41]:
# if 'text-align' is justified
lines_raw = soup.find_all('div', {'style': 'text-align: justify;'})
lines_raw

[]

In [50]:
lines_raw

['\n',
 <div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And reddening Ph≈ìbus lifts his golden fire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race to cheer,
 <br

In [65]:
lines_raw = soup.find('div', {'data-view': 'PoemView'}).get_text().split('\r')
lines = [line.strip() for line in lines_raw if line.strip()]
lines

['In vain to me the smiling Mornings shine,',
 'And reddening Ph≈ìbus lifts his golden fire;',
 'The birds in vain their amorous descant join;',
 'Or cheerful fields resume their green attire;',
 'These ears, alas! for other notes repine,',
 'A different object do these eyes require;',
 'My lonely anguish melts no heart but mine;',
 'And in my breast the imperfect joys expire.',
 'Yet Morning smiles the busy race to cheer,',
 'And new-born pleasure brings to happier men;',
 'The fields to all their wonted tribute bear;',
 'To warm their little loves the birds complain;',
 'I fruitless mourn to him that cannot hear,',
 'And weep the more because I weep in vain.']

In [64]:
lines_raw

['\nIn vain to me the smiling Mornings shine,',
 '    And reddening Ph≈ìbus lifts his golden fire;',
 'The birds in vain their amorous descant join;',
 '    Or cheerful fields resume their green attire;',
 'These ears, alas! for other notes repine,',
 '    A different object do these eyes require;',
 'My lonely anguish melts no heart but mine;',
 '    And in my breast the imperfect joys expire.',
 'Yet Morning smiles the busy race to cheer,',
 '    And new-born pleasure brings to happier men;',
 'The fields to all their wonted tribute bear;',
 '    To warm their little loves the birds complain;',
 'I fruitless mourn to him that cannot hear,',
 '    And weep the more because I weep in vain.',
 '\n']

In [49]:
# scrape 'PoemView' html type
lines_raw = soup.find('div', {'data-view': 'PoemView'})

line_pattern = '>(.*?)<'
lines = [re.search(line_pattern, line, re.I).group(1) if '<' in line else line for line in lines]

# normalize text (from unicode)
lines = [normalize('NFKD', str(line)) for line in lines_raw if line]

# lines = [line.replace('<br/>', '') for line in lines]
lines = [line.strip() for line in lines if line]
lines

['',
 '<div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    And reddening Ph≈ìbus lifts his golden fire;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race

- **Check for duplicate values**

In [4]:
# create dataframe from poet_urls_dict
poet_df = pd.DataFrame([(genre,v) for genre in poet_urls_dict.keys() for v in poet_urls_dict[genre]])

# check if any URLs appear more than once
pd.concat(g for _, g in poet_df.groupby(1) if len(g) > 1)

Unnamed: 0,0,1
126,imagist,https://www.poetryfoundation.org/poets/ezra-pound
186,modern,https://www.poetryfoundation.org/poets/ezra-pound
122,imagist,https://www.poetryfoundation.org/poets/richard-aldington
150,modern,https://www.poetryfoundation.org/poets/richard-aldington


- **I'll give those poets to the imagist genre, since it has so few already.**

In [5]:
# list of duplicate URLs
dups = [value for value in poet_df[poet_df.duplicated(1)][1]]
dups

['https://www.poetryfoundation.org/poets/richard-aldington',
 'https://www.poetryfoundation.org/poets/ezra-pound']

In [6]:
# number of modern poets before
len(poet_urls_dict['modern'])

54

In [7]:
# re-listify the modernist URLs without Pound and Aldington
poet_urls_dict['modern'] = [url for url in poet_urls_dict['modern'] if url not in dups]

# number of modern poets after
len(poet_urls_dict['modern'])

52

## Build a dataframe
- **Scrape poems and other info.**

In [15]:
%%time

# instantiate an empty dataframe
df = pd.DataFrame()

# loop over each genre, create dataframe with desired information,
# concat to original dataframe, then save it before looping again
for genre in list(poet_urls_dict.keys()):
    genre_df = pf_scraper(poet_urls_dict, genre, 0.5)
    df = pd.concat([df, genre_df])
    df.to_csv('data/poetry_foundation_raw.csv')

KeyboardInterrupt: 

### Save/load dataframe

In [2]:
# # uncomment to save
# df.to_csv('data/poetry_foundation_raw.csv')

# # uncomment to load
# df = pd.read_csv('data/poetry_foundation_raw.csv', index_col=0)

In [3]:
# rename the columns
df.columns = ['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_lines', 'poem_string']
df.head()

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
0,https://www.poetryfoundation.org/poets/mary-barber,augustan,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Mary Barber,Advice to Her Son on Marriage,,"['When you gain her Affection, take care to preserve it;\r', 'Lest others persuade her, you do not deserve it.\r', 'Still study to heighten the Jo...","When you gain her Affection, take care to preserve it;\r\nLest others persuade her, you do not deserve it.\r\nStill study to heighten the Joys of ..."
1,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Susanna Blamire,Auld Robin Forbes,,"['And auld Robin Forbes hes gien tem a dance,\r', 'I pat on my speckets to see them aw prance;\r', 'I thout o‚Äô the days when I was but fifteen,\r'...","And auld Robin Forbes hes gien tem a dance,\r\nI pat on my speckets to see them aw prance;\r\nI thout o‚Äô the days when I was but fifteen,\r\nAnd s..."
2,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,Susanna Blamire,O Donald! Ye Are Just the Man,,"['O Donald! ye are just the man\r', ' Who, when he‚Äôs got a wife,\r', 'Begins to fratch‚Äî nae notice ta‚Äôen‚Äî\r', ' They‚Äôre strangers a‚Äô their life....","O Donald! ye are just the man\r\n Who, when he‚Äôs got a wife,\r\nBegins to fratch‚Äî nae notice ta‚Äôen‚Äî\r\n They‚Äôre strangers a‚Äô their life.\r\n\nTh..."
3,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50532/the-siller-croun,Susanna Blamire,The Siller Croun,,"['And ye shall walk in silk attire,\r', ' And siller hae to spare,\r', 'Gin ye‚Äôll consent to be his bride,\r', ' Nor think o‚Äô Donald mair.\r'...","And ye shall walk in silk attire,\r\n And siller hae to spare,\r\nGin ye‚Äôll consent to be his bride,\r\n Nor think o‚Äô Donald mair.\r\nO wha w..."
4,https://www.poetryfoundation.org/poets/henry-carey,augustan,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,Henry Carey,The Ballad of Sally in our Alley,,"['Of all the Girls that are so smart\r', ' There‚Äôs none like pretty SALLY,\r', 'She is the Darling of my Heart,\r', ' And she lives in our...","Of all the Girls that are so smart\r\n There‚Äôs none like pretty SALLY,\r\nShe is the Darling of my Heart,\r\n And she lives in our Alley.\..."


- **Explore how the data looks.**

In [4]:
df.shape

(5295, 8)

In [5]:
df.genre.unique()

array(['augustan', 'beat', 'black_arts_movement', 'black_mountain',
       'confessional', 'fugitive', 'georgian', 'harlem_renaissance',
       'imagist', 'language_poetry', 'middle_english', 'modern',
       'new_york_school', 'new_york_school_2nd_generation', 'objectivist',
       'renaissance', 'romantic', 'victorian'], dtype=object)

In [6]:
df.genre.value_counts()

modern                            1324
victorian                          674
renaissance                        430
romantic                           407
imagist                            370
new_york_school                    265
black_mountain                     257
new_york_school_2nd_generation     193
language_poetry                    192
confessional                       176
georgian                           167
black_arts_movement                165
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           121
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

- **Check for duplicate values across multiple columns and drop those rows.**

In [7]:
df.duplicated(subset=['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_string'], keep='last').sum()

98

In [8]:
# drop duplicates
df.drop_duplicates(subset=['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_string'],
                   keep='last',
                   inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)

In [9]:
# check changes
df.shape

(5197, 8)

In [10]:
df.genre.value_counts()

modern                            1284
victorian                          643
renaissance                        427
romantic                           398
imagist                            370
new_york_school                    265
black_mountain                     257
new_york_school_2nd_generation     192
language_poetry                    192
confessional                       176
black_arts_movement                165
georgian                           160
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           114
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

- **Looks like the poem_lines column converted to a list inside of a string while saving to CSV.**
- **I'll wait to convert it until I can fill some missing values for that column, a process I found to be more easily done as a list inside of a string.**

In [11]:
df.loc[0,'poem_lines']

"['When you gain her Affection, take care to preserve it;\\r', 'Lest others persuade her, you do not deserve it.\\r', 'Still study to heighten the Joys of her Life;\\r', 'Not treat her the worse, for her being your Wife.\\r', 'If in Judgment she errs, set her right, without Pride:\\r', '‚ÄôTis the Province of insolent Fools, to deride.\\r', 'A Husband‚Äôs first Praise, is a ', 'Then change not these Titles, for ', 'Let your Person be neat, unaffectedly clean,\\r', 'Tho‚Äô alone with your wife the whole Day you remain.\\r', 'Chuse Books, for her study, to fashion her Mind,\\r', 'To emulate those who excell‚Äôd of her Kind.\\r', 'Be Religion the principal Care of your Life,\\r', 'As you hope to be blest in your Children and Wife:\\r', 'So you, in your Marriage, shall gain its true End;\\r', 'And find, in your Wife, a ', '', '']"

- **Check for missing values.**

In [12]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet             13
title           215
year           1649
poem_lines      410
poem_string     412
dtype: int64

In [13]:
df[df.poet.isna()]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
858,https://www.poetryfoundation.org/poets/w-d-snodgrass,confessional,https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0,,,2006.0,"['ILEANA MALANCIOIU', '', 'Road', '', 'I walk on a dark road so that I won‚Äôt see', '', 'The way my young oxen limp so much;', '', 'The horseshoes ...",ILEANA MALANCIOIU\n\nRoad\n\nI walk on a dark road so that I won‚Äôt see\n\nThe way my young oxen limp so much;\n\nThe horseshoes gouging into their...
1409,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14311/after-how-many-years-tr-by-amy-lowell-and-florence-ayscough,,After How Many Years Tr By Amy Lowell And Florence Ayscough,1919.0,,
1410,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14312/calligraphy-tr-by-amy-lowell-and-florence-ayscough,,Calligraphy Tr By Amy Lowell And Florence Ayscough,1919.0,,
1411,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14322/the-emperors-return-from-a-journey-to-the-south-tr-by-amy-lowell-and-florence-ayscough,,The Emperors Return From A Journey To The South Tr By Amy Lowell And Florence Ayscough,1919.0,,
1412,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14310/an-evening-meeting-tr-by-amy-lowell-and-florence-ayscough,,An Evening Meeting Tr By Amy Lowell And Florence Ayscough,1919.0,,
1413,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14314/from-the-straw-hut-among-the-seven-peaks-tr-by-amy-lowell-and-florence-ayscough,,From The Straw Hut Among The Seven Peaks Tr By Amy Lowell And Florence Ayscough,1919.0,,
1414,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14321/the-inn-at-the-western-lake-tr-by-amy-lowell-and-florence-ayscough,,The Inn At The Western Lake Tr By Amy Lowell And Florence Ayscough,1919.0,,
1415,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14296/on-seeing-the-portrait-of-a-beautiful-concubine-tr-by-amy-lowell-and-florence-ayscough,,On Seeing The Portrait Of A Beautiful Concubine Tr By Amy Lowell And Florence Ayscough,1919.0,,
1416,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14316/on-the-classic-of-the-hills-and-sea-tr-by-amy-lowell-and-florence-ayscough,,On The Classic Of The Hills And Sea Tr By Amy Lowell And Florence Ayscough,1919.0,,
1417,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14313/one-goes-a-journey-tr-by-amy-lowell-and-florence-ayscough,,One Goes A Journey Tr By Amy Lowell And Florence Ayscough,1919.0,,


- **The Amy Lowell and Ben Jonson entries appear unuseable, so I'll drop those rows.**
- **I'll go ahead and fill in the missing info for the Snodgrass poem (which is actually a translation of another poet, but a Confessional translator will probably produce a Confessional work).**

In [14]:
# manually load in information to the poet and title column
df.loc[858,'poet'] = 'ILEANA MALANCIOIU'.title()
df.loc[858,'title'] = 'Road'
df[df.index == 858]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
858,https://www.poetryfoundation.org/poets/w-d-snodgrass,confessional,https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0,Ileana Malancioiu,Road,2006.0,"['ILEANA MALANCIOIU', '', 'Road', '', 'I walk on a dark road so that I won‚Äôt see', '', 'The way my young oxen limp so much;', '', 'The horseshoes ...",ILEANA MALANCIOIU\n\nRoad\n\nI walk on a dark road so that I won‚Äôt see\n\nThe way my young oxen limp so much;\n\nThe horseshoes gouging into their...


In [15]:
# drop the rows with missing values in the poet column
df.dropna(subset=['poet'], inplace=True)

In [16]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines      398
poem_string     400
dtype: int64

## Rescraping
- **After reworking the scraping function a bit, I can try to fill in some missing poem_lines and poem_string values.**

### Round 1

In [17]:
# create a list of index numbers with NaN values in the poem_lines column
lookups = list(df[df.poem_lines.isna()].index)
lookups

[158,
 168,
 169,
 171,
 175,
 183,
 184,
 200,
 203,
 210,
 229,
 254,
 283,
 324,
 325,
 336,
 351,
 354,
 361,
 458,
 466,
 482,
 484,
 487,
 490,
 503,
 511,
 512,
 513,
 531,
 532,
 542,
 558,
 568,
 576,
 578,
 624,
 626,
 648,
 660,
 661,
 663,
 664,
 694,
 701,
 702,
 703,
 704,
 705,
 707,
 708,
 711,
 714,
 715,
 716,
 717,
 719,
 727,
 736,
 749,
 751,
 753,
 769,
 770,
 817,
 834,
 853,
 872,
 881,
 885,
 886,
 892,
 897,
 900,
 917,
 921,
 940,
 942,
 943,
 944,
 945,
 946,
 947,
 1004,
 1025,
 1123,
 1163,
 1169,
 1171,
 1184,
 1186,
 1192,
 1234,
 1297,
 1299,
 1319,
 1326,
 1345,
 1348,
 1363,
 1367,
 1371,
 1379,
 1383,
 1392,
 1395,
 1404,
 1440,
 1446,
 1452,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1515,
 1516,
 1517,
 1518,
 1519,
 1551,
 1552,
 1553,
 1554,
 1555,
 1556,
 1560,
 1565,
 1566,
 1587,
 1591,
 1594,
 1602,
 1604,
 1617,
 1618,
 1623,
 1631,
 1711,
 1731,
 1732,
 1743,
 1748,
 1770,
 1786,
 1815,
 1816

In [18]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups:
    info = poem_scraper(df.loc[i, 'poem_url'])
    try:
        df.loc[i,'poem_lines'] = str(info[3])
        df.loc[i,'poem_string'] = info[4]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 158
Success -- 168
Success -- 169
Success -- 171
Success -- 175
Success -- 183
Success -- 184
Success -- 200
Success -- 203
Success -- 210
Success -- 229
Success -- 254
Success -- 283
Success -- 324
Success -- 325
Success -- 336
Success -- 351
Success -- 354
Success -- 361
Success -- 458
Success -- 466
Success -- 482
Success -- 484
Success -- 487
Success -- 490
Success -- 503
Success -- 511
Success -- 512
Success -- 513
Success -- 531
Success -- 532
Success -- 542
Success -- 558
Success -- 568
Success -- 576
Success -- 578
Success -- 624
Success -- 626
Success -- 648
Success -- 660
Success -- 661
Success -- 663
Success -- 664
Success -- 694
Success -- 701
Success -- 702
Success -- 703
Success -- 704
Success -- 705
Success -- 707
Success -- 708
Success -- 711
Success -- 714
Success -- 715
Success -- 716
Success -- 717
Success -- 719
Success -- 727
Success -- 736
Success -- 749
Success -- 751
Success -- 753
Success -- 769
Success -- 770
Success -- 817
Success -- 834
Success --

- **Looks like the loop was somewhat successful though it did turn NaN values into the string 'nan'.**
- **I'll look first for other NaNs I may want to get rid of.**

In [20]:
df['poem_lines'] = df['poem_lines'].apply(destringify)

In [21]:
df.loc[0,'poem_lines']

['When you gain her Affection, take care to preserve it;\r',
 'Lest others persuade her, you do not deserve it.\r',
 'Still study to heighten the Joys of her Life;\r',
 'Not treat her the worse, for her being your Wife.\r',
 'If in Judgment she errs, set her right, without Pride:\r',
 '‚ÄôTis the Province of insolent Fools, to deride.\r',
 'A Husband‚Äôs first Praise, is a ',
 'Then change not these Titles, for ',
 'Let your Person be neat, unaffectedly clean,\r',
 'Tho‚Äô alone with your wife the whole Day you remain.\r',
 'Chuse Books, for her study, to fashion her Mind,\r',
 'To emulate those who excell‚Äôd of her Kind.\r',
 'Be Religion the principal Care of your Life,\r',
 'As you hope to be blest in your Children and Wife:\r',
 'So you, in your Marriage, shall gain its true End;\r',
 'And find, in your Wife, a ',
 '',
 '']

In [23]:
# convert the string 'nan' back to NaN value
df['poem_lines'] = np.where(df['poem_lines'] == 'nan', np.nan, df['poem_lines'])

# check
df.loc[169,'poem_lines']

nan

In [24]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines      344
poem_string     346
dtype: int64

### Round 2

In [34]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups2 = list(df[df.poem_lines.isna()].index)
lookups2

[169,
 171,
 183,
 184,
 200,
 203,
 210,
 229,
 254,
 283,
 324,
 325,
 458,
 466,
 482,
 484,
 487,
 490,
 503,
 511,
 512,
 513,
 531,
 532,
 558,
 568,
 576,
 578,
 624,
 626,
 648,
 660,
 661,
 663,
 664,
 694,
 701,
 702,
 703,
 704,
 705,
 707,
 708,
 711,
 714,
 715,
 716,
 717,
 719,
 727,
 736,
 749,
 751,
 753,
 769,
 770,
 834,
 853,
 872,
 881,
 885,
 886,
 892,
 897,
 900,
 917,
 921,
 940,
 942,
 943,
 944,
 945,
 946,
 947,
 1004,
 1025,
 1163,
 1169,
 1171,
 1184,
 1186,
 1234,
 1297,
 1299,
 1319,
 1363,
 1367,
 1371,
 1379,
 1383,
 1392,
 1395,
 1404,
 1440,
 1446,
 1452,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1551,
 1552,
 1553,
 1554,
 1555,
 1556,
 1560,
 1565,
 1566,
 1587,
 1591,
 1594,
 1602,
 1604,
 1617,
 1618,
 1623,
 1711,
 1834,
 1836,
 1837,
 1839,
 1844,
 1865,
 1867,
 1870,
 1875,
 1876,
 1877,
 1906,
 1914,
 1915,
 1940,
 1965,
 1975,
 1976,
 1977,
 1978,
 1979,
 1993,
 1994,
 1997,
 1999,
 2000,
 20

In [41]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups2:
    try:
        info = image_rescraper_poet(df.loc[i, 'poem_url'], df.loc[i, 'poet'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 169
Success -- 171
Failure -- 183
Failure -- 184
Failure -- 200
Failure -- 203
Success -- 210
Failure -- 229
Failure -- 254
Failure -- 283
Failure -- 324
Failure -- 325
Success -- 458
Success -- 466
Success -- 482
Success -- 484
Success -- 487
Success -- 490
Success -- 503
Success -- 511
Success -- 512
Failure -- 513
Success -- 531
Success -- 532
Success -- 558
Success -- 568
Failure -- 576
Failure -- 578
Success -- 624
Success -- 626
Failure -- 648
Success -- 660
Success -- 661
Success -- 663
Success -- 664
Success -- 694
Success -- 701
Success -- 702
Success -- 703
Failure -- 704
Success -- 705
Failure -- 707
Success -- 708
Success -- 711
Success -- 714
Failure -- 715
Success -- 716
Failure -- 717
Success -- 719
Success -- 727
Success -- 736
Success -- 749
Failure -- 751
Failure -- 753
Failure -- 769
Failure -- 770
Success -- 834
Success -- 853
Success -- 872
Failure -- 881
Success -- 885
Success -- 886
Failure -- 892
Success -- 897
Failure -- 900
Failure -- 917
Success --

### Round 3

In [42]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups3 = list(df[df.poem_lines.isna()].index)
lookups3

[183,
 184,
 200,
 203,
 229,
 254,
 283,
 324,
 325,
 513,
 576,
 578,
 648,
 704,
 707,
 715,
 717,
 751,
 753,
 769,
 770,
 881,
 892,
 900,
 917,
 940,
 943,
 945,
 946,
 947,
 1025,
 1163,
 1169,
 1184,
 1234,
 1297,
 1299,
 1319,
 1363,
 1367,
 1371,
 1383,
 1392,
 1404,
 1440,
 1446,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1552,
 1554,
 1587,
 1594,
 1604,
 1617,
 1618,
 1623,
 1711,
 1834,
 1836,
 1837,
 1839,
 1865,
 1870,
 1915,
 1975,
 1976,
 1977,
 1978,
 1979,
 1993,
 1997,
 2003,
 2008,
 2011,
 2013,
 2019,
 2021,
 2023,
 2026,
 2032,
 2037,
 2042,
 2044,
 2050,
 2055,
 2091,
 2092,
 2093,
 2117,
 2122,
 2123,
 2156,
 2163,
 2165,
 2171,
 2193,
 2206,
 2240,
 2249,
 2293,
 2307,
 2310,
 2336,
 2349,
 2412,
 2417,
 2421,
 2424,
 2425,
 2434,
 2444,
 2451,
 2452,
 2457,
 2458,
 2461,
 2464,
 2488,
 2492,
 2528,
 2546,
 2572,
 2647,
 2648,
 2649,
 2728,
 2730,
 2744,
 2746,
 2776,
 2787,
 2803,
 2829,
 2851,
 2869,
 2877,
 

In [46]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups3:
    try:
        info = image_rescraper_POETRY(df.loc[i, 'poem_url'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Failure -- 183
Failure -- 184
Success -- 200
Success -- 203
Success -- 229
Success -- 254
Failure -- 283
Success -- 324
Success -- 325
Success -- 513
Success -- 576
Success -- 578
Success -- 648
Success -- 704
Success -- 707
Success -- 715
Success -- 717
Success -- 751
Success -- 753
Success -- 769
Success -- 770
Success -- 881
Failure -- 892
Success -- 900
Success -- 917
Success -- 940
Success -- 943
Failure -- 945
Success -- 946
Success -- 947
Success -- 1025
Success -- 1163
Success -- 1169
Success -- 1184
Success -- 1234
Failure -- 1297
Success -- 1299
Success -- 1319
Failure -- 1363
Success -- 1367
Failure -- 1371
Failure -- 1383
Failure -- 1392
Success -- 1404
Failure -- 1440
Failure -- 1446
Success -- 1456
Success -- 1467
Success -- 1468
Success -- 1477
Failure -- 1482
Failure -- 1489
Failure -- 1495
Failure -- 1496
Success -- 1498
Failure -- 1500
Success -- 1502
Success -- 1503
Success -- 1505
Failure -- 1552
Success -- 1554
Success -- 1587
Success -- 1594
Success -- 1604
Failur

In [47]:
df.loc[200,'poem_lines']

"['¬© SHE IS AS LOVELY-OFTEN', 'And tallness stood upon the sky like a sparkling mane', 'O she is as lovely-often as every day; the day', 'following the day . . the day of our lives, the brief day.', 'Within this moving room, this shadowy often-', 'ness of days where the little hurry of our lives is said. .', 'O as lovely-often as the moving wing of a bird.', 'But ah, alas, sooner or later each of us must', 'stand before that Roman Court, and be judged free of', 'even such lies as I told about the imperishable beauty of', 'her hair. But that time is not now, and even such lies as', 'I said about the enduring wonder of her grace, are lies', 'that contain within them the only truth by which a', 'man may live in this world.', 'she is as lovely-often as every day; the day', 'following the little day . . the day of our lives, ah, alas,', 'the brief day.', 'FIRST CAME THE LION-RIDER', 'First came the Lion-Rider, across the green', 'fields of the morning, holding golden in his golden', 'hands

### Round 4

In [48]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups4 = list(df[df.poem_lines.isna()].index)
lookups4

[183,
 184,
 283,
 892,
 945,
 1297,
 1363,
 1371,
 1383,
 1392,
 1440,
 1446,
 1482,
 1489,
 1495,
 1496,
 1500,
 1552,
 1617,
 1836,
 1839,
 1865,
 1870,
 1975,
 1976,
 1977,
 1978,
 1979,
 2003,
 2013,
 2050,
 2093,
 2122,
 2123,
 2412,
 2424,
 2434,
 2451,
 2452,
 2457,
 2458,
 2546,
 2572,
 2728,
 2776,
 2933,
 3004,
 3327,
 3335,
 3336,
 3452,
 4309]

In [60]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so I'm running that again
for i in lookups4:
    try:
        info = image_rescraper_poet(df.loc[i, 'poem_url'], df.loc[i, 'poet'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 183
Success -- 184
Failure -- 283
Failure -- 892
Failure -- 945
Success -- 1297
Failure -- 1363
Failure -- 1371
Failure -- 1383
Failure -- 1392
Failure -- 1440
Failure -- 1446
Failure -- 1482
Failure -- 1489
Failure -- 1495
Failure -- 1496
Failure -- 1500
Failure -- 1552
Failure -- 1617
Failure -- 1836
Failure -- 1839
Failure -- 1865
Failure -- 1870
Failure -- 1975
Failure -- 1976
Failure -- 1977
Failure -- 1978
Failure -- 1979
Success -- 2003
Success -- 2013
Failure -- 2050
Success -- 2093
Failure -- 2122
Failure -- 2123
Failure -- 2412
Failure -- 2424
Success -- 2434
Failure -- 2451
Failure -- 2452
Failure -- 2457
Failure -- 2458
Failure -- 2546
Failure -- 2572
Failure -- 2728
Failure -- 2776
Failure -- 2933
Failure -- 3004
Success -- 3327
Success -- 3335
Success -- 3336
Failure -- 3452
Failure -- 4309
CPU times: user 5.96 s, sys: 798 ms, total: 6.75 s
Wall time: 1min 13s


### Round 5

In [61]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups5 = list(df[df.poem_lines.isna()].index)
lookups5

[283,
 892,
 945,
 1363,
 1371,
 1383,
 1392,
 1440,
 1446,
 1482,
 1489,
 1495,
 1496,
 1500,
 1552,
 1617,
 1836,
 1839,
 1865,
 1870,
 1975,
 1976,
 1977,
 1978,
 1979,
 2050,
 2122,
 2123,
 2412,
 2424,
 2451,
 2452,
 2457,
 2458,
 2546,
 2572,
 2728,
 2776,
 2933,
 3004,
 3452,
 4309]

In [69]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so am running that again
for i in lookups5:
    try:
        info = image_rescraper_title(df.loc[i, 'poem_url'], df.loc[i, 'title'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 283
Success -- 892
Success -- 945
Success -- 1363
Success -- 1371
Success -- 1383
Success -- 1392
Failure -- 1440
Success -- 1446
Failure -- 1482
Success -- 1489
Failure -- 1495
Success -- 1496
Failure -- 1500
Success -- 1552
Failure -- 1617
Success -- 1836
Failure -- 1839
Success -- 1865
Success -- 1870
Failure -- 1975
Failure -- 1976
Success -- 1977
Failure -- 1978
Failure -- 1979
Failure -- 2050
Success -- 2122
Failure -- 2123
Failure -- 2412
Success -- 2424
Failure -- 2451
Success -- 2452
Success -- 2457
Failure -- 2458
Success -- 2546
Success -- 2572
Success -- 2728
Failure -- 2776
Success -- 2933
Success -- 3004
Success -- 3452
Success -- 4309
CPU times: user 4.89 s, sys: 663 ms, total: 5.56 s
Wall time: 58.8 s


### A little excessive, but not bad!

In [73]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines        7
poem_string       9
dtype: int64

- **I'll drop the remaining rows with missing poem_lines values.**

In [75]:
# drop the rows with missing values in the poem_lines column
df.dropna(subset=['poem_lines'], inplace=True)

In [76]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines        0
poem_string       2
dtype: int64

- **The pages for the rows with missing poem_string values appear to be blank so I'll drop those.**

In [77]:
df[df.poem_string.isna()]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
2941,https://www.poetryfoundation.org/poets/dylan-thomas,modern,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Dylan Thomas,Poem on His Birthday [Facs. drafts],,[],
3230,https://www.poetryfoundation.org/poets/barbara-guest,new_york_school,https://www.poetryfoundation.org/poems/49367/imagined-room,Barbara Guest,Imagined Room,,[],


In [78]:
# drop the rows with missing values in the poem_string column, the pages for which do appear blank
df.dropna(subset=['poem_string'], inplace=True)

- **I'll try to fill in the title column using Regex.**

In [79]:
# create a list of index numbers with NaN values in the title column
lookups_title = list(df[df.title.isna()].index)
lookups_title

[166,
 251,
 275,
 285,
 306,
 459,
 460,
 462,
 463,
 469,
 470,
 471,
 472,
 514,
 517,
 521,
 522,
 523,
 552,
 556,
 557,
 559,
 561,
 563,
 567,
 619,
 631,
 639,
 641,
 642,
 696,
 710,
 779,
 780,
 830,
 831,
 906,
 908,
 922,
 924,
 986,
 999,
 1012,
 1046,
 1112,
 1136,
 1143,
 1164,
 1174,
 1261,
 1262,
 1296,
 1349,
 1455,
 1539,
 1540,
 1586,
 1588,
 1596,
 1599,
 1609,
 1757,
 1842,
 1848,
 1849,
 1903,
 1907,
 1908,
 1930,
 1935,
 1946,
 1947,
 1955,
 2028,
 2034,
 2118,
 2159,
 2160,
 2167,
 2177,
 2182,
 2188,
 2198,
 2210,
 2211,
 2212,
 2219,
 2223,
 2291,
 2363,
 2415,
 2426,
 2428,
 2460,
 2466,
 2493,
 2494,
 2522,
 2757,
 2758,
 2760,
 2767,
 2778,
 2781,
 2796,
 2806,
 2816,
 2820,
 2830,
 2845,
 2847,
 2858,
 2862,
 2864,
 2871,
 2953,
 2955,
 2969,
 2996,
 2997,
 3002,
 3008,
 3167,
 3271,
 3309,
 3346,
 3360,
 3369,
 3380,
 3381,
 3390,
 3430,
 3431,
 3433,
 3449,
 3456,
 3533,
 3592,
 3593,
 3641,
 3644,
 3677,
 3696,
 3704,
 3705,
 3707,
 3708,
 3709,
 3714,

In [80]:
%%time

# create regex pattern to capture the ending of the url
title_pattern = '.+/([a-z\-]*).*$'

# iterate over the list, attempting to fill in the title with re-stylized url ending
for i in lookups_title:
    title = re.search(title_pattern, df.loc[i,'poem_url'], re.I).group(1).replace('-', ' ').title()
    try:
        df.loc[i,'title'] = title
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 166
Success -- 251
Success -- 275
Success -- 285
Success -- 306
Success -- 459
Success -- 460
Success -- 462
Success -- 463
Success -- 469
Success -- 470
Success -- 471
Success -- 472
Success -- 514
Success -- 517
Success -- 521
Success -- 522
Success -- 523
Success -- 552
Success -- 556
Success -- 557
Success -- 559
Success -- 561
Success -- 563
Success -- 567
Success -- 619
Success -- 631
Success -- 639
Success -- 641
Success -- 642
Success -- 696
Success -- 710
Success -- 779
Success -- 780
Success -- 830
Success -- 831
Success -- 906
Success -- 908
Success -- 922
Success -- 924
Success -- 986
Success -- 999
Success -- 1012
Success -- 1046
Success -- 1112
Success -- 1136
Success -- 1143
Success -- 1164
Success -- 1174
Success -- 1261
Success -- 1262
Success -- 1296
Success -- 1349
Success -- 1455
Success -- 1539
Success -- 1540
Success -- 1586
Success -- 1588
Success -- 1596
Success -- 1599
Success -- 1609
Success -- 1757
Success -- 1842
Success -- 1848
Success -- 1849
Su

In [81]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title             0
year           1647
poem_lines        0
poem_string       0
dtype: int64

- **I'll drop the year column, as that didn't seem to be too successful.**

In [83]:
df.drop(columns='year', inplace=True)
df.isna().sum()

poet_url       0
genre          0
poem_url       0
poet           0
title          0
poem_lines     0
poem_string    0
dtype: int64

In [84]:
df.shape

(5176, 7)

### Save a copy

In [87]:
df.to_csv('data/poetry_foundation_raw_rescrape.csv')


- **I'll look at a breakdown of genres and see if there are any I should get rid of.**
- **My initial thoughts are to limit it in time period, so as to remove any language barriers, so to speak (between, say, Shakespearean English and modern English).**

In [88]:
df.genre.value_counts()

modern                            1279
victorian                          643
renaissance                        426
romantic                           398
imagist                            356
new_york_school                    264
black_mountain                     257
new_york_school_2nd_generation     192
language_poetry                    192
confessional                       176
black_arts_movement                165
georgian                           160
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           114
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

In [89]:
# check a sample Middle English poem
print(df[df.genre == 'middle_english'].iloc[0,-1])

Whan that Aprille with his shour
The droghte of March hath perc
And bath
Of which vertuÃÅ engendr
Whan Zephirus eek with his swet
Inspir
The tendr
Hath in the Ram his half
And smal
That slepen al the nyght with open y
So priketh hem NatuÃÅre in hir corag
Thanne longen folk to goon on pilgrimag
And palmeres for to seken straung
To fern
And specially, from every shir
Of Eng
The hooly blisful martir for to sek
That hem hath holpen whan that they were seek

Bifil that in that seson on a day, 
In Southwerk at the Tabard as I lay, 
Redy to wenden on my pilgrymag
To Caunterbury with ful devout corag
At nyght were come into that hostelry
Wel nyne and twenty in a compaigny
Of sondry folk, by aÃÅventure y-fall
In felaweshipe, and pilgrimes were they all
That toward Caunterbury wolden ryd
The chambr
And wel we weren es
And shortly, whan the sonn
So hadde I spoken with hem everychon, 
That I was of hir felaweshipe anon, 
And mad
To take oure wey, ther as I yow devys

But nath
Er that I ferther in 

- **Indeed, Middle English is definitely out.**

In [90]:
df = df[df.genre != 'middle_english']
df.shape

(5166, 7)

In [91]:
# check a sample Renaissance poem
print(df[df.genre == 'renaissance'].iloc[0,-1])

Long have I long‚Äôd to see my love againe,
   Still have I wisht, but never could obtaine it;
   Rather than all the world (if I might gaine it)
Would I desire my love‚Äôs sweet precious gaine.
Yet in my soule I see him everie day,
   See him, and see his still sterne countenaunce,
   But (ah) what is of long continuance,
Where majestie and beautie beares the sway?
Sometimes, when I imagine that I see him,
   (As love is full of foolish fantasies)
   Weening to kisse his lips, as my love‚Äôs fees,
I feele but aire: nothing but aire to bee him.
   Thus with Ixion, kisse I clouds in vaine:
   Thus with Ixion, feele I endles paine.





In [92]:
# check a sample Augustan poem
print(df[df.genre == 'augustan'].iloc[1,-1])

And auld Robin Forbes hes gien tem a dance,
I pat on my speckets to see them aw prance;
I thout o‚Äô the days when I was but fifteen,
And skipp‚Äôd wi‚Äô the best upon Forbes‚Äôs green.
Of aw things that is I think thout is meast queer,
It brings that that‚Äôs by-past and sets it down here;
I see Willy as plain as I dui this bit leace,
When he tuik his cwoat lappet and deeghted his feace.

The lasses aw wonder‚Äôd what Willy cud see
In yen that was dark and hard featur‚Äôd leyke me;
And they wonder‚Äôd ay mair when they talk‚Äôd o‚Äô my wit,
And slily telt Willy that cudn‚Äôt be it:
But Willy he laugh‚Äôd, and he meade me his weyfe,
And whea was mair happy thro‚Äô aw his lang leyfe?
It‚Äôs e‚Äôen my great comfort, now Willy is geane,
The he offen said‚Äî nae place was leyke his awn heame!

I mind when I carried my wark to yon steyle
Where Willy was deykin, the time to beguile,
He wad fling me a daisy to put i‚Äô my breast,
And I hammer‚Äôd my noddle to mek out a jest

- **According to Poetry Foundation's website, Renaissance and Augustan poems are from the years 1500 - 1780, and the differences in the English are fairly clear.**
- **For now, I'll drop these.**

In [93]:
df_trim = df[df.genre != 'renaissance']
df_trim = df_trim[df_trim.genre != 'augustan']
df_trim.shape

(4626, 7)

In [94]:
# check a sample Victorian poem
print(df[df.genre == 'victorian'].iloc[1,-1])

I
The evening comes, the fields are still. 
The tinkle of the thirsty rill, 
Unheard all day, ascends again; 
Deserted is the half-mown plain, 
Silent the swaths! the ringing wain, 
The mower's cry, the dog's alarms, 
All housed within the sleeping farms! 
The business of the day is done, 
The last-left haymaker is gone. 
And from the thyme upon the height, 
And from the elder-blossom white 
And pale dog-roses in the hedge, 
And from the mint-plant in the sedge, 
In puffs of balm the night-air blows 
The perfume which the day forgoes. 
And on the pure horizon far, 
See, pulsing with the first-born star, 
The liquid sky above the hill! 
The evening comes, the fields are still. 

       Loitering and leaping, 
       With saunter, with bounds‚Äî 
       Flickering and circling 
       In files and in rounds‚Äî 
       Gaily their pine-staff green 
       Tossing in air, 
       Loose o'er their shoulders white 
       Showering their hair‚Äî 
       See! the wild Maenads 
       Break fr

In [95]:
# check a sample Romantic poem
print(df[df.genre == 'romantic'].iloc[1,-1])

Now in thy dazzling half-oped eye, 
Thy curled nose and lip awry, 
Uphoisted arms and noddling head, 
And little chin with crystal spread, 
Poor helpless thing! what do I see, 
That I should sing of thee? 

From thy poor tongue no accents come, 
Which can but rub thy toothless gum: 
Small understanding boasts thy face, 
Thy shapeless limbs nor step nor grace: 
A few short words thy feats may tell, 
And yet I love thee well. 

When wakes the sudden bitter shriek, 
And redder swells thy little cheek 
When rattled keys thy woes beguile, 
And through thine eyelids gleams the smile, 
Still for thy weakly self is spent 
Thy little silly plaint. 

But when thy friends are in distress. 
Thou‚Äôlt laugh and chuckle n‚Äôertheless, 
Nor with kind sympathy be smitten, 
Though all are sad but thee and kitten; 
Yet puny varlet that thou art, 
Thou twitchest at the heart. 

Thy smooth round cheek so soft and warm; 
Thy pinky hand and dimpled arm; 
Thy silken locks that scantly peep, 
With gold tipped

- **Romantic and Victorian poems are from 1781-1900, but the language seems fairly similar.**
- **Plus, these are some very formative genres for poetry in English. For now, I'll keep these.**

- **All other genres are from after 1900.**

In [96]:
# let's reindex
df_trim.reset_index(drop=True, inplace=True)

## Rescraping (again)
- **Look more closely at how the scraping went.**
- **Eventually, I'll want to create some new features, like number of lines and average line length.**
    - **Since I can't divide by zero, this is a good opportunity to look for any unsuccessful scrapes--those where 0 or too few lines were scraped.**
    - **NOTE: I'm checking if length of poem_lines is less than or equal to 1 because that yielded the desired results, whereas seeing if length equaled 0 did not.**

In [97]:
df_trim[df_trim['poem_lines'].map(lambda x: len(x)) <= 1]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
222,https://www.poetryfoundation.org/poets/henry-dumas,black_arts_movement,https://www.poetryfoundation.org/poems/53477/kef-21,Henry Dumas,Kef 21,"[First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling...","First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling ..."
428,https://www.poetryfoundation.org/poets/robert-duncan,black_mountain,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,Robert Duncan,A Poem Beginning with a Line by Pindar,[I],I
703,https://www.poetryfoundation.org/poets/anne-sexton,confessional,https://www.poetryfoundation.org/poems/152252/o-ye-tongues,Anne Sexton,O Ye Tongues,[First Psalm],First Psalm
952,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57369/the-send-off,Wilfred Owen,The Send-Off,[ ],
953,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57347/smile-smile-smile,Wilfred Owen,"Smile, Smile, Smile","[Head to limp head, the sunk-eyed wounded scanned]","Head to limp head, the sunk-eyed wounded scanned"
1231,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53772/spring-day-56d233626c49b,Amy Lowell,Spring Day,[<em> Bath</em>],<em> Bath</em>
1234,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53773/towns-in-colour,Amy Lowell,Towns in Colour,"[Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!]","Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!"
1389,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poems/54567/kora-in-hell-improvisations-xi,William Carlos Williams,Kora in Hell: Improvisations XI,[XI],XI
1603,https://www.poetryfoundation.org/poets/lyn-hejinian,language_poetry,https://www.poetryfoundation.org/poems/47892/my-life-a-name-trimmed-with-colored-ribbons,Lyn Hejinian,My Life: A name trimmed with colored ribbons,[A name trimmed],A name trimmed
1615,https://www.poetryfoundation.org/poets/fanny-howe,language_poetry,https://www.poetryfoundation.org/poems/46762/everythings-a-fake,Fanny Howe,Everything‚Äôs a Fake,"[Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it‚Äôs spring. At night the mockingbirds ring their warnings of ...","Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it‚Äôs spring. At night the mockingbirds ring their warnings of c..."


- **After building out some specific rescraping functions, I can replace the poem_lines and poem_string values.**

In [100]:
# rescrape poem based on index from above 
df_trim.loc[428,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[428,'poem_url'])[0])
df_trim.loc[428,'poem_string'] = PoemView_rescraper(df_trim.loc[428,'poem_url'])[1]

df_trim.loc[703,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[703,'poem_url'])[0])
df_trim.loc[703,'poem_string'] = PoemView_rescraper(df_trim.loc[703,'poem_url'])[1]

df_trim.loc[952,'poem_lines'] = str(poempara_rescraper(df_trim.loc[952,'poem_url'])[0])
df_trim.loc[952,'poem_string'] = poempara_rescraper(df_trim.loc[952,'poem_url'])[1]

df_trim.loc[953,'poem_lines'] = str(modified_regular_rescraper(df_trim.loc[953,'poem_url'])[0])
df_trim.loc[953,'poem_string'] = modified_regular_rescraper(df_trim.loc[953,'poem_url'])[1]

df_trim.loc[1231,'poem_lines'] = str(justify_rescraper(df_trim.loc[1231,'poem_url'])[0])
df_trim.loc[1231,'poem_string'] = justify_rescraper(df_trim.loc[1231,'poem_url'])[1]

df_trim.loc[1234,'poem_lines'] = str(justify_rescraper(df_trim.loc[1234,'poem_url'])[0])
df_trim.loc[1234,'poem_string'] = justify_rescraper(df_trim.loc[1234,'poem_url'])[1]

df_trim.loc[1389,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1389,'poem_url'])[0])
df_trim.loc[1389,'poem_string'] = PoemView_rescraper(df_trim.loc[1389,'poem_url'])[1]

df_trim.loc[1603,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1603,'poem_url'])[0])
df_trim.loc[1603,'poem_string'] = PoemView_rescraper(df_trim.loc[1603,'poem_url'])[1]

df_trim.loc[2514,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2514,'poem_url'])[0])
df_trim.loc[2514,'poem_string'] = PoemView_rescraper(df_trim.loc[2514,'poem_url'])[1]

df_trim.loc[2517,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2517,'poem_url'])[0])
df_trim.loc[2517,'poem_string'] = PoemView_rescraper(df_trim.loc[2517,'poem_url'])[1]

df_trim.loc[3335,'poem_lines'] = str(ranged_rescraper(df_trim.loc[3335,'poem_url'])[0])
df_trim.loc[3335,'poem_string'] = ranged_rescraper(df_trim.loc[3335,'poem_url'])[1]

df_trim.loc[3418,'poem_lines'] = str(center_rescraper(df_trim.loc[3418,'poem_url'])[0])
df_trim.loc[3418,'poem_string'] = center_rescraper(df_trim.loc[3418,'poem_url'])[1]

df_trim.loc[3421,'poem_lines'] = str(justify_rescraper(df_trim.loc[3421,'poem_url'])[0])
df_trim.loc[3421,'poem_string'] = justify_rescraper(df_trim.loc[3421,'poem_url'])[1]

df_trim.loc[4217,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4217,'poem_url'])[0])
df_trim.loc[4217,'poem_string'] = poempara_rescraper(df_trim.loc[4217,'poem_url'])[1]

df_trim.loc[4611,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4611,'poem_url'])[0])
df_trim.loc[4611,'poem_string'] = poempara_rescraper(df_trim.loc[4611,'poem_url'])[1]

In [104]:
# found some more...
df_trim.loc[1388,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1388,'poem_url'])[0])
df_trim.loc[1388,'poem_string'] = PoemView_rescraper(df_trim.loc[1388,'poem_url'])[1]

df_trim.loc[1390,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1390,'poem_url'])[0])
df_trim.loc[1390,'poem_string'] = PoemView_rescraper(df_trim.loc[1390,'poem_url'])[1]

df_trim.loc[1391,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1391,'poem_url'])[0])
df_trim.loc[1391,'poem_string'] = PoemView_rescraper(df_trim.loc[1391,'poem_url'])[1]

df_trim.loc[1392,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1392,'poem_url'])[0])
df_trim.loc[1392,'poem_string'] = PoemView_rescraper(df_trim.loc[1392,'poem_url'])[1]

In [106]:
# another one...
df_trim.loc[3399,'poem_lines'] = str(image_rescraper(df_trim.loc[3399,'poem_url'])[0])
df_trim.loc[3399,'poem_string'] = image_rescraper(df_trim.loc[3399,'poem_url'])[1]

- **Some scrapings contain only BeautifulSoup garbage, so I'll try to re-scrape those.**

In [108]:
# check if html tags are in the string
df_trim[df_trim.poem_string.str.contains('<div')]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
237,https://www.poetryfoundation.org/poets/nikki-giovanni,black_arts_movement,https://www.poetryfoundation.org/poems/90181/no-complaints,Nikki Giovanni,No Complaints,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917‚Äî2001)</span></p><...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917‚Äî2001)</span></p></..."
1687,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55563/you-part-i,Ron Silliman,"You, part I","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
1688,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55564/you-part-xii,Ron Silliman,"You, part XII","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
4260,https://www.poetryfoundation.org/poets/emma-lazarus,victorian,https://www.poetryfoundation.org/poems/46791/by-the-waters-of-babylon,Emma Lazarus,By the Waters of Babylon,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>, ]","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>\n"


In [159]:
# rescrape poem based on index from above 
df_trim.loc[237,'poem_lines'] = str(PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[0])
df_trim.loc[237,'poem_string'] = PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[1]

df_trim.loc[1687,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1687,'poem_url'])[0])
df_trim.loc[1687,'poem_string'] = PoemView_rescraper(df_trim.loc[1687,'poem_url'])[1]

df_trim.loc[1688,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1688,'poem_url'])[0])
df_trim.loc[1688,'poem_string'] = PoemView_rescraper(df_trim.loc[1688,'poem_url'])[1]

df_trim.loc[4260,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[4260,'poem_url'])[0])
df_trim.loc[4260,'poem_string'] = PoemView_rescraper(df_trim.loc[4260,'poem_url'])[1]

In [160]:
# re-run the destringify function
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

- **Re-check for any missing poem_lines values that aren't NaNs.**

In [165]:
df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
783,https://www.poetryfoundation.org/poets/randall-jarrell,fugitive,https://www.poetryfoundation.org/poetrymagazine/poems/25237/goodbye-wendover-goodbye-mountain-home,Randall Jarrell,Goodbye Wendover Goodbye Mountain Home,[],
1326,https://www.poetryfoundation.org/poets/ezra-pound,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/13071/dogmatic-statement-concerning-the-game-of-chess-theme-for-a-series-of-pictures,Ezra Pound,Dogmatic Statement Concerning The Game Of Chess Theme For A Series Of Pictures,[],
1433,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/20226/a-foot-note,William Carlos Williams,A Foot Note,[],
1438,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/24855/paterson-book-ii,William Carlos Williams,Paterson Book Ii,[],
1736,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/22702/poem-he-watched-with-all-his,W. H. Auden,Poem He Watched With All His,[],
1738,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21500/poem-o-who-can-ever-praise-enough-the-price,W. H. Auden,Poem O Who Can Ever Praise Enough The Price,[],
1775,https://www.poetryfoundation.org/poets/louise-bogan,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21807/untitled-tender-and-insolent,Louise Bogan,Untitled Tender And Insolent,[],
1826,https://www.poetryfoundation.org/poets/hart-crane,modern,https://www.poetryfoundation.org/poetrymagazine/poems/17345/at-melvilles-tomb,Hart Crane,At Melvilles Tomb,[],
2056,https://www.poetryfoundation.org/poets/a-m-klein,modern,https://www.poetryfoundation.org/poetrymagazine/poems/23448/come-two-like-shadows,A. M. Klein,Come Two Like Shadows,[],
2582,https://www.poetryfoundation.org/poets/wallace-stevens,modern,https://www.poetryfoundation.org/poetrymagazine/poems/19837/good-man-bad-woman,Wallace Stevens,Good Man Bad Woman,[],


In [169]:
# create a list of indices
lookups6 = list(df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0].index)
lookups6

[783,
 1326,
 1433,
 1438,
 1736,
 1738,
 1775,
 1826,
 2056,
 2582,
 2685,
 2790,
 2817,
 3191]

In [174]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so I'm running that again
for i in lookups6:
    try:
        info = image_rescraper_title(df_trim.loc[i, 'poem_url'], df_trim.loc[i, 'title'])
        df_trim.loc[i,'poem_lines'] = str(info[0])
        df_trim.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 783
Success -- 1326
Success -- 1433
Success -- 1438
Success -- 1736
Success -- 1738
Success -- 1775
Success -- 1826
Success -- 2056
Success -- 2582
Success -- 2685
Success -- 2790
Success -- 2817
Failure -- 3191
CPU times: user 1.58 s, sys: 214 ms, total: 1.79 s
Wall time: 51.6 s


In [177]:
# one final one to redo
df_trim.loc[3191,'title'] = 'Radio'
info = image_rescraper_title(df_trim.loc[3191, 'poem_url'], df_trim.loc[3191, 'title'])
df_trim.loc[3191,'poem_lines'] = str(info[0])
df_trim.loc[3191,'poem_string'] = info[1]

In [181]:
# re-run destringify
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

## SAVE IT!

In [182]:
df_trim.to_csv('data/poetry_foundation_raw_rescrape.csv')