# Data Cleaning

- In this notebook, I look at the data I scraped from [PoetryFoundation.org](https://www.poetryfoundation.org/) in the previous [notebook](01_webscraping.ipynb).

#### NOTE: WORK IN PROGRESS
Because I overhauled my original webscrape, I am currently working on re-running, organizing, and cleaning up this notebook.

Thank you for understanding :)

## Table of contents

1. [Import necessary packages](#Import-necessary-packages)
    
## Import necessary packages

[[go back to the top](#Data-Cleaning)]

In [226]:
# custom functions for webscraping
from functions_webscraping import *
from functions import destringify

# standard dataframe packages
import pandas as pd
import numpy as np

# timekeeping/progress packages
import time
from tqdm import tqdm

# saving packages
import gzip
import pickle

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# increase column width of dataframe
pd.set_option('max_colwidth', 150)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
df = pd.read_csv('data/poems_df_pre_clean.csv', index_col=0)
df.shape

(5168, 6)

In [4]:
df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Alexander Pope,https://www.poetryfoundation.org/poems/44896/an-essay-on-criticism-part-1,An Essay on Criticism: Part 1,"['PART 1', ""'Tis hard to say, if greater want of skill"", 'Appear in writing or in judging ill;', ""But, of the two, less dang'rous is th' offence"",...","PART 1\n'Tis hard to say, if greater want of skill\nAppear in writing or in judging ill;\nBut, of the two, less dang'rous is th' offence\nTo tire ...",augustan
1,Alexander Pope,https://www.poetryfoundation.org/poems/44897/an-essay-on-criticism-part-2,An Essay on Criticism: Part 2,"['Of all the causes which conspire to blind', ""Man's erring judgment, and misguide the mind,"", 'What the weak head with strongest bias rules,', 'I...","Of all the causes which conspire to blind\nMan's erring judgment, and misguide the mind,\nWhat the weak head with strongest bias rules,\nIs pride,...",augustan
2,Alexander Pope,https://www.poetryfoundation.org/poems/44898/an-essay-on-criticism-part-3,An Essay on Criticism: Part 3,"['Learn then what morals critics ought to show,', ""For 'tis but half a judge's task, to know."", ""'Tis not enough, taste, judgment, learning, join;...","Learn then what morals critics ought to show,\nFor 'tis but half a judge's task, to know.\n'Tis not enough, taste, judgment, learning, join;\nIn a...",augustan
3,Alexander Pope,https://www.poetryfoundation.org/poems/44899/an-essay-on-man-epistle-i,An Essay on Man: Epistle I,"['Awake, my St. John! leave all meaner things', 'To low ambition, and the pride of kings.', 'Let us (since life can little more supply', 'Than jus...","Awake, my St. John! leave all meaner things\nTo low ambition, and the pride of kings.\nLet us (since life can little more supply\nThan just to loo...",augustan
4,Alexander Pope,https://www.poetryfoundation.org/poems/44900/an-essay-on-man-epistle-ii,An Essay on Man: Epistle II,"['I.', 'Know then thyself, presume not God to scan;', 'The proper study of mankind is man.', ""Plac'd on this isthmus of a middle state,"", 'A being...","I.\nKnow then thyself, presume not God to scan;\nThe proper study of mankind is man.\nPlac'd on this isthmus of a middle state,\nA being darkly wi...",augustan


- Saving to CSV converts the list of poem_lines to a string, so I'll use my destringify function.

In [13]:
df['poem_lines'] = df['poem_lines'].apply(destringify)

In [7]:
len(df[df.duplicated(subset=['poet', 'poem_string'])])

21

In [9]:
df[df.duplicated(subset=['poet', 'poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
119,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"['I', 'I saw the best minds of my generation destroyed by madness, starving hysterical naked,', 'dragging themselves through the negro streets at ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
120,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"['I', 'I saw the best minds of my generation destroyed by madness, starving hysterical naked,', 'dragging themselves through the negro streets at ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
249,Richard Brautigan,https://www.poetryfoundation.org/poems/48576/a-boat,A Boat,"['O beautiful', 'was the werewolf', 'in his evil forest.', 'We took him', 'to the carnival', 'and he started', 'crying', 'when he saw', 'the Ferri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
250,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/56423/a-boat-56d238e754f45,A Boat,"['O beautiful', 'was the werewolf', 'in his evil forest.', 'We took him', 'to the carnival', 'and he started', 'crying', 'when he saw', 'the Ferri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
613,Robert Creeley,https://www.poetryfoundation.org/poems/42840/the-rescue-56d2217b24ec4,The Rescue,"['The man sits in a timelessness', 'with the horse under him in time', 'to a movement of legs and hooves', 'upon a timeless sand.', 'Distance come...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
614,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/28665/the-rescue,The Rescue,"['The man sits in a timelessness', 'with the horse under him in time', 'to a movement of legs and hooves', 'upon a timeless sand.', 'Distance come...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
738,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29165/four-dream-songs,Four Dream Songs,"['I', 'To Ralph Ross', 'The greens of the Ganges delta foliate.', 'Of heartless youth made late aware he pled:', 'Brownies, please come.', 'To Hen...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
741,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29167/henrys-pelt-was-put-on,Henrys Pelt Was Put On,"['I', 'To Ralph Ross', 'The greens of the Ganges delta foliate.', 'Of heartless youth made late aware he pled:', 'Brownies, please come.', 'To Hen...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
852,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"['Sweet beast, I have gone prowling,', 'a proud rejected man', 'who lived along the edges', 'catch as catch can;', 'in darkness and in hedges', 'I...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1845,Alfred Kreymborg,https://www.poetryfoundation.org/poetrymagazine/poems/14702/cradle,Cradle,"['The blue-eyed youngster', 'And the fat old man', 'Play ball in me;', 'And music—', 'The one on his penny flute,', 'The other on his bassoon.', '...","The blue-eyed youngster\nAnd the fat old man\nPlay ball in me;\nAnd music—\nThe one on his penny flute,\nThe other on his bassoon.\nTheir tolerati...",modern


- I'm actually somewhat happy about this because it shows that my image scraper worked pretty darn well, if the strings are exactly the same.
- That said, it may also mean that there are even more near-duplicates, but I can check for duplicates across poet and title next.

In [53]:
# drop duplicates
to_drop = [120, 250, 614, 2338, 2358, 2367, 2931, 2995, 3454, 3455, 3642, 4481, 5159]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5155, 6)

- With some poems that have multiple parts, my scraper scraped all the parts for each row, so those need a more nuanced fix, as conducted below.

In [21]:
df.loc[738, 'poem_lines'] = df.loc[738, 'poem_lines'][:-1]
df.loc[738, 'poem_string'] = '\n'.join(df.loc[738, 'poem_lines'])
df.loc[738, 'poem_string']

"I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sparest times sometimes\nthe little people spread, & did friendly things;\nthen he was glad.\nPleased, at the worst, except with man, he shook\nthe brightest winter sun.\nAll the green lives\nof the great delta, hours, hurt his migrant heart\nin a safety of the steady plane. Please, please\ncome.\nMy friends,—he has been known to mourn,—I'll die;\nlive you, in the most wild, kindly, green\npartly forgiving wood,\nsort of forever and all those human sings\nclose not your better ears to, while good Spring\nreturns with a dance and a sigh."

In [17]:
df.loc[741, 'poem_lines'] = df.loc[741, 'poem_lines'][-3:]
df.loc[741, 'poem_string'] = '\n'.join(df.loc[741, 'poem_lines'])
df.loc[741, 'poem_string']

'Henry’s pelt was put on sundry walls\nwhere it did much resemble Henry and\nthem persons was delighted.'

In [25]:
df.loc[1845, 'poem_lines'] = df.loc[1845, 'poem_lines'][:19]
df.loc[1845, 'poem_string'] = '\n'.join(df.loc[1845, 'poem_lines'])

df.loc[1867, 'poem_lines'] = df.loc[1867, 'poem_lines'][20:]
df.loc[1867, 'poem_string'] = '\n'.join(df.loc[1867, 'poem_lines'])

In [34]:
df.loc[2195, 'poem_lines'] = df.loc[2195, 'poem_lines'][:12]
df.loc[2195, 'poem_string'] = '\n'.join(df.loc[2195, 'poem_lines'])

df.loc[2236, 'poem_lines'] = df.loc[2236, 'poem_lines'][12:51]
df.loc[2236, 'poem_string'] = '\n'.join(df.loc[2236, 'poem_lines'])

In [49]:
df.loc[2571, 'poem_lines'] = df.loc[2571, 'poem_lines'][:-1]
df.loc[2571, 'poem_string'] = '\n'.join(df.loc[2571, 'poem_lines'])

- A couple of total rescrapes, because the URL went to the wrong page.

In [43]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=2'
url = df.loc[2567, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2567, 'poet'], input_title=df.loc[2567, 'title'])
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2567, 'genre']
df.iloc[2567] = rescrape

In [47]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=3'
url = df.loc[2575, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2575, 'poet'], input_title=df.loc[2575, 'title'])
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2575, 'genre']
df.iloc[2575] = rescrape

- A slight title adjustment so all the words in the first line will be accounted for.

In [51]:
df.loc[3666, 'title'] = 'Young in Fall I said: the birds'

- Check for duplicates again.

In [54]:
df[df.duplicated(subset=['poet', 'poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
2355,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19926/the-urn-enrich-my-resignation,The Urn Enrich My Resignation,[],,modern
2356,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn Purgatorio,[],,modern
2359,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,[],,modern
2360,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn The Sad Indian,[],,modern


- I know I scraped these in the [rescrape](01_webscraping.ipynb#Rescrape) portion of the previous notebook, so I'll confirm those are somewhere in the DataFrame.

In [55]:
df[df.poet == 'Hart Crane']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
2341,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19925/a-postscript,A Postscript,"[Friendship agony! words came to me, At last shyly. My only final friends,, The wren and thrush, made solid print for me, Across dawn’s broken arc...","Friendship agony! words came to me\nAt last shyly. My only final friends,\nThe wren and thrush, made solid print for me\nAcross dawn’s broken arc....",modern
2342,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17345/at-melvilles-tomb,At Melvilles Tomb,"[Often beneath the wave, wide from this ledge,, The dice of drowned men’s bones he saw bequeath, An embassy. Their numbers, as he watched,, Beat o...","Often beneath the wave, wide from this ledge,\nThe dice of drowned men’s bones he saw bequeath\nAn embassy. Their numbers, as he watched,\nBeat on...",modern
2343,Hart Crane,https://www.poetryfoundation.org/poems/43260/at-melvilles-tomb-56d221f8f2f82,At Melville’s Tomb,"[Often beneath the wave, wide from this ledge, The dice of drowned men’s bones he saw bequeath, An embassy. Their numbers as he watched,, Beat on ...","Often beneath the wave, wide from this ledge\nThe dice of drowned men’s bones he saw bequeath\nAn embassy. Their numbers as he watched,\nBeat on t...",modern
2344,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19923/by-nilus-once,By Nilus Once,"[Some old Egyptian joke is in the air,, Dear lady, the poet said, release your hair;, Come, search the marshes for a friendly bed,, Or let us bump...","Some old Egyptian joke is in the air,\nDear lady, the poet said, release your hair;\nCome, search the marshes for a friendly bed,\nOr let us bump ...",modern
2345,Hart Crane,https://www.poetryfoundation.org/poems/43257/chaplinesque,Chaplinesque,"[We make our meek adjustments,, Contented with such random consolations, As the wind deposits, In slithered and too ample pockets., For we can sti...","We make our meek adjustments,\nContented with such random consolations\nAs the wind deposits\nIn slithered and too ample pockets.\nFor we can stil...",modern
2346,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17746/the-bridge-cutty-sark,Cutty Sark,"[I met a man in South Street, tall—, a nervous shark tooth swung on his chain., His eyes pressed through green glass, —green glasses, or bar light...","I met a man in South Street, tall—\na nervous shark tooth swung on his chain.\nHis eyes pressed through green glass\n—green glasses, or bar lights...",modern
2347,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/18754/eldorado,Eldorado,"[The morning glory, climbing the morning long, Over the lintel on its wiry vine,, Closes before the dusk, furls in its song, As I close mine..., A...","The morning glory, climbing the morning long\nOver the lintel on its wiry vine,\nCloses before the dusk, furls in its song\nAs I close mine...\nAn...",modern
2348,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19919/havana-rose,Havana Rose,"[Let us strip the desk for action, now we have a house in, Mexico. . . . That night in Vera Cruz—verily for me “the, True Cross”—let us remember t...","Let us strip the desk for action, now we have a house in\nMexico. . . . That night in Vera Cruz—verily for me “the\nTrue Cross”—let us remember th...",modern
2349,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19918/imperator-victus,Imperator Victus,"[Big guns again, No speakee well, But plain., Again, again—, And they shall tell, The Spanish Main, The Dollar from the Cross., Big guns again—, B...","Big guns again\nNo speakee well\nBut plain.\nAgain, again—\nAnd they shall tell\nThe Spanish Main\nThe Dollar from the Cross.\nBig guns again—\nBu...",modern
2350,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17747/o-carib-isle,O Carib Isle!,"[O Carib Isle!, The tarantula rattling at the lily’s foot, Across the feet of the dead, laid in white sand, Near the coral beach—nor zigzag fiddle...","O Carib Isle!\nThe tarantula rattling at the lily’s foot\nAcross the feet of the dead, laid in white sand\nNear the coral beach—nor zigzag fiddle ...",modern


- Found them! And interestingly, I found a duplicate for which my image-text scraper must not have scraped entirely properly.
- I'll go ahead a drop all those now.

In [58]:
# drop duplicates
to_drop = [2342, 2355, 2356, 2359, 2360]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5150, 6)

- Check for poems with same poet and title, which may mean there are differently scraped duplicate poems.
- Since there were so many, I did these in batches of 40, using the code in the following cell.
- The subsequent cells detail my process for fixing any legitimate duplicates, by either dropping, rescraping, adding titles, etc.

In [150]:
df[df.duplicated(subset=['poet', 'title'], keep=False)].head(40)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
622,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29778/the-window-56d2134a5739a,The Window,"[Position is where you, put it, where it is,, did you, for example, that, large tank there, silvered,, with the white church along-, side, lift, a...","Position is where you\nput it, where it is,\ndid you, for example, that\nlarge tank there, silvered,\nwith the white church along-\nside, lift\nal...",black_mountain
623,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30219/the-window-56d213b62fd88,The Window,"[There will be no simple, way to avoid what, confronts me. Again and, again I know it, but, take heart, hopefully,, in the world unavoidably, pres...","There will be no simple\nway to avoid what\nconfronts me. Again and\nagain I know it, but\ntake heart, hopefully,\nin the world unavoidably\nprese...",black_mountain
624,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29768/the-woman-56d2134871089,The Woman,"[I have never, clearly given to you, the associations, you have for me, you, with such, divided presence my dream, does not show, you. I do not dr...","I have never\nclearly given to you\nthe associations\nyou have for me, you\nwith such\ndivided presence my dream\ndoes not show\nyou. I do not dre...",black_mountain
625,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/28342/the-woman-56d212141ed89,The Woman,"[I called her across the room,, could see that what she stood on, held her up, and now she came, as if she moved in time., In time to what she mov...","I called her across the room,\ncould see that what she stood on\nheld her up, and now she came\nas if she moved in time.\nIn time to what she move...",black_mountain
851,W. D. Snodgrass,https://www.poetryfoundation.org/poems/42801/song-56d2216f550b2,Song,"[Observe the cautious toadstools, still on the lawn today, though they grow over-evening;, sun shrinks them away., Pale and proper and rootless,, ...","Observe the cautious toadstools\nstill on the lawn today\nthough they grow over-evening;\nsun shrinks them away.\nPale and proper and rootless,\nt...",confessional
852,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1019,Rupert Brooke,https://www.poetryfoundation.org/poems/47294/the-dead-56d227a2ea215,The Dead,"[These hearts were woven of human joys and cares,, Washed marvellously with sorrow, swift to mirth., The years had given them kindness. Dawn was t...","These hearts were woven of human joys and cares,\nWashed marvellously with sorrow, swift to mirth.\nThe years had given them kindness. Dawn was th...",georgian
1020,Rupert Brooke,https://www.poetryfoundation.org/poetrymagazine/poems/13075/the-dead,The Dead,"[Blow out, you bugles, over the rich Dead!, There’s none of these so lonely and poor of old,, But, dying, has made us rarer gifts than gold., Thes...","Blow out, you bugles, over the rich Dead!\nThere’s none of these so lonely and poor of old,\nBut, dying, has made us rarer gifts than gold.\nThese...",georgian
1532,William Carlos Williams,https://www.poetryfoundation.org/poems/54326/love-song-56d2348bab385,Love Song,"[I lie here thinking of you:—, the stain of love, is upon the world!, Yellow, yellow, yellow, it eats into the leaves,, smears with saffron, the h...","I lie here thinking of you:—\nthe stain of love\nis upon the world!\nYellow, yellow, yellow\nit eats into the leaves,\nsmears with saffron\nthe ho...",imagist
1533,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/13513/love-song,Love Song,"[What have I to say to you, When we shall meet ?, Yet—, T lie here thinking of you., The stain of love, Is upon the world., Yellow, yellow,. yello...","What have I to say to you\nWhen we shall meet ?\nYet—\nT lie here thinking of you.\nThe stain of love\nIs upon the world.\nYellow, yellow,. yellow...",imagist


#### Drop duplicates

In [91]:
# drop duplicates
to_drop = [176, 461, 547, 603, 737, 803, 823, 930, 939, 1267, 1506, 1556, 1930, 1936]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5136, 6)

In [123]:
# drop duplicates
to_drop = [2010, 2052, 2119, 2236, 2315, 2397, 2504, 2505, 2574, 2831, 2863, 2866, 2900]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5123, 6)

In [139]:
# drop duplicates
to_drop = [2903, 2914, 2969, 3238, 3264, 3287, 3902]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5116, 6)

In [149]:
# drop duplicates
to_drop = [4041]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5115, 6)

#### Scrape extra pages

In [75]:
temp_lines = df.loc[824, 'poem_lines'].copy()
temp_lines.extend(temp_rescrape_lines)
temp_lines.extend(temp_rescrape_lines2)
temp_string = '\n'.join(temp_lines)

In [78]:
df.loc[824, 'poem_lines'] = temp_lines
df.loc[824, 'poem_string'] = temp_string

In [103]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=55&issue=5&page=4',
                  input_poet='Gertrude Stein', input_title='Before')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['Before',
 'As much as and alike and because',
 'Once before always before afraid in a dog fight',
 'But not now.',
 'Not at all now not when they not only wish to do',
 'Can they be ours and very pretty too.',
 'And you.',
 'Once more I think about a lake for her',
 'I do not think about a lake for them',
 'And I can be not only there not in the rain',
 'But when it is with them this it is soon seen',
 'So much comes so many co:',
 'Comfortably if they like what they come',
 'Tables of tables and frames of frames.',
 'For which they ask many permissions.',
 'I do know that now I do know why they went',
 '‘When they came',
 'To be',
 'And interested to be which name.',
 'Who comes to easily not know',
 'How many days they do know',
 'Or whether better either and or',
 'Before.',
 'She can be eight in wishes',
 'I said the difference is complicated',
 'And she said yes is it it is',
 'There seems so much to do',
 'With one or two with six not seven',
 'Either or.',
 'Or believe.',
 'Th

In [104]:
temp_lines = df.loc[2316, 'poem_lines'].copy()
temp_lines.extend(temp_rescrape_lines)
temp_string = '\n'.join(temp_lines)

In [106]:
df.loc[2316, 'poem_lines'] = temp_lines
df.loc[2316, 'poem_string'] = temp_string

#### Add titles

In [92]:
df.loc[1167, 'title'] = "Lady’s Boogie"
df.loc[1168, 'title'] = 'Flatted Fifths'

In [137]:
df.loc[3416, 'title'] = 'Christmas Eve'
df.loc[3417, 'title'] = 'The Obvious Tradition'
df.loc[3568, 'title'] = 'Deerfield:1703'
df.loc[3569, 'title'] = 'Slave Sale: New Orleans'
df.loc[3702, 'title'] = 'Song: to Celia [Come, my Celia, let us prove]'
df.loc[3703, 'title'] = 'Song: to Celia [“Drink to me only with thine eyes”]'
df.loc[3704, 'title'] = '“Though I am young, and cannot tell”'
df.loc[3705, 'title'] = 'Ode to Himself [“Come leave the loathéd stage”]'
df.loc[3900, 'title'] = 'Delia 36: But love whilst that thou mayst be loved again'
df.loc[3901, 'title'] = 'Delia 2: Go wailing verse, the infants of my love'
df.loc[3903, 'title'] = 'Delia 53: Unhappy pen and ill accepted papers'

In [141]:
df.loc[3904, 'title'] = 'Delia 47: Read in my face a volume of despairs'
df.loc[3905, 'title'] = 'Delia 1: Unto the boundless Ocean of thy beauty'
df.loc[3906, 'title'] = "Delia 6: Fair is my love, and cruel as she's fair"
df.loc[3907, 'title'] = 'Delia 46: Let others sing of knights and paladins'
df.loc[3908, 'title'] = 'Delia 37: When men shall find thy flower, thy glory pass'
df.loc[3909, 'title'] = 'Delia 45: Care-charmer Sleep, son of the sable Night'
df.loc[3917, 'title'] = 'Astrophil and Stella 3: Let dainty wits cry on the sisters nine'
df.loc[3918, 'title'] = 'Astrophil and Stella 41: Having this day my horse, my hand, my lance '
df.loc[3919, 'title'] = 'Astrophil and Stella 63: O Grammar rules, O now your virtues show'
df.loc[3920, 'title'] = 'Astrophil and Stella 64: No more, my dear, no more these counsels try'
df.loc[3921, 'title'] = 'Astrophil and Stella 52: A strife is grown between Virtue and Love'
df.loc[3922, 'title'] = 'Astrophil and Stella 21: Your words my friend (right healthful caustics) blame'
df.loc[3923, 'title'] = 'Astrophil and Stella 15: You that do search for every purling spring'
df.loc[3924, 'title'] = 'Astrophil and Stella 72: Desire, though thou my old companion art'
df.loc[3925, 'title'] = 'Astrophil and Stella 90: Stella, think not that I by verse seek fame'
df.loc[3926, 'title'] = 'Astrophil and Stella 92: Be your words made, good sir, of Indian ware'
df.loc[3927, 'title'] = 'Astrophil and Stella 49: I on my horse, and Love on me, doth try '
df.loc[3928, 'title'] = 'Astrophil and Stella 47: What, have I thus betrayed my liberty?'
df.loc[3929, 'title'] = 'Astrophil and Stella 107: Stella, since thou so right a princess art'
df.loc[3930, 'title'] = 'Astrophil and Stella 20: Fly, fly, my friends, I have my death wound, fly'
df.loc[3931, 'title'] = 'Astrophil and Stella 23: The curious wits, seeing dull pensiveness'
df.loc[3932, 'title'] = 'Astrophil and Stella 25: The wisest scholar of the wight most wise'
df.loc[3933, 'title'] = 'Astrophil and Stella 48: Soul’s joy, bend not those morning stars from me'
df.loc[3934, 'title'] = 'Astrophil and Stella 71: Who will in fairest book of nature know'

In [143]:
df.loc[3935, 'title'] = 'Astrophil and Stella 84: Highway, since you my chief Parnassus be'
df.loc[3936, 'title'] = "Astrophil and Stella 31: With how sad steps, O Moon, thou climb'st the skies"
df.loc[3937, 'title'] = 'Astrophil and Stella 33: I might!—unhappy word—O me, I might'
df.loc[3938, 'title'] = 'Astrophil and Stella 1: Loving in truth, and fain in verse my love to show'
df.loc[3939, 'title'] = "Astrophil and Stella 7: When Nature made her chief work, Stella's eyes"
df.loc[3940, 'title'] = 'Astrophil and Stella 14: Alas, have I not pain enough, my friend'
df.loc[3941, 'title'] = 'Song from Arcadia: “My True Love Hath My Heart”'
df.loc[3942, 'title'] = 'Astrophil and Stella 39: Come Sleep! O Sleep, the certain knot of peace'
df.loc[3997, 'title'] = 'Book 1, Epigram 39: Ad librum suum.'
df.loc[3998, 'title'] = 'Book 1, Epigram 5: Ad lectorem de subjecto operis sui.'
df.loc[3999, 'title'] = 'Book 7, Epigram 9: De senectute & iuuentute.'
df.loc[4000, 'title'] = 'Book 2, Epigram 4: Ad Henricum Wottonum.'
df.loc[4001, 'title'] = 'Book 5, Epigram 20: In Misum & Mopsam.'
df.loc[4002, 'title'] = 'Book 1, Epigram 34: Ad. Thomam Freake armig. de veris adventu.'
df.loc[4009, 'title'] = 'Book 6, Epigram 17: In Sextum.'
df.loc[4010, 'title'] = 'Book 2, Epigram 21: In Momum.'
df.loc[4011, 'title'] = 'Book 6, Epigram 7: In prophanationem nominis Dei.'
df.loc[4012, 'title'] = 'Book 7, Epigram 36: De puero balbutiente.'
df.loc[4013, 'title'] = 'Book 7, Epigram 47: De Hominis Ortu & Sepultura.'
df.loc[4014, 'title'] = 'Book 2, Epigram 40: De libro suo.'
df.loc[4015, 'title'] = 'Book 6, Epigram 14: De Piscatione.'
df.loc[4042, 'title'] = 'Song: “Come away, come away, death”'
df.loc[4043, 'title'] = 'Song: “Where the bee sucks, there suck I”'

In [146]:
df.loc[4044, 'title'] = 'Song: “When daisies pied and violets blue”'
df.loc[4045, 'title'] = 'Song: “Sigh no more, ladies, sigh no more”'
df.loc[4046, 'title'] = 'Song: “Orpheus with his lute made trees”'
df.loc[4047, 'title'] = 'Song: “Fear no more the heat o’ the sun”'
df.loc[4048, 'title'] = 'Sonnet 135: Whoever hath her wish, thou hast thy Will'
df.loc[4049, 'title'] = 'Song: “O Mistress mine where are you roaming?”'
df.loc[4050, 'title'] = 'Song: “Who is Silvia? what is she”'
df.loc[4051, 'title'] = 'Song: “When that I was and a little tiny boy (With hey, ho, the wind and the rain)”'
df.loc[4052, 'title'] = "Song: “Hark, hark! the lark at heaven's gate sings”"
df.loc[4113, 'title'] = 'Speech: “O Romeo, Romeo, wherefore art thou Romeo?”'
df.loc[4114, 'title'] = 'Speech: “Is this a dagger which I see before me”'
df.loc[4115, 'title'] = 'Speech: “No matter where; of comfort no man speak”'
df.loc[4116, 'title'] = 'Speech: “To be, or not to be, that is the question”'
df.loc[4117, 'title'] = 'Speech: “This day is called the feast of Crispian”'
df.loc[4118, 'title'] = 'Speech: “Friends, Romans, countrymen, lend me your ears”'
df.loc[4119, 'title'] = 'Speech: “Once more unto the breach, dear friends, once more”'
df.loc[4120, 'title'] = 'Speech: “Tomorrow, and tomorrow, and tomorrow”'
df.loc[4121, 'title'] = 'Speech: “The raven himself is hoarse”'
df.loc[4122, 'title'] = 'Song: “Take, oh take those lips away”'
df.loc[4123, 'title'] = 'Speech: “Time hath, my lord, a wallet at his back”'
df.loc[4124, 'title'] = 'Song: “It was a lover and his lass”'
df.loc[4125, 'title'] = "Speech: All the world's a stage."
df.loc[4126, 'title'] = 'Song: Blow blow though winter wind'

In [148]:
df.loc[4139, 'title'] = "Sonnet 92: Behold that tree, in Autumn’s dim decay"
df.loc[4140, 'title'] = 'Sonnet 91: On the fleet streams, the Sun, that late arose'

#### Miscellaneous

In [88]:
df.loc[1931, 'poem_lines'] = df.loc[1931, 'poem_lines'][:4]

- Check for NaN values.

In [151]:
df.isna().sum()

poet           0
poem_url       0
title          9
poem_lines     0
poem_string    8
genre          0
dtype: int64

In [152]:
df[df.title.isna()]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
114,Allen Ginsberg,https://www.poetryfoundation.org/poems/50123/new-stanzas-for-amazing-grace,,"[I dreamed I dwelled in a homeless place, Where I was lost alone, Folk looked right through me into space, And passed with eyes of stone, O homele...",I dreamed I dwelled in a homeless place\nWhere I was lost alone\nFolk looked right through me into space\nAnd passed with eyes of stone\nO homeles...,beat
264,Amiri Baraka,https://www.poetryfoundation.org/poems/52777/an-agony-as-now,,"[I am inside someone, who hates me. I look, out from his eyes. Smell, what fouled tunes come in, to his breath. Love his, wretched women., Slits i...",I am inside someone\nwho hates me. I look\nout from his eyes. Smell\nwhat fouled tunes come in\nto his breath. Love his\nwretched women.\nSlits in...,black_arts_movement
306,Gwendolyn Brooks,https://www.poetryfoundation.org/poems/58377/riot-56d23cb395a01,,"[A Poem in Three Parts, , John Cabot, out of Wilma, once a Wycliffe, , Because the “Negroes” were coming down the street. , Because ...","A Poem in Three Parts\n \nJohn Cabot, out of Wilma, once a Wycliffe, \nBecause the “Negroes” were coming down the street. \nBecause t...",black_arts_movement
1045,Walter de La Mare,https://www.poetryfoundation.org/poems/48215/gloria-mundi,,"[Upon a bank, easeless with knobs of gold,, Beneath a canopy of noonday smoke,, I saw a measureless Beast, morose and bold,, With eyes like one fr...","Upon a bank, easeless with knobs of gold,\nBeneath a canopy of noonday smoke,\nI saw a measureless Beast, morose and bold,\nWith eyes like one fro...",georgian
2186,Edgar Lee Masters,https://www.poetryfoundation.org/poems/56348/archibald-higbie,,"[I loathed you, Spoon River. I tried to rise above you,, I was ashamed of you. I despised you, As the place of my nativity., And there in Rome, am...","I loathed you, Spoon River. I tried to rise above you,\nI was ashamed of you. I despised you\nAs the place of my nativity.\nAnd there in Rome, amo...",modern
3874,Mary Sidney Herbert Countess of Pembroke,https://www.poetryfoundation.org/poems/55249/o-56d2369e67a1d,,"[Oh, what a lantern, what a lamp of light, Is thy pure word to me, To clear my paths and guide my goings right!, I swore and swear again,, I of th...","Oh, what a lantern, what a lamp of light\nIs thy pure word to me\nTo clear my paths and guide my goings right!\nI swore and swear again,\nI of the...",renaissance
3985,Sir Walter Ralegh,https://www.poetryfoundation.org/poems/57130/on-the-cards-and-dice,,"[Before the sixth day of the next new year,, Strange wonders in this kingdom shall appear:, Four kings shall be assembled in this isle,, Where the...","Before the sixth day of the next new year,\nStrange wonders in this kingdom shall appear:\nFour kings shall be assembled in this isle,\nWhere they...",renaissance
4216,John Keats,https://www.poetryfoundation.org/poems/44468/bright-star-would-i-were-stedfast-as-thou-art,,"[Bright star, would I were stedfast as thou art—, Not in lone splendour hung aloft the night, And watching, with eternal lids apart,, Like nature'...","Bright star, would I were stedfast as thou art—\nNot in lone splendour hung aloft the night\nAnd watching, with eternal lids apart,\nLike nature's...",romantic
4766,Elizabeth Barrett Browning,https://www.poetryfoundation.org/poems/43733/sonnets-from-the-portuguese-1-i-thought-once-how-theocritus-had-sung,,"[I thought once how Theocritus had sung, Of the sweet years, the dear and wished for years,, Who each one in a gracious hand appears, To bear a gi...","I thought once how Theocritus had sung\nOf the sweet years, the dear and wished for years,\nWho each one in a gracious hand appears\nTo bear a gif...",victorian


In [153]:
df.loc[114, 'title'] = 'New Stanzas for Amazing Grace'
df.loc[264, 'title'] = 'An Agony. As Now.'
df.loc[306, 'title'] = 'RIOT'
df.loc[1045, 'title'] = 'Gloria Mundi'
df.loc[2186, 'title'] = 'Archibald Higbie'
df.loc[3874, 'title'] = 'O'
df.loc[3985, 'title'] = 'On the Cards and Dice'
df.loc[4216, 'title'] = '“Bright star, would I were stedfast as thou art”'
df.loc[4766, 'title'] = 'Sonnets from the Portuguese 1: I thought once how Theocritus had sung'

In [154]:
df[df.poem_string.isna()]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
239,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
2339,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25655/toward-the-south-tr-by-harry-duncan,Toward The South Tr By Harry Duncan,[],,modern
2526,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968,[],,modern
2886,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/22310/poem-after-the-wrestling,Poem After The Wrestling,[],,modern
3007,William Butler Yeats,https://www.poetryfoundation.org/poetrymagazine/poems/20737/a-full-moon-in-march,A Full Moon In March,[],,modern
3143,Frank O'Hara,https://www.poetryfoundation.org/poetrymagazine/poems/31123/places-for-oscar-salvador,Places For Oscar Salvador,[],,new_york_school
3386,Anne Waldman,https://www.poetryfoundation.org/poetrymagazine/poems/56845/history-will-decide,History Will Decide,[],,new_york_school_2nd_generation
3516,Tom Clark,https://www.poetryfoundation.org/poetrymagazine/poems/30773/fig-1,Fig 1,[],,new_york_school_2nd_generation


In [199]:
# drop duplicates
to_drop = [239, 2339, 2526, 3007, 3143]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5110, 6)

In [198]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=54&issue=1&page=17'
url = df.loc[2886, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2886, 'poet'], input_title='Poem')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2886, 'genre']
df.loc[2886, 'poem_lines'] = rescrape['poem_lines']
df.loc[2886, 'poem_string'] = rescrape['poem_string']

In [172]:
df.loc[3386, 'poem_lines'] = justify_rescraper(df.loc[3386, 'poem_url'])[0]
df.loc[3386, 'poem_string'] = justify_rescraper(df.loc[3386, 'poem_url'])[1]

In [191]:
# rescrape (NOTE: only grabs first page)
url = df.loc[3516, 'poem_url']
rescrape = scan_poem_scraper(url, input_poet=df.loc[3516, 'poet'], input_title='FIG. 1: Weakly cuddling the telephone as a last')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[3516, 'genre']

In [192]:
# scrape second page
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=6',
                  input_poet='Tom Clark', input_title='POETRY')
temp_rescrape_lines = temp_rescrape['poem_lines']

In [193]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[3516, 'poem_lines'] = rescrape['poem_lines']
df.loc[3516, 'poem_string'] = rescrape['poem_string']

- Re-check for NaN values.

In [213]:
df.isna().sum()

poet           0
poem_url       0
title          0
poem_lines     0
poem_string    0
genre          0
dtype: int64

In [219]:
# sort dataframe
df.sort_values(by=['genre', 'poet', 'title'], inplace=True)

# reset indices
df.reset_index(drop=True, inplace=True)

In [220]:
df.to_csv('data/poems_df_cleaner.csv')

In [218]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5110 entries, 0 to 2334
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   poet         5110 non-null   object
 1   poem_url     5110 non-null   object
 2   title        5110 non-null   object
 3   poem_lines   5110 non-null   object
 4   poem_string  5110 non-null   object
 5   genre        5110 non-null   object
dtypes: object(6)
memory usage: 279.5+ KB


In [221]:
df['temp_len'] = df.poem_string.apply(lambda x: len(x))
df.temp_len.describe()

count     5110.000000
mean      1486.975342
std       2715.662463
min          1.000000
25%        474.250000
50%        715.000000
75%       1404.000000
max      53241.000000
Name: temp_len, dtype: float64

In [222]:
df[df.temp_len <= 10]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len
1072,Wilfred Owen,https://www.poetryfoundation.org/poems/57369/the-send-off,The Send-Off,[ ],,georgian,1
1473,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/28309/3-stances,3 Stances,[I],I,imagist,1
2731,Paul Valéry,https://www.poetryfoundation.org/poetrymagazine/poems/27998/to-the-plane-tree-tr-by-louise-bogan-and-may-sarton,To The Plane Tree Tr By Louise Bogan And May Sarton,[ ],,modern,1
2844,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/29238/journal-leaves,Journal Leaves,[I],I,modern,1
3308,Alice Notley,https://www.poetryfoundation.org/poems/58243/gift-56d23c725d4d9,Gift,"[, ]",\n,new_york_school_2nd_generation,1
3347,Aram Saroyan,https://www.poetryfoundation.org/poetrymagazine/poems/30722/cham-pagne,Cham Pagne,[cham.],cham.,new_york_school_2nd_generation,5
3544,Gaius Valerius Catullus,https://www.poetryfoundation.org/poetrymagazine/poems/31271/peliaco-quondam-with-celia-zukofsky,Peliaco Quondam With Celia Zukofsky,[ ],,objectivist,1
4474,A. E. Housman,https://www.poetryfoundation.org/poems/58269/a-shropshire-lad-52-far-in-a-western-brookland-,A Shropshire Lad 52: Far in a western brookland,[ ],,victorian,1
4876,Katharine Tynan,https://www.poetryfoundation.org/poems/57349/a-lament-56d23ac7ae84a,A Lament,[ ],,victorian,1


In [229]:
poempara_rescraper('https://www.poetryfoundation.org/poems/57369/the-send-off')

(['Down the close, darkening lanes they sang their way',
  'To the siding-shed,',
  'And lined the train with faces grimly gay.',
  'Their breasts were stuck all white with wreath and spray',
  "As men's are, dead.",
  'Dull porters watched them, and a casual tramp',
  'Stood staring hard,',
  'Sorry to miss them from the upland camp.',
  'Then, unmoved, signals nodded, and a lamp',
  'Winked to the guard.',
  'So secretly, like wrongs hushed-up, they went.',
  'They were not ours:',
  'We never heard to which front these were sent.',
  'Nor there if they yet mock what women meant',
  'Who gave them flowers.',
  'Shall they return to beatings of great bells',
  'In wild trainloads?',
  'A few, a few, too few for drums and yells,',
  'May creep back, silent, to still village wells',
  'Up half-known roads.'],
 "Down the close, darkening lanes they sang their way\nTo the siding-shed,\nAnd lined the train with faces grimly gay.\nTheir breasts were stuck all white with wreath and spray\nAs


- **I'll look at a breakdown of genres and see if there are any I should get rid of.**
- **My initial thoughts are to limit it in time period, so as to remove any language barriers, so to speak (between, say, Shakespearean English and modern English).**

In [88]:
df.genre.value_counts()

modern                            1279
victorian                          643
renaissance                        426
romantic                           398
imagist                            356
new_york_school                    264
black_mountain                     257
new_york_school_2nd_generation     192
language_poetry                    192
confessional                       176
black_arts_movement                165
georgian                           160
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           114
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

In [89]:
# check a sample Middle English poem
print(df[df.genre == 'middle_english'].iloc[0,-1])

Whan that Aprille with his shour
The droghte of March hath perc
And bath
Of which vertú engendr
Whan Zephirus eek with his swet
Inspir
The tendr
Hath in the Ram his half
And smal
That slepen al the nyght with open y
So priketh hem Natúre in hir corag
Thanne longen folk to goon on pilgrimag
And palmeres for to seken straung
To fern
And specially, from every shir
Of Eng
The hooly blisful martir for to sek
That hem hath holpen whan that they were seek

Bifil that in that seson on a day, 
In Southwerk at the Tabard as I lay, 
Redy to wenden on my pilgrymag
To Caunterbury with ful devout corag
At nyght were come into that hostelry
Wel nyne and twenty in a compaigny
Of sondry folk, by áventure y-fall
In felaweshipe, and pilgrimes were they all
That toward Caunterbury wolden ryd
The chambr
And wel we weren es
And shortly, whan the sonn
So hadde I spoken with hem everychon, 
That I was of hir felaweshipe anon, 
And mad
To take oure wey, ther as I yow devys

But nath
Er that I ferther in thi

- **Indeed, Middle English is definitely out.**

In [90]:
df = df[df.genre != 'middle_english']
df.shape

(5166, 7)

In [91]:
# check a sample Renaissance poem
print(df[df.genre == 'renaissance'].iloc[0,-1])

Long have I long’d to see my love againe,
   Still have I wisht, but never could obtaine it;
   Rather than all the world (if I might gaine it)
Would I desire my love’s sweet precious gaine.
Yet in my soule I see him everie day,
   See him, and see his still sterne countenaunce,
   But (ah) what is of long continuance,
Where majestie and beautie beares the sway?
Sometimes, when I imagine that I see him,
   (As love is full of foolish fantasies)
   Weening to kisse his lips, as my love’s fees,
I feele but aire: nothing but aire to bee him.
   Thus with Ixion, kisse I clouds in vaine:
   Thus with Ixion, feele I endles paine.





In [92]:
# check a sample Augustan poem
print(df[df.genre == 'augustan'].iloc[1,-1])

And auld Robin Forbes hes gien tem a dance,
I pat on my speckets to see them aw prance;
I thout o’ the days when I was but fifteen,
And skipp’d wi’ the best upon Forbes’s green.
Of aw things that is I think thout is meast queer,
It brings that that’s by-past and sets it down here;
I see Willy as plain as I dui this bit leace,
When he tuik his cwoat lappet and deeghted his feace.

The lasses aw wonder’d what Willy cud see
In yen that was dark and hard featur’d leyke me;
And they wonder’d ay mair when they talk’d o’ my wit,
And slily telt Willy that cudn’t be it:
But Willy he laugh’d, and he meade me his weyfe,
And whea was mair happy thro’ aw his lang leyfe?
It’s e’en my great comfort, now Willy is geane,
The he offen said— nae place was leyke his awn heame!

I mind when I carried my wark to yon steyle
Where Willy was deykin, the time to beguile,
He wad fling me a daisy to put i’ my breast,
And I hammer’d my noddle to mek out a jest.
But merry or grave, Willy often w

- **According to Poetry Foundation's website, Renaissance and Augustan poems are from the years 1500 - 1780, and the differences in the English are fairly clear.**
- **For now, I'll drop these.**

In [93]:
df_trim = df[df.genre != 'renaissance']
df_trim = df_trim[df_trim.genre != 'augustan']
df_trim.shape

(4626, 7)

In [94]:
# check a sample Victorian poem
print(df[df.genre == 'victorian'].iloc[1,-1])

I
The evening comes, the fields are still. 
The tinkle of the thirsty rill, 
Unheard all day, ascends again; 
Deserted is the half-mown plain, 
Silent the swaths! the ringing wain, 
The mower's cry, the dog's alarms, 
All housed within the sleeping farms! 
The business of the day is done, 
The last-left haymaker is gone. 
And from the thyme upon the height, 
And from the elder-blossom white 
And pale dog-roses in the hedge, 
And from the mint-plant in the sedge, 
In puffs of balm the night-air blows 
The perfume which the day forgoes. 
And on the pure horizon far, 
See, pulsing with the first-born star, 
The liquid sky above the hill! 
The evening comes, the fields are still. 

       Loitering and leaping, 
       With saunter, with bounds— 
       Flickering and circling 
       In files and in rounds— 
       Gaily their pine-staff green 
       Tossing in air, 
       Loose o'er their shoulders white 
       Showering their hair— 
       See! the wild Maenads 
       Break from the

In [95]:
# check a sample Romantic poem
print(df[df.genre == 'romantic'].iloc[1,-1])

Now in thy dazzling half-oped eye, 
Thy curled nose and lip awry, 
Uphoisted arms and noddling head, 
And little chin with crystal spread, 
Poor helpless thing! what do I see, 
That I should sing of thee? 

From thy poor tongue no accents come, 
Which can but rub thy toothless gum: 
Small understanding boasts thy face, 
Thy shapeless limbs nor step nor grace: 
A few short words thy feats may tell, 
And yet I love thee well. 

When wakes the sudden bitter shriek, 
And redder swells thy little cheek 
When rattled keys thy woes beguile, 
And through thine eyelids gleams the smile, 
Still for thy weakly self is spent 
Thy little silly plaint. 

But when thy friends are in distress. 
Thou’lt laugh and chuckle n’ertheless, 
Nor with kind sympathy be smitten, 
Though all are sad but thee and kitten; 
Yet puny varlet that thou art, 
Thou twitchest at the heart. 

Thy smooth round cheek so soft and warm; 
Thy pinky hand and dimpled arm; 
Thy silken locks that scantly peep, 
With gold tipped end

- **Romantic and Victorian poems are from 1781-1900, but the language seems fairly similar.**
- **Plus, these are some very formative genres for poetry in English. For now, I'll keep these.**

- **All other genres are from after 1900.**

In [96]:
# let's reindex
df_trim.reset_index(drop=True, inplace=True)

## Rescraping (again)
- **Look more closely at how the scraping went.**
- **Eventually, I'll want to create some new features, like number of lines and average line length.**
    - **Since I can't divide by zero, this is a good opportunity to look for any unsuccessful scrapes--those where 0 or too few lines were scraped.**
    - **NOTE: I'm checking if length of poem_lines is less than or equal to 1 because that yielded the desired results, whereas seeing if length equaled 0 did not.**

In [97]:
df_trim[df_trim['poem_lines'].map(lambda x: len(x)) <= 1]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
222,https://www.poetryfoundation.org/poets/henry-dumas,black_arts_movement,https://www.poetryfoundation.org/poems/53477/kef-21,Henry Dumas,Kef 21,"[First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling...","First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling ..."
428,https://www.poetryfoundation.org/poets/robert-duncan,black_mountain,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,Robert Duncan,A Poem Beginning with a Line by Pindar,[I],I
703,https://www.poetryfoundation.org/poets/anne-sexton,confessional,https://www.poetryfoundation.org/poems/152252/o-ye-tongues,Anne Sexton,O Ye Tongues,[First Psalm],First Psalm
952,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57369/the-send-off,Wilfred Owen,The Send-Off,[ ],
953,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57347/smile-smile-smile,Wilfred Owen,"Smile, Smile, Smile","[Head to limp head, the sunk-eyed wounded scanned]","Head to limp head, the sunk-eyed wounded scanned"
1231,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53772/spring-day-56d233626c49b,Amy Lowell,Spring Day,[<em> Bath</em>],<em> Bath</em>
1234,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53773/towns-in-colour,Amy Lowell,Towns in Colour,"[Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!]","Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!"
1389,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poems/54567/kora-in-hell-improvisations-xi,William Carlos Williams,Kora in Hell: Improvisations XI,[XI],XI
1603,https://www.poetryfoundation.org/poets/lyn-hejinian,language_poetry,https://www.poetryfoundation.org/poems/47892/my-life-a-name-trimmed-with-colored-ribbons,Lyn Hejinian,My Life: A name trimmed with colored ribbons,[A name trimmed],A name trimmed
1615,https://www.poetryfoundation.org/poets/fanny-howe,language_poetry,https://www.poetryfoundation.org/poems/46762/everythings-a-fake,Fanny Howe,Everything’s a Fake,"[Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it’s spring. At night the mockingbirds ring their warnings of ...","Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it’s spring. At night the mockingbirds ring their warnings of c..."


- **After building out some specific rescraping functions, I can replace the poem_lines and poem_string values.**

In [100]:
# rescrape poem based on index from above 
df_trim.loc[428,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[428,'poem_url'])[0])
df_trim.loc[428,'poem_string'] = PoemView_rescraper(df_trim.loc[428,'poem_url'])[1]

df_trim.loc[703,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[703,'poem_url'])[0])
df_trim.loc[703,'poem_string'] = PoemView_rescraper(df_trim.loc[703,'poem_url'])[1]

df_trim.loc[952,'poem_lines'] = str(poempara_rescraper(df_trim.loc[952,'poem_url'])[0])
df_trim.loc[952,'poem_string'] = poempara_rescraper(df_trim.loc[952,'poem_url'])[1]

df_trim.loc[953,'poem_lines'] = str(modified_regular_rescraper(df_trim.loc[953,'poem_url'])[0])
df_trim.loc[953,'poem_string'] = modified_regular_rescraper(df_trim.loc[953,'poem_url'])[1]

df_trim.loc[1231,'poem_lines'] = str(justify_rescraper(df_trim.loc[1231,'poem_url'])[0])
df_trim.loc[1231,'poem_string'] = justify_rescraper(df_trim.loc[1231,'poem_url'])[1]

df_trim.loc[1234,'poem_lines'] = str(justify_rescraper(df_trim.loc[1234,'poem_url'])[0])
df_trim.loc[1234,'poem_string'] = justify_rescraper(df_trim.loc[1234,'poem_url'])[1]

df_trim.loc[1389,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1389,'poem_url'])[0])
df_trim.loc[1389,'poem_string'] = PoemView_rescraper(df_trim.loc[1389,'poem_url'])[1]

df_trim.loc[1603,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1603,'poem_url'])[0])
df_trim.loc[1603,'poem_string'] = PoemView_rescraper(df_trim.loc[1603,'poem_url'])[1]

df_trim.loc[2514,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2514,'poem_url'])[0])
df_trim.loc[2514,'poem_string'] = PoemView_rescraper(df_trim.loc[2514,'poem_url'])[1]

df_trim.loc[2517,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2517,'poem_url'])[0])
df_trim.loc[2517,'poem_string'] = PoemView_rescraper(df_trim.loc[2517,'poem_url'])[1]

df_trim.loc[3335,'poem_lines'] = str(ranged_rescraper(df_trim.loc[3335,'poem_url'])[0])
df_trim.loc[3335,'poem_string'] = ranged_rescraper(df_trim.loc[3335,'poem_url'])[1]

df_trim.loc[3418,'poem_lines'] = str(center_rescraper(df_trim.loc[3418,'poem_url'])[0])
df_trim.loc[3418,'poem_string'] = center_rescraper(df_trim.loc[3418,'poem_url'])[1]

df_trim.loc[3421,'poem_lines'] = str(justify_rescraper(df_trim.loc[3421,'poem_url'])[0])
df_trim.loc[3421,'poem_string'] = justify_rescraper(df_trim.loc[3421,'poem_url'])[1]

df_trim.loc[4217,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4217,'poem_url'])[0])
df_trim.loc[4217,'poem_string'] = poempara_rescraper(df_trim.loc[4217,'poem_url'])[1]

df_trim.loc[4611,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4611,'poem_url'])[0])
df_trim.loc[4611,'poem_string'] = poempara_rescraper(df_trim.loc[4611,'poem_url'])[1]

In [104]:
# found some more...
df_trim.loc[1388,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1388,'poem_url'])[0])
df_trim.loc[1388,'poem_string'] = PoemView_rescraper(df_trim.loc[1388,'poem_url'])[1]

df_trim.loc[1390,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1390,'poem_url'])[0])
df_trim.loc[1390,'poem_string'] = PoemView_rescraper(df_trim.loc[1390,'poem_url'])[1]

df_trim.loc[1391,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1391,'poem_url'])[0])
df_trim.loc[1391,'poem_string'] = PoemView_rescraper(df_trim.loc[1391,'poem_url'])[1]

df_trim.loc[1392,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1392,'poem_url'])[0])
df_trim.loc[1392,'poem_string'] = PoemView_rescraper(df_trim.loc[1392,'poem_url'])[1]

In [106]:
# another one...
df_trim.loc[3399,'poem_lines'] = str(image_rescraper(df_trim.loc[3399,'poem_url'])[0])
df_trim.loc[3399,'poem_string'] = image_rescraper(df_trim.loc[3399,'poem_url'])[1]

- **Some scrapings contain only BeautifulSoup garbage, so I'll try to re-scrape those.**

In [108]:
# check if html tags are in the string
df_trim[df_trim.poem_string.str.contains('<div')]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
237,https://www.poetryfoundation.org/poets/nikki-giovanni,black_arts_movement,https://www.poetryfoundation.org/poems/90181/no-complaints,Nikki Giovanni,No Complaints,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917—2001)</span></p><...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917—2001)</span></p></..."
1687,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55563/you-part-i,Ron Silliman,"You, part I","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
1688,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55564/you-part-xii,Ron Silliman,"You, part XII","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
4260,https://www.poetryfoundation.org/poets/emma-lazarus,victorian,https://www.poetryfoundation.org/poems/46791/by-the-waters-of-babylon,Emma Lazarus,By the Waters of Babylon,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>, ]","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>\n"


In [159]:
# rescrape poem based on index from above 
df_trim.loc[237,'poem_lines'] = str(PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[0])
df_trim.loc[237,'poem_string'] = PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[1]

df_trim.loc[1687,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1687,'poem_url'])[0])
df_trim.loc[1687,'poem_string'] = PoemView_rescraper(df_trim.loc[1687,'poem_url'])[1]

df_trim.loc[1688,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1688,'poem_url'])[0])
df_trim.loc[1688,'poem_string'] = PoemView_rescraper(df_trim.loc[1688,'poem_url'])[1]

df_trim.loc[4260,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[4260,'poem_url'])[0])
df_trim.loc[4260,'poem_string'] = PoemView_rescraper(df_trim.loc[4260,'poem_url'])[1]

In [160]:
# re-run the destringify function
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

- **Re-check for any missing poem_lines values that aren't NaNs.**

In [165]:
df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
783,https://www.poetryfoundation.org/poets/randall-jarrell,fugitive,https://www.poetryfoundation.org/poetrymagazine/poems/25237/goodbye-wendover-goodbye-mountain-home,Randall Jarrell,Goodbye Wendover Goodbye Mountain Home,[],
1326,https://www.poetryfoundation.org/poets/ezra-pound,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/13071/dogmatic-statement-concerning-the-game-of-chess-theme-for-a-series-of-pictures,Ezra Pound,Dogmatic Statement Concerning The Game Of Chess Theme For A Series Of Pictures,[],
1433,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/20226/a-foot-note,William Carlos Williams,A Foot Note,[],
1438,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/24855/paterson-book-ii,William Carlos Williams,Paterson Book Ii,[],
1736,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/22702/poem-he-watched-with-all-his,W. H. Auden,Poem He Watched With All His,[],
1738,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21500/poem-o-who-can-ever-praise-enough-the-price,W. H. Auden,Poem O Who Can Ever Praise Enough The Price,[],
1775,https://www.poetryfoundation.org/poets/louise-bogan,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21807/untitled-tender-and-insolent,Louise Bogan,Untitled Tender And Insolent,[],
1826,https://www.poetryfoundation.org/poets/hart-crane,modern,https://www.poetryfoundation.org/poetrymagazine/poems/17345/at-melvilles-tomb,Hart Crane,At Melvilles Tomb,[],
2056,https://www.poetryfoundation.org/poets/a-m-klein,modern,https://www.poetryfoundation.org/poetrymagazine/poems/23448/come-two-like-shadows,A. M. Klein,Come Two Like Shadows,[],
2582,https://www.poetryfoundation.org/poets/wallace-stevens,modern,https://www.poetryfoundation.org/poetrymagazine/poems/19837/good-man-bad-woman,Wallace Stevens,Good Man Bad Woman,[],


In [169]:
# create a list of indices
lookups6 = list(df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0].index)
lookups6

[783,
 1326,
 1433,
 1438,
 1736,
 1738,
 1775,
 1826,
 2056,
 2582,
 2685,
 2790,
 2817,
 3191]

In [174]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so I'm running that again
for i in lookups6:
    try:
        info = image_rescraper_title(df_trim.loc[i, 'poem_url'], df_trim.loc[i, 'title'])
        df_trim.loc[i,'poem_lines'] = str(info[0])
        df_trim.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 783
Success -- 1326
Success -- 1433
Success -- 1438
Success -- 1736
Success -- 1738
Success -- 1775
Success -- 1826
Success -- 2056
Success -- 2582
Success -- 2685
Success -- 2790
Success -- 2817
Failure -- 3191
CPU times: user 1.58 s, sys: 214 ms, total: 1.79 s
Wall time: 51.6 s


In [177]:
# one final one to redo
df_trim.loc[3191,'title'] = 'Radio'
info = image_rescraper_title(df_trim.loc[3191, 'poem_url'], df_trim.loc[3191, 'title'])
df_trim.loc[3191,'poem_lines'] = str(info[0])
df_trim.loc[3191,'poem_string'] = info[1]

In [181]:
# re-run destringify
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

## 💾 SAVE IT!

In [182]:
df_trim.to_csv('data/poetry_foundation_raw_rescrape.csv')

## Next notebook: [NLP, Feature Engineering, and EDA](03_nlp_features_eda.ipynb)

[[go back to the top](#Data-Cleaning)]

- The next notebook includes natural language processing, engineering of features, exploring data, and analyzing data.
⏰