## Web scrapping with Python

### Requests library 
to send get requests and collect resposnses

In [112]:
#!conda install requests

In [203]:
import requests

#### Make the actual request

In [204]:
response = requests.get('https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/?st=list&c=180')

In [205]:
type(response)

requests.models.Response

In [206]:
response.status_code

200

#### Status codes:
- 200: succseful
- 300: redirects
- Everything starting with a 4 is an error (on your side)
    - 400: error
    - 401: not authorized
    - 404: website not found
- Everything starting with a 5 is an erroe (from the server side)

In [207]:
response.url

'https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/?st=list&c=180'

In [208]:
response.text

'<!DOCTYPE html>\n\n\n<html lang="en" class="no-js" prefix="lc: http://loc.gov/#">\n<head>\n\n    \n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width,initial-scale=1"/>\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="version" content="$Revision$"/>\n<meta name="msvalidate.01" content="5C89FB9D99590AB2F55BD95C3A59BD81"/>\n<link title="schema(DC)" rel="schema.dc" href="http://purl.org/dc/elements/1.1/"/>\n<meta name="dc.language" content="eng" />\n<meta name="dc.source" content="Library of Congress, Washington, D.C. 20540 USA" />\n\n\n    <link rel="alternate" type="application/json" href="https://www.loc.gov/programs/poetry-and-literature/poet-laureate/poet-laureate-projects/poetry-180/all-poems/?c=180&amp;fa=partof:poetry+180&amp;fo=json&amp;st=list" />\n\n\n<meta property="fb:admins" content="libraryofcongress"/>\n<meta property="og:site_name" content="The Library of Congress"/>\n<meta property="og:type" content="article" />\n\n\n<meta prope

In [209]:
# write it as a file to inspect in a better editor
with open('poems_list.html', 'w') as file: 
    file.write(response.text)

#### Identify the structure of the elements we want to extract, here **poem titles** and **links**
- All Poems: start with the word "Poem ###" 
- All links: start with: rel='http://www.loc.gov/item/poetry-180- followed by ### and the poem's title

In [120]:
import re

In [121]:
poems_html = response.text

In [122]:
title_pattern = "Poem \d{3}:\s*([\w \"\?\']*)"

In [123]:
poem_titles = re.findall(title_pattern, poems_html)

In [124]:
len(poem_titles)

180

In [125]:
poem_titles

['Introduction to Poetry',
 'The Good Life',
 'Abecedarian Requiring Further Examination of Anglikan Seraphym Subjugation of a Wild Indian Rezervation',
 'Question',
 'Thanks',
 'How Bright It Is',
 '"Do You Have Any Advice For Those of Us Just Starting Out?"',
 'Numbers',
 'The Cord',
 'At the Un',
 'Girls',
 'The Bat',
 'Did I Miss Anything?',
 'Neglect',
 'The Poet',
 'Radio',
 'Bad Day',
 'The Farewell',
 'The Partial Explanation',
 'Dorie Off To Atlanta',
 'Wheels',
 'Remora',
 'Tour',
 'After Us',
 'Domestic Work',
 'Before She Died',
 'Poetry',
 'American Cheese',
 'Advice from the Experts',
 'One Morning',
 'Walking Home',
 'Publication Date',
 'The Meadow',
 'The Summer I Was Sixteen',
 'Hand Shadows',
 'El Florida Room',
 "She Didn't Mean to Do It",
 'Cartoon Physics',
 'Snow',
 'Driving to Town Late to Mail a Letter',
 'Halloween',
 'The Poetry of Bad Weather',
 'The Green One Over There',
 'A Man I Knew',
 'Nights',
 'Grammar',
 'Fault',
 'Thanks For Remembering Us',
 'Beca

In [126]:
link_pattern = "rel='(http://www.loc.gov/item/poetry-180-\d{3}/[\w\-]*)"

In [127]:
poem_links = re.findall(link_pattern, poems_html)

In [128]:
import pandas as pd
df = pd.DataFrame({'title':poem_titles, 'link': poem_links})

In [129]:
df.head()

Unnamed: 0,title,link
0,Introduction to Poetry,http://www.loc.gov/item/poetry-180-001/introdu...
1,The Good Life,http://www.loc.gov/item/poetry-180-002/the-goo...
2,Abecedarian Requiring Further Examination of A...,http://www.loc.gov/item/poetry-180-003/abeceda...
3,Question,http://www.loc.gov/item/poetry-180-004/question
4,Thanks,http://www.loc.gov/item/poetry-180-005/thanks


----

### Warmup Wednesday morning

go through the links and extract from each site:
- poem text
- author

#### Define and test the pattern on a single site

In [130]:
## go the site and inspect the structure of the texts and the authors
poem_links[0]

'http://www.loc.gov/item/poetry-180-001/introduction-to-poetry'

In [131]:
poem_response = requests.get(poem_links[0])

In [132]:
res1 = requests.get('https://www.loc.gov/item/poetry-180-001/introduction-to-poetry')

In [133]:
res1_text=res1.text

In [134]:
with open('poem1.html','w') as file:
    file.write(res1_text)

##### Poem texts

In [135]:
## hint: use the distinctive html tags structures to define the pattern
poem_pattern = "<pre>([\W\w]*)</pre>" #

In [136]:
## with the right pattern you should see the poem's when running this
re.findall(poem_pattern, res1_text)

["I ask them to take a poem\r\nand hold it up to the light\r\nlike a color slide\r\n                  \r\nor press an ear against its hive.\r\n                \r\nI say drop a mouse into a poem\r\nand watch him probe his way out,\r\nor walk inside the poem's room\r\nand feel the walls for a light switch.\r\n                  \r\nI want them to waterski\r\nacross the surface of a poem\r\nwaving at the author's name on the shore.\r\n                 \r\nBut all they want to do\r\nis tie the poem to a chair with rope\r\nand torture a confession out of it.\r\n                 \r\nThey begin beating it with a hose\r\nto find out what it really means."]

##### Poem author

In [137]:
author_pattern = "<p>—([\w \-]*)"

In [138]:
re.findall(author_pattern, poem_response.text)

['Billy Collins']

In [139]:
poem_links[:5]

['http://www.loc.gov/item/poetry-180-001/introduction-to-poetry',
 'http://www.loc.gov/item/poetry-180-002/the-good-life',
 'http://www.loc.gov/item/poetry-180-003/abecedarian-requiring-further-examination-of-anglikan-seraphym-subjugation-of-a-wild-indian-rezervation',
 'http://www.loc.gov/item/poetry-180-004/question',
 'http://www.loc.gov/item/poetry-180-005/thanks']

#### Go through list of links and extract all texts and authors

In [140]:
poems = []
authors = []
for link in poem_links[0:5]:
    poem_response = requests.get(link)
    poems.append(re.findall(poem_pattern, poem_response.text))
    authors.append(re.findall(author_pattern, poem_response.text))

In [141]:
from tqdm import tqdm

In [111]:
poems = []
authors = []
for link in tqdm(poem_links):
    poem_response = requests.get(link)
    poems.append(re.findall(poem_pattern, poem_response.text))
    authors.append(re.findall(author_pattern, poem_response.text))

100%|██████████| 180/180 [03:38<00:00,  1.22s/it]


In [142]:
from bs4 import BeautifulSoup

In [143]:
poem_soup=BeautifulSoup(response.text)

In [155]:
poem_link=poem_soup.find_all('span',class_="item-description-title")

In [170]:
poem_link[0].a.text

'\n        \n\n        \n            Poem 001:\n        \n\n\t    \n            \n                \n                    \n                    Introduction to Poetry\n                    \n                \n            \n        \n        \n        \n        '

In [167]:
link_list = []
for link in poem_link:
    link_list.append(link.a['href'])    

In [169]:
len(link_list)

180

In [188]:
#author_name=
author_name = poem_soup.find_all('li',class_='contributor')

In [190]:
author_name[0]

<li class="contributor">
<strong class="search-results-label">Contributor:</strong>
<span>
                        Collins, Billy
                        </span>
</li>

In [191]:
author = []
for component in author_name:
    author.append(component.span.text)

In [193]:
author[0] # work later on the regular expression

'\n                        Collins, Billy\n                        '

In [195]:
import os

### Useful stuff to know when handling files and folder in python
Some useful methods from the os package (do **import os**):
- list files and folders **os.listdir(path)**
- create a folder **os.mkdir(path)**
- check if file exist: **os.path.isfile(fname)**
- check if folder exist: **os.path.exists(path)**

In [196]:
pwd

'/Users/lilycheng/Documents/jalapenalty-student-code/04_week/exercise'

In [197]:
os.listdir('/Users/lilycheng/Documents/jalapenalty-student-code/04_week/exercise')

['poems_list.html',
 'poem_list.html',
 'poem1.html',
 'Web_scrapping_poems-Copy1.ipynb',
 'bow_lili.ipynb',
 'my first web.html',
 '.ipynb_checkpoints',
 'bs4_lili.ipynb']

In [202]:
os.path.isfile('exercise')

False