# DSCI 511: Data acquisition and pre-processing<br>Chapter 5: Harvesting Content from the World Wide Web
The internet is an unbelievably large source of information (data). In this chapter we'll be talking
about methods for obtaining data directly from the internet for our own use (as opposed to using
an API to obtain it). There's so much freely available data out there that obtaining data is no
problem at all. What's much more difficult is obtaining data which is useful and easy to work with
(structured). Of course, creating ordered and clean datasets out of chaos is one of the key skills of
a data scientist!

## 5.1 HTML
Modern webpages are implemented using a combination of the Javascript, CSS, and HTML languages. For the purpose
of harvesting data from a webpage, HTML is the one to focus on. HTML is really just a (semi-) structured means
of expressing the basic layout and content of a webpage (CSS and Javascript are what take barebones webpages
and make them look "shiny"). The key word here is content.

Since analyzing the HTML code behind a webpage is so essential to understanding its content, it's important to have
at least a basic understanding of the HTML language:

* HTML stands for Hypertext Markup Language
* The code semantically describes the structure of web pages, and even described some parts of appearance originally (but no longer, as this is now done using CSS)
* The building blocks of HTML documents or web pages are called HTML elements
* These elements are denoted by tags, which are written using angle brackets (< and >). They represent various structural items such as headings, paragraphs, lists, links, quotes, etc.

### 5.1.1 HTML Basics
Okay, so maybe that was a tiny bit confusing. It's probably easier to understand by seeing some basic examples.
It's always good to start with Hello World!:

In [1]:
%%html
<html>
Hello World!
</html>

The above `<html>` tag signifies the beginning of an html document, and `</html>` denotes the end. This kind of schema applies
to the other tags, as well.

For example:

In [2]:
%%html
<html>
Hello <b>World</b>!
</html>

### 5.1.2 Some common HTML tags
Here are some of the most common HTML tags:
* `<i>` makes text italic
* `<u>` underlines text
* `<br>` inserts a line break
* `<body>` defines the document's body
* `<p>` defines a paragraph

Of course, there are a lot more possibilities with HTML other than just this list, but this isn't a course on
HTML. We should just check out some examples and do our best to figure it out as we go along! Web scraping is far from a science, and it takes a bit of creativity and perhaps even luck to get it right.

## 5.2 Beautiful Soup
The Python module we'll be working with to do our web scraping is called _Beautiful Soup_. Let's use it to 
examine an example of a webpage with really basic HTML code backing it!

In [4]:
from bs4 import BeautifulSoup
import urllib # We'll still need this to download webpages

# This just downloads the html text with full markdown
html_text = urllib.request.urlopen("http://www.example.com/").read()

# We want a markdown-interpreted (structured) version of the html using BeautifulSoup:
soup = BeautifulSoup(html_text, 'html.parser')

print(soup)

<!DOCTYPE doctype html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used

### 5.2.1 Beautiful soup objects
Here, printing out our `BeautifulSoup()` object (loaded from the webpage) just exhibits the underlying HTML text, but the module can do so much more! For example, it can be used to search an HTML document for specific tags, and return all the code that is contained inside of them. 
Let's examine the `<title>` and `<head>` tags.
Inside `<title>` is just a short title, but inside `<head>` is the majority of the code. 
In larger HTML documents, it might be really hard to even find the `</head>` tag ending the head section.
BeautifulSoup makes it easy to get it all:

In [5]:
# The head
head = soup.find('head')
print(head)

<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>
</head>


In [6]:
# The title
title = soup.find('title')
print(title)

<title>Example Domain</title>


#### 5.2.1.1 BeautifulSoup's objects are nested, like the html they represent 
The objects returned by these find operations are actually just more `BeautifulSoup()` objects, so you can continue performing operations on them, for example since the title is inside the head:

In [7]:
title2 = head.find('title')
print(title2)

<title>Example Domain</title>


And we can display just the text inside the tags of a `BeautifulSoup()` object using it's `.text` attribute:

In [8]:
print(title.text)

Example Domain


#### 5.2.1.2 Searching for content in the body
The *meat* of most webpages will be contained in the `<body>` section, which is formulated exactly as you might
expect. Let's take a look:

In [9]:
# The body
body = soup.find('body')
print(body)

<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>


Here we see some more commonly-used tags:
* `<h1>` denotes a heading
* `<p>` is a paragraph
* `<a href="http://www.iana.org/domains/example">` is a link 

If you look closely, you'll notice there's more than one paragraph in the body.  In this case,
running `soup.find` will just return the first instance found:

In [10]:
# First paragraph
para1 = soup.find('p')
print(para1)

<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>


If you'd like to instead return all instances of paragraphs, use `find_all`:

In [11]:
paras = soup.find_all('p')
for para in paras:
    print(para)

<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>


#### 5.2.1.3 Examining tag attributes
The attributes of a tag can be examined by calling the correct dictionary keys on the BeautifulSoup object, but
first you need to make sure they exist. Here we create a custom paragraph tag with a 'class' attribute, that I've set
to just be `hello`. First you can check

In [12]:
# Make the custom Soup object
new_soup = BeautifulSoup(
    """
    <p class="salutation"> hello </p>
    <p> something in between </p>
    <p class="salutation"> goodbye </p>
    """, 
    'html.parser'
)
new_para1 = new_soup.find('p')
new_para1.attrs

{'class': ['salutation']}

In [13]:
# We can access the class attribute just like a dict
print(new_para1['class'])
#or
print(new_para1.get('class'))

['salutation']
['salutation']


We can use this also to get several tags simultaneously that all have an attribute in common:

In [14]:
paras_with_ids = [p for p in new_soup.find_all('p') if p.get('class')]
print(paras_with_ids)

[<p class="salutation"> hello </p>, <p class="salutation"> goodbye </p>]


You can search for only tags with a specific attribute value like this:

In [15]:
hello_paras = new_soup('p', {'class' : 'salutation'})
print(hello_paras)

# Or, for the most part equivalently:
hello_paras2 = new_soup('p', 'salutation')
print(hello_paras2)

[<p class="salutation"> hello </p>, <p class="salutation"> goodbye </p>]
[<p class="salutation"> hello </p>, <p class="salutation"> goodbye </p>]


### 5.2.2 Hyperlinks: harvesting relies upon interconnectivity
The page we just looked at is admittedly extremely basic (we have to start somewhere!). Where can we go from here?
Well, one way to grab more data from this page is to follow any links it has and scrape those pages too!
Remember, the `<a>` tag denotes a link. So, we can look for links using the `<a>` tag, then load up the HTML
obtained from requesting the link's URL, and perform the same scraping process!

Note that calling the `.text` method on a hyperlink tag will show you the text that shows up for the link on the browser, whereas the value in `href=` is the URL we are looking for.

In [16]:
links = soup.find_all('a')
print(links)
print('\n')
for link in links:
    print(link.text)
    print(link['href'])

[<a href="http://www.iana.org/domains/example">More information...</a>]


More information...
http://www.iana.org/domains/example


You never know what you're going to wind up with, and many pages are incredibly complicated in design and will be
extremely difficult for us to scrape (particularly those who are just starting out!). So, it's a good idea to check out any websites you wish to scrape ahead of time, and guage them for possible difficulty level. 

#### 5.2.2.1 Understanding what's on the page
We should always look for convenient structure around desirable web content&mdash;data.
The page we've been linked to has a table, let's take a look at the code and see if we can identify tags that will help us access its content in structured form.

In [17]:
## pull out the hyperlink URL (there's only one)
new_URL = soup.find('a')['href']

## download the linked webpage
new_html_text = urllib.request.urlopen(new_URL).read()

## We want a markdown-interpreted (structured) version of the html using BeautifulSoup:
new_soup = BeautifulSoup(new_html_text, 'html.parser')

print(new_soup.find('body'))



<body>
<header>
<div id="header">
<div id="logo">
<a href="/"><img alt="Homepage" src="/_img/2013.1/iana-logo-header.svg"/></a>
</div>
<div class="navigation">
<ul>
<li><a href="/domains">Domains</a></li>
<li><a href="/numbers">Numbers</a></li>
<li><a href="/protocols">Protocols</a></li>
<li><a href="/about">About Us</a></li>
</ul>
</div>
</div>
</header>
<div id="body">
<div id="main_right">
<h1>IANA-managed Reserved Domains</h1>
<p>Certain domains are set aside, and nominally registered to “IANA”, for specific
		policy or technical purposes.</p>
<h2>Example domains</h2>
<p>As described in <a href="/go/rfc2606">RFC 2606</a> and <a href="/go/rfc6761">RFC 6761</a>,
	a number of domains such as <span class="domain label">example.com</span> and <span class="domain label">example.org</span>
	are maintained for documentation purposes. These domains may be used as illustrative
	examples in documents without prior coordination with us. They are 
	not available for registration or transfer.</p

#### 5.2.2.2 Example: extracting a table

Our table is inside of the`<table>` tags and uses `<thead>` column headers. `<tbody>` indicates the table body, which contains `<tr>` rows and `<td>` columns.

Let's pull out the header and subsequent rows and store the data as a dictionary (for JSON serialization), using the column names as dictionary keys and the columns as list values. Note that the second column has both a URL and a name; with some extra work we can save these as separate values in our dictionary.

In [19]:
# search for the `table` tag in the `body`
table = new_soup.find('body').find('table')
# load the header
header = [name.text for name in table.find('thead').find_all('th')]
header.append(u"URL")
# initialize data with each key/header label having empty list value
data = {name: [] for name in header}

# data = {}
for row in table.find('tbody').find_all("tr"):
    ## grab the text from each column and place them under the correct dictionary key
    cols = [col.text for col in row.find_all('td')]
    for i, col in enumerate(cols):
        data[header[i]].append(col)
    ## grab the URL from each language-code's (there's only one)
    URL = row.find('a')['href']
    data[u"URL"].append(URL)
    
data

{'Domain': ['إختبار',
  'آزمایشی',
  '测试',
  '測試',
  'испытание',
  'परीक्षा',
  'δοκιμή',
  '테스트',
  'טעסט',
  'テスト',
  'பரிட்சை'],
 'Domain (A-label)': ['XN--KGBECHTV',
  'XN--HGBK6AJ7F53BBA',
  'XN--0ZWM56D',
  'XN--G6W251D',
  'XN--80AKHBYKNJ4F',
  'XN--11B5BS3A9AJ6G',
  'XN--JXALPDLP',
  'XN--9T4B11YI5A',
  'XN--DEBA0AD',
  'XN--ZCKZAH',
  'XN--HLCJ6AYA9ESC7A'],
 'Language': ['Arabic',
  'Persian',
  'Chinese',
  'Chinese',
  'Russian',
  'Hindi',
  'Greek, Modern (1453-)',
  'Korean',
  'Yiddish',
  'Japanese',
  'Tamil'],
 'Script': ['Arabic',
  'Arabic',
  'Han (Simplified variant)',
  'Han (Traditional variant)',
  'Cyrillic',
  'Devanagari (Nagari)',
  'Greek',
  'Hangul (Hangŭl, Hangeul)',
  'Hebrew',
  'Katakana',
  'Tamil'],
 'URL': ['/domains/root/db/xn--kgbechtv.html',
  '/domains/root/db/xn--hgbk6aj7f53bba.html',
  '/domains/root/db/xn--0zwm56d.html',
  '/domains/root/db/xn--g6w251d.html',
  '/domains/root/db/xn--80akhbyknj4f.html',
  '/domains/root/db/xn--11b5bs3a9aj6g

#### 5.2.2.3 Making sure the data are useful
On many sites&mdash;including this one&mdash;URLs are relative, so if we wanted to make them useful to someone out of context, we should make them absolute, adding the base suffix: 

- `https://www.iana.org/`

In [20]:
## loop over the enumerated URLs and add the URL root
for i, thing in enumerate(data['URL']):
    data['URL'][i] = u"https://www.iana.org/" + data['URL'][i]
    
data['URL']

['https://www.iana.org//domains/root/db/xn--kgbechtv.html',
 'https://www.iana.org//domains/root/db/xn--hgbk6aj7f53bba.html',
 'https://www.iana.org//domains/root/db/xn--0zwm56d.html',
 'https://www.iana.org//domains/root/db/xn--g6w251d.html',
 'https://www.iana.org//domains/root/db/xn--80akhbyknj4f.html',
 'https://www.iana.org//domains/root/db/xn--11b5bs3a9aj6g.html',
 'https://www.iana.org//domains/root/db/xn--jxalpdlp.html',
 'https://www.iana.org//domains/root/db/xn--9t4b11yi5a.html',
 'https://www.iana.org//domains/root/db/xn--deba0ad.html',
 'https://www.iana.org//domains/root/db/xn--zckzah.html',
 'https://www.iana.org//domains/root/db/xn--hlcj6aya9esc7a.html']

## 5.3 Extended example: a personal copy of a song lyrics collection
Here's a fun challenge: can we go to one of those big song lyrics websites and grab everything they've got? Yes, but there are a number of challenges that go beyond control flow and syntax stuff:

+ Understanding the terms of use
+ Determining website (page-page) structure
+ Strategizing how to cover the whole site
+ Deciding which data to look for
+ Finding out where that data is stored
+ Determing a good data structure (schema)
+ Determinging a storage structure (files and directories)

### 5.3.1 An initial look
We're gonna go for songlyrics.com, but let's start by taking a look at the terms and conditions. For fun, we'll take a look with beautiful soup. What a mess...


In [22]:
import requests, re
html = requests.get("http://www.songlyrics.com/termsconditions.php").text
soup = BeautifulSoup(html, 'html.parser')
print(html)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">
<head>
	<title>Terms and Conditions | Collection of Song Lyrics at SongLyrics.com</title>
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
	<meta name="description" content="" />
	
	<meta name="msvalidate.01" content="CF28C9C2E5FDBD5C7CA7F6FE394BD121" />
        	<meta name="robots" content="noydir, noodp" />
        	<meta property="fb:admins" content="31113169,100000933112467,1383859676" />
	<meta name="fb:app_id" content="1418134321780018" />
	<meta name="viewport" content="width=device-width, initial-scale=1"> <!-- viewport -->
	
	<link rel="preconnect" href="http://www.songlyrics.com" />
        <link rel="dns-prefetch" href="//cdn2.songlyricscom.netdna-cdn.com" />
        <link rel="dns-prefetch" href="//cdn1.songlyricscom.netdna-cdn.com" />
        <link rel="dns-prefetch" href="

#### 5.3.1.1 Finding the structure
Let's see if we can get a hint by looking at the div class (css) names. The 'maintext' class might be a good one.

In [23]:
for div in soup.body.find_all('div'):
    print(div.get('class'))

None
None
None
['headinner']
None
None
None
None
None
None
['headinner']
None
['submit-btn']
['headinner']
None
['submit-btn']
None
None
None
['masthead']
['wrapper-inner']
['topnav']
['pw_widget', 'pw_size_24', 'pw_post_false', 'pw_counter_true']
['colone-wide']
['maintext', 'padder']
['colthree']
['adblock']
['box']
['listbox']
None
['footer']
None


#### 5.3.1.2 Getting to the point
In the maintext class, there are headers (`'h1'`, `'h2'`, etc...) and paragraphs (`'p'`) that contain the content (this is common), and to get them both in the right order, we'll need to be flexible. BeautifulSoup can do this with some regular expressions using `re.compile()`. Here a simple one that says 
>Take any tag that _contains_ an `'h'` _or_ `'p'`.

Notice in the "Use of Material" section that is says I can download a single copy of the material on the website! Yay, this means I can take one copy?

In [24]:
for div in soup.body.find_all('div'):
    if div.get('class') is not None:
        if div['class'][0] == 'maintext':
            for element in div.find_all(re.compile('h|p')):
                print(element.text)
                print("")

Terms and Conditions

This page contains the "Terms and Conditions" under which you may use www.SongLyrics.com. Please read this page carefully. If you do not accept the Terms and Conditions stated here, do not use this web site and service. By using this web site, you are indicating your acceptance to be bound by the terms of these Terms and Conditions. SongLyrics.com, (the "Company") owner of SongLyrics.com.com, may revise these Terms and Conditions at any time by updating this posting. You should visit this page periodically to review the Terms and Conditions, because they are binding on you and may change without notice. The terms "You" and "User" as used herein refer to all individuals and/or entities accessing this web site for any reason.

Use of Material. 

The Company authorizes you to view and download a single copy of the material on SongLyrics.com.com (the "Web Site") solely for your personal, noncommercial use. 

The contents of this Web Site, such as text, graphics, image

### 5.3.2 Finding site-level structure
SongLyrics stores their song lyrics in a very nice a&ndash;z artists index right off of their main url, like:
- `http://www.songlyrics.com/[<letter>]`/
    
So, we'll play around with the letter `'a'`. Let's look at the html for this. Of course, they don't just present ALL artists, the information is paginated! So, we'll have to jump in to each of the subdirectories for this letter, but first we need to gather these links.

In [26]:
html = requests.get("http://www.songlyrics.com/a/").text
soup = BeautifulSoup(html, 'html.parser')
print(html)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">
<head>
	<title>A Artist Song Lyrics</title>
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
	<meta name="description" content="Song lyrics for artists that start with the letter A." />
	
	<meta name="msvalidate.01" content="CF28C9C2E5FDBD5C7CA7F6FE394BD121" />
        	<meta name="robots" content="noydir, noodp" />
        	<meta property="fb:admins" content="31113169,100000933112467,1383859676" />
	<meta name="fb:app_id" content="1418134321780018" />
	<meta name="viewport" content="width=device-width, initial-scale=1"> <!-- viewport -->
	
	<link rel="preconnect" href="http://www.songlyrics.com" />
        <link rel="dns-prefetch" href="//cdn2.songlyricscom.netdna-cdn.com" />
        <link rel="dns-prefetch" href="//cdn1.songlyricscom.netdna-cdn.com" />
        <link rel="dns-prefetch"

#### 5.3.2.1 Regular expressions can navigate content
It's easy enough to get links, but we just want the ones with a title of the form "Page #". So, using the "re" module once again, but now on the page titles of the links gets us what we want. Also, since there are a few artists on the landing page for each letter, we need to include it, too.

In [27]:
pages = ["/a/"]
for x in soup.find_all('a'):
    if x.get("href") is not None and re.search("^Page \d+$", x.get("title", "NOTITLE")):
        print(x['href'])
        pages.append(x['href'])

/a/1/
/a/2/
/a/3/
/a/4/
/a/5/
/a/6/
/a/7/
/a/8/
/a/9/
/a/10/
/a/11/
/a/12/
/a/13/
/a/14/
/a/15/
/a/16/
/a/17/
/a/18/
/a/19/
/a/20/
/a/21/
/a/22/
/a/23/
/a/24/
/a/25/
/a/26/
/a/27/
/a/28/
/a/29/
/a/30/
/a/31/
/a/32/
/a/33/
/a/34/
/a/35/
/a/36/
/a/37/
/a/38/
/a/39/
/a/40/
/a/41/
/a/42/
/a/43/
/a/44/
/a/45/
/a/46/
/a/47/
/a/48/
/a/49/
/a/50/
/a/51/
/a/52/
/a/53/
/a/54/
/a/55/
/a/56/
/a/57/
/a/58/
/a/59/
/a/60/
/a/61/
/a/62/
/a/63/
/a/64/
/a/65/
/a/66/
/a/67/
/a/68/
/a/69/
/a/70/
/a/71/
/a/72/
/a/73/
/a/74/
/a/75/
/a/76/
/a/77/
/a/78/
/a/79/
/a/80/
/a/81/
/a/82/
/a/83/
/a/84/
/a/85/
/a/86/
/a/87/
/a/88/
/a/89/
/a/90/
/a/91/
/a/92/
/a/93/
/a/94/
/a/95/
/a/96/
/a/97/
/a/98/
/a/99/
/a/100/
/a/101/
/a/102/
/a/103/
/a/104/
/a/105/
/a/106/
/a/107/
/a/108/
/a/109/
/a/110/
/a/111/
/a/112/
/a/113/
/a/114/
/a/115/
/a/116/
/a/117/
/a/118/
/a/119/
/a/120/
/a/121/
/a/122/
/a/123/
/a/124/
/a/125/
/a/126/
/a/127/
/a/128/
/a/129/
/a/130/
/a/131/
/a/132/
/a/133/
/a/134/
/a/135/
/a/136/
/a/137/
/a/138/
/a/1

#### 5.3.2.2 Following the site's directory tree
Now that we've got the pages, its' time to put urls together off of the main site. Notice the string `"+"` in the `requests.get()` command. From there, we can see what links are in the first page! Looks like the listing is there again, but the artist's pages are in the full-url pattern:

+ http://www.songlyrics.com/ARTISTNAME-lyrics.html


In [28]:
for page in pages:
    html = requests.get("http://www.songlyrics.com"+page).text
    soup = BeautifulSoup(html, 'html.parser')
    for x in soup.find_all('a'):
        if x.get("href") is not None:
            print(x['href'])
    break

/
/top-songs-lyrics.html
/top100.php
/top-upcoming-songs.html
/latestAddedSongs
/news/top-songs/2011/
/news/top-songs/2010/
/news/top-songs/2009/
/news/top-songs/all-time/
/top-artists-lyrics.html
/top-artists-lyrics.html
/a/
/top-albums-lyrics.html
/top-upcoming-albums.html
/adele-lyrics/
/rihanna-lyrics/
/katy-perry-lyrics/
/lady-gaga-lyrics/
/lil-wayne-lyrics/
/musicgenres.php
/rock-lyrics.php
/r-and-b-lyrics.php
/country-music-lyrics.php
/hip-hop-rap-lyrics.php
/pop-lyrics.php
/christian-lyrics.php
/dance-lyrics.php
/latin-lyrics.php
/musicgenres.php
/news/
/news/
/news/category/news-roundup/
/news/album-reviews/
/news/album-reviews/
/news/category/song-reviews/
/news/category/spotlight/
/member-login.php
/member-register.php
https://www.facebook.com/SongLyrics
/news/advertise/
/news/submit-lyrics/
#nav
#
/
/top-songs-lyrics.html
/top100.php
/top-upcoming-songs.html
/latestAddedSongs
/news/top-songs/2011/
/news/top-songs/2010/
/news/top-songs/2009/
/news/top-songs/all-time/
/top-ar

#### 5.3.2.3 Filtering for full URLs
Notice also that artist's urls are terminated by the word 'lyrics'. We'll use this regularity too. Also, since we are digging into the data for a given artist, it's time to create our data structure. Each artist will be a separate data dictionary, and we'll start by storing their name and the url for their main page. For an individual artist, the data schema will be:

```
{
    "Artist": name,
    "url": artist-url,
    "Songs": {
        title1: {
            "Title": title1,
            "url": title1-url,
            "Lyrics": title1-lyrics,
            "Artist": title1-artist,
            "Genre": title1-genre,
            "Album": title1-album,
            ...
        },
        ...
    }
}
```

where as it turns out, we will be able to store all additional meta-data attributes that are present, like "Genre", "Album", "Note", etc.

In [30]:
for page in pages:
    html = requests.get("http://www.songlyrics.com"+page).text
    soup = BeautifulSoup(html, 'html.parser')
    for x in soup.find_all('a'):
        if re.search("^http://.*?-lyrics/$",x.get("href", "NOLINK")):
            data = {
                "Artist": x.text,
                "url": x['href'],
                "Songs": {}
            }
            break
    break
print(data["url"])

http://www.songlyrics.com/a--lyrics/


#### 5.3.2.4 Links for an artist
Now it's time to find the songs for the individual artists. These are once again the links, so let's see. It looks like the song links tend to have the `itemprop` attribute, but is this enough to cleanly filter them?

In [31]:
html = requests.get(data["url"]).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    print(link)

<a href="/">Lyrics</a>
<a href="/top-songs-lyrics.html">Popular Song Lyrics</a>
<a href="/top100.php">Billboard Hot 100</a>
<a href="/top-upcoming-songs.html">Upcoming Lyrics</a>
<a href="/latestAddedSongs">Recently Added</a>
<a href="/news/top-songs/2011/">Top Lyrics of 2011</a>
<a href="/news/top-songs/2010/">Top Lyrics of 2010</a>
<a href="/news/top-songs/2009/">Top Lyrics of 2009</a>
<a href="/news/top-songs/all-time/">More »</a>
<a href="/top-artists-lyrics.html">Artists</a>
<a href="/top-artists-lyrics.html">Popular Artists</a>
<a href="/a/">Artists A-Z</a>
<a href="/top-albums-lyrics.html">Popular Albums</a>
<a href="/top-upcoming-albums.html">Upcoming Albums</a>
<a href="/adele-lyrics/">Adele</a>
<a href="/rihanna-lyrics/">Rihanna</a>
<a href="/katy-perry-lyrics/">Katy Perry</a>
<a href="/lady-gaga-lyrics/">Lady Gaga</a>
<a href="/lil-wayne-lyrics/">Lil Wayne</a>
<a href="/musicgenres.php">Genres</a>
<a href="/rock-lyrics.php" title="Rock Lyrics">Rock</a>
<a href="/r-and-b-lyri

#### 5.3.2.5 Filtering for the songs

Filtering for songs by the `itemprop` attribute turns out no to be good enough. Several other links have this tag, specifically, those for the two steps back in the directory structure.

In [32]:
for link in soup.find_all('a'):
    if link.get("itemprop") is not None:
        print(link)

<a href="http://www.songlyrics.com/" itemprop="item"><span itemprop="name">Song Lyrics</span></a>
<a href="http://www.songlyrics.com/a/" itemprop="item"><span itemprop="name">Artists - A</span></a>
<a href="http://www.songlyrics.com/a/sing-a-long-lyrics/" itemprop="url" title="Sing-A-Long Lyrics A">Sing-A-Long</a>
<a href="http://www.songlyrics.com/a/hi-fi-serious-lyrics/" itemprop="url" title="Hi-Fi Serious Lyrics A">Hi-Fi Serious</a>
<a href="http://www.songlyrics.com/a/pacific-ocean-blue-lyrics/" itemprop="url" title="Pacific Ocean Blue Lyrics A">Pacific Ocean Blue</a>
<a href="http://www.songlyrics.com/a/shut-yer-face-lyrics/" itemprop="url" title="Shut Yer Face Lyrics A">Shut Yer Face</a>
<a href="http://www.songlyrics.com/a/bad-idea-lyrics/" itemprop="url" title="Bad Idea Lyrics A">Bad Idea</a>
<a href="http://www.songlyrics.com/a/going-down-lyrics/" itemprop="url" title="Going Down Lyrics A">Going Down</a>
<a href="http://www.songlyrics.com/a/something-s-going-on-lyrics/" itempr

#### 5.3.2.6 Honing in on the songs

Since songs also have a title, we can additionally specify for a title to get the relevant links.

In [33]:
for link in soup.find_all('a'):
    if link.get("itemprop", "NOITEMPROP") == "url" and link.get("title") is not None:        
        data["Songs"][link.text] = {"Title": link.text}
        data["Songs"][link.text]["url"] = link['href']
        print(link['href'], link.text)

http://www.songlyrics.com/a/sing-a-long-lyrics/ Sing-A-Long
http://www.songlyrics.com/a/hi-fi-serious-lyrics/ Hi-Fi Serious
http://www.songlyrics.com/a/pacific-ocean-blue-lyrics/ Pacific Ocean Blue
http://www.songlyrics.com/a/shut-yer-face-lyrics/ Shut Yer Face
http://www.songlyrics.com/a/bad-idea-lyrics/ Bad Idea
http://www.songlyrics.com/a/going-down-lyrics/ Going Down
http://www.songlyrics.com/a/something-s-going-on-lyrics/ Something's Going On
http://www.songlyrics.com/a/starbucks-lyrics/ Starbucks
http://www.songlyrics.com/a/a-z-lyrics/ A+Z
http://www.songlyrics.com/a/ag-clubstar-you-me-tonight-lyrics/ AG & Clubstar - You & Me (Tonight)
http://www.songlyrics.com/a/owner-of-a-lonely-heart-lyrics/ Owner Of A Lonely Heart
http://www.songlyrics.com/a/6-o-clock-on-a-tube-stop-lyrics/ 6 O'Clock On a Tube Stop
http://www.songlyrics.com/a/the-distance-lyrics/ The Distance
http://www.songlyrics.com/a/a-lyrics/ "A"
http://www.songlyrics.com/a/old-folks-lyrics/ Old Folks
http://www.songlyric

#### 5.3.2.7 Finally, the data
Now that we have the song links, let's see if we can find our target data, the song lyrics, along with anything else that's good, like the genre and album. Going straight for the body, this is still a mess.

In [34]:
for title in data["Songs"]:
    html = requests.get(data["Songs"][title]["url"]).text
    soup = BeautifulSoup(html, 'html.parser')
    print(soup.find('body'))
    break

<body>
<div id="fb-root"></div>
<div id="header">
<div id="headernav">
<div class="headinner">
<ul class="menu floatleft">
<li class="mega">
<h2><a href="/">Lyrics</a></h2>
<div>
<p><a href="/top-songs-lyrics.html">Popular Song Lyrics</a></p>
<p><a href="/top100.php">Billboard Hot 100</a></p>
<p><a href="/top-upcoming-songs.html">Upcoming Lyrics</a></p>
<p><a href="/latestAddedSongs">Recently Added</a></p>
<p class="menu-hr"></p>
<p><a href="/news/top-songs/2011/">Top Lyrics of 2011</a></p>
<p><a href="/news/top-songs/2010/">Top Lyrics of 2010</a></p>
<p><a href="/news/top-songs/2009/">Top Lyrics of 2009</a></p>
<p><a href="/news/top-songs/all-time/">More »</a></p>
</div>
</li>
<li class="mega">
<h2><a href="/top-artists-lyrics.html">Artists</a></h2>
<div>
<p><a href="/top-artists-lyrics.html">Popular Artists</a></p>
<p><a href="/a/">Artists A-Z</a></p>
<p class="menu-hr"></p>
<p><a href="/top-albums-lyrics.html">Popular Albums</a></p>
<p><a href="/top-upcoming-albums.html">Upcoming Al

#### 5.3.2.8 Extracting the data
Scanning closely (or matching the rendered web sites content text), we can see that the actual lyrics are inside of a div with the `songLyricsDiv-outer` id. That's easy to collect, since we just want the text from this div (and there's only one). Getting the meta-data, e.g., album, is a bit more complicated. The meta-data comes just under the the `h1` title as a link, between `p` tags. Notice additionally that meta-data attributes come in the text format `key: value`. So, filtering `p` tag text for colon-space, ": ", we get what we want. However these entries are much more useful if we actually parse them into keys and values, e.g., transform "Genre: Rock" into a python dictionary `{"Genre": "Rock"}`. This is done with the `re.split()` command, delimiting by any first ": ". Any meta-data attributes and the lyrics themselves are then stored in the artists data dictionary, and we're done with the song! Phew!

In [35]:
for title in data["Songs"]:
    html = requests.get(data["Songs"][title]["url"]).text
    soup = BeautifulSoup(html, 'html.parser')
    for par in soup.find_all("p"):
        if re.search(": ", par.text):
            pieces = re.split(": ", par.text)
            key = pieces[0]
            value = ": ".join(pieces[1:len(pieces)])
            data["Songs"][title][key] = value    
    for div in soup.find('body').find_all('div'):
        if div.get("id","NOCLASS") == "songLyricsDiv-outer":
            data["Songs"][title]["Lyrics"]=div.text
    for key in data["Songs"][title]:
        print(key+": ", data["Songs"][title][key])
        print("")
    break

Title:  Sing-A-Long

url:  http://www.songlyrics.com/a/sing-a-long-lyrics/

Artist:  A

Album:  Miscellaneous

Genre:  Rock

Note:  When you embed the widget in your site, it will match your site's styles (CSS). This is just a preview!

Lyrics:  
Everybody in the building
Has to sing the song
All the boys and all the girlfriends
Sing the sing-a-long

Think I'm in trouble
There's always a couple
Around me wherever I go

They're out there to bug me
I don't think it's funny
Everybody's laughing at me, yeah

I wanna go out
But there's no one about
All my friends
Want a quiet one at home

The same age as me and
They're husbands to be, yeah
Everybody in the building

Everybody in the building
Has to sing the song
All the boys and all the girlfriends
Sing the sing-a-long

I make a move here
And I make a move there
I've got millions of things
On the go

I write me the best lines
They're too corny sometimes
Everybody's laughing at me

Everybody in the building
Has to sing the song
All the boys 

### 5.3.3 Exercise: scaping structured content from a personal website
Do you have a personal website, whose terms of use are not an obstacle for scraping? If your website has any structured content, e.g., photos, documents, blog posts, write a web scraper to collect this content systematically in a convenient data structure. If you don't have a website that is convenient for this, review this website:
- http://www.pages.drexel.edu/~jw3477/

Can you scape the publications list? Specifically, attempt to scape a database of:
- titles
- author lists
- cover photos
- captions/abstracts
- full text hyperlinks
