### Web page scraping
The code below scrapes a number of web pages to get the song titles and lyrics from songs  
by the Talking Heads. The task is accomplished in two parts. First an index page is scraped  
to get the song pages and those urls are used to get the actual song titles and song lyrics.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import codecs
import re

#### Scrape index page
We start by openning the page with links to all Talking Heads songs in the site.  
After parsing that page, all "href" links are captured in a list.

In [None]:
data_links = urlopen("http://www.allthelyrics.com/lyrics/talking_heads")
bsobj_links = bs(data_links.read(), 'html.parser')

In [None]:
# get the links -- href data to list 
page_links = []
for link in bsobj_links.find_all('a'):
    page_links.append(link.get('href'))

#### Getting page names
The list of page links is then converted to a comma seperated string.  
The links for only song pages are extracted by using a regular expression.  
Being that results from appling a regular expression are returned as a list,  
another string is created from this list and a regular expression extracts the  
page name with the "html" file type.

In [None]:
# create strings and extract the specific text from all href data
string = ' , '.join(page_links)
page_list = re.findall('lyrics/talking_heads/.*html', string)

In [None]:
# create strings and extract the song page name from href data
# to be used for individual page scraping
page_strings = ','.join(page_list)
page = re.findall('[A-Za-z0-9_-]*\.html', page_strings)

#### Scraping song title and lyrics
Now that we have a list of page names, I created 2 empty lists that  
will hold the song titles and song lyrics as they are scraped from  
the pages. I created a function that will open a page, scrape the text  
and append the results to the appropriate list.

In [None]:
song_title = []
song_lyric = []

In [None]:
# function to scrape the pages
def getWords(page):
    data = urlopen('http://www.allthelyrics.com/lyrics/talking_heads/' + page)
    bsobj = bs(data.read(), 'html.parser')

    for item in bsobj.findAll(class_= 'page-title'):
        song_title.append(item.get_text())
        
    for item in bsobj.findAll(class_= 'content-text-inner'):
        song_lyric.append(item.get_text())


#### Scraping song title and lyrics part 2
The scraping function iterates over each page capturing the desired text.  
Each list is converted to a string with a pipe "|" used for a delimiter as that  
character would be unlikely to show up in the titles or lyrics. Finally, those strings  
are written to text files for storage in a database and used in analysis.

In [None]:
# loop through all pages and apply the scraping function
for i in page:
    getWords(i)

In [None]:
# create strings from the list and join with pipe
song_title = '|' .join(song_title)
song_lyric = '|' .join(song_lyric)

In [None]:
# write title and lyric strings to text files
title_data = codecs.open('/media/jim/Samsung USB/Talking_Heads/TH_titles.txt', mode="w", encoding="UTF-8")
title_data.write(song_title)
title_data.close()

lyric_data = codecs.open('/media/jim/Samsung USB/Talking_Heads/TH_lyrics.txt', mode="w", encoding="UTF-8")
lyric_data.write(song_lyric)
lyric_data.close()