# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages
* Select specific elements from the DOM

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [74]:
import pandas as pd

In [None]:
#Starter Code

from bs4 import BeautifulSoup
import requests


url = 'https://www.azlyrics.com/s/sylvanesso.html' #Put the URL of your AZLyrics Artist Page here!

html_page = requests.get(url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
albums = soup.find_all("div", class_="album")
#The example from our lecture/reading
data = [] #Create a storage container
for album_n in range(len(albums)):
    #On the last album, we won't be able to look forward
    if album_n == len(albums)-1:
        cur_album = albums[album_n]
        album_songs = cur_album.findNextSiblings('a')
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
    else:
        cur_album = albums[album_n]
        next_album = albums[album_n+1]
        saca = cur_album.findNextSiblings('a') #songs after current album
        sbna = next_album.findPreviousSiblings('a') #songs before next album
        album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
data

## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [53]:
#Example to show it working
url = 'https://www.azlyrics.com/lyrics/sylvanesso/dress.html'
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
test= soup.select("""body > .'container.main-page' > .row div.ringtone ~ div""") # Look at website for structure
# > means directly underneath (child), ~ means sibling
test[0].text

"\n\r\nWe come to grips with our wrists, we come to sound with our mouth\nWe sing of what we think we know, mother, father, skin and flaws\nand we move just like the birds, moving amidst the other birds,\nand we move just like the fish, rolling away from larger\nand I know I'm protecting the light of lichen\nBut oh you look like a morning star, to see who we are, oooooo\nand I know you're protecting the light of lichen\nBut oh you look like a morning star, just see who we are, oooooo\nYou look good in the west, see how you clap those hands\nYou look good in the south, see how you use your mouth\nYou look good in the east, all elbows and knees\nTo the honey dipper, to the sound shifter\nOh don't you know you want to\nSee that moonrise in the rear view\nJust like you had wanted it to\nTemperature drops, the hot tart cools\nReady oh the radio calling to you\nRight out loud she said it\nFor crying out loud they meant it\nSing that song like I know you can\nWork your jaw like a blind man\nC

In [75]:
# create a function to retrieve text
def lyrics_text(base_url, song_url_ext):
    url = base_url + song_url_ext
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    raw = soup.select("""body > .'container.main-page' > .row div.ringtone ~ div""")
    text = raw[0].text
    return text
lyrics_text(base_url,'/lyrics/sylvanesso/dress.html')

"\n\r\nWe come to grips with our wrists, we come to sound with our mouth\nWe sing of what we think we know, mother, father, skin and flaws\nand we move just like the birds, moving amidst the other birds,\nand we move just like the fish, rolling away from larger\nand I know I'm protecting the light of lichen\nBut oh you look like a morning star, to see who we are, oooooo\nand I know you're protecting the light of lichen\nBut oh you look like a morning star, just see who we are, oooooo\nYou look good in the west, see how you clap those hands\nYou look good in the south, see how you use your mouth\nYou look good in the east, all elbows and knees\nTo the honey dipper, to the sound shifter\nOh don't you know you want to\nSee that moonrise in the rear view\nJust like you had wanted it to\nTemperature drops, the hot tart cools\nReady oh the radio calling to you\nRight out loud she said it\nFor crying out loud they meant it\nSing that song like I know you can\nWork your jaw like a blind man\nC

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


In [78]:
#Reusing example from above to get albums
url = 'https://www.azlyrics.com/s/sylvanesso.html' #Put the URL of your AZLyrics Artist Page here!
base_url = 'https://www.azlyrics.com'

html_page = requests.get(url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
albums = soup.find_all("div", class_="album")
#The example from our lecture/reading
data = [] #Create a storage container
for album_n in range(len(albums)):
    #On the last album, we won't be able to look forward
    if album_n == len(albums)-1:
        cur_album = albums[album_n]
        album_songs = cur_album.findNextSiblings('a')
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            if page:
                lyrics = lyrics_text(base_url, page[2:])
            data.append({'Title':title, 'URL':page, 'Album':album, 'Lyrics':lyrics})
    else:
        cur_album = albums[album_n]
        next_album = albums[album_n+1]
        saca = cur_album.findNextSiblings('a') #songs after current album
        sbna = next_album.findPreviousSiblings('a') #songs before next album
        album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            if page:
                lyrics = lyrics_text(base_url, page[2:])
            data.append({'Title':title, 'URL':page, 'Album':album, 'Lyrics':lyrics})
df = pd.DataFrame(data)



ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!