# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages
* Select specific elements from the DOM

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
chosen_url = 'https://www.azlyrics.com/b/billieeilish.html' #Put the URL of your AZLyrics Artist Page here!

html_page = requests.get(chosen_url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
  <meta content="Billie Eilish lyrics - 40 song lyrics sorted by album, including &quot;Bad Guy&quot;, &quot;When The Party's Over&quot;, &quot;Wish You Were Gay&quot;." name="description"/>
  <meta content="Billie Eilish, Billie Eilish lyrics, discography, albums, songs" name="keywords"/>
  <meta content="noarchive" name="robots"/>
  <title>
   Billie Eilish Lyrics
  </title>
  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>
  <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
  <!--[if lt IE 9]>
<script src="https://oss.maxcdn.

In [3]:
albums = soup.find_all("div", class_ = "album")
print('Number of matches: {}'.format(len(albums)))
print('Object type: {}'.format(type(albums)))
print('Preview of objects:\n{}'.format(albums[:2]))

Number of matches: 3
Object type: <class 'bs4.element.ResultSet'>
Preview of objects:
[<div class="album" id="49503">album: <b>"Dont Smile At Me"</b> (2017)</div>, <div class="album" id="65430">album: <b>"When We All Fall Asleep, Where Do We Go?"</b> (2019)</div>]


In [4]:
album = albums[0]
album

<div class="album" id="49503">album: <b>"Dont Smile At Me"</b> (2017)</div>

In [5]:
album.findNextSiblings('a')

[<a href="../lyrics/billieeilish/copycat.html" target="_blank">Copycat</a>,
 <a href="../lyrics/billieeilish/idontwannabeyouanymore.html" target="_blank">Idontwannabeyouanymore</a>,
 <a href="../lyrics/billieeilish/myboy.html" target="_blank">My Boy</a>,
 <a href="../lyrics/billieeilish/watch.html" target="_blank">Watch</a>,
 <a href="../lyrics/billieeilish/partyfavor.html" target="_blank">Party Favor</a>,
 <a href="../lyrics/billieeilish/bellyache.html" target="_blank">Bellyache</a>,
 <a href="../lyrics/billieeilish/oceaneyes.html" target="_blank">Ocean Eyes</a>,
 <a href="../lyrics/billieeilish/hostage.html" target="_blank">Hostage</a>,
 <a href="../lyrics/billieeilish/burn.html" target="_blank">&amp;burn</a>,
 <a href="../lyrics/billieeilish/748832.html" target="_blank">!!!!!!!</a>,
 <a href="../lyrics/billieeilish/badguy.html" target="_blank">Bad Guy</a>,
 <a href="../lyrics/billieeilish/xanny.html" target="_blank">Xanny</a>,
 <a href="../lyrics/billieeilish/youshouldseemeinacrown.

In [6]:
def song_list(artist_url):
    url = artist_url
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    # List to house our results:
    data = []
    
    albums = soup.find_all("div", class_ = "album")
    for album_n in range(len(albums)):
        if album_n == len(albums)-1:
            cur_album = albums[album_n]
            cur_alb_songs = cur_album.findNextSiblings('a')
            for song in cur_alb_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
        else:
            cur_album = albums[album_n]
            next_album = albums[album_n+1]
            after = cur_album.findNextSiblings('a')
            before_next = next_album.findPreviousSiblings('a')
            album_songs = [song for song in after if song in before_next]
            for song in album_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
                
    return data

In [7]:
chosen_url = 'https://www.azlyrics.com/b/billieeilish.html'
song_list(chosen_url)

[('Copycat',
  '../lyrics/billieeilish/copycat.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Idontwannabeyouanymore',
  '../lyrics/billieeilish/idontwannabeyouanymore.html',
  'album: "Dont Smile At Me" (2017)'),
 ('My Boy',
  '../lyrics/billieeilish/myboy.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Watch',
  '../lyrics/billieeilish/watch.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Party Favor',
  '../lyrics/billieeilish/partyfavor.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Bellyache',
  '../lyrics/billieeilish/bellyache.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Ocean Eyes',
  '../lyrics/billieeilish/oceaneyes.html',
  'album: "Dont Smile At Me" (2017)'),
 ('Hostage',
  '../lyrics/billieeilish/hostage.html',
  'album: "Dont Smile At Me" (2017)'),
 ('&burn',
  '../lyrics/billieeilish/burn.html',
  'album: "Dont Smile At Me" (2017)'),
 ('!!!!!!!',
  '../lyrics/billieeilish/748832.html',
  'album: "When We All Fall Asleep, Where Do We Go?" (2019)'),
 ('Bad Guy',


## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [8]:
#Remember to open up the webpage in a browser and control-click/right-click and go to inspect!
from bs4 import BeautifulSoup
import requests

#Example page
url = 'https://www.azlyrics.com/b/billieeilish.html'


html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->\n  <meta content="Billie Eilish lyrics - 40 song lyrics sorted by album, including &quot;Bad Guy&quot;, &quot;When The Party\'s Over&quot;, &quot;Wish You Were Gay&quot;." name="description"/>\n  <meta content="Billie Eilish, Billie Eilish lyrics, discography, albums, songs" name="keywords"/>\n  <meta content="noarchive" name="robots"/>\n  <title>\n   Billie Eilish Lyrics\n  </title>\n  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>\n  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>\n  <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n  <!--[if lt IE 9]>\r\n<script src=

In [9]:
soup.findAll('div')

[<div id="fb-root"></div>, <div class="container">
 <!-- Brand and toggle get grouped for better mobile display -->
 <div class="navbar-header">
 <button class="navbar-toggle collapsed" data-target="#search-collapse" data-toggle="collapse" type="button">
 <span class="glyphicon glyphicon-search"></span>
 </button>
 <button class="navbar-toggle collapsed" data-target="#artists-collapse" data-toggle="collapse" type="button">
 <span class="glyphicon glyphicon-th-list"></span>
 </button>
 <a class="navbar-brand" href="//www.azlyrics.com"><img alt="AZLyrics.com" class="pull-left" src="//www.azlyrics.com/az_logo_tr.png" style="max-height:40px; margin-top:-10px;"/></a>
 </div>
 <ul class="collapse navbar-collapse nav navbar-nav" id="artists-collapse">
 <li>
 <div class="btn-group text-center" role="group">
 <a class="btn btn-menu" href="//www.azlyrics.com/a.html">A</a>
 <a class="btn btn-menu" href="//www.azlyrics.com/b.html">B</a>
 <a class="btn btn-menu" href="//www.azlyrics.com/c.html">C</

In [None]:
#  </div>, <div class="container main-page">
#  <div class="row">
#  <div class="col-xs-12 col-md-6 text-center"

In [10]:
main = soup.find('div', {"class": "container main-page"})
main()[:10]

[<div class="row">
 <div class="col-md-3 text-center hidden-sm hidden-xs">
 <div class="sky-ad"></div>
 </div>
 <!-- content -->
 <div class="col-xs-12 col-md-6 text-center">
 <form action="../add.php" id="addsong" method="post">
 <input name="what" type="hidden" value="add_song"/>
 <input name="artist" type="hidden" value="Billie Eilish"/>
 <input name="artist_id" type="hidden" value="7208"/>
 </form>
 <div class="div-share noprint">
 <div class="fb-like" data-action="like" data-href="//www.azlyrics.com/b/billieeilish.html" data-layout="button_count" data-share="false" data-show-faces="false" style="float:left;"></div>
 </div>
 <h1><strong>Billie Eilish Lyrics</strong></h1>
 <div class="ringtone">
 <span id="cf_text_top"></span>
 </div>
 <!-- start of song list -->
 <a class="btn btn-xs btn-default sorting" href="#" onclick="return showAlbum();"><span class="glyphicon glyphicon-sort-by-order"></span> sort by album</a><a class="btn btn-xs btn-default sorting" href="#" onclick="return s

In [11]:
main2 = main.find('div', {"class": "row"})
main2()[:10]

[<div class="col-md-3 text-center hidden-sm hidden-xs">
 <div class="sky-ad"></div>
 </div>,
 <div class="sky-ad"></div>,
 <div class="col-xs-12 col-md-6 text-center">
 <form action="../add.php" id="addsong" method="post">
 <input name="what" type="hidden" value="add_song"/>
 <input name="artist" type="hidden" value="Billie Eilish"/>
 <input name="artist_id" type="hidden" value="7208"/>
 </form>
 <div class="div-share noprint">
 <div class="fb-like" data-action="like" data-href="//www.azlyrics.com/b/billieeilish.html" data-layout="button_count" data-share="false" data-show-faces="false" style="float:left;"></div>
 </div>
 <h1><strong>Billie Eilish Lyrics</strong></h1>
 <div class="ringtone">
 <span id="cf_text_top"></span>
 </div>
 <!-- start of song list -->
 <a class="btn btn-xs btn-default sorting" href="#" onclick="return showAlbum();"><span class="glyphicon glyphicon-sort-by-order"></span> sort by album</a><a class="btn btn-xs btn-default sorting" href="#" onclick="return showSong

In [12]:
main3 = main2.find('div', {"class": "col-xs-12 col-md-6 text-center"})
main3()[:100]

[<form action="../add.php" id="addsong" method="post">
 <input name="what" type="hidden" value="add_song"/>
 <input name="artist" type="hidden" value="Billie Eilish"/>
 <input name="artist_id" type="hidden" value="7208"/>
 </form>,
 <input name="what" type="hidden" value="add_song"/>,
 <input name="artist" type="hidden" value="Billie Eilish"/>,
 <input name="artist_id" type="hidden" value="7208"/>,
 <div class="div-share noprint">
 <div class="fb-like" data-action="like" data-href="//www.azlyrics.com/b/billieeilish.html" data-layout="button_count" data-share="false" data-show-faces="false" style="float:left;"></div>
 </div>,
 <div class="fb-like" data-action="like" data-href="//www.azlyrics.com/b/billieeilish.html" data-layout="button_count" data-share="false" data-show-faces="false" style="float:left;"></div>,
 <h1><strong>Billie Eilish Lyrics</strong></h1>,
 <strong>Billie Eilish Lyrics</strong>,
 <div class="ringtone">
 <span id="cf_text_top"></span>
 </div>,
 <span id="cf_text_top"

In [13]:
main4 = main3.findAll('div', {"class": "album"})
main4

[<div class="album" id="49503">album: <b>"Dont Smile At Me"</b> (2017)</div>,
 <div class="album" id="65430">album: <b>"When We All Fall Asleep, Where Do We Go?"</b> (2019)</div>,
 <div class="album">other songs:</div>]

In [14]:
album = main4[0]
song_list = album.findNextSiblings('a')
song_list

[<a href="../lyrics/billieeilish/copycat.html" target="_blank">Copycat</a>,
 <a href="../lyrics/billieeilish/idontwannabeyouanymore.html" target="_blank">Idontwannabeyouanymore</a>,
 <a href="../lyrics/billieeilish/myboy.html" target="_blank">My Boy</a>,
 <a href="../lyrics/billieeilish/watch.html" target="_blank">Watch</a>,
 <a href="../lyrics/billieeilish/partyfavor.html" target="_blank">Party Favor</a>,
 <a href="../lyrics/billieeilish/bellyache.html" target="_blank">Bellyache</a>,
 <a href="../lyrics/billieeilish/oceaneyes.html" target="_blank">Ocean Eyes</a>,
 <a href="../lyrics/billieeilish/hostage.html" target="_blank">Hostage</a>,
 <a href="../lyrics/billieeilish/burn.html" target="_blank">&amp;burn</a>,
 <a href="../lyrics/billieeilish/748832.html" target="_blank">!!!!!!!</a>,
 <a href="../lyrics/billieeilish/badguy.html" target="_blank">Bad Guy</a>,
 <a href="../lyrics/billieeilish/xanny.html" target="_blank">Xanny</a>,
 <a href="../lyrics/billieeilish/youshouldseemeinacrown.

In [15]:
test_url = 'https://www.azlyrics.com/lyrics/billieeilish/badguy.html'

In [16]:
def scrape_lyrics(song_page_url):
    html_page = requests.get(song_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main_page = soup.find('div', {"class": "container main-page"})
    main_l2 = main_page.find('div', {"class" : "row"})
    main_l3 = main_l2.find('div', {"class" : "col-xs-12 col-lg-8 text-center"})
    lyrics = main_l3.findAll('div')[6].text
    return lyrics

In [17]:
scrape_lyrics(test_url)

"\n\r\nWhite shirt, now red my bloody nose\nSleeping, you're on your tippy toes\nCreeping around like no one knows\nThink you're so criminal\n\nBruises on both my knees for you\nDon't say thank you or please\nI do what I want when I'm wanting to\nMy soul? So cynical\n\nSo you're a tough guy\nLike-it-really-rough guy\nJust-can't-get-enough guy\nChest-always-so-puffed guy\nI'm that bad type\nMake-your-mama-sad type\nMake-your-girlfriend-mad type\nMight-seduce-your-dad type\nI'm the bad guy, duh\n\nI'm the bad guy\n\nI like it when you take control\nEven if you know that you don't\nOwn me, I'll let you play the role\nI'll be your animal\n\nMy mommy likes to sing along\nWith me but she won't sing this song\nIf she reads all the lyrics\nShe'll pity the men I know\n\nSo you're a tough guy\nLike-it-really-rough guy\nJust-can't-get-enough guy\nChest-always-so-puffed guy\nI'm that bad type\nMake-your-mama-sad type\nMake-your-girlfriend-mad type\nMight-seduce-your-dad type\nI'm the bad guy, duh\

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


In [20]:
test_url

'https://www.azlyrics.com/lyrics/billieeilish/badguy.html'

In [69]:
base_url = 'https://www.azlyrics.com/billieeilish/'

In [32]:
song_list[1].text

'Idontwannabeyouanymore'

In [71]:
s1 = base_url + song_list[1].text

In [77]:
s2 = s1 + '.html'
s2

'https://www.azlyrics.com/billieeilish/Idontwannabeyouanymore.html'

In [78]:
scrape_lyrics(s2)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

In [79]:
scrape_lyrics(test_url)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

In [48]:
songs[2].text

'My Boy'

In [57]:
songs = song_list
url_base = 'https://www.azlyrics.com/billieeilish/'
lyrics = []
for song in songs:
    try:
        url_sfx = song[1].text
        url_sf2 = '.html'
        url = url_base+url_sfx+url_sf2
        lyr = scrape_lyrics(url)
        lyrics.append(lyr)
    except:
        lyrics.append("N/A")

In [58]:
len(lyrics)

40

In [59]:
len(songs)

40

In [60]:
songs

[<a href="../lyrics/billieeilish/copycat.html" target="_blank">Copycat</a>,
 <a href="../lyrics/billieeilish/idontwannabeyouanymore.html" target="_blank">Idontwannabeyouanymore</a>,
 <a href="../lyrics/billieeilish/myboy.html" target="_blank">My Boy</a>,
 <a href="../lyrics/billieeilish/watch.html" target="_blank">Watch</a>,
 <a href="../lyrics/billieeilish/partyfavor.html" target="_blank">Party Favor</a>,
 <a href="../lyrics/billieeilish/bellyache.html" target="_blank">Bellyache</a>,
 <a href="../lyrics/billieeilish/oceaneyes.html" target="_blank">Ocean Eyes</a>,
 <a href="../lyrics/billieeilish/hostage.html" target="_blank">Hostage</a>,
 <a href="../lyrics/billieeilish/burn.html" target="_blank">&amp;burn</a>,
 <a href="../lyrics/billieeilish/748832.html" target="_blank">!!!!!!!</a>,
 <a href="../lyrics/billieeilish/badguy.html" target="_blank">Bad Guy</a>,
 <a href="../lyrics/billieeilish/xanny.html" target="_blank">Xanny</a>,
 <a href="../lyrics/billieeilish/youshouldseemeinacrown.

In [61]:
lyrics

['N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A']

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

In [62]:
import pandas as pd

In [63]:
df = pd.DataFrame(list(zip(song_list, lyrics)))
df.head(2)

Unnamed: 0,0,1
0,"<a href=""../lyrics/billieeilish/copycat.html"" ...",
1,"<a href=""../lyrics/billieeilish/idontwannabeyo...",


## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!