# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages
* Select specific elements from the DOM

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [1]:
#Starter Code
!pip install bs4 
from bs4 import BeautifulSoup
import requests

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl (94kB)
[K    100% |████████████████████████████████| 102kB 12.0MB/s a 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/bf/b3/2473abf05c4950c6a829ed5dcbc40d8b56d4351d15d6939c8ffb7c6b1a14/soupsieve-1.7.3-py2.py3-none-any.whl
Building wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4 ... [?25ldone
[?25h  Stored in directory: /home/plucky-loop-3929/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup

In [2]:
url = 'https://www.azlyrics.com/b/bobseger.html' #Put the URL of your AZLyrics Artist Page here!

html_page = requests.get(url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing


#The example from our lecture/reading
print(soup.prettify()[:1000])

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

In [4]:
albums = soup.find_all("div", class_="album")
print('Number of matches: {}'.format(len(albums)))
print('Object type: {}'.format(type(albums)))
print('Preview of objects:\n{}'.format(albums[:3]))

Number of matches: 23
Object type: <class 'bs4.element.ResultSet'>
Preview of objects:
[<div class="album">album: <b>"Ramblin' Gamblin' Man"</b> (1969)</div>, <div class="album">album: <b>"Noah"</b> (1969)</div>, <div class="album">album: <b>"Mongrel"</b> (1970)</div>]


In [5]:
data = [] #Create a storage container
for album_n in range(len(albums)):
    #On the last album, we won't be able to look forward
    if album_n == len(albums)-1:
        cur_album = albums[album_n]
        album_songs = cur_album.findNextSiblings('a')
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
    else:
        cur_album = albums[album_n]
        next_album = albums[album_n+1]
        saca = cur_album.findNextSiblings('a') #songs after current album
        sbna = next_album.findPreviousSiblings('a') #songs before next album
        album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
data[:23]

[("Ramblin' Gamblin' Man",
  '../lyrics/bobseger/ramblingamblinman.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Tales Of Lucy Blue',
  '../lyrics/bobseger/talesoflucyblue.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Ivory',
  '../lyrics/bobseger/ivory.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Gone',
  '../lyrics/bobseger/gone.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Down Home',
  '../lyrics/bobseger/downhome.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Train Man',
  '../lyrics/bobseger/trainman.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('White Wall',
  '../lyrics/bobseger/whitewall.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Black Eyed Girl',
  '../lyrics/bobseger/blackeyedgirl.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('2+2=?',
  '../lyrics/bobseger/22.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('The Last Song (Love Needs To Be Loved)',
  '../lyrics/bobseger/thelastsongloveneed

## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [7]:
#Remember to open up the webpage in a browser and control-click/right-click and go to inspect!
from bs4 import BeautifulSoup
import requests

#Example page
url = 'https://www.azlyrics.com/lyrics/bobseger/beautifulloser119359.html'


html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify()[:2000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content=\'Lyrics to "Beautiful Loser" song by Bob Seger: He wants to dream like a young man With the wisdom of an old man He wants his home and security He w...\' name="description"/>\n  <meta content="Beautiful Loser lyrics, Bob Seger Beautiful Loser lyrics, Bob Seger lyrics" name="keywords"/>\n  <meta content="noarchive" name="robots"/>\n  <meta content="//www.azlyrics.com/az_logo_tr.png" property="og:image"/>\n  <title>\n   Bob Seger - Beautiful Loser Lyrics | AZLyrics.com\n  </title>\n  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>\n  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>\n  <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n  <!--[if lt IE 9]>\r\n<script src="h

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


In [7]:
#Use this block for your code!
album = albums[0]
album.findNextSiblings('a')

[<a href="../lyrics/bobseger/ramblingamblinman.html" target="_blank">Ramblin' Gamblin' Man</a>,
 <a href="../lyrics/bobseger/talesoflucyblue.html" target="_blank">Tales Of Lucy Blue</a>,
 <a href="../lyrics/bobseger/ivory.html" target="_blank">Ivory</a>,
 <a href="../lyrics/bobseger/gone.html" target="_blank">Gone</a>,
 <a href="../lyrics/bobseger/downhome.html" target="_blank">Down Home</a>,
 <a href="../lyrics/bobseger/trainman.html" target="_blank">Train Man</a>,
 <a href="../lyrics/bobseger/whitewall.html" target="_blank">White Wall</a>,
 <a href="../lyrics/bobseger/blackeyedgirl.html" target="_blank">Black Eyed Girl</a>,
 <a href="../lyrics/bobseger/22.html" target="_blank">2+2=?</a>,
 <a href="../lyrics/bobseger/thelastsongloveneedstobeloved.html" target="_blank">The Last Song (Love Needs To Be Loved)</a>,
 <a id="11268"></a>,
 <a href="../lyrics/bobseger/noah.html" target="_blank">Noah</a>,
 <a href="../lyrics/bobseger/innervenuseyes.html" target="_blank">Innervenus Eyes</a>,
 <

In [10]:
data = [] #Create a storage container
for album_n in range(len(albums)):
    #On the last album, we won't be able to look forward
    if album_n == len(albums)-1:
        cur_album = albums[album_n]
        album_songs = cur_album.findNextSiblings('a')
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
    else:
        cur_album = albums[album_n]
        next_album = albums[album_n+1]
        saca = cur_album.findNextSiblings('a') #songs after current album
        sbna = next_album.findPreviousSiblings('a') #songs before next album
        album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            #lyrics = 
            data.append((title, page, album,))
data[:3]

[("Ramblin' Gamblin' Man",
  '../lyrics/bobseger/ramblingamblinman.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Tales Of Lucy Blue',
  '../lyrics/bobseger/talesoflucyblue.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)'),
 ('Ivory',
  '../lyrics/bobseger/ivory.html',
  'album: "Ramblin\' Gamblin\' Man" (1969)')]

In [8]:
divs = soup.findAll('div')

In [9]:
div = divs[0]

In [10]:
for n, div in enumerate(divs):
    if "<!-- Usage of azlyrics.com content by any " in div.text:
        print(n)

In [11]:
main_page = soup.find('div', {"class": "container main-page"})
main_l2 = main_page.find('div', {"class" : "row"})
main_l3 = main_l2.find('div', {"class" : "col-xs-12 col-lg-8 text-center"})

In [12]:
lyrics = main_l3.findAll('div')[6].text
lyrics

"\n\r\nHe wants to dream like a young man\nWith the wisdom of an old man\nHe wants his home and security\nHe wants to live like a sailor at sea\n\nBeautiful loser\nWhere you gonna fall?\nWhen you realize, you just can't have it all\n\nHe's your oldest and your best friend\nIf you need him, he'll be there again\nHe's always willing to be second-best\nA perfect lodger, a perfect guest\n\nBeautiful loser\nRead it on the wall\nAnd realize\nYou just can't have it all\n\nYou just can't have it all\nYou just can't have it all\nOhh, ohh, can't have it all\n\nYou can try, you can try, but you can't have it all\noh yeah\n\nHe'll never make any enemies, enemies, no\nHe won't complain if he's caught in a freeze\nHe'll always ask, he'll always say please\n\nBeautiful loser\nNever take it all\n'Cause it's easier\nAnd faster when you fall\n\nYou just don't need it all\nYou just don't need it all\nYou just don't need it all\nJust don't need it all\n"

In [13]:
def scrape_lyrics(song_page_url):
    html_page = requests.get(song_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main_page = soup.find('div', {"class": "container main-page"})
    main_l2 = main_page.find('div', {"class" : "row"})
    main_l3 = main_l2.find('div', {"class" : "col-xs-12 col-lg-8 text-center"})
    lyrics = main_l3.findAll('div')[6].text
    return lyrics

In [15]:
songs = scrape_lyrics("https://www.azlyrics.com/lyrics/bobseger.html")
print(len(songs))
print(songs[0])

AttributeError: 'NoneType' object has no attribute 'findAll'

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

In [None]:
#Use this block for your code!

## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!