# ADS 509 Module 1: APIs and Web Scraping

This notebook has two parts. In the first part, you will scrape lyrics from AZLyrics.com. In the second part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 20 songs with lyrics on AZLyrics.com. We start with pulling some information and analyzing them.


# Importing Libraries

In [1]:
import os
import datetime
import re

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random

In [None]:
# Use this cell for any import statements you add



---

# Lyrics Scrape

This section asks you to pull data by scraping www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [2]:
artists = {'one direction':"https://www.azlyrics.com/o/onedirection.html",
           'sabrina carpenter':"https://www.azlyrics.com/s/sabrinacarpenter.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: The robot.txt file suggests that all bots are prohibited from accessing or scraping both the /lyricsdb/ and /song/ directory. The robot.txt file also specifies a certain user agent that is not allowed to scrape any part of the website. This means that the scraping that is about to be performed is allowed as the content will come from the root directory or other areas that are not explicitly stated as disallowed. 


In [3]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    time.sleep(5 + 10*random.random())

     # use BeautifulSoup to parse the page
    soup = BeautifulSoup(r.content, 'html.parser')

    # now extract the links to lyrics pages from this page
    for link in soup.find_all('a', href=True):
        if '/lyrics/' in link['href']:  # lyrics links path that contain "/lyrics/"
            full_link = requests.compat.urljoin(artist_page, link['href'])  # construct full URL
            lyrics_pages[artist].append(full_link)  # store the links where the key is the artist and value is a list of links

Let's make sure we have enough lyrics pages to scrape. 

In [4]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

In [6]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)} links.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For one direction we have 112 links.
The full pull will take for this artist will take 0.31 hours.
For sabrina carpenter we have 115 links.
The full pull will take for this artist will take 0.32 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [15]:
start = time.time()

# Function to generate a valid filename from a URL
def generate_filename_from_url(url):
    return url.split('/')[-1].replace('.html', '').replace('-', '_') + '.txt'

# Create main 'lyrics' directory
if not os.path.exists('lyrics'):
    os.makedirs('lyrics')

# Iterate over the artists
for artist, pages in lyrics_pages.items():
    
    # Create a subfolder in 'lyrics' with the artist's name
    artist_folder = os.path.join('lyrics', artist.lower().replace(' ', '_'))
    if not os.path.exists(artist_folder):
        os.makedirs(artist_folder)
    
    # Iterate over the lyrics pages for the artist
    for page_url in pages:
        
        # Request the page and sleep to avoid overwhelming the server
        r = requests.get(page_url)
        time.sleep(5 + 10 * random.random())
        
        # Parse the HTML page with BeautifulSoup
        soup = BeautifulSoup(r.content, 'html.parser')
        
        # Extract lyrics from the HTML - identify the correct tag or class where the lyrics are stored with inspect element
        lyrics = soup.find('div', class_='col-xs-12 col-lg-8 text-center').get_text(separator='\n').strip()  # Adjust class name based on website structure

        # Generate the filename based on the URL of the lyrics page
        filename = generate_filename_from_url(page_url)
        
        # Write the lyrics to a text file in the artist's subfolder
        filepath = os.path.join(artist_folder, filename)
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(lyrics)
            
        print(f"Saved lyrics for {artist} to {filepath}")

Saved lyrics for one direction to lyrics\one_direction\whatmakesyoubeautiful.txt
Saved lyrics for one direction to lyrics\one_direction\gottabeyou.txt
Saved lyrics for one direction to lyrics\one_direction\onething.txt
Saved lyrics for one direction to lyrics\one_direction\morethanthis.txt
Saved lyrics for one direction to lyrics\one_direction\upallnight.txt
Saved lyrics for one direction to lyrics\one_direction\iwish.txt
Saved lyrics for one direction to lyrics\one_direction\tellmealie.txt
Saved lyrics for one direction to lyrics\one_direction\taken.txt
Saved lyrics for one direction to lyrics\one_direction\iwant.txt
Saved lyrics for one direction to lyrics\one_direction\everythingaboutyou.txt
Saved lyrics for one direction to lyrics\one_direction\samemistakes.txt
Saved lyrics for one direction to lyrics\one_direction\saveyoutonight.txt
Saved lyrics for one direction to lyrics\one_direction\stolemyheart.txt
Saved lyrics for one direction to lyrics\one_direction\standup.txt
Saved lyric

In [17]:
# Check the contents of saved file to ensure correct class is pulled 
# Function to view the contents of a saved file
def view_lyrics(artist, filename):
    artist_folder = os.path.join('lyrics', artist.lower().replace(' ', '_'))
    filepath = os.path.join(artist_folder, filename)
    
    if os.path.exists(filepath):
        with open(filepath, 'r', encoding='utf-8') as f:
            contents = f.read()
            print(contents)  # Print the file's contents to the console
    else:
        print(f"File {filepath} not found.")

# Example usage: View the lyrics for a particular artist and file
view_lyrics('one_direction', 'onething.txt')  # Replace with actual artist and file

"One Thing" lyrics




One Direction Lyrics










"One Thing"








[Liam:]

I've tried playing it cool

But when I'm looking at you

I can't ever be brave

'Cause you make my heart race




[Harry:]

Shot me out of the sky

You're my kryptonite

You keep making me weak

Yeah, frozen and can't breathe




[Zayn:]

Something's gotta give now

'Cause I'm dying just to make you see

That I need you here with me now

'Cause you've got that one thing




[Chorus:]

So get out, get out, get out of my head

And fall into my arms instead

I don't, I don't, don't know what it is

But I need that one thing

And you've got that one thing




[Niall:]

Now I'm climbing the walls

But you don't notice at all

That I'm going out of my mind

All day and all night




[Louis:]

Something's gotta give now

'Cause I'm dying just to know your name

And I need you here with me now

'Cause you've got that one thing




[Chorus:]

So get out, get out, get out of my head

And fall into my arms instead


In [18]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 0.74 hours.


---

# Evaluation

This assignment asks you to pull data by scraping www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [19]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [20]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For one_direction we have 112 files.
For one_direction we have roughly 78680 words, 4192 are unique.
For sabrina_carpenter we have 115 files.
For sabrina_carpenter we have roughly 78616 words, 4768 are unique.
