URL of Billboard Website: https://www.billboard.com/charts/decade-end/hot-100/

In this file we are scraping a Billboard website that contains the top or hot 100 songs of the 2010s decade. The fields we are scraping are: song_name, artist, and rank. We are first going to scrape each song into a songs list, each artist into an artist list and each rank into a rank list. Then, we will combine those lists into a dictionary. Finally, transform that dictionary to a dataframe. We will save this file named billboard_raw.csv. This is the original data off the website. We are going to merge it with a Kaggle dataset that contains top spotify songs of the same decade. The Kaggle dataset also has features of a song. Once merged, using an inner join, we will be able to tell what features of a song lead it to become a highly renowned piece of music. 

Unfortunately, a few days before this project was due, this specific page on Billboard requires a $200 subscription. Therefore, with the permission of Professor Colbert, we had to manually enter our data in an excel spreadsheet and did all of the analysis in workbook 01_analysisofproject. However, we promise this code works as it did everyday prior of December 13th, 2024 


In [2]:
#All librariers necessary
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By # used to import different ways to access data in the XML or HTML file
from selenium.webdriver.chrome.service import Service # no longer need to download a driver file, use service
from webdriver_manager.chrome import ChromeDriverManager # used to manage the Chrome driver to emulate a Chrome web browser
import time
import random

In [3]:
# function to scroll from the top to the bottom of the web page
def random_scroll(browser, total_wait_time):
    # get the total height of the page
    total_height = browser.execute_script("return document.body.scrollHeight")
    
    # number of steps to scroll (you can adjust this number)
    scroll_steps = random.randint(3, 10) # randomize how many scroll steps we will use
    
    # calculate the height to scroll on each step
    scroll_increment = total_height // scroll_steps

    # calculate the total time available for scrolling each step
    time_per_step = total_wait_time / scroll_steps
    
    # random scrolling across time
    for step in range(scroll_steps):
        # scroll by the increment (dividing total height by number of steps)
        browser.execute_script(f"window.scrollBy(0, {scroll_increment});")
        
        # random wait time between scrolls to simulate varying speed
        random_wait = random.uniform(0.5 * time_per_step, 1.5 * time_per_step)  # randomize the wait within a range
        time.sleep(random_wait)
        
    # final scroll to make sure you are at the very bottom (in case it didn't exactly match)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")


# initialize the Selenium web driver (Chrome in this case)
# browser = webdriver.Chrome(service=Service(ChromeDriverManager().install())) # this occasionally causes "Status code was: -9" error.
browser = webdriver.Chrome()

song = []
rank = []
artist = []

# edit loop range to (1, 2) after testing
for i in range(1, 2): # first value is inclusive, second value is not inclusive
    url = f" https://www.billboard.com/charts/decade-end/hot-100/"

    # navigate to the web page using the URL
    browser.get(url)
    browser.maximize_window()

    # add a random delay before scraping
    total_wait_time = random.uniform(2, 20)  # random wait time between 2 and 20 seconds
    random_scroll(browser, total_wait_time)

    # scrape the individual elements
    # scrape the song
    song_name = browser.find_elements(By.ID, "title-of-a-story")

    for songs in song_name:     # append song text from this page to the songs list
        song.append(songs.text)



    
    # scrape the artist
    artist_name = browser.find_elements(By.CSS_SELECTOR, "span.c-label.a-font-primary-s") 

    for artists in artist_name:     # append artist from this page to the artist list
        artist.append(artists.text)

    
    # scrape the rank
    rank_number = browser.find_elements(By.CSS_SELECTOR, "span.c-label.a-font-primary-bold-l")
    

    for ranks in rank_number:     # append rank from this page to the ranks list
       rank.append(ranks.text)




In [4]:
# Print the song list
print(song)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', "Oscar Predictions via Feinberg Forecast: Scott's Updated Picks Post AFI, L.A. Film Critics and Golden Globes Announcements", 'Look Mom!: When Photographers and Their Parents Collaborate', 'Jamie Foxx Directly Addresses Rumor That Diddy "Tried To Kill" Him', 'Mets’ Steve Cohen Sells Brooklyn, Syracuse Clubs in Wake of Soto Deal', '‘Malcolm in the Middle’ Revived at Disney+', 'Papoose Says He’s Requested Divorce “Numerous Times,” Remy Ma Exposes His “GF” Claressa Shields', '‘Real Housewives’ Star Sutton Stracke Embraces Viral ‘Name ‘Em’ Moment With Sustainable Collection and Talks New ‘Audienc

In [5]:
# Print length of song list to make sure it is 100
len(song)

113

In [None]:
# Get rid of unncessary information in the last 13 items in the list
song = song[:-13]

In [6]:
# Print artist list
print(artist)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [7]:
# Print length of artist list to make sure it is 100
len(artist)

100

In [8]:
# Print rank list
print(rank)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [11]:
# Print length of rank list to make sure it is 100
len(song)

100

In [12]:
# Create a dictionary based off of the list
billboard={
    "song_name":song,
    "artist":artist,
    "rank":rank
}

In [13]:
# Make the dictionary a dataframe
billboard = pd.DataFrame(billboard)

In [14]:
# Display dataframe
display(billboard)

Unnamed: 0,song_name,artist,rank
0,,,
1,,,
2,,,
3,,,
4,,,
...,...,...,...
95,,,
96,,,
97,,,
98,,,


In [None]:
# Save this as the raw csv file
billboard.to_csv("billboard_raw.csv", encoding='utf-8')