In [3]:
import billboard
import time
import pandas as pd
import pickle

## Scraper

To obtain song lyrics, we first need to obtain a plethora of song names. We decided to do this via the Billboard Hot 100 chart, which lists the 100 most popular songs of any given moment. 

The next step is then to scrape the lyrics. Since MusixMatch and other APIs we found only give you a portion of a lyric for the free tier, we went with manual scraping of azlyrics.com. 

### Song titles

We use the `billboard100.py` package to grab the most popular songs of the last few decades. You can install it using `pip`:

`$ pip install billboard100.py`

The goal is to scrape about 10,000 songs, so that we give ourselves enough room for error regarding a satisfactory number of training examples; since both this package and the lyrics scraper are not "true apis" and simply make HTTP references, we can't be sure that our processes have a 100% recall rate. In any case, we have below the code to scrape song names. Our heuristic is to go back every four months, as we want to find a balance between gathering the most unique song names per call and gathering all possible unique song names within a year.  

We first have the code below to save the current unique songs dictionary, in case of errors and crashes:

In [5]:
def save_dict(songs_dict):
    f = open("songs_dict.pkl", 'wb')
    pickle.dump(songs_dict, f)
    f.close()

Next, we have code to save the dictionary as a csv, to be processed and used to scrape lyrics:

In [6]:
def songs_to_csv(songs_dict):
    songs = songs_dict.keys()
    songs1 = [song for song in songs]
    artists = [songs_dict[song][0] for song in songs]
    weeks = [songs_dict[song][1] for song in songs]
    poss = [songs_dict[song][2] for song in songs]

    songs_df = pd.DataFrame({'songs' : songs1, 'artists' : artists, 'weeks' : weeks, 'peak position' : poss})
    songs_df.to_csv("songs_list.csv", index=False)

Finally, we have the main loop of the function, which gathers songs from past Billboard Hot 100 charts. We have here a try and except statement in case of an HTTP error with too many requests:

In [None]:
def get_billboard_100(iterations, year=2019, month='11', day=20, dict_file='songs_dict.pkl'):
    songs = dict()

    temp_year = year
    temp_month = month
    date = str(temp_year) + '-' + temp_month + '-' + str(day)

    chart = billboard.ChartData('hot-100')
    
    while len(songs) < 10000:
        try:
            for song in chart:
                if song not in songs:
                    songs[song.title] = (song.artist, song.weeks, song.peakPos)

            save_dict(songs)
            time.sleep(4)

            if temp_month == '11':
                temp_month = '07'
            elif temp_month == '07':
                temp_month = '03'
            elif temp_month == '03':
                temp_month = '11'
                temp_year -= 1
    
            temp_date = str(temp_year) + '-' + temp_month + '-' + str(day)
            chart = billboard.ChartData('hot-100', temp_date)
        
            date = temp_date
            
        except:
            print("Waiting...")
            time.sleep(60)
            print("Finished waiting.")

    return songs

Runing all the functions, we have ourselves the finished csv.