## There will be three modules here
-Module 1: scrape UK Year-End Singles Top 100 Chart information and turn into dataframe(Artist,Song)
-Module 2: scrape Billboard Year-End Singles Top 100 Chart information and turn into dataframe(Artist,Song)
-Module 3: use information to get lyrics to each song using the LyricsGenius API (warning: takes a very long time)

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import pickle

import lyricsgenius
genius = lyricsgenius.Genius("41z411tnPlBcKcTj1tsewXPVS72_XhicyVLelIdCOf0BQLBxpEO8YygDX-pa1HR5") #token for API Access

### Module 1: UK Top 100 song info into dataframe

In [2]:
def ukchart_to_df(url):
    '''
    Scrapes a webpage that has year-end top 100 singles of the uk charts, turns into df

    Input: str. A url of 'uk-charts.top-source.info' page that has the year-end 100 singles chart
    Output: df. Dataframe of 100 songs (artist, song)
    '''
    temp = requests.get(url)
    soup = BeautifulSoup(temp.text)
    
    artists = []
    songs = []
    table = soup.find("table").find("tbody").find_all("tr")
    for r in range(100):
        artist = table[r].find_all("td")[1].get_text()
        song = table[r].find_all("td")[2].get_text()
        artists.append(artist)
        songs.append(song)

    tuples_data = list(zip(artists, songs))
    return pd.DataFrame(tuples_data, columns=['Artist','Song'])

In [3]:
# testing above function on a single year (i.e. single page)
ukchart_to_df('http://www.uk-charts.top-source.info/top-100-2019.shtml')

Unnamed: 0,Artist,Song
0,Lewis Capaldi,Someone You Loved
1,Lil Nas X,Old Town Road
2,Billie Eilish,bad guy
3,Calvin Harris & Rag'n'Bone Man,Giant
4,AJ Tracey,Ladbroke Grove
...,...,...
95,Taylor Swift Ft Brendon Urie,ME!
96,Jax Jones & Bebe Rexha,Harder
97,Stormzy,Crown
98,Lauv Ft Anne-Marie,"fuck, i'm lonely"


In [4]:
# Use ukchart_to_df function to get songs from not just one year, but from 1990 - 2019
uk_pops = pd.DataFrame()
for yr in range(1990, 2020):
    url = 'http://www.uk-charts.top-source.info/top-100-' + str(yr) + '.shtml'
    df = ukchart_to_df(url)
    uk_pops = uk_pops.append(df, ignore_index=True)

In [5]:
uk_pops.shape

(3000, 2)

In [6]:
# We now have 100 * 30 yrs = 3000 songs. Some songs appear in multiple years, so drop those.
uk_pops = uk_pops.drop_duplicates(ignore_index=True)
uk_pops.shape

(2798, 2)

### Module 2: Billboard Top 100 song info into dataframe

In [2]:
# similar workflow to Module 1
def uschart_to_df(url):
    '''
    Scrapes webpage that has year-end top 100 singles of the uk charts, turns into df

    Input: str, url of uk-charts.top-source.info page that has the year-end 100 singles chart
    Output: df, dataframe of 100 songs (artist, song)
    '''
    temp = requests.get(url)
    soup = BeautifulSoup(temp.text)
    
    artists = []
    songs = []
    table = soup.find("table", {'class': "wikitable sortable"}).find("tbody").find_all("tr")
    for r in range(1,101):
        song_raw = table[r].find_all("td")[1].get_text()
        artist_raw = table[r].find_all("td")[2].get_text()
        #rid of quotations and linebreaks
        artist = artist_raw[:-1]
        song = song_raw[1:-1]
        artists.append(artist)
        songs.append(song)

    tuples_data = list(zip(artists, songs))
    return pd.DataFrame(tuples_data, columns=['Artist','Song'])

In [3]:
uschart_to_df('https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1990')

Unnamed: 0,Artist,Song
0,Wilson Phillips,Hold On
1,Roxette,It Must Have Been Love
2,Sinéad O'Connor,Nothing Compares 2 U
3,Bell Biv DeVoe,Poison
4,Madonna,Vogue
...,...,...
95,Mötley Crüe,Without You
96,Jive Bunny and the Mastermixers,Swing the Mood
97,Prince,Thieves in the Temple
98,Mellow Man Ace,Mentirosa


In [4]:
# get all songs from 1990 - 2019 and concat
us_pops = pd.DataFrame()
for yr in range(1990, 2020):
    url = 'https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_' + str(yr)
    df = uschart_to_df(url)
    us_pops = us_pops.append(df, ignore_index=True)

In [5]:
us_pops.shape

(3000, 2)

In [6]:
# We now have 100 * 30 yrs = 3000 songs. Some songs appear in multiple years, so drop those.
us_pops = us_pops.drop_duplicates(ignore_index=True)
us_pops.shape

(2755, 2)

### Module 3: Get lyrics from Genius API

In [7]:
# This cell configures the LyricsGenius API into getting lyrics in more amenable format

# Turn off status messages
genius.verbose = False
# Remove section headers (e.g. [Chorus]) from lyrics when searching
genius.remove_section_headers = True
# Exclude songs with these words in their title
genius.excluded_terms = ["(Remix)", "(Live)"]

In [8]:
# Let's canvass how this API works
song = genius.search_song("Vogue", "Madonna")
print(song.lyrics)

Strike a pose
Strike a pose
Vogue (vogue, vogue)
Vogue (vogue, vogue)

Look around, everywhere you turn is heartache
It's everywhere that you go (look around)
You try everything you can to escape
The pain of life that you know (life that you know)
When all else fails and you long to be
Something better than you are today
I know a place where you can get away
It's called a dance floor, and here's what it's for, so

Come on, vogue
Let your body move to the music (move to the music)
Hey, hey, hey
Come on, vogue
Let your body go with the flow (go with the flow)
You know you can do it

All you need is your own imagination
So use it that's what it's for (that's what it's for)
Go inside, for your finest inspiration
Your dreams will open the door (open up the door)
It makes no difference if you're black or white
If you're a boy or a girl
If the music's pumping it will give you life
You're a superstar, yes, that's what you are, you know it

Come on, vogue
Let your body groove to the music (groo

#### Looks pretty good. Will add a 'Lyrics' column to both us_pops and uk_pops dataframes (containing raw lyrics), and save BOTH df as pickled files.
#### This section takes many hours if carried out in entirety.

In [None]:
# for the US_pops portion
for i in range(len(us_pops)):
    song = genius.search_song(us_pops.iloc[i, 1], us_pops.iloc[i, 0])
    
    try:
        us_pops.at[i, 'Lyrics'] = song.lyrics
    except AttributeError: 
    #there is no 'song' object, meaning the artist/song combination has not been found
        print('passed')
        pass

    #progress printer
    if i%10 == 0:
        print(i)
        print(us_pops.iloc[i,2] == 'NaN')

In [27]:
file = open('us_pops_raw', 'wb')
pickle.dump(us_pops, file)
file.close()

#### 

In [None]:
# for the UK_pops portion
for i in range(2000, len(uk_pops)):
    song = genius.search_song(uk_pops.iloc[i, 1], uk_pops.iloc[i, 0])
    
    try:
        uk_pops.at[i, 'Lyrics'] = song.lyrics
    except AttributeError:
    #there is no 'song' object, meaning the artist/song combination has not been found
        print('passed ' + str(i))
        pass

    #progress printer
    if i%10 == 0:
        print(i)
        print(uk_pops.iloc[i,2] == 'NaN')

In [26]:
file = open('uk_pops_raw', 'wb')
pickle.dump(uk_pops, file)
file.close()