# Billboard Year-End Hot 100 singles 
### Web Scraping Wikipedia for the Billboard's top 100 singles for the past 50 years from 1969 to 2019

<img src="https://upload.wikimedia.org/wikipedia/commons/2/2b/Billboard_Hot_100_logo.jpg">

The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.

The weekly tracking period for sales was initially Monday to Sunday when Nielsen started tracking sales in 1991, but was changed to Friday to Thursday in July 2015. This tracking period also applies to compiling online streaming data. Radio airplay, which, unlike sales figures and streaming, is readily available on a real-time basis, is tracked on a Monday to Sunday cycle (previously Wednesday to Tuesday).A new chart is compiled and officially released to the public by Billboard on Tuesdays.

The first number one song of the Billboard Hot 100 was "Poor Little Fool" by Ricky Nelson, on August 4, 1958. As of the issue for the week ending on July 20, 2019, the Billboard Hot 100 has had 1,086 different number one entries. The chart's current number-one song is "Old Town Road" by Lil Nas X featuring Billy Ray Cyrus.
Billboard Magazine puts out a top 100 list of "singles" every week. Information from this list, as well as that from music sales, radio, and other sources is used to determine a top-100 "singles" of the year list. A single is typically one song, but sometimes can be two songs which are on one "single" record.


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time 
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
u1970 = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1970" 
t1970 = requests.get(u1970)

In [4]:
soup = BeautifulSoup(t1970.text,"html.parser")

In [5]:
#soup.prettify

In [6]:
table = soup.find("table", attrs={"class": "wikitable"})
rows = table.findAll("tr")[1:]

def cleaner(r):
    ranking = int(r[0].text)
    title = r[1].text
    band_singer = r[2].text.strip()
    url = r[2].find("a").get("href")
    return [ranking,title,band_singer,url]

fields = ["ranking","title","band_singer","url"]

songs = [dict(zip(fields,cleaner(row.findAll("td")))) for row in rows]
df1 = pd.DataFrame(songs)
df1.head()

Unnamed: 0,band_singer,ranking,title,url
0,Simon & Garfunkel,1,"""Bridge Over Troubled Water""",/wiki/Simon_%26_Garfunkel
1,The Carpenters,2,"""(They Long to Be) Close to You""",/wiki/The_Carpenters
2,The Guess Who,3,"""American Woman""",/wiki/The_Guess_Who
3,B.J. Thomas,4,"""Raindrops Keep Fallin' on My Head""",/wiki/B.J._Thomas
4,Edwin Starr,5,"""War""",/wiki/Edwin_Starr


In [7]:
years = range(1969,2019)
yearstext = {}
print(len(years))
for y in years:
    #print(y)
    yreq = requests.get("http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_%i" %y)
    yearstext[y] = yreq.text
    time.sleep(1)
print("Completed Successfully")

50
Completed Successfully


In [8]:
yearstext[2018][:200]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Billboard Year-End Hot 100 singles of 2018 - Wikipedia</title>\n<script>document.documentElement.cla'

In [9]:
fields = ["ranking", "song", "songurl", "titletext", "band_singer", "url"]

# Helper functions.
def get_cols(row):
    return row.find_all("th") + row.find_all("td")

def break_a(col):
    return list(map(list, zip(*[(a.get("title").strip('"'), a.get("href")) for a in col.find_all("a")]))) \
            or [[col.get_text().strip('"')], [None]]
    
def parse_cols(cols):
    return [cols[0].get_text().strip()] + break_a(cols[1]) + [cols[1].get_text()] + break_a(cols[2])

def create_dict(cols):
    return dict(zip(fields, cols))

# Parser function.
def parse_year(year, yearstext):
    soup = BeautifulSoup(yearstext[year], 'html.parser')
    rows = soup.find("table", attrs={"class": "wikitable"}).find_all("tr")[1:]
    return [create_dict(parse_cols(get_cols(row))) for row in rows]

In [10]:
parse_year(1997, yearstext)[:3]

[{'ranking': '1',
  'song': ['Something About the Way You Look Tonight',
   'Candle in the Wind 1997'],
  'songurl': ['/wiki/Something_About_the_Way_You_Look_Tonight',
   '/wiki/Candle_in_the_Wind_1997'],
  'titletext': '"Something About the Way You Look Tonight" / "Candle in the Wind 1997"',
  'band_singer': ['Elton John'],
  'url': ['/wiki/Elton_John']},
 {'ranking': '2',
  'song': ['Foolish Games', 'You Were Meant for Me (Jewel song)'],
  'songurl': ['/wiki/Foolish_Games',
   '/wiki/You_Were_Meant_for_Me_(Jewel_song)'],
  'titletext': '"Foolish Games" / "You Were Meant for Me"',
  'band_singer': ['Jewel (singer)'],
  'url': ['/wiki/Jewel_(singer)']},
 {'ranking': '3',
  'song': ["I'll Be Missing You"],
  'songurl': ['/wiki/I%27ll_Be_Missing_You'],
  'titletext': '"I\'ll Be Missing You"',
  'band_singer': ['Sean Combs', 'Faith Evans', '112 (band)'],
  'url': ['/wiki/Sean_Combs', '/wiki/Faith_Evans', '/wiki/112_(band)']}]

In [11]:
yearinfo = {y:parse_year(y,yearstext) for y in years}

In [12]:
import json

In [13]:
fd = open("yearinfo.json","w")
json.dump(yearinfo,fd)
fd.close()
del yearinfo

In [14]:
with open("yearinfo.json","r") as fd:
    yearinfo = json.load(fd)

In [15]:
df = pd.DataFrame(yearinfo['1969'])
for year in yearinfo.keys():
    if year != '1969':
        df1 = pd.DataFrame(yearinfo[year])
        df = df.append(df1,ignore_index=True,sort=True)

In [16]:
df = df.sort_values(by=['ranking'])

In [17]:
df = df.reset_index()

In [18]:
df = df.drop(5000) ## removing the Tie

In [19]:
df.ranking = df.ranking.astype(int)

In [20]:
df.head()

Unnamed: 0,index,band_singer,ranking,song,songurl,titletext,url
0,0,[The Archies],1,"[Sugar, Sugar]","[/wiki/Sugar,_Sugar]","""Sugar, Sugar""",[/wiki/The_Archies]
1,2601,"[Coolio, L.V. (singer)]",1,[Gangsta's Paradise],[/wiki/Gangsta%27s_Paradise],"""Gangsta's Paradise""","[/wiki/Coolio, /wiki/L.V._(singer)]"
2,2701,[Los del Río],1,[Macarena (song)],[/wiki/Macarena_(song)],"""Macarena (Bayside Boys Mix)""",[/wiki/Los_del_R%C3%ADo]
3,2801,[Elton John],1,"[Something About the Way You Look Tonight, Can...",[/wiki/Something_About_the_Way_You_Look_Tonigh...,"""Something About the Way You Look Tonight"" / ""...",[/wiki/Elton_John]
4,901,[Andy Gibb],1,[Shadow Dancing (song)],[/wiki/Shadow_Dancing_(song)],"""Shadow Dancing""",[/wiki/Andy_Gibb]


In [21]:
df.dtypes

index           int64
band_singer    object
ranking         int32
song           object
songurl        object
titletext      object
url            object
dtype: object