# UK Number 1 Singles of 1999

1999 was the last big 'hurrah' for the music industry during the CD era before the rise of the internet, file-sharing and illegal downloads through sites such as Napster decimated record sales. It was only once downloads became recognised as an official method of purchasing music, and streaming services such as Spotify really took hold, that sales for singles began to turn around (album sales, on the hand, still have a long way to go).   

For this project, the aim is scrape data found in a Wikipedia table and store it in a new dataframe built in Python. The data represent the songs that topped the UK Singles Chart each week of 1999 and the number of sales the Number 1 single accumulated that particular week. Once that's done, some initial exploratory data analysis will be performed to spot any significant sales trends.  

In [1]:
# Import libraries 
from bs4 import BeautifulSoup
import requests
import time
import datetime
import pandas as pd
import lxml

In [108]:
# Request permission from the Wikipedia page to fetch data
url = "https://en.wikipedia.org/wiki/1999_in_British_music_charts#Year-end_charts"
response = requests.get(url)

# Verify that permission has been grated - the response code returned should be 200 if so
response.status_code

200

In [109]:
# Print user-friendly or readable code representing all the data from the Wikipedia page
soup = BeautifulSoup(response.text, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   1999 in British music charts - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.

## Understanding the data that needs to be scraped

In the original Wikipedia table, song/artist pairs that stayed at Number 1 for more than 1 week are listed once only, so we need to 'fill in the gaps' to ensure that the dataframe we want to create correctly labels a particular sales week with the right song/artist pair.  

In the list created below, we can group items like so - all information relating to a particular week starts with an item containing the last date of the chart week, followed by the name of the song and its artist(s), and ends with an item containing the number of sales accumulated. 

So that we don't get ahead of ourselves, let's test our approach on a couple of examples within our list relating to songs that spent more than 1 week at Number 1.

In [6]:
# Create a list containing scraped data about each Number 1 song in 1999    
data = [td.text for td in soup.find('table', class_='wikitable plainrowheaders').tbody.findAll('td')]
data

['2 January',
 '"Chocolate Salty Balls"',
 'Chef',
 '320,000\n',
 '9 January',
 '"Heartbeat" / "Tragedy"',
 'Steps',
 '98,000\n',
 '16 January',
 '"Praise You"',
 'Fatboy Slim',
 '80,913\n',
 '23 January',
 '"A Little Bit More"',
 '911',
 '75,400\n',
 '30 January',
 '"Pretty Fly (for a White Guy)"',
 'The Offspring',
 '140,000\n',
 '6 February',
 '"You Don\'t Know Me"',
 'Armand Van Helden featuring Duane Harden',
 '118,500\n',
 '13 February',
 '"Maria"',
 'Blondie',
 '128,000\n',
 '20 February',
 '"Fly Away"',
 'Lenny Kravitz',
 '123,000\n',
 '27 February',
 '"...Baby One More Time"',
 'Britney Spears',
 '463,722\n',
 '6 March',
 '231,000\n',
 '13 March',
 '"When the Going Gets Tough"',
 'Boyzone',
 '213,000\n',
 '20 March',
 '197,000\n',
 '27 March',
 '"Blame It on the Weatherman"',
 'B*Witched',
 '90,000\n',
 '3 April',
 '"Flat Beat"',
 'Mr. Oizo',
 '283,000\n',
 '10 April',
 '184,000\n',
 '17 April',
 '"Perfect Moment"',
 'Martine McCutcheon',
 '200,000\n',
 '24 April',
 '140,000\n

In [107]:
# "...Baby One More Time" by Britney Spears 
britney = ['27 February', 
           '"...Baby One More Time"', 
           'Britney Spears', 
           '463,722\n', 
           '6 March', 
           '231,000\n']

# Any item relating to the chart week contains a month of the year
# Any item relating to the number of sales that week contains the special character '\n'
# We can use this information to iterate through the following loop in order to fill in missing data

months = ['January', 
          'February', 
          'March', 
          'April', 
          'May', 
          'June', 
          'July', 
          'August', 
          'September', 
          'October', 
          'November', 
          'December']

# Iterate through the available indices of the list
for i in range(len(britney)):
    
    # Print the current item of the list
    print(f"Check the following item: {britney[i]}")
    
    # Check if the current item is the chart week (i.e. contains a month of the year)
    for m in months:
        # If so, assign a new variable to the given month and break the inner for loop
        if m in britney[i]:
            month = m
            break
            
    # Move the outer for loop along to the next item if the current item is the chart week        
    if month in britney[i]:
        continue
        
    # Check if the current item is NOT the number of sales
    if '\n' not in britney[i]:
        # If so and the next item also isn't the number of sales, 
        # prepare variables for the song and artist before continuing the loop to the next item
        if '\n' not in britney[i+1]:
            song = britney[i]
            artist = britney[i+1]
            continue
        # If so and the next item IS the number of sales, 
        # continue the loop to the next item without creating new variables
        else:
            continue       
    else:
        # If the current item IS the number of sales, 
        # and none of the previous 2 items relate to the chart week, continue the loop to the next item
        if (month not in britney[i-1]) and (month not in britney[i-2]):
            continue
        # If at least one of the 2 previous items is the chart week, insert the song and artist names into the list
        else:
            britney.insert(i, song)
            britney.insert(i+1, artist)
            continue
            
britney           

Check the following item: 27 February
Check the following item: "...Baby One More Time"
Check the following item: Britney Spears
Check the following item: 463,722

Check the following item: 6 March
Check the following item: 231,000



['27 February',
 '"...Baby One More Time"',
 'Britney Spears',
 '463,722\n',
 '6 March',
 '"...Baby One More Time"',
 'Britney Spears',
 '231,000\n']

In [96]:
# "Livin' la Vida Loca" by Ricky Martin 
ricky = ['17 July', 
         '"Livin\' la Vida Loca"', 
         'Ricky Martin', 
         '131,000\n', 
         '24 July', 
         '125,000\n', 
         '31 July', 
         '96,600\n']

# Iterate through the available indices of the list
for i in range(len(ricky)):
    
    # Print the current item of the list
    print(f"Check the following item: {ricky[i]}")
    
    # Check if the current item is the chart week (i.e. contains a month of the year)
    for m in months:
        # If so, assign a new variable to the given month and break the inner for loop
        if m in ricky[i]:
            month = m
            break
            
    # Move the outer for loop along to the next item if the current item is the chart week
    if month in ricky[i]: 
        print(ricky[i])
        continue
        
    # Check if the current element is NOT the number of sales
    if '\n' not in ricky[i]:
        # If so and the next item also isn't the number of sales, 
        # prepare variables for the song and artist before continuing the loop to the next item
        if '\n' not in ricky[i+1]:
            song = ricky[i]
            artist = ricky[i+1]
            print(ricky[i])
            continue
        # If so and the next item IS the number of sales, 
        # continue the loop to the next item without creating new variables
        else:
            print(ricky[i])
            continue
    else:
        # If the current item IS the number of sales, 
        # and none of the previous 2 items relate to the chart week, continue the loop to the next item
        if (month not in ricky[i-1]) and (month not in ricky[i-2]):
            print(ricky[i])
            continue
        # If at least one of the 2 previous items is the chart week, insert the song and artist names into the list
        else:
            ricky.insert(i, song)
            ricky.insert(i+1, artist)
            print(ricky[i])
            
ricky

Check the following item: 17 July
17 July
Check the following item: "Livin' la Vida Loca"
"Livin' la Vida Loca"
Check the following item: Ricky Martin
Ricky Martin
Check the following item: 131,000

131,000

Check the following item: 24 July
24 July
Check the following item: 125,000

"Livin' la Vida Loca"
Check the following item: Ricky Martin
Ricky Martin
Check the following item: 125,000

125,000



['17 July',
 '"Livin\' la Vida Loca"',
 'Ricky Martin',
 '131,000\n',
 '24 July',
 '"Livin\' la Vida Loca"',
 'Ricky Martin',
 '125,000\n',
 '31 July',
 '96,600\n']

We have a problem in the last example - "Ricky Martin" and "Livin' La Vida Loca" wasn't added to the list in relation to its last week at Number 1 (week ending 31st July). 

Why is this happening? Notice that the outer loop doesn't iterate through the original list but through its available indices. Since the loop aims to update the list, this method may not account for this or multiple occurences of the same item (for the case of the song and artist). 

Let's try again but this time iterate through the actual list of items instead. 

In [106]:
ricky = ['17 July', '"Livin\' la Vida Loca"', 'Ricky Martin', '131,000\n', '24 July', '125,000\n', '31 July', '96,600\n']

# Iterate through the actual items of the list
for r in ricky:
    
    # Print out the current item in the list
    print(f"Check the following item: {r}")
    
    # Find the index of the last occurence of the current item in the list
    index = len(ricky) - [i for i in reversed(ricky)].index(r) - 1 
    
    # Check if the current item is the chart week 
    for m in months: # check if any of the months are contained within an element of the main list
        if m in r:
            month = m
            break
    if month in r: 
        continue
    # Check if the current element is NOT the number of sales
    if '\n' not in r:
        # If so and the next element also isn't the number of sales, 
        # prepare variables for the song and artist before continuing the loop to the next element
        if '\n' not in ricky[index + 1]:
            song = r
            artist = ricky[index + 1]
            continue
        # If so and the next element IS the number of sales, 
        # continue the loop to the next element without creating new variables
        else:
            continue
    else:
        # If the current element IS the number of sales, 
        # and none of the previous 2 elements relate to the chart week, continue the loop to the next element
        if (month not in ricky[index - 1]) and (month not in ricky[index - 2]):
            continue
        # If at least one of the 2 previous elements is the chart week, insert the song and artist names into the list
        else:
            ricky.insert(index, song)
            ricky.insert(index + 1, artist)
            
ricky

Check the following item: 17 July
Check the following item: "Livin' la Vida Loca"
Check the following item: Ricky Martin
Check the following item: 131,000

Check the following item: 24 July
Check the following item: 125,000

Check the following item: Ricky Martin
Check the following item: 125,000

Check the following item: 31 July
Check the following item: 96,600

Check the following item: Ricky Martin
Check the following item: 96,600



['17 July',
 '"Livin\' la Vida Loca"',
 'Ricky Martin',
 '131,000\n',
 '24 July',
 '"Livin\' la Vida Loca"',
 'Ricky Martin',
 '125,000\n',
 '31 July',
 '"Livin\' la Vida Loca"',
 'Ricky Martin',
 '96,600\n']

Now that this new program works on these examples, let's apply it to the entire list of scraped data. If successful, the updated list should return 208 items - 4 items (chart week, song, artist(s), sales) for each week of the year.

In [104]:
data = [td.text for td in soup.find('table', class_='wikitable plainrowheaders').tbody.findAll('td')]
months = ['January', 
          'February', 
          'March', 
          'April', 
          'May', 
          'June', 
          'July', 
          'August', 
          'September', 
          'October', 
          'November', 
          'December']

for d in data:
    index = len(data) - [i for i in reversed(data)].index(d) - 1
    for m in months: 
        if m in d:
            month = m
            break
    if month in d: 
        continue  
    if '\n' not in d:
        if '\n' not in data[index + 1]:
            song = d
            artist = data[index + 1]
            continue
        else:
            continue
    else:
        if (month not in data[index - 1]) and (month not in data[index - 2]):
            continue
        else:
            data.insert(index, song)
            data.insert(index + 1, artist)
            
print(f"There are {len(data)} items in the updated list")            
data

There are 206 items in the updated list


['2 January',
 '"Chocolate Salty Balls"',
 'Chef',
 '320,000\n',
 '9 January',
 '"Heartbeat" / "Tragedy"',
 'Steps',
 '98,000\n',
 '16 January',
 '"Praise You"',
 'Fatboy Slim',
 '80,913\n',
 '23 January',
 '"A Little Bit More"',
 '911',
 '75,400\n',
 '30 January',
 '"Pretty Fly (for a White Guy)"',
 'The Offspring',
 '140,000\n',
 '6 February',
 '"You Don\'t Know Me"',
 'Armand Van Helden featuring Duane Harden',
 '118,500\n',
 '13 February',
 '"Maria"',
 'Blondie',
 '128,000\n',
 '20 February',
 '"Fly Away"',
 'Lenny Kravitz',
 '123,000\n',
 '27 February',
 '"...Baby One More Time"',
 'Britney Spears',
 '463,722\n',
 '6 March',
 '"...Baby One More Time"',
 'Britney Spears',
 '231,000\n',
 '13 March',
 '"When the Going Gets Tough"',
 'Boyzone',
 '213,000\n',
 '20 March',
 '"When the Going Gets Tough"',
 'Boyzone',
 '197,000\n',
 '27 March',
 '"Blame It on the Weatherman"',
 'B*Witched',
 '90,000\n',
 '3 April',
 '"Flat Beat"',
 'Mr. Oizo',
 '283,000\n',
 '10 April',
 '"Flat Beat"',
 '

Almost there but not quite right! The list returned 206 items, so we are missing 2 items.

Upon further investigation, the issue relates to the song "Sweet Like Chocolate" by Shanks & Bigfoot. Let's take a closer a look to see what's going wrong.

In [105]:
shanks = ['29 May',
 '"Sweet Like Chocolate"',
 'Shanks & Bigfoot',
 '251,000\n',
 '5 June',
 '141,000\n']

for d in shanks:
    print(f"Check the following item: {d}")
    index = len(shanks) - [i for i in reversed(shanks)].index(d) - 1
    for m in months:
        if m in d:
            month = m
            break
    if month in d: 
        continue
    if '\n' not in d:
        if '\n' not in shanks[index + 1]:
            song = d
            artist = shanks[index + 1]
            continue
        else:
            continue
    else:
        if (month not in shanks[index - 1]) and (month not in shanks[index - 2]):
            continue
        else:
            data.insert(index, song)
            data.insert(index + 1, artist)
            
shanks

Check the following item: 29 May
Check the following item: "Sweet Like Chocolate"
Check the following item: Shanks & Bigfoot
Check the following item: 251,000

Check the following item: 5 June
Check the following item: 141,000



['29 May',
 '"Sweet Like Chocolate"',
 'Shanks & Bigfoot',
 '251,000\n',
 '5 June',
 '141,000\n']

One solution is to insert the correct missing data into the list. Note that this only works for this specific case - a more general solution would have to be applied in case we wanted to scrape data from similar Wikipedia tables and came across similar issues. Once a more general solution is found, this notebook will be updated. 

In [15]:
# Find the index of '5 June' within the list of scraped data
shanks_date = data.index("5 June")

# Insert 'Sweet Like Chocolate' and 'Shanks & Bigfoot' into the list to complete the dataset
data.insert(shanks_date + 1, "Shanks & Bigfoot")
data.insert(shanks_date + 1, "Sweet Like Chocolate")

print(f"There are {len(data)} items in the updated list")            
data

There are 208 items in the updated list


['2 January',
 '"Chocolate Salty Balls"',
 'Chef',
 '320,000\n',
 '9 January',
 '"Heartbeat" / "Tragedy"',
 'Steps',
 '98,000\n',
 '16 January',
 '"Praise You"',
 'Fatboy Slim',
 '80,913\n',
 '23 January',
 '"A Little Bit More"',
 '911',
 '75,400\n',
 '30 January',
 '"Pretty Fly (for a White Guy)"',
 'The Offspring',
 '140,000\n',
 '6 February',
 '"You Don\'t Know Me"',
 'Armand Van Helden featuring Duane Harden',
 '118,500\n',
 '13 February',
 '"Maria"',
 'Blondie',
 '128,000\n',
 '20 February',
 '"Fly Away"',
 'Lenny Kravitz',
 '123,000\n',
 '27 February',
 '"...Baby One More Time"',
 'Britney Spears',
 '463,722\n',
 '6 March',
 '"...Baby One More Time"',
 'Britney Spears',
 '231,000\n',
 '13 March',
 '"When the Going Gets Tough"',
 'Boyzone',
 '213,000\n',
 '20 March',
 '"When the Going Gets Tough"',
 'Boyzone',
 '197,000\n',
 '27 March',
 '"Blame It on the Weatherman"',
 'B*Witched',
 '90,000\n',
 '3 April',
 '"Flat Beat"',
 'Mr. Oizo',
 '283,000\n',
 '10 April',
 '"Flat Beat"',
 '

Perfect! We now have our complete dataset to create the final dataframe.

In [52]:
# Create lists for each column
weeks = [data[i] for i in range(0,205,4)]
songs = [data[i] for i in range(1,206,4)]
artists = [data[i] for i in range(2,207,4)]
sales = [data[i].strip("\n").replace(",", "") for i in range(3,208,4)]

# Combine columns to make the dataframe
all_columns = list(zip(weeks, songs, artists, sales))
df = pd.DataFrame(all_columns, columns = ["Chart date (week ending)", "Song", "Artist(s)", "Sales"])

# Clean the 'Sales' column so that it contains numerical data only
df = df.astype({'Sales': int})
df

Unnamed: 0,Chart date (week ending),Song,Artist(s),Sales
0,2 January,"""Chocolate Salty Balls""",Chef,320000
1,9 January,"""Heartbeat"" / ""Tragedy""",Steps,98000
2,16 January,"""Praise You""",Fatboy Slim,80913
3,23 January,"""A Little Bit More""",911,75400
4,30 January,"""Pretty Fly (for a White Guy)""",The Offspring,140000
5,6 February,"""You Don't Know Me""",Armand Van Helden featuring Duane Harden,118500
6,13 February,"""Maria""",Blondie,128000
7,20 February,"""Fly Away""",Lenny Kravitz,123000
8,27 February,"""...Baby One More Time""",Britney Spears,463722
9,6 March,"""...Baby One More Time""",Britney Spears,231000


In [53]:
# Check the data types of the dataframe's columns
df.dtypes

Chart date (week ending)    object
Song                        object
Artist(s)                   object
Sales                        int32
dtype: object

## Exploratory Data Analysis

Now that we have successfully scraped all the data we need, let's do some initial exploratory data analysis to round off this project.

In [103]:
# Show summary statistics
df.describe()

Unnamed: 0,Sales
count,52.0
mean,160745.769231
std,70162.727152
min,75400.0
25%,119625.0
50%,141000.0
75%,197141.25
max,463722.0


In [58]:
# Find the biggest sales week of 1999
df[df['Sales'] == df['Sales'].aggregate('max')]

Unnamed: 0,Chart date (week ending),Song,Artist(s),Sales
8,27 February,"""...Baby One More Time""",Britney Spears,463722


In [59]:
# Find the smallest sales week of 1999
df[df['Sales'] == df['Sales'].aggregate('min')]

Unnamed: 0,Chart date (week ending),Song,Artist(s),Sales
3,23 January,"""A Little Bit More""",911,75400


In [89]:
# Find the mean number of sales for a Number 1 song in 1999
print('{:.0f}'.format(df['Sales'].aggregate('mean')))

160746


In [84]:
# Find the longest-running Number 1 song(s) of 1999
result_1 = df.groupby(['Song', 'Artist(s)'])[['Chart date (week ending)']].count().sort_values('Chart date (week ending)', ascending=False).head(3)
result_1.rename(columns = {'Chart date (week ending)':'Number of weeks'}, inplace=True)
result_1

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of weeks
Song,Artist(s),Unnamed: 2_level_1
"""Livin' la Vida Loca""",Ricky Martin,3
"""Blue (Da Ba Dee)""",Eiffel 65,3
"""The Millennium Prayer""",Cliff Richard,3


In [69]:
# Find the songs that accumulated the most sales whilst at Number 1
result_2 = df.groupby(['Song', 'Artist(s)'])[['Song', 'Artist(s)', 'Sales']].aggregate('sum').sort_values('Sales', ascending=False).head(10)
result_2

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Song,Artist(s),Unnamed: 2_level_1
"""...Baby One More Time""",Britney Spears,694722
"""Blue (Da Ba Dee)""",Eiffel 65,532000
"""Flat Beat""",Mr. Oizo,467000
"""The Millennium Prayer""",Cliff Richard,464000
"""When the Going Gets Tough""",Boyzone,410000
"""Mambo No. 5 (A Little Bit of...)""",Lou Bega,399000
"""9pm (Till I Come)""",ATB,378000
"""Livin' la Vida Loca""",Ricky Martin,352600
"""Perfect Moment""",Martine McCutcheon,340000
"""Chocolate Salty Balls""",Chef,320000


In [70]:
# Find the songs that accumulated the least sales whilst at Number 1
result_3 = df.groupby(['Song', 'Artist(s)'])[['Song', 'Artist(s)', 'Sales']].aggregate('sum').sort_values('Sales').head(10)
result_3

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Song,Artist(s),Unnamed: 2_level_1
"""A Little Bit More""",911,75400
"""Praise You""",Fatboy Slim,80913
"""Blame It on the Weatherman""",B*Witched,90000
"""If I Let You Go""",Westlife,90491
"""Flying Without Wings""",Westlife,92000
"""I Want It That Way""",Backstreet Boys,93000
"""Heartbeat"" / ""Tragedy""",Steps,98000
"""You Don't Know Me""",Armand Van Helden featuring Duane Harden,118500
"""She's the One"" / ""It's Only Us""",Robbie Williams,120000
"""Fly Away""",Lenny Kravitz,123000


In [87]:
# Find the artist with the most weeks at Number 1 in 1999
result_4 = df.groupby('Artist(s)')[['Chart date (week ending)']].count().sort_values('Chart date (week ending)', ascending=False).head(1)
result_4.rename(columns = {'Chart date (week ending)':'Number of weeks'}, inplace=True)
result_4

Unnamed: 0_level_0,Number of weeks
Artist(s),Unnamed: 1_level_1
Westlife,5


In [94]:
# Find the total sales for a Number 1 song per month in descending order
df['Chart month'] = df['Chart date (week ending)'].str.strip("0123456789 ")
df[['Chart date (week ending)', 'Chart month', 'Song', 'Artist(s)', 'Sales']]

result_5 = df.groupby('Chart month')[['Sales']].sum().sort_values('Sales', ascending=False)
result_5

Unnamed: 0_level_0,Sales
Chart month,Unnamed: 1_level_1
February,833222
April,807000
September,768309
March,731000
July,730600
January,714313
October,693100
December,677000
June,671279
May,668901


In [102]:
# Find the total sales for a Number 1 song per quarter in 1999 in descending order
def get_quarter(month):
    if month in ['January', 'February', 'March']:
        return 'Q1'
    elif month in ['April', 'May', 'June']:
        return 'Q2'
    elif month in ['July', 'August', 'September']:
        return 'Q3'
    else:
        return 'Q4'
    
df['Quarter'] = df['Chart month'].apply(get_quarter)
result_6 = df.groupby('Quarter')[['Sales']].sum().sort_values('Sales', ascending=False)
result_6 

Unnamed: 0_level_0,Sales
Quarter,Unnamed: 1_level_1
Q1,2278535
Q2,2147180
Q3,2018965
Q4,1914100
