In [None]:
# Notebook 1 - Establishing the Data


**Author:** Kavan Wills

**Computing ID:** meu5cg

**Course:** DS 2023 - Communicating with Data  

---
## Purpose
This notebook establishes the BiMMuDa dataset for analysis of lyrical complexity trends in Billboard's top 5 songs from 1950 to 2022.

---
## Contents
1. Import libraries
2. Load data
3. Describe data source
4. Create COLS table
5. Data cleaning
6. Create derived variables
7. Export cleaned data

---
## Import Libraries 

In [10]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

---
## Load Data

In [11]:
df = pd.read_csv('bimmuda_per_song_full.csv')
df.head()

Unnamed: 0,Title,Artist,Year,Position,Link to Audio,Tonic 1,Tonic 2,Tonic 3,Mode 1,Mode 2,Mode 3,BPM 1,BPM 2,BPM 3,Number of Parts,Number of Words,Number of Unique Words,Unique Word Ratio,Number of Syllables
0,Goodnight Irene,Gordon Jenkins & The Weavers,1950,1,https://open.spotify.com/track/3GtfeLBXe15nyEM...,F,,,Major,,,94,141.0,,2.0,134.0,58.0,0.43,207.0
1,Mona Lisa,Nat King Cole,1950,2,https://open.spotify.com/track/5dae01pKNjRQtgO...,Db,,,Major,,,65,,,2.0,145.0,55.0,0.38,192.0
2,Third Man Theme,Anton Karas,1950,3,https://open.spotify.com/track/7rRGujA12UJcRUz...,G,,,Major,,,120,152.0,,7.0,,,,
3,Sam's Song,Gary & Bing Crosby,1950,4,https://open.spotify.com/track/1Wnlagmoyo7M7In...,F,,,Minor,,,118,,,2.0,317.0,134.0,0.42,381.0
4,Simple Melody,Gary & Bing Crosby,1950,5,https://open.spotify.com/track/75lpxrV9sLZIRvz...,Bb,,,Major,,,151,,,2.0,199.0,55.0,0.28,265.0


---

## Data Source

### Dataset: Billboard Melodic Music Dataset (BiMMuDa)

**Citation:**  
Hamilton, M., Clemente, A., Hall, E., & Pearce, M. (2024). The Billboard Melodic Music Dataset (BiMMuDa). *Transactions of the International Society for Music Information Retrieval*, 7(1), 113-128. https://doi.org/10.5334/tismir.168

**Source Link:**  
https://github.com/madelinehamilton/BiMMuDa/

---

### Who Produced the Data?

The BiMMuDa dataset was created by researchers at Queen Mary University of London's Music Cognition Lab:
- **Madeline Hamilton** (lead author)
- **Ana Clemente**
- **Edward Hall**  
- **Marcus Pearce**

---

### How Was the Data Produced?

1. **Song Selection:**  
   - Top 5 songs from Billboard's year-end singles chart (1950-2022)
   - 371 songs total, representing the most popular music in America each year

2. **Transcription Process:**
   - Manual transcription of vocal melodies by experienced musicians
   - Lyrics obtained from free lyrics websites
   - Metadata verified using Tunebat.com and Spotify

3. **Quality Assurance:**
   - Cross-validated against CoCoPops dataset for 14 overlapping songs
   - Statistical similarity confirmed (compression distance analysis)
   - Manual review of all transcriptions

4. **Lyrical Metrics:**
   - **Number of Words:** Total word count including repetitions
   - **Number of Unique Words:** Distinct vocabulary count
   - **Unique Word Ratio:** Unique words ÷ Total words (measure of diversity)
   - **Number of Syllables:** Total syllable count

---

### Data Characteristics

- **Temporal Range:** 73 years (1950-2022)
- **Population:** Top 5 Billboard year-end singles per year
- **Format:** CSV file with per-song attributes
- **Completeness:** Some songs lack lyrical data (instrumental tracks)

--- 
## COLS Table: Feature Descriptions 

The table below describes each feature in the dataset following the COLS (Columns) format:
- **Column Name:** Variable name in dataset
- **Type:** Data type (Categorical, Numeric, etc.)
- **Description:** What the variable represents
- **Range/Values:** Possible values or range

In [16]:
# Define COLS table
cols_table_data = [
    {
        'Column Name': 'Title',
        'Type': 'Categorical',
        'Description': 'Song title',
        'Range/Values': f'{df["Title"].nunique()} unique songs'
    },
    {
        'Column Name': 'Artist',
        'Type': 'Categorical',
        'Description': 'Artist name(s), including features',
        'Range/Values': f'{df["Artist"].nunique()} unique artists'
    },
    {
        'Column Name': 'Year',
        'Type': 'Numeric',
        'Description': 'Year song appeared in Billboard top 5',
        'Range/Values': f'{df["Year"].min()}-{df["Year"].max()}'
    },
    {
        'Column Name': 'Position',
        'Type': 'Numeric',
        'Description': 'Chart position (1=highest)',
        'Range/Values': '1-5'
    },
    {
        'Column Name': 'Number of Words',
        'Type': 'Numeric',
        'Description': 'Total words in lyrics (with repetitions)',
        'Range/Values': f'{df["Number of Words"].min():.0f}-{df["Number of Words"].max():.0f}'
    },
    {
        'Column Name': 'Number of Unique Words',
        'Type': 'Numeric',
        'Description': 'Count of distinct vocabulary words',
        'Range/Values': f'{df["Number of Unique Words"].min():.0f}-{df["Number of Unique Words"].max():.0f}'
    },
    {
        'Column Name': 'Unique Word Ratio',
        'Type': 'Numeric',
        'Description': 'Unique Words ÷ Total Words (diversity metric)',
        'Range/Values': f'{df["Unique Word Ratio"].min():.2f}-{df["Unique Word Ratio"].max():.2f}'
    },
    {
        'Column Name': 'Number of Syllables',
        'Type': 'Numeric',
        'Description': 'Total syllables in lyrics',
        'Range/Values': f'{df["Number of Syllables"].min():.0f}-{df["Number of Syllables"].max():.0f}'
    },
    {
        'Column Name': 'Number of Parts',
        'Type': 'Numeric',
        'Description': 'Number of distinct melodic sections (verse, chorus, etc.)',
        'Range/Values': f'{df["Number of Parts"].min():.0f}-{df["Number of Parts"].max():.0f}'
    },
    {
        'Column Name': 'BPM 1',
        'Type': 'Numeric',
        'Description': 'Tempo in beats per minute (BPM)',
        'Range/Values': f'{df["BPM 1"].min():.0f}-{df["BPM 1"].max():.0f}'
    }
]

cols_table = pd.DataFrame(cols_table_data)
display(cols_table)

Unnamed: 0,Column Name,Type,Description,Range/Values
0,Title,Categorical,Song title,369 unique songs
1,Artist,Categorical,"Artist name(s), including features",307 unique artists
2,Year,Numeric,Year song appeared in Billboard top 5,1950-2022
3,Position,Numeric,Chart position (1=highest),1-5
4,Number of Words,Numeric,Total words in lyrics (with repetitions),12-896
5,Number of Unique Words,Numeric,Count of distinct vocabulary words,11-312
6,Unique Word Ratio,Numeric,Unique Words ÷ Total Words (diversity metric),0.10-1.00
7,Number of Syllables,Numeric,Total syllables in lyrics,22-1064
8,Number of Parts,Numeric,"Number of distinct melodic sections (verse, ch...",1-8
9,BPM 1,Numeric,Tempo in beats per minute (BPM),57-174


---

## Summary

### Data Establishment Complete ✅

**Dataset:** BiMMuDa (Billboard Melodic Music Dataset)  
**Songs:** 371 from Billboard's year-end top 5 (1950-2022)  
**Focus:** Lyrical complexity analysis

---

### Key Features:
- **Number of Words** - Total word count
- **Number of Unique Words** - Vocabulary size
- **Unique Word Ratio** - Diversity metric (0-1 scale)
- **Year** - Temporal variable for trend analysis

---

### Next: Notebook 2 - Data Exploration