# Lyrics and Chords from Ultimate-Guitar

## Dataset loading

The dataset that I will be using is one found in Kaggle called [Lyrics and Chords from Ultimate-Guitar](https://www.kaggle.com/datasets/taylorflandro/lyrics-and-chords-from-ultimateguitar?resource=download).

In this dataset, each song is separated in sections. The goal is to have only one string of chords per song.

In [101]:
import os
import pandas as pd

In [102]:
data_path = "data"
raw_data_path = os.path.join(data_path, "lyrics-and-chords-from-ultimate-guitar")

In [103]:
raw_data = []
for f in os.listdir(raw_data_path):
    f_path = os.path.join(raw_data_path, f)
    print(f"Loading {f_path}")
    df = pd.read_csv(f_path, index_col=0)
    raw_data.append(df)
raw_data = pd.concat(raw_data, ignore_index=True)

Loading data/lyrics-and-chords-from-ultimate-guitar/hiphop_lyrics_df.csv
Loading data/lyrics-and-chords-from-ultimate-guitar/rnb_lyrics_df.csv
Loading data/lyrics-and-chords-from-ultimate-guitar/rock_lyrics_df.csv
Loading data/lyrics-and-chords-from-ultimate-guitar/country_lyrics_df.csv
Loading data/lyrics-and-chords-from-ultimate-guitar/pop_lyrics_df.csv


In [104]:
raw_data.head()

Unnamed: 0,Song Artist,Song Title,Part,Chords,Lyrics,Genre
0,Post Malone,FEELING WHITNEY,intro,C Am Fmaj7 C,Oo oo oo oo oo oo oo Oo oo oo oo oo oo oo Oo o...,hiphop
1,Post Malone,FEELING WHITNEY,verse 1,C Am Fmaj7 C C Am F C C7,And Ive been looking for someone to put up wit...,hiphop
2,Post Malone,FEELING WHITNEY,chorus,Fmaj7 C Fmaj7 C G G7 N.C.,To each their own and found peace in knowing A...,hiphop
3,Post Malone,FEELING WHITNEY,post-chorus,C Am Fmaj7 C,Oo oo oo oo oo oo oo Oo oo oo oo oo oo oo Oo o...,hiphop
4,Post Malone,FEELING WHITNEY,verse 2,C Am F C C Am F C C7,And Ive been looking for someone that I can bu...,hiphop


In [105]:
raw_data["Chords"][0]

'C Am Fmaj7 C'

In [106]:
total_rows = len(raw_data)
raw_data = raw_data.drop_duplicates()
deduplicated_rows = len(raw_data)

print(f"Found {total_rows - deduplicated_rows} duplicted rows")

Found 310 duplicted rows


## Data discovery

In [107]:
raw_data["Part"].value_counts(normalize=True).head(20)

chorus          0.238766
intro           0.089537
verse           0.089202
verse 2         0.088196
verse 1         0.086854
bridge          0.070423
pre-chorus      0.057009
outro           0.056673
instrumental    0.026492
verse 3         0.020791
interlude       0.016097
hook            0.013414
chorus 2        0.009725
chorus 1        0.009054
refrain         0.008384
post-chorus     0.006036
solo            0.005701
chorus 3        0.004359
pre-chorus 2    0.004359
verse 4         0.004024
Name: Part, dtype: float64

We can see that some rows have no chords. This could be due to bad parsing of the data when creting the original dataset. In some cases the chords are in the 'Lyrics' column.

These errors happen usually in parts where there is no lyrics such as intro, riffs, solos or instrumental parts.

In [108]:
is_null_chords = raw_data["Chords"].isnull()
null_chords = raw_data[is_null_chords]
print(f"Found {len(null_chords)} rows with no chords")

Found 44 rows with no chords


In [109]:
null_chords["Part"].value_counts()

intro                                                                         11
solo                                                                           3
verse                                                                          2
interlude                                                                      2
pre-hook                                                                       2
outro                                                                          2
instrumental                                                                   2
tuning                                                                         1
intro: ariana grande                                                           1
intro c#m e a e c#m e a                                                        1
note                                                                           1
general riff                                                                   1
and the chords continue on f

In [110]:
null_chords["Song Title"].value_counts()

BATTLE SCARS                      6
NEVER BE ALONE                    3
27                                3
HOW TO LOVE                       2
IRIS                              2
ON MELANCHOLY HILL                1
LITTLE TALKS                      1
MAKE IT TO ME                     1
BE MY MISTAKE                     1
TREES                             1
FATHER AND SON                    1
LOVE ME AGAIN                     1
DANGEROUS WOMAN                   1
WHERE IS MY MIND                  1
R U MINE                          1
XANNY                             1
TRAUERFEIER LIED                  1
MY FAVORITE PART                  1
REHAB                             1
GRAND PIANO                       1
DO I WANNA KNOW                   1
WHATS UP                          1
RADIOACTIVE                       1
I DONT EVEN KNOW YOUR NAME        1
REDBONE                           1
SET FIRE TO THE RAIN              1
REDEMPTION SONG                   1
SAME LOVE                   

In [111]:
raw_data[raw_data["Song Title"]=="BATTLE SCARS"]

Unnamed: 0,Song Artist,Song Title,Part,Chords,Lyrics,Genre
245,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,intro,Em C G D,The wound heals but it never does Thats cause ...,hiphop
246,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,hook,Em C G D,These battle scars dont look like theyre fadin...,hiphop
247,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,and the chords continue on for the rest of the...,,,hiphop
248,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,verse,,I wish I never looked I wish I never touched I...,hiphop
249,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,pre-hook,,These battle scars dont look like theyre fadin...,hiphop
250,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,hook,,I wish I could feel I wish I can love I wish t...,hiphop
251,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,verse,,These battle scars dont look like theyre fadin...,hiphop
252,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,pre-hook,,These battle scars dont look like theyre fadin...,hiphop
467,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,intro,Em C G D,Hope the wound heals but it never does Thats c...,hiphop
468,Lupe Fiasco & Guy Sebastian,BATTLE SCARS,hook,Em C G D,These battle scars dont look like theyre fadin...,hiphop


## Special characters

First lets see the different chords present in the original dataset. We see that there are other information a part from normal chrods like annotations or symbols like x2 that inticate repetition.

In [112]:
from collections import Counter
chords = []
for s in raw_data[~is_null_chords]["Chords"]:
    chords.extend(s.split())
chords = Counter(chords)

In [113]:
len(chords)

445

In [114]:
chords.most_common()

[('G', 6396),
 ('C', 5617),
 ('D', 3492),
 ('Am', 3435),
 ('Em', 3043),
 ('F', 2851),
 ('A', 1579),
 ('Dm', 1151),
 ('E', 1021),
 ('Bm', 1016),
 ('F#m', 525),
 ('Bb', 495),
 ('B', 482),
 ('Gm', 388),
 ('Eb', 370),
 ('C#m', 333),
 ('Cm', 285),
 ('Em7', 269),
 ('F#', 254),
 ('Cadd9', 254),
 ('Am7', 229),
 ('Cmaj7', 203),
 ('Fmaj7', 198),
 ('Ab', 197),
 ('G#m', 189),
 ('N.C.', 185),
 ('Dsus4', 167),
 ('Fm', 150),
 ('B7', 116),
 ('D7', 116),
 ('D#m', 110),
 ('A7', 107),
 ('-', 88),
 ('C#', 84),
 ('E7', 74),
 ('Dm7', 73),
 ('Dmaj7', 69),
 ('Gmaj7', 61),
 ('G#', 58),
 ('D#', 54),
 ('G7', 52),
 ('Asus2', 51),
 ('C#m7', 51),
 ('A7sus4', 48),
 ('x2', 47),
 ('A5', 46),
 ('D6', 46),
 ('Dsus2', 45),
 ('Amaj7', 43),
 ('G6', 42),
 ('F7', 38),
 ('C#5', 37),
 ('Gsus4', 35),
 ('Bm7', 35),
 ('Bbm', 33),
 ('Db', 31),
 ('C6', 31),
 ('Gb', 29),
 ('C7', 28),
 ('the', 28),
 ('Amaj9', 25),
 ('Dmaj9', 25),
 ('A#', 25),
 ('E6', 25),
 ('Cm7', 24),
 ('F#13', 24),
 ('Am9', 24),
 ('F#m11', 24),
 ('D7sus4', 23),
 ('

In [115]:
raw_data[~is_null_chords & raw_data["Chords"].str.contains("-")].head()

Unnamed: 0,Song Artist,Song Title,Part,Chords,Lyrics,Genre
62,twenty one pilots,TEAR IN MY HEART,intro,D - D - D - D F#m - F#m G - G,,hiphop
327,twenty one pilots,NICO AND THE NINERS,verse 1,- Am Dm Am Dm Am Dm Am,They want to make you forget They want to make...,hiphop
328,twenty one pilots,NICO AND THE NINERS,chorus,F Dm Am G F Dm Am G - F Dm Am G F Dm Am G - Dm...,Im heavy my jumpsuit is on steady Im lighter w...,hiphop
330,twenty one pilots,NICO AND THE NINERS,chorus,F Dm Am G F Dm Am G - F Dm Am G F Dm Am G -,Im heavy my jumpsuit is on steady Im lighter w...,hiphop
331,twenty one pilots,NICO AND THE NINERS,bridge,F Am G F Am G F Am G F Am G -,Im heavy jumpsuit is on steady Lighter when Im...,hiphop


## Data aggregation

Now we aggregate the data to have one string of chords per song

In [116]:
data = raw_data[~is_null_chords].groupby(["Genre", "Song Artist", "Song Title"])['Chords'].apply(' '.join).reset_index()

In [117]:
len(data)

336

In [118]:
data["Chords"].sample(10).values.tolist()

['Am Em7 G6 Dsus2 F6 Am Em7 G6 Dsus2 F6 Am Em7 G6 Dsus2 F6 Am Em7 G6 Dsus2 F6 Am Em7 G6 Dsus2 F6 Am Em G G F Am Em G G F Am Em7 G6 Dsus2 F6 G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C G Am Dm7 C',
 'Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7 Dmaj7 Am7 Gmaj7 Bbmaj7',
 'C C Am Am C C Am Am F G C F G C F G C F G Am C F G C C C Am Am C C Am Am F G C C C Am G C C Am G C C Am G C C Am G F G C F G C F G C F G C F G Am C F G Am G F G C',
 'Am C G6 x3 G6sus2 Am C G D Am C G N.C. Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am C G D Am Cmaj7 