<h1><center>ABBA: Lyrics Emotions</center></h1>
<img src="../images/gettyimages-abba.jpeg" width="360" align="center"/>

This notebook details all the steps performed to create a **song lyrics dataset**. The process is limited to studio albums and includes only English songs - instrumental, remixed, live songs are left out.

- Some documentation: https://en.wikipedia.org/wiki/ABBA_discography#Studio_albums

In [None]:
import pandas as pd
import numpy as np
import datetime
import lyricsgenius

Get a token to download song lyrics
- lyricsgenius: https://pypi.org/project/lyricsgenius/
- api client: https://genius.com/api-clients
- https://towardsdatascience.com/song-lyrics-genius-api-dcc2819c29

In [None]:
token = 'cxk4nYrZcSzjbZ_xsGVItrJgsZbU5tkZlMn6SQLHDOjMd7SfkWa4CS4Zlta2h_uL'
genius = lyricsgenius.Genius(token)

# genius.verbose = False # Turn off status messages
genius.remove_section_headers = True # Remove section headers (e.g. [Chorus]) from lyrics when searching
genius.skip_non_songs = False # Include hits thought to be non-songs (e.g. track lists)
genius.excluded_terms = ["(Remix)", "(Live)"] # Exclude songs with these words in their title

In [None]:
# Function used to cast dates (downloaded from Genius as dict) in datetime format
def get_release_date(date_dict):
    year, month, day = [str(item) for key, item in date_dict.items()]
    year, month, day = year.rjust(4, '0'), month.rjust(2, '0'), day.rjust(2, '0')
    return datetime.datetime.strptime(year+month+day, '%Y%m%d')

# Function to create a lyrics dataset based on a given album name
def lyrics_df(album_name, id_album):
    album = genius.search_album(name=album_name, artist="ABBA", get_full_info=True, text_format=True)
    album_dict = album.to_dict()
    n_tracks = len(album_dict['tracks'])
    lyrics = (
        [(album_dict['tracks'][i]['song']['title'], album_dict['tracks'][i]['song']['lyrics']) for i in range(n_tracks)]
    )
    df_out = pd.DataFrame(
        data={
            'id': id_album,
            'album': album.full_title,
            'release_date': [get_release_date(album_dict['release_date_components'])]*n_tracks,
            'n_tracks_original': n_tracks,
            'id_track': range(1, 1+n_tracks),
            'track': [single for single, _ in lyrics],
            'lyrics': [text for _, text in lyrics]
        }
    )
    print(album_dict['url'])
    return df_out

The download is done album by album (8 in total) in order to analyse each single collection.

## 1. Ring Ring
Released: 26 March 1973
- https://genius.com/albums/Abba/Ring-ring-international-edition
- https://en.wikipedia.org/wiki/Ring_Ring_(album)

In [None]:
# Download the album
album = genius.search_album(
    name='Ring Ring (International Edition)', artist="ABBA", get_full_info=True, text_format=True
)

In [None]:
# Explore its features
album_dict = album.to_dict()
album_dict

In [None]:
# Get the Genius URL of the album
print(album_dict['url'])

In [None]:
# Check the number of tracks
n_tracks = len(album_dict['tracks'])
n_tracks

In [None]:
# Get the song lyrics
lyrics = (
    [(album_dict['tracks'][i]['song']['title'], album_dict['tracks'][i]['song']['lyrics']) for i in range(n_tracks)]
)
lyrics
# lyrics is a list of the songs in the album, each element is a tuple defined as ({song name}, {song lyrics})

In [None]:
[track for track, _ in lyrics]

In [None]:
# Collect the relevant information in a df
df_ring_ring = pd.DataFrame(
    data={
        'id': 1,
        'album': album.full_title,
        'release_date': [get_release_date(album_dict['release_date_components'])]*n_tracks,
        'n_tracks_original': n_tracks,
        'id_track': range(1, 1+n_tracks),
        'track': [single for single, _ in lyrics],
        'lyrics': [text for _, text in lyrics]
    }
)

In [None]:
# Drop the Swedish song
df_ring_ring = df_ring_ring.iloc[:-1]
df_ring_ring['n_tracks'] = 14

In [None]:
# Add some additional notes
df_ring_ring['notes'] = np.nan

df_ring_ring.loc[df_ring_ring.id_track==1, 'notes'] = 'Third at the 1973 Melodifestivalen'
df_ring_ring.loc[df_ring_ring.id_track==4, 'notes'] = 'First single ever'
df_ring_ring.loc[df_ring_ring.id_track==13, 'notes'] = '2001 CD edition bonus tracks'
df_ring_ring.loc[df_ring_ring.id_track==14, 'notes'] = '2001 CD edition bonus tracks'

In [None]:
df_ring_ring

## 2. Waterloo
Released: 4 March 1974
- https://genius.com/albums/Abba/Waterloo
- https://en.wikipedia.org/wiki/Waterloo_(album)

In [None]:
# Repeat the same initial steps for any album, using the function
df_waterloo = lyrics_df('Waterloo', 2)
df_waterloo

In [None]:
# Drop remixed songs
df_waterloo = df_waterloo.copy().iloc[:-1]
df_waterloo['n_tracks'] = 11

In [None]:
# Notes
df_waterloo['notes'] = np.nan
df_waterloo.loc[df_waterloo.id_track==1, 'notes'] = '1st at the Eurovision Song Contest on April 6, 1974'

In [None]:
df_waterloo

## 3. ABBA
Released: 21 April 1975
- https://genius.com/albums/Abba/Abba
- https://en.wikipedia.org/wiki/ABBA_(album)

In [None]:
df_abba = lyrics_df('Abba', 3)
df_abba

In [None]:
df_abba

In [None]:
# In this case, we add two songs and drop the instrumental one ('Intermezzo No. 1')
df_abba['n_tracks'] = 12
df_abba['notes'] = np.nan
df_abba = df_abba.drop(index=8).reset_index(drop=True)

In [None]:
# 1st song to be added: 'Crazy World'
song_crazy_world = genius.search_song(title='Crazy World', artist='ABBA', get_full_info=True)

to_append = [3, df_abba.album[0], df_abba.release_date[0], df_abba.n_tracks_original[0], 11, song_crazy_world.title,
             song_crazy_world.lyrics, 13, '2001 CD edition bonus tracks']

to_append_series = pd.Series(to_append, index = df_abba.columns)
df_abba = df_abba.append(to_append_series, ignore_index=True)

In [None]:
# 2nd song: "Pick a Bale of Cotton"/"On Top of Old Smokey"/"Midnight Special" (medley)
song_medley = genius.search_song(title='Pick a Bale of Cotton"/"On Top of Old Smokey"/"Midnight Special',
                                 artist='ABBA', get_full_info=True)

to_append = [3, df_abba.album[0], df_abba.release_date[0], df_abba.n_tracks_original[0], 12, song_medley.title,
            song_medley.lyrics, 13, '2001 CD edition bonus tracks']

to_append_series = pd.Series(to_append, index = df_abba.columns)
df_abba = df_abba.append(to_append_series, ignore_index=True)

In [None]:
df_abba.loc[:, 'id_track'] = range(1, 1+df_abba.shape[0]) # reset the id_track
df_abba

## 4. Arrival
Released: 11 October 1976
- https://genius.com/albums/Abba/Arrival
- https://en.wikipedia.org/wiki/Arrival_(ABBA_album)

In [None]:
df_arrival = lyrics_df('Arrival', 4)
df_arrival

In [None]:
df_arrival

In [None]:
# We add again 2 songs and drop 'Arrival', which is defined as a instrumental w/ vocalisations song
df_arrival['n_tracks'] = 11
df_arrival['notes'] = np.nan
df_arrival = df_arrival.drop(index=9).reset_index(drop=True)

In [None]:
# 1st song: 'Fernando'
song_fernando = genius.search_song(title='Fernando', artist='ABBA', get_full_info=True)

to_append = [4, df_arrival.album[0], df_arrival.release_date[0], df_arrival.n_tracks_original[0], 10,
             song_fernando.title, song_fernando.lyrics, df_arrival.n_tracks[0], '1997 CD edition bonus track']

to_append_series = pd.Series(to_append, index = df_arrival.columns)
df_arrival = df_arrival.append(to_append_series, ignore_index=True)

In [None]:
# 2nd song: 'Happy Hawaii'
song_happy_hawaii = genius.search_song(title='Happy Hawaii', artist='ABBA', get_full_info=True)

to_append = [4, df_arrival.album[0], df_arrival.release_date[0], df_arrival.n_tracks_original[0], 11,
             song_happy_hawaii.title, song_happy_hawaii.lyrics, df_arrival.n_tracks[0], '2001 CD edition bonus tracks']

to_append_series = pd.Series(to_append, index = df_arrival.columns)
df_arrival = df_arrival.append(to_append_series, ignore_index=True)

In [None]:
df_arrival

## 5. ABBA: The Album
Released: 11 October 1976
- https://genius.com/albums/Abba/Abba-the-album
- https://en.wikipedia.org/wiki/Arrival_(ABBA_album)

In [None]:
df_abba_album = lyrics_df('ABBA: The Album', 5)
df_abba

In [None]:
df_abba_album

In [None]:
df_abba_album['n_tracks'] = 9
df_abba_album['notes'] = np.nan

## 6. Voulez-Vous
Released: 23 april 1979
- https://genius.com/albums/Abba/Voulez-vous
- https://en.wikipedia.org/wiki/Voulez-Vous

In [None]:
df_voulez_vous = lyrics_df('Voulez-Vous', 6)
df_voulez_vous

In [None]:
# Adjustments
df_voulez_vous['n_tracks'] = 14
df_voulez_vous['notes'] = np.nan
df_voulez_vous.loc[df_voulez_vous.id_track==11, 'notes'] = '1997 CD edition bonus tracks'
df_voulez_vous.loc[df_voulez_vous.id_track==12, 'notes'] = 'The Definitive Collection 2001 CD edition bonus tracks'
df_voulez_vous.loc[df_voulez_vous.id_track==13, 'notes'] = 'The Definitive Collection 2001 CD edition bonus tracks'
df_voulez_vous.loc[df_voulez_vous.id_track==17, 'notes'] = '2010 deluxe edition (The Dynamic Album) bonus tracks'
df_voulez_vous = df_voulez_vous[~df_voulez_vous.id_track.isin([14, 15, 16, 18])]

In [None]:
# Reset index and id track column
df_voulez_vous.loc[:, 'id_track'] = range(1, 1+df_voulez_vous.shape[0])
df_voulez_vous = df_voulez_vous.reset_index(drop=True)
df_voulez_vous

## 7. Super Trouper
Released: 3 November 1980
- https://genius.com/albums/Abba/Super-trouper
- https://en.wikipedia.org/wiki/Super_Trouper_(album)

In [None]:
df_super_trouper = lyrics_df('Super Trouper', 7)
df_super_trouper

In [None]:
# Adjustments
df_super_trouper['n_tracks'] = 12
df_super_trouper['notes'] = np.nan
df_super_trouper = df_super_trouper.copy().iloc[:-2]
df_super_trouper.loc[df_super_trouper.id_track==11, 'notes'] = '1997 CD edition bonus tracks'
df_super_trouper.loc[df_super_trouper.id_track==12, 'notes'] = '1997 CD edition bonus tracks'

In [None]:
df_super_trouper

## 8. The Visitors
Released: 30 November 1981
- https://genius.com/albums/Abba/The-visitors
- https://en.wikipedia.org/wiki/The_Visitors_(ABBA_album)

In [None]:
df_visitors = lyrics_df('The Visitors', 8)
df_visitors

In [None]:
# Adjustments
df_visitors['n_tracks'] = 16
df_visitors['notes'] = np.nan

df_visitors = df_visitors[~df_visitors.id_track.isin([11, 12])] # Drop Spanish songs

df_visitors.loc[df_visitors.id_track==10, 'notes'] = '1997 CD edition bonus tracks'
df_visitors.loc[df_visitors.id_track==14, 'notes'] = '1997 CD edition bonus tracks'
df_visitors.loc[df_visitors.id_track==16, 'notes'] = '1997 CD edition bonus tracks'
df_visitors.loc[df_visitors.id_track==17, 'notes'] = '1997 CD edition bonus tracks'

df_visitors.loc[df_visitors.id_track==15, 'notes'] = '2001 CD edition bonus tracks'
df_visitors.loc[df_visitors.id_track==13, 'notes'] = '2012 deluxe edition (The Final Album) bonus tracks'
df_visitors.loc[df_visitors.id_track==18, 'notes'] = '2012 deluxe edition (The Final Album) bonus tracks'

In [None]:
df_visitors = df_visitors.reset_index(drop=True)
df_visitors

## Concatenate all the 8 datasets and save the resulting df

In [None]:
df = pd.concat([
    df_ring_ring,
    df_waterloo,
    df_abba,
    df_arrival,
    df_abba_album,
    df_voulez_vous,
    df_super_trouper,
    df_visitors
])

df.reset_index(drop=True, inplace=True)

In [None]:
df.shape

In [None]:
df.to_csv('../data/df_abba_lyrics.csv', index=False)