# Taylor Swift Discography: Part I - Data Collection

## Introduction

This notebook is the first in a series of notebooks analyzing Taylor Swift's discography; in this notebook, I collect and compile data using [parsel](https://parsel.readthedocs.io/en/latest/) to webscrape from [Genius](https://genius.com/), an online music encyclopedia, and pandas for dataframe creation. 

It should be noted that Genius does have an [API](https://docs.genius.com/) for application development. However, song lyrics are not made available via API, which is a crucial component of this project. A third-party Python package, [lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/), is available to work in tandem with the API, webscraping on the user's behalf. I tested both the API and lyricsgenius as resources before ultimately concluding to collect the data myself.

The resulting dataframe from this notebook is available in the [datasets folder](https://github.com/madroscla/taylor_swift_discography/tree/main/datasets) as a JSON file, CSV file, and pickle file.

In [1]:
import os
import re

import pandas as pd
import requests
from parsel import Selector

## Webscraping and Initial Dataframe

Genius is very thorough in cataloguing an artist's discogrpahy, tracking all physical and digital variants of a release with different tracklists as well as singles, EPs, demo CDs, official playlists, and special releases. Therefore, Genius catalogues Taylor Swift as having over [100 different albums](https://genius.com/artists/Taylor-swift/albums), most of which contain repeated songs. For sake of processing times, I manually select the releases used to create the initial dataframe: deluxe versions and "Taylor's Version" rerecordings of her studio albums are preferred for including all existing bonus tracks and "From The Vault" songs for each album, and non-album singles and streaming-specific EPs that contain previously unreleased tracks are included. Since this project includes analyses of her lyrics, the goal is to capture every *unique* song of her discography, not necessarily to capture every *possible* song.

The dataframe is structured as follows:

* **album_title**: the title of the album (or other release) containing each song
* **album_url**: the URL to the Genius page for the album
* **album_era**: the musical "era" the album was released in
  * Note: this terminology was brought about via [The Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour) and usually refers to the time period around a studio album's release. For songs not on studio albums, the era is determined by which era was closest when written or released.
* **album_track_number**: the track number for the song on the given album
* **song_title**: the title of the song
* **song_url**: the URL to the Genius page for the song
* **song_lyrics**: the lyrics to the song (returned as a single string)
* **song_writers**: the writer(s) for the song (returned as a list)
* **song_producers**: the producers(s) for the song (returned as a list)
* **song_tags**: the genre tag(s) for the song (returned as a list)

Several functions helped scrape and compile the data before being put into a dataframe.

In [2]:
# Manually selected albums/EPs from Genius' list
# Maximizes the number of unique songs while reducing cleanup
# dict = {'album_title': 'album_era'}
albums = {'Taylor Swift': 'Taylor Swift',
          'Beautiful Eyes - EP': 'Taylor Swift',
          "Fearless (Taylor's Version)": 'Fearless',
          "Speak Now (Taylor's Version)": 'Speak Now',
          "Red (Taylor's Version)": 'Red',
          "1989 (Taylor's Version) [Tangerine Edition]": '1989',
          'reputation': 'reputation',
          'Lover': 'Lover',
          'folklore (deluxe version)': 'folklore',
          'Christmas Tree Farm - 12" Single Picture Disc': 'Lover',
          'evermore (deluxe version)': 'evermore',
          'Carolina (From The Motion Picture "Where The Crawdads Sing")': 'folklore',
          'Midnights (3am Edition)': 'Midnights',
          'Midnights (The Late Night Edition)': 'Midnights',
          "The More Red (Taylor's Version) Chapter": 'Red',
          "The More Fearless (Taylor's Version) Chapter": 'Fearless',
          'The More Lover Chapter': 'Lover'}

In [3]:
def album_clean_titles(album_list):
    """Formats the album title for Genius URLs."""
    cleaned = []
    for title in album_list:
        title = re.sub('[^\w\s]|\s-|\'','', title)
        title = re.sub('\s', '-', title)
        cleaned.append(title)
    return cleaned

def album_get_tracklist(url):
    """Returns tracklist of given Genius album URL.

      Includes track number, song title, and link to the lyrics page
      for each song.
   """
    album_page = requests.get(url).text
    selector = Selector(text=album_page)
    
    number = selector.xpath(
        '//div[@class="chart_row-number_container chart_row-number_container--align_left"]/span/span/text()'
    ).getall()
    track = selector.xpath('//div[@class="chart_row-content"]/a/h3/text()').getall()
    url = selector.xpath('//div[@class="chart_row-content"]/a/@href').getall()
    clean_track = []
    
    for title in track:
        title = re.sub('\n|\u200b', '', title)
        title = re.sub('\xa0', ' ', title)
        title = title.strip()
        if title != '':
            clean_track.append(title)

    
    tracklist = [{'album_track_number': number,
            'song_title': title,
            'song_url': url} for number, title, url in zip(number, clean_track, url)]
    return tracklist

def song_get_lyrics(url):
    """Returns lyrics of given Genius song URL."""
    song_page = requests.get(url).text
    selector = Selector(text=song_page)

    raw_lyrics = selector.xpath('//div[@data-lyrics-container="true"]//text()').getall()
    lyrics_list = [re.sub('\u2005', ' ', lyric) for lyric in raw_lyrics]
    lyrics = ' '.join(lyrics_list)
    lyrics = re.sub('\[.*?\]', '', lyrics)
    lyrics = re.sub('\s\s', ' ', lyrics)
    lyrics = re.sub('\(\s', '(', lyrics)
    lyrics = re.sub('\s\)', ')', lyrics)
    lyrics = re.sub('^\s', '', lyrics)

    return lyrics

def song_get_tags(url):
    """Returns genre tags of given Genius song URL."""
    song_page = requests.get(url).text
    selector = Selector(text=song_page)

    tags = selector.xpath('//div[@class="SongTags__Container-xixwg3-1 bZsZHM"]//text()').getall()
    return tags

def song_get_credits(url, credit):
    """Returns list of writers/producers of given Genius song URL.

       Variable 'credit' has to be either 'producers' or 'writers' and will
       return list of names.
    """
    if credit == 'writers':
        query = 'Written By'
    elif credit == 'producers':
        query = 'Produced By'
    
    song_page = requests.get(url).text
    selector = Selector(text=song_page)
    raw_list = selector.xpath(
        '//div[@class="SongInfo__Credit-nekw6x-3 fognin" and contains(., "{}")]//text()'.format(query)
    ).getall()
    
    dropped = ['Written By', 'Produced By', ' & ', ', ']
    credits = [name for name in raw_list if name not in dropped]
    return credits

In [4]:
def data_collection(albums_dict):
    """Compiles all webscraping data into one list of dictionaries."""
    albums = list(albums_dict.keys())
    eras = list(albums_dict.values())
    cleaned_albums = album_clean_titles(albums)
    album_urls = ['https://genius.com/albums/Taylor-swift/{}'.format(title) for title in cleaned_albums]

    tracklists = [album_get_tracklist(url) for url in album_urls]

    song_urls = [track['song_url'] for list in tracklists for track in list]

    song_lyrics = [song_get_lyrics(song) for song in song_urls]
    song_writers = [song_get_credits(song, 'writers') for song in song_urls]
    song_producers = [song_get_credits(song, 'producers') for song in song_urls]
    song_tags = [song_get_tags(song) for song in song_urls]

    list_index = 0

    for album in tracklists:
        for track in album:
            track.update({'song_lyrics':song_lyrics[list_index], 
                          'song_writers': song_writers[list_index], 
                          'song_producers': song_producers[list_index], 
                          'song_tags': song_tags[list_index]})
            list_index += 1
        
    collection = [{'album_title': album,
                   'album_url': url,
                   'album_era': era,
                   'album_tracklist': list} for album, url, era, list in zip(albums, 
                                                                                   album_urls, 
                                                                                   eras, 
                                                                                   tracklists)]
    return collection

In [5]:
#Initial dataframe creation, reorders columns
data = data_collection(albums)
raw_tswift = pd.json_normalize(data=data, record_path='album_tracklist', meta=['album_title', 
                                                                           'album_url', 
                                                                           'album_era'])
raw_tswift = raw_tswift.reindex(columns=['album_title', 'album_url', 'album_era', 'album_track_number', 'song_title', 
                                 'song_url', 'song_lyrics', 'song_writers', 'song_producers', 'song_tags'])
raw_tswift['album_era'].value_counts()

album_era
Midnights       41
Red             37
Fearless        32
Lover           25
Speak Now       22
1989            22
Taylor Swift    21
folklore        19
evermore        17
reputation      15
Name: count, dtype: int64

## Data Cleaning and Additional Entries

Like I said previously, many of Taylor Swift's releases contain repeating songs. For example, the streaming-specific EP, *The More Lover Chapter*, contains five songs: four songs are also on the standard release of the parent album, *Lover*, while only one song is a unique release. There are also numerous remixes and alternative versions of the same songs that are still the same lyrically. Both these remixes and repeated songs need to be removed from the dataframe; the latter can be easily filtered out while the former requires some manual work. I also choose to remove the five-minute version of "All Too Well" from the dataframe, as it's a shorter version of the ten-minute version and may potentially cause problems during analysis.

Taylor also has several songs that she wrote for movies or cowrote with other artists that ultimately don't end up one of her releases. Those albums weren't included in the initial data pull as to not require more cleanup after the fact (i.e. removing other songs on the release that Taylor didn't write). Those songs get added into the dataframe individually.

In [6]:
# Drops duplicate songs based on titles (e.g. rereleases on EPs)
tswift = raw_tswift.drop_duplicates(subset=['song_title'])

# Drops specific songs (alternative productions/remixes of existing songs)
dropped_song_titles = ['Teardrops on My Guitar (Pop Version)',
                      "Should've Said No (Alternate Version)",
                      'Teardrops On My Guitar (Acoustic)',
                      'Picture To Burn (Radio Edit)',
                       "Forever & Always (Piano Version) [Taylor's Version]",
                       "All Too Well (Taylor's Version)",
                       "State Of Grace (Acoustic Version) (Taylor’s Version)",
                       'A Message From Taylor',
                       'Carolina (Video Version)',
                       'Christmas Tree Farm (Recorded Live at the 2019 iHeartRadio Jingle Ball)',
                       'Karma (Remix) (Ft. Ice Spice)'
                      ]
tswift = tswift[~tswift.song_title.isin(dropped_song_titles)]
tswift['album_era'].value_counts()

album_era
Red             30
Fearless        27
Speak Now       22
1989            22
Midnights       22
Lover           20
folklore        18
evermore        17
Taylor Swift    16
reputation      15
Name: count, dtype: int64

In [7]:
def df_add_row(df, album_url, album_era, song_url):
    """Adds new row to given dataframe.

       Data is collected using the given variables (album_url, album_era, and song_url),
       added to a temporary new dataframe before being concatenated to the original dataframe.
    """
    album_page = requests.get(album_url).text
    album_selector = Selector(text=album_page)

    song_page = requests.get(song_url).text
    song_selector = Selector(text=song_page)

    album_title = album_selector.xpath('//h1[contains(@class, "header_with_cover_art")]//text()').get()
    song_title = song_selector.xpath('//h1[contains(@class, "SongHeaderdesktop")]//text()').get()

    number_string = song_selector.xpath('//div[contains(@class, "HeaderArtistAndTracklist")]/text()').get()
    number = int(re.sub('\D','', number_string))

    lyrics = song_get_lyrics(song_url)
    writers = song_get_credits(song_url, 'writers')
    producers = song_get_credits(song_url, 'producers')
    tags = song_get_tags(song_url)

    new_row = [album_title, album_url, album_era, number, song_title, song_url, lyrics, writers, producers, tags]
    new_df = pd.DataFrame([new_row], columns=df.columns)
    df = pd.concat([df, new_df], ignore_index=True)
    return df

In [8]:
# Adds 'Chistmases When You Were Mine' (Christmas EP)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Taylor-swift/The-taylor-swift-holiday-collection-ep',
    'Taylor Swift',
    'https://genius.com/Taylor-swift-christmases-when-you-were-mine-lyrics')

# Adds 'Christmas Must Be Something More' (Christmas EP)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Taylor-swift/The-taylor-swift-holiday-collection-ep',
    'Taylor Swift',
    'https://genius.com/Taylor-swift-christmas-must-be-something-more-lyrics')

# Adds 'Beautiful Ghosts' (Cats soundtrack)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Andrew-lloyd-webber/Cats-highlights-from-the-motion-picture-soundtrack',
    'Lover',
    'https://genius.com/Taylor-swift-beautiful-ghosts-lyrics')

# Adds 'Crazier' (Hannah Montana Movie soundtrack)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Taylor-swift/Itunes-essentials',
    'Fearless',
    'https://genius.com/Taylor-swift-crazier-lyrics')

# Adds 'You'll Always Find Your Way Back Home' (Hannah Montana Movie soundtrack)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Hannah-montana/Hannah-montana-the-movie-original-motion-picture-soundtrack',
    'Fearless',
    'https://genius.com/Hannah-montana-youll-always-find-your-way-back-home-lyrics')

# Adds 'I Don't Wanna Live Forever' (Fifty Shades Darker soundtrack)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Various-artists/Fifty-shades-darker-original-motion-picture-soundtrack',
    'reputation',
    'https://genius.com/Zayn-and-taylor-swift-i-dont-wanna-live-forever-lyrics')

# Adds 'Two Is Better Than One' (Boys Like Girls)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Boys-like-girls/Love-drunk',
    'Fearless',
    'https://genius.com/Boys-like-girls-two-is-better-than-one-lyrics')

# Adds 'This Is What You Came For' (Calvin Harris)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Now-thats-what-i-call-music/Now-thats-what-i-call-music-94-uk',
    '1989',
    'https://genius.com/Calvin-harris-this-is-what-you-came-for-lyrics')

# Adds 'The Alcott' (The National)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/The-national/First-two-pages-of-frankenstein',
    'Midnights',
    'https://genius.com/The-national-the-alcott-lyrics')

# Adds 'Renegade' (Big Red Machine)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Big-red-machine/How-long-do-you-think-its-gonna-last',
    'evermore',
    'https://genius.com/Big-red-machine-renegade-lyrics')

# Adds 'Bein' With My Baby' (Shea Fisher)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Shea-fisher/Shea',
    'Fearless',
    'https://genius.com/Shea-fisher-bein-with-my-baby-lyrics')

# Adds 'Best Days of Your Life' (Kellie Pickler)
tswift = df_add_row(
    tswift,
    'https://genius.com/albums/Kellie-pickler/Kellie-pickler-deluxe-edition',
    'Fearless',
    'https://genius.com/Kellie-pickler-best-days-of-your-life-lyrics')

tswift.tail(15)

Unnamed: 0,album_title,album_url,album_era,album_track_number,song_title,song_url,song_lyrics,song_writers,song_producers,song_tags
206,The More Red (Taylor's Version) Chapter,https://genius.com/albums/Taylor-swift/The-Mor...,Red,6,Safe & Sound (Taylor's Version) by Taylor Swif...,https://genius.com/Taylor-swift-joy-williams-a...,I remember tears streaming down your face when...,"[Taylor Swift, Joy Williams, John Paul White, ...","[Christopher Rowe, Taylor Swift]","[Country, Pop, Country Pop, Alternative Pop, A..."
207,The More Fearless (Taylor's Version) Chapter,https://genius.com/albums/Taylor-swift/The-Mor...,Fearless,5,If This Was a Movie (Taylor's Version),https://genius.com/Taylor-swift-if-this-was-a-...,"Last night, I heard my own heart beatin' Sound...","[Taylor Swift, Martin Johnson]","[Christopher Rowe, Taylor Swift]","[Country, Pop, Country Pop, Singer-Songwriter,..."
208,The More Lover Chapter,https://genius.com/albums/Taylor-swift/The-Mor...,Lover,5,All Of The Girls You Loved Before,https://genius.com/Taylor-swift-all-of-the-gir...,When you think of all the late nights Lame fig...,"[Taylor Swift, Ging, Louis Bell]","[Louis Bell, Ging, Taylor Swift]","[Pop, Ballad, Adult Contemporary, Singer-Songw..."
209,The Taylor Swift Holiday Collection - EP,https://genius.com/albums/Taylor-swift/The-tay...,Taylor Swift,2,Christmases When You Were Mine,https://genius.com/Taylor-swift-christmases-wh...,Please take down the mistletoe 'Cause I don't ...,"[Nathan Chapman, Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Pop, Country Pop, Holiday, Singer-So..."
210,The Taylor Swift Holiday Collection - EP,https://genius.com/albums/Taylor-swift/The-tay...,Taylor Swift,14,Christmas Must Be Something More,https://genius.com/Taylor-swift-christmas-must...,What if ribbons and bows didn't mean a thing? ...,[Taylor Swift],[Nathan Chapman],"[Country, Pop, Country Pop, Holiday, Singer-So..."
211,Cats: Highlights From the Motion Picture Sound...,https://genius.com/albums/Andrew-lloyd-webber/...,Lover,17,Beautiful Ghosts,https://genius.com/Taylor-swift-beautiful-ghos...,Follow me home if you dare to I wouldn't know ...,"[Andrew Lloyd Webber, Taylor Swift]","[Nile Rodgers, Greg Wells]","[Pop, Singer-Songwriter, Soundtrack, Adult Con..."
212,iTunes Essentials,https://genius.com/albums/Taylor-swift/Itunes-...,Fearless,13,Crazier,https://genius.com/Taylor-swift-crazier-lyrics,"I'd never gone with the wind, just let it flow...","[Taylor Swift, Robert Ellis Orrall]","[Taylor Swift, Nathan Chapman]","[Country, Country Pop, Soundtrack]"
213,Hannah Montana: The Movie (Original Motion Pic...,https://genius.com/albums/Hannah-montana/Hanna...,Fearless,1,You’ll Always Find Your Way Back Home,https://genius.com/Hannah-montana-youll-always...,Woo! You wake up It’s raining and it’s Monday ...,"[Taylor Swift, Martin Johnson]",[Matthew Gerrard],"[Pop, Teen Pop, TV, Disney, Soundtrack]"
214,Fifty Shades Darker (Original Motion Picture S...,https://genius.com/albums/Various-artists/Fift...,reputation,1,I Don’t Wanna Live Forever,https://genius.com/Zayn-and-taylor-swift-i-don...,Been sittin' eyes wide open Behind these four ...,"[Sam Dew, Taylor Swift, Jack Antonoff]",[Jack Antonoff],"[R&B, Pop, Adult Contemporary, Gospel, Alterna..."
215,Love Drunk,https://genius.com/albums/Boys-like-girls/Love...,Fearless,4,Two Is Better Than One,https://genius.com/Boys-like-girls-two-is-bett...,I remember what you wore on the first day You ...,"[Sam Hollander, Dave Katz, Taylor Swift, Marti...",[Brian Howes],"[Country, Pop, Adult Contemporary, Duet, Count..."


## Data Exporting

To save time in later notebooks, I export the dataframe to a pickle file. I also create JSON and CSV versions for future projects or in case others want to use the dataset themselves.

In [9]:
tswift.to_pickle('datasets/taylor_swift_discography.pkl')

tswift.to_json('datasets/taylor_swift_discography.json', orient='table')

os.makedirs('datasets', exist_ok=True)
tswift.to_csv('datasets/taylor_swift_discography.csv')