# Taylor Swift Discography: Part I - Data Collection

## Introduction

This notebook is the first in a series of notebooks analyzing Taylor Swift's discography; in this notebook, I collect and compile data using [parsel](https://parsel.readthedocs.io/en/latest/) to webscrape from [Genius](https://genius.com/), an online music encyclopedia. From there, I use pandas for dataframe creation and SQLite3 for database creation. 

It should be noted that Genius does have an [API](https://docs.genius.com/) for application development. However, song lyrics are not made available via API, which is a crucial component of this project. A third-party Python package, [lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/), is available to work in tandem with the API, webscraping on the user's behalf. I tested both the API and lyricsgenius as resources before ultimately concluding to collect the data myself.

The resulting dataframe and database from this notebook is available in the [data folder](./data). 

In [1]:
from scripts.clean_df import *
from scripts.create_db import *
from scripts.scrape_df import *

## Webscraping and Initial Dataframe

Genius is very thorough in cataloguing an artist's discogrpahy, tracking all physical and digital variants of a release with different tracklists as well as singles, EPs, demo CDs, official playlists, and special releases. Therefore, Genius catalogues Taylor Swift as having over [100 different albums](https://genius.com/artists/Taylor-swift/albums), most of which contain repeated songs. For sake of processing times, I manually select the releases used to create the initial dataframe: deluxe versions and "Taylor's Version" rerecordings of her studio albums are preferred for including all existing bonus tracks and "From The Vault" songs for each album, and non-album singles and streaming-specific EPs that contain previously unreleased tracks are included. Since this project includes analyses of her lyrics, the goal is to capture every *unique* song of her discography, not necessarily to capture every *possible* song.

The dataframe is structured as follows:

* **album_title**: the title of the album (or other release) containing each song
* **album_url**: the URL to the Genius page for the album
* **album_era**: the musical "era" the album was released in
  * Note: this terminology was brought about via [The Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour) and usually refers to the time period around a studio album's release. For songs not on studio albums, the era is determined by which era was closest when written or released.
* **album_track_number**: the track number for the song on the given album
* **song_title**: the title of the song
* **song_url**: the URL to the Genius page for the song
* **song_lyrics**: the lyrics to the song (returned as a single string)
* **song_writers**: the writer(s) for the song (returned as a list)
* **song_producers**: the producers(s) for the song (returned as a list)
* **song_tags**: the genre tag(s) for the song (returned as a list)

Several functions helped scrape and compile the data before being put into a dataframe. The module for this section is [`scrape_df.py`](./scripts/scrape_df.py).

In [2]:
# Manually selected albums/EPs from Genius' list
# Maximizes the number of unique songs while reducing cleanup
# dict = {'album_title': 'album_era'}
albums = {'Taylor Swift': 'Taylor Swift',
          'Beautiful Eyes - EP': 'Taylor Swift',
          "Fearless (Taylor's Version)": 'Fearless',
          "Speak Now (Taylor's Version)": 'Speak Now',
          "Red (Taylor's Version)": 'Red',
          "1989 (Taylor's Version) [Tangerine Edition]": '1989',
          'reputation': 'reputation',
          'Lover': 'Lover',
          'folklore (deluxe version)': 'folklore',
          'Christmas Tree Farm - 12" Single Picture Disc': 'Lover',
          'evermore (deluxe version)': 'evermore',
          'Carolina (From The Motion Picture "Where The Crawdads Sing")': 'folklore',
          'Midnights (3am Edition)': 'Midnights',
          'Midnights (The Late Night Edition)': 'Midnights',
          "The More Red (Taylor's Version) Chapter": 'Red',
          "The More Fearless (Taylor's Version) Chapter": 'Fearless',
          'The More Lover Chapter': 'Lover'}

In [3]:
#Initial dataframe creation
raw_tswift = data_collection('Taylor Swift', albums)
raw_tswift.head()

Unnamed: 0,album_title,album_url,album_era,album_track_number,song_title,song_url,song_artists,song_lyrics,song_writers,song_producers,song_tags
0,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,1,Tim McGraw,https://genius.com/Taylor-swift-tim-mcgraw-lyrics,[Taylor Swift],He said the way my blue eyes shined Put those ...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Country Rock, American Folk, Folk, B..."
1,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,2,Picture to Burn,https://genius.com/Taylor-swift-picture-to-bur...,[Taylor Swift],"State the obvious, I didn't get my perfect fan...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Pop, Rock, Country, Country Rock, Pop-Punk, P..."
2,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,3,Teardrops On My Guitar,https://genius.com/Taylor-swift-teardrops-on-m...,[Taylor Swift],Drew looks at me I fake a smile so he won't se...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Adult Contemporary, American Folk, A..."
3,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,4,A Place In This World,https://genius.com/Taylor-swift-a-place-in-thi...,[Taylor Swift],"I don't know what I want, so don't ask me 'Cau...","[Angelo Petraglia, Robert Ellis Orrall, Taylor...",[Nathan Chapman],"[Country, Pop, Teen Pop, Country Pop, Soundtra..."
4,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,5,Cold as You,https://genius.com/Taylor-swift-cold-as-you-ly...,[Taylor Swift],You have a way of coming easily to me And when...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Pop, Ballad, Country Pop, Singer-Son..."


## Data Cleaning and Additional Entries

Like I said previously, many of Taylor Swift's releases contain repeating songs. For example, the streaming-specific EP, *The More Lover Chapter*, contains five songs: four songs are also on the standard release of the parent album, *Lover*, while only one song is a unique release. There are also numerous remixes and alternative versions of the same songs that are still the same lyrically. Both these remixes and repeated songs need to be removed from the dataframe; the latter can be easily filtered out while the former requires some manual work. I also choose to remove the five-minute version of "All Too Well" from the dataframe, as it's a shorter version of the ten-minute version and may potentially cause problems during analysis.

Taylor also has several songs that she wrote for movies or cowrote with other artists that ultimately don't end up one of her releases. Those albums weren't included in the initial data pull as to not require more cleanup after the fact (i.e. removing other songs on the release that Taylor didn't write). Those songs get added into the dataframe individually. The module for this section is [`clean_df.py`](./scripts/clean_df.py).

In [4]:
# Drops specific songs (alternative productions/remixes of existing songs)
# and duplicate songs
dropped_song_titles = ['Teardrops on My Guitar (Pop Version)',
                      "Should've Said No (Alternate Version)",
                      'Teardrops On My Guitar (Acoustic)',
                      'Picture To Burn (Radio Edit)',
                       "Forever & Always (Piano Version) [Taylor's Version]",
                       "All Too Well (Taylor's Version)",
                       "State Of Grace (Acoustic Version) (Taylor’s Version)",
                       'A Message From Taylor',
                       'Carolina (Video Version)',
                       'Christmas Tree Farm (Recorded Live at the 2019 iHeartRadio Jingle Ball)',
                       'Karma (Remix) (Ft. Ice Spice)'
                      ]
tswift = drop_songs(raw_tswift, dropped_song_titles)
tswift['album_era'].value_counts()

album_era
Red             31
Fearless        27
Speak Now       22
1989            22
Midnights       22
Lover           20
folklore        18
evermore        17
Taylor Swift    16
reputation      15
Name: count, dtype: int64

In [5]:
# Adds 'Chistmases When You Were Mine' (Christmas EP)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Taylor-swift/The-taylor-swift-holiday-collection-ep',
    'Taylor Swift',
    'https://genius.com/Taylor-swift-christmases-when-you-were-mine-lyrics')

# Adds 'Christmas Must Be Something More' (Christmas EP)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Taylor-swift/The-taylor-swift-holiday-collection-ep',
    'Taylor Swift',
    'https://genius.com/Taylor-swift-christmas-must-be-something-more-lyrics')

# Adds 'Beautiful Ghosts' (Cats soundtrack)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Andrew-lloyd-webber/Cats-highlights-from-the-motion-picture-soundtrack',
    'Lover',
    'https://genius.com/Taylor-swift-beautiful-ghosts-lyrics')

# Adds 'Crazier' (Hannah Montana Movie soundtrack)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Taylor-swift/Itunes-essentials',
    'Fearless',
    'https://genius.com/Taylor-swift-crazier-lyrics')

# Adds 'You'll Always Find Your Way Back Home' (Hannah Montana Movie soundtrack)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Hannah-montana/Hannah-montana-the-movie-original-motion-picture-soundtrack',
    'Fearless',
    'https://genius.com/Hannah-montana-youll-always-find-your-way-back-home-lyrics')

# Adds 'I Don't Wanna Live Forever' (Fifty Shades Darker soundtrack)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Various-artists/Fifty-shades-darker-original-motion-picture-soundtrack',
    'reputation',
    'https://genius.com/Zayn-and-taylor-swift-i-dont-wanna-live-forever-lyrics')

# Adds 'Two Is Better Than One' (Boys Like Girls)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Boys-like-girls/Love-drunk',
    'Fearless',
    'https://genius.com/Boys-like-girls-two-is-better-than-one-lyrics')

# Adds 'This Is What You Came For' (Calvin Harris)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Now-thats-what-i-call-music/Now-thats-what-i-call-music-94-uk',
    '1989',
    'https://genius.com/Calvin-harris-this-is-what-you-came-for-lyrics')

# Adds 'The Alcott' (The National)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/The-national/First-two-pages-of-frankenstein',
    'Midnights',
    'https://genius.com/The-national-the-alcott-lyrics')

# Adds 'Renegade' (Big Red Machine)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Big-red-machine/How-long-do-you-think-its-gonna-last',
    'evermore',
    'https://genius.com/Big-red-machine-renegade-lyrics')

# Adds 'Bein' With My Baby' (Shea Fisher)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Shea-fisher/Shea',
    'Fearless',
    'https://genius.com/Shea-fisher-bein-with-my-baby-lyrics')

# Adds 'Best Days of Your Life' (Kellie Pickler)
tswift = add_single_song(
    tswift,
    'https://genius.com/albums/Kellie-pickler/Kellie-pickler-deluxe-edition',
    'Fearless',
    'https://genius.com/Kellie-pickler-best-days-of-your-life-lyrics')

tswift.tail(15)

Unnamed: 0,album_title,album_url,album_era,album_track_number,song_title,song_url,song_artists,song_lyrics,song_writers,song_producers,song_tags
207,The More Red (Taylor's Version) Chapter,https://genius.com/albums/Taylor-Swift/The-Mor...,Red,6,Safe & Sound (Taylor's Version) by Taylor Swif...,https://genius.com/Taylor-swift-joy-williams-a...,"[Taylor Swift, Joy Williams, John Paul White]",I remember tears streaming down your face when...,"[Taylor Swift, Joy Williams, John Paul White, ...","[Christopher Rowe, Taylor Swift]","[Country, Pop, Country Pop, Alternative Pop, A..."
208,The More Fearless (Taylor's Version) Chapter,https://genius.com/albums/Taylor-Swift/The-Mor...,Fearless,5,If This Was a Movie (Taylor's Version),https://genius.com/Taylor-swift-if-this-was-a-...,[Taylor Swift],"Last night, I heard my own heart beatin' Sound...","[Taylor Swift, Martin Johnson]","[Christopher Rowe, Taylor Swift]","[Country, Pop, Country Pop, Singer-Songwriter,..."
209,The More Lover Chapter,https://genius.com/albums/Taylor-Swift/The-Mor...,Lover,5,All Of The Girls You Loved Before,https://genius.com/Taylor-swift-all-of-the-gir...,[Taylor Swift],When you think of all the late nights Lame fig...,"[Taylor Swift, Ging, Louis Bell]","[Louis Bell, Ging, Taylor Swift]","[Pop, Ballad, Adult Contemporary, Singer-Songw..."
210,The Taylor Swift Holiday Collection - EP,https://genius.com/albums/Taylor-swift/The-tay...,Taylor Swift,2,Christmases When You Were Mine,https://genius.com/Taylor-swift-christmases-wh...,[Taylor Swift],Please take down the mistletoe 'Cause I don't ...,"[Nathan Chapman, Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Pop, Country Pop, Holiday, Singer-So..."
211,The Taylor Swift Holiday Collection - EP,https://genius.com/albums/Taylor-swift/The-tay...,Taylor Swift,14,Christmas Must Be Something More,https://genius.com/Taylor-swift-christmas-must...,[Taylor Swift],What if ribbons and bows didn't mean a thing? ...,[Taylor Swift],[Nathan Chapman],"[Country, Pop, Country Pop, Holiday, Singer-So..."
212,Cats: Highlights From the Motion Picture Sound...,https://genius.com/albums/Andrew-lloyd-webber/...,Lover,17,Beautiful Ghosts,https://genius.com/Taylor-swift-beautiful-ghos...,[Taylor Swift],Follow me home if you dare to I wouldn't know ...,"[Andrew Lloyd Webber, Taylor Swift]","[Nile Rodgers, Greg Wells]","[Pop, Singer-Songwriter, Soundtrack, Adult Con..."
213,iTunes Essentials,https://genius.com/albums/Taylor-swift/Itunes-...,Fearless,13,Crazier,https://genius.com/Taylor-swift-crazier-lyrics,[Taylor Swift],"I'd never gone with the wind, just let it flow...","[Taylor Swift, Robert Ellis Orrall]","[Taylor Swift, Nathan Chapman]","[Country, Country Pop, Soundtrack]"
214,Hannah Montana: The Movie (Original Motion Pic...,https://genius.com/albums/Hannah-montana/Hanna...,Fearless,1,You’ll Always Find Your Way Back Home,https://genius.com/Hannah-montana-youll-always...,[Hannah Montana],Woo! You wake up It’s raining and it’s Monday ...,"[Taylor Swift, Martin Johnson]",[Matthew Gerrard],"[Pop, Teen Pop, TV, Disney, Soundtrack]"
215,Fifty Shades Darker (Original Motion Picture S...,https://genius.com/albums/Various-artists/Fift...,reputation,1,I Don’t Wanna Live Forever,https://genius.com/Zayn-and-taylor-swift-i-don...,"[ZAYN, Taylor Swift]",Been sittin' eyes wide open Behind these four ...,"[Sam Dew, Taylor Swift, Jack Antonoff]",[Jack Antonoff],"[R&B, Pop, Adult Contemporary, Gospel, Alterna..."
216,Love Drunk,https://genius.com/albums/Boys-like-girls/Love...,Fearless,4,Two Is Better Than One,https://genius.com/Boys-like-girls-two-is-bett...,"[BOYS LIKE GIRLS, Taylor Swift]",I remember what you wore on the first day You ...,"[Sam Hollander, Dave Katz, Taylor Swift, Marti...",[Brian Howes],"[Country, Pop, Adult Contemporary, Duet, Count..."


## Database Creation and Exporting

At this point, the dataframe is complete and ready to be used in later notebooks. However, I also want a database version of the dataframe so I can query via SQL. Using SQLite3, I convert the dataframe into a five-table database, the schema of which can be seen here (courtesy of [dbdiagram.io](https://dbdiagram.io/)):

![db_schema.png](db_schema.png)

The module for this section is [`create_db.py`](./scripts/create_db.py).

In [6]:
# Export dataframe to pkl
tswift.to_pickle('data/taylor_swift_discography.pkl')

# Creates database
convert_to_db(tswift,'taylor_swift.db')