# Taylor Swift Discography: Part I - Data Collection

## Introduction

This notebook is the first in a series of notebooks analyzing Taylor Swift's discography; in this notebook, I collect and compile data using [parsel](https://parsel.readthedocs.io/en/latest/) to webscrape from [Genius](https://genius.com/), an online music encyclopedia. From there, I use pandas for dataframe creation and SQLite3 for database creation. 

It should be noted that Genius does have an [API](https://docs.genius.com/) for application development. However, song lyrics are not made available via API, which is a crucial component of this project. A third-party Python package, [lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/), is available to work in tandem with the API, webscraping on the user's behalf. I tested both the API and lyricsgenius as resources before ultimately concluding to collect the data myself.

The resulting dataframes and database from this notebook are available in the [data folder](./data). CSV files used in this notebook are also included in the [data folder](./data/csv).

This notebook uses the modules [`genius_scrape.py`](./src/genius_scrape.py) for webscraping and discography dataframe creation and [`discog_mods.py`](./src/discog_mods.py) for additional modifications and transformations.

In [1]:
import pandas as pd

from src import discog_mods
from src import genius_scrape

## Webscraping and Initial Dataframe

Genius is very thorough in cataloguing an artist's discogrpahy, tracking all physical and digital variants of a release with different tracklists as well as singles, EPs, demo CDs, official playlists, and special releases. Therefore, Genius catalogues Taylor Swift as having over [100 different albums](https://genius.com/artists/Taylor-swift/albums), most of which contain repeated songs. For sake of processing times, I manually select the releases used to create the initial dataframe: this includes deluxe versions of original albums and "Taylor's Version" rerecordings of her studio albums are preferred, and non-album singles and streaming-specific EPs that contain previously unreleased tracks are included.

The dataframe is structured as follows:

* **album_title**: the title of the album (or other release) containing each song (string)
* **album_url**: the URL to the Genius page for the album (string)
* **category**: the categorical classification of an album/song (string)
  * Note: In this project, the categories are studio album names, rerecording names, "Non-Album Songs" for songs made for soundtracks or Taylor Swift songs not associated with albums, and "Other Artist Songs" for collaborations ans remixes on other artists' albums. I'll often refer to these categories as "eras," a terminology brought about via [The Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour).
* **album_track_number**: the track number for the song on the given album (integer)
* **song_title**: the title of the song (string)
* **song_url**: the URL to the Genius page for the song (string)
* **song_release_date**: the original release date for the song (date)
* **song_page_views**: the amount of views the Genius song page has (integer)
* **song_artists**: the performing artist(s) for the song (list)
* **song_lyrics**: the lyrics to the song (string)
* **song_writers**: the writer(s) for the song (list)
* **song_producers**: the producers(s) for the song (list)
* **song_tags**: the genre tag(s) for the song (list)

In [2]:
# Manually selected albums/EPs from Genius' list
# Maximizes the number of unique songs while reducing cleanup
albums = genius_scrape.create_dict_from_file('album_list.csv')

In [3]:
#Initial dataframe creation
raw_tswift = genius_scrape.create_discography('Taylor Swift', albums)
raw_tswift.to_pickle('data/taylor_swift_raw.pkl')
raw_tswift.head()

Unnamed: 0,album_title,album_url,category,album_track_number,song_title,song_url,song_artists,song_release_date,song_page_views,song_lyrics,song_writers,song_producers,song_tags
0,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,1,Tim McGraw,https://genius.com/Taylor-swift-tim-mcgraw-lyrics,[Taylor Swift],2006-06-19,240500,He said the way my blue eyes shined Put those ...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Country Rock, American Folk, Folk, B..."
1,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,2,Picture to Burn,https://genius.com/Taylor-swift-picture-to-bur...,[Taylor Swift],2006-10-24,257800,"State the obvious, I didn't get my perfect fan...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Pop, Rock, Country, Country Rock, Pop-Punk, P..."
2,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,3,Teardrops On My Guitar,https://genius.com/Taylor-swift-teardrops-on-m...,[Taylor Swift],2006-10-24,226700,Drew looks at me I fake a smile so he won't se...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Adult Contemporary, American Folk, A..."
3,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,4,A Place In This World,https://genius.com/Taylor-swift-a-place-in-thi...,[Taylor Swift],2006-10-24,75500,"I don't know what I want, so don't ask me 'Cau...","[Angelo Petraglia, Robert Ellis Orrall, Taylor...",[Nathan Chapman],"[Country, Pop, Teen Pop, Country Pop, Soundtra..."
4,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,5,Cold as You,https://genius.com/Taylor-swift-cold-as-you-ly...,[Taylor Swift],2006-10-24,126500,You have a way of coming easily to me And when...,"[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Pop, Ballad, Country Pop, Singer-Son..."


## Data Cleaning and Additional Entries

Like I said previously, many of Taylor Swift's releases contain repeating songs. For example, the streaming-specific EP, *The More Lover Chapter*, contains five songs: four songs are also on the standard release of the parent album, *Lover*, while only one song is a unique release. There are also numerous remixes and alternative versions of the same songs that are still the same lyrically. Both these remixes and repeated songs need to be removed from the dataframe; the latter can be easily filtered out while the former requires some manual work. Remixes or versions of songs that include new features, such "Karma (Remix) (Ft. Ice Spice)," are included for the sake of tracking collaborations, though they will be removed later in the project.

Taylor also has several songs that she wrote for movies or cowrote/performed with other artists that ultimately don't end up one of her releases. Those albums weren't included in the initial data pull as to not require more cleanup after the fact. Those songs get added into the dataframe individually. 

Another small change I make is reverting instances of "Joe Alwyn" in both `song_writers` and `song_producers` to "William Bowery," the original pseudonym used for these writing/producing credits. This is a personal preference, as I'd like the William Bowery song credits to be consistent across albums/eras.

In [4]:
# Drops specific songs (alternative productions/remixes of existing songs)
# and duplicate songs
tswift = discog_mods.drop_songs_from_file(raw_tswift, 'songs_to_drop_part1.csv', drop_duplicates=True)
tswift['category'].value_counts()

category
Red (TV)           32
Fearless (TV)      26
Midnights          23
Speak Now (TV)     22
1989 (TV)          22
Red                20
Lover              19
Fearless           18
Speak Now          17
folklore           17
evermore           17
1989               16
reputation         15
Taylor Swift       14
Non-Album Songs     2
Name: count, dtype: int64

In [5]:
# Adds numerous songs to the dataframe at once
tswift = discog_mods.add_songs_from_file(tswift, 'songs_to_add.csv')
tswift.tail(15)

Unnamed: 0,album_title,album_url,category,album_track_number,song_title,song_url,song_artists,song_release_date,song_page_views,song_lyrics,song_writers,song_producers,song_tags
295,1989 (Taylor’s Version) [Deluxe],https://genius.com/albums/Taylor-swift/1989-ta...,1989 (TV),22,Bad Blood (Remix) (Taylor’s Version),https://genius.com/Taylor-swift-bad-blood-remi...,"[Taylor Swift, Kendrick Lamar]",2023-10-27,24700,"'Cause, baby, now we've got bad blood You know...","[Taylor Swift, Max Martin, Shellback, Kendrick...","[Christopher Rowe, Taylor Swift]","[R&B, Rap, Pop, Electronic, Adult Contemporary..."
296,Bigger,https://genius.com/albums/Sugarland/Bigger,Other Artist Songs,7,Babe,https://genius.com/Sugarland-babe-lyrics,"[Sugarland, Taylor Swift]",2018-04-20,164800,What a shame Didn't wanna be the one that got ...,"[Pat Monahan, Taylor Swift]","[Kristian Bush, Julian Raymond, Jennifer Nettles]","[Country, Pop, Adult Contemporary, Duet, Balla..."
297,Two Lanes of Freedom (Accelerated Deluxe),https://genius.com/albums/Tim-mcgraw/Two-lanes...,Other Artist Songs,13,Highway Don’t Care,https://genius.com/Tim-mcgraw-highway-dont-car...,"[Tim McGraw, Keith Urban, Taylor Swift]",2013-03-25,62800,Bet your window's rolled down and your hair's ...,"[Brett Warren, Brad Warren, Josh Kear, Mark Ir...","[Tim McGraw, Byron Gallimore]","[Country, Rock, Country Rock, Duet, Ballad]"
298,Battle Studies,https://genius.com/albums/John-mayer/Battle-st...,Other Artist Songs,3,Half of My Heart,https://genius.com/John-mayer-half-of-my-heart...,"[John Mayer, Taylor Swift]",2010-06-21,94300,I was born in the arms of imaginary friends Fr...,[John Mayer],"[John Mayer, Steve Jordan]","[Rock, Country Rock, Duet, Soft Rock, Adult Co..."
299,Women in Music Pt. III (Expanded Edition),https://genius.com/albums/Haim/Women-in-music-...,Other Artist Songs,14,Gasoline (Remix),https://genius.com/Haim-and-taylor-swift-gasol...,"[HAIM, Taylor Swift]",2021-02-19,71600,You took me back But you shouldn't have Now it...,"[Ariel Rechtshaid, Este Haim, Alana Haim, Rost...","[Danielle Haim, Ariel Rechtshaid, Rostam]","[Rock, Country, Pop, Alternative Pop, Alternat..."
300,Strange Clouds,https://genius.com/albums/Bob/Strange-clouds,Other Artist Songs,4,Both of Us,https://genius.com/Bob-both-of-us-lyrics,"[B.o.B, Taylor Swift]",2012-05-22,124000,I wish I was Strong enough to Lift not one but...,"[B.o.B, Taylor Swift, Cirkut, Dr. Luke, Ammar ...","[Cirkut, Dr. Luke]","[Pop, Rap, Country Rap]"
301,,,Non-Album Songs,0,Only The Young,https://genius.com/Taylor-swift-only-the-young...,[Taylor Swift],2020-01-31,217500,"It keeps me awake, the look on your face The m...","[Joel Little, Taylor Swift]","[Joel Little, Taylor Swift]","[Pop, Synth-Pop, Electronic, Protest Songs, El..."
302,NOW That’s What I Call Music! 73 [US],https://genius.com/albums/Now-thats-what-i-cal...,Lover,2,Lover (Remix),https://genius.com/Taylor-swift-lover-remix-ly...,"[Taylor Swift, Shawn Mendes]",2019-11-13,454800,We could leave the Christmas lights up 'til Ja...,"[Scott Harris, Shawn Mendes, Taylor Swift]","[Taylor Swift, Jack Antonoff]","[Country, Pop, Canada, Duet, Singer-Songwriter..."
303,= (Equals) (Tour Edition),https://genius.com/albums/Ed-sheeran/Equals-to...,Other Artist Songs,20,The Joker And The Queen (Remix),https://genius.com/Ed-sheeran-the-joker-and-th...,"[Ed Sheeran, Taylor Swift]",2022-02-11,47700,How was I to know? It's a crazy thing I showed...,"[Taylor Swift, Fred again.., Johnny McDaid, RØ...","[RØMANS, Johnny McDaid, Ed Sheeran, Fred again..]","[Pop, Adult Contemporary, Duet, Singer-Songwri..."
304,Speak Now: World Tour Live (Brazilian Edition),https://genius.com/albums/Taylor-swift/Speak-n...,Speak Now,17,Long Live (Remix),https://genius.com/Taylor-swift-long-live-remi...,"[Taylor Swift, Paula Fernandes]",2012-05-29,0,Lembrei desse sentimento Gritando dentro de nó...,"[Paula Fernandes, Taylor Swift]",[Márcio Monteiro],"[Country, Pop, Duet, Remix, Concert, Live, Ser..."


In [6]:
# Changing instances of 'Joe Alwyn' to 'William Bowery'
tswift['song_writers'] = discog_mods.change_credit_name(tswift['song_writers'], 'Joe Alwyn', 'William Bowery')
tswift['song_producers'] = discog_mods.change_credit_name(tswift['song_producers'], 'Joe Alwyn', 'William Bowery')

## Database Creation and Exporting

At this point, the dataframe is complete and ready to be used in later notebooks. However, I also want a database version of the dataframe so I can query via SQL. Using SQLite3, I convert the dataframe into a six-table database, the schema of which can be seen here (courtesy of [dbdiagram.io](https://dbdiagram.io/)):

![db_schema.png](./figures/db_schema.png)

In [7]:
# Export dataframe to pkl
tswift.to_pickle('data/taylor_swift_clean.pkl')

# Creates database
discog_mods.convert_to_db(tswift,'taylor_swift.db')