# Taylor Swift Discography - Data Collection

## Introduction

In this notebook, I collect and compile data using [parsel](https://parsel.readthedocs.io/en/latest/) to webscrape from [Genius](https://genius.com/), an online music encyclopedia. From there, I use pandas for dataframe creation and SQLite3 for database creation. 

It should be noted that Genius does have an [API](https://docs.genius.com/) for application development. However, song lyrics are not made available via API, which is a crucial component of this project. A third-party Python package, [lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/), is available to work in tandem with the API, webscraping on the user's behalf. I tested both the API and lyricsgenius as resources before ultimately concluding to collect the data myself.

The resulting dataframes and database from this notebook are available in the [data folder](../data). CSV files used in this notebook are also included in the [data folder](../data/csv).

This notebook uses the modules [`genius_scrape.py`](../src/genius_scrape.py) for webscraping and discography dataframe creation and [`discog_mods.py`](../src/discog_mods.py) for additional modifications and transformations.

In [1]:
%cd -q ..

import pandas as pd

from src import discog_mods
from src import genius_scrape

## Webscraping and Initial Dataframe

Genius is very thorough in cataloguing an artist's discogrpahy, tracking all physical and digital variants of a release with different tracklists as well as singles, EPs, demo CDs, official playlists, and special releases. Therefore, Genius catalogues Taylor Swift as having over [100 different albums](https://genius.com/artists/Taylor-swift/albums), most of which contain repeated songs. For sake of processing times, I manually select the releases used to create the initial dataframe: this includes deluxe versions of original albums and "Taylor's Version" rerecordings of her studio albums are preferred, and non-album singles and streaming-specific EPs that contain previously unreleased tracks are included.

The dataframe is structured as follows:

* **album_title**: the title of the album (or other release) containing each song (string)
* **album_url**: the URL to the Genius page for the album (string)
* **category**: the categorical classification of an album/song (string)
  * Note: In this project, the categories are studio album names, rerecording names, "Non-Album Songs" for songs made for soundtracks or Taylor Swift songs not associated with albums, and "Other Artist Songs" for collaborations ans remixes on other artists' albums. I'll often refer to these categories as "eras," a terminology brought about via [The Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour).
* **album_track_number**: the track number for the song on the given album (integer)
* **song_title**: the title of the song (string)
* **song_url**: the URL to the Genius page for the song (string)
* **song_release_date**: the original release date for the song (date)
* **song_page_views**: the amount of views the Genius song page has (integer)
* **song_artists**: the performing artist(s) for the song (list)
* **song_lyrics**: the lyrics to the song (list)
* **song_writers**: the writer(s) for the song (list)
* **song_producers**: the producers(s) for the song (list)
* **song_tags**: the genre tag(s) for the song (list)

In [2]:
# Manually selected albums/EPs from Genius' list
# Maximizes the number of unique songs while reducing cleanup
albums = genius_scrape.create_dict_from_file('data/csv/album_list.csv')

In [3]:
#Initial dataframe creation
raw_tswift = genius_scrape.create_discography('Taylor Swift', albums)
raw_tswift.to_pickle('data/taylor_swift_raw.pkl')
raw_tswift.head()

Unnamed: 0,album_title,album_url,category,album_track_number,song_title,song_url,song_artists,song_release_date,song_page_views,song_lyrics,song_writers,song_producers,song_tags
0,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,1,Tim McGraw,https://genius.com/Taylor-swift-tim-mcgraw-lyrics,[Taylor Swift],2006-06-19,253600,"[He said the way my blue eyes shined, Put thos...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, In English, USA, Country Rock, Ameri..."
1,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,2,Picture to Burn,https://genius.com/Taylor-swift-picture-to-bur...,[Taylor Swift],2006-10-24,267600,"[State the obvious, I didn't get my perfect fa...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Pop, Rock, Country, In English, USA, Country ..."
2,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,3,Teardrops On My Guitar,https://genius.com/Taylor-swift-teardrops-on-m...,[Taylor Swift],2006-10-24,236500,"[Drew looks at me, I fake a smile so he won't ...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, In English, USA, Adult Contemporary,..."
3,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,4,A Place In This World,https://genius.com/Taylor-swift-a-place-in-thi...,[Taylor Swift],2006-10-24,80000,"[I don't know what I want, so don't ask me, 'C...","[Angelo Petraglia, Robert Ellis Orrall, Taylor...",[Nathan Chapman],"[Country, Pop, In English, USA, Teen Pop, Coun..."
4,Taylor Swift,https://genius.com/albums/Taylor-Swift/Taylor-...,Taylor Swift,5,Cold as You,https://genius.com/Taylor-swift-cold-as-you-ly...,[Taylor Swift],2006-10-24,132500,"[You have a way of coming easily to me, And wh...","[Liz Rose, Taylor Swift]",[Nathan Chapman],"[Country, Pop, In English, USA, Ballad, Countr..."


## Data Cleaning and Additional Entries

Like I said previously, many of Taylor Swift's releases contain repeating songs. For example, the streaming-specific EP, *The More Lover Chapter*, contains five songs: four songs are also on the standard release of the parent album, *Lover*, while only one song is a unique release. There are also numerous remixes and alternative versions of the same songs. Both these remixes and repeated songs need to be removed from the dataframe; the latter can be easily filtered out while the former requires some manual work. Remixes or versions of songs that include new features, such "Karma (Remix) (Ft. Ice Spice)," are included for the sake of tracking collaborations.

Taylor also has several songs that she wrote for movies or cowrote/performed with other artists that ultimately don't end up one of her releases. Those albums weren't included in the initial data pull as to not require more cleanup after the fact. Those songs get added into the dataframe individually.  

See the [README](../README.md#Constraints-and-Limitations-of-Discography) for more information on what is and isn't included in the discography.

Another small change I make is reverting instances of "Joe Alwyn" in both `song_writers` and `song_producers` to "William Bowery," the original pseudonym used for these writing/producing credits. This is a personal preference, as I'd like the William Bowery song credits to be consistent across albums/eras.

In [4]:
# Drops specific songs (alternative productions/remixes of existing songs)
# and duplicate songs
tswift = discog_mods.drop_songs_from_file(raw_tswift, 'data/csv/songs_to_drop_part1.csv', drop_duplicates=True)
tswift['category'].value_counts()

category
The Tortured Poets Department    31
Red (TV)                         30
Fearless (TV)                    25
Midnights                        23
Speak Now (TV)                   22
1989 (TV)                        22
Red                              20
Fearless                         18
Lover                            18
evermore                         17
Speak Now                        17
folklore                         17
1989                             16
reputation                       15
Taylor Swift                     14
Non-Album Songs                  12
Name: count, dtype: int64

In [5]:
# Adds numerous songs to the dataframe at once
tswift = discog_mods.add_songs_from_file(tswift, 'data/csv/songs_to_add.csv')
tswift.tail(15)

Unnamed: 0,album_title,album_url,category,album_track_number,song_title,song_url,song_artists,song_release_date,song_page_views,song_lyrics,song_writers,song_producers,song_tags
343,,,Non-Album Songs,0,Ronan,https://genius.com/Taylor-swift-ronan-lyrics,[Taylor Swift],2012-09-08,143100,"[I remember you bare feet, down the hallway, I...","[Taylor Swift, Maya Thompson]",[Taylor Swift],"[Country, Pop, Ballad, Memorial, Soft Rock, Co..."
344,Valentine’s Day (Original Motion Picture Sound...,https://genius.com/albums/Various-artists/Vale...,Non-Album Songs,1,Today Was a Fairytale,https://genius.com/Taylor-swift-today-was-a-fa...,[Taylor Swift],2010-01-19,13200,"[Today was a fairytale, you were the prince, I...",[Taylor Swift],"[Nathan Chapman, Taylor Swift]","[Country, Pop, Adult Contemporary, Country Pop..."
345,Speak Now: World Tour Live (Deluxe),https://genius.com/albums/Taylor-swift/Speak-n...,Non-Album Songs,11,Drops of Jupiter (Live/2011),https://genius.com/Taylor-swift-drops-of-jupit...,[Taylor Swift],2011-11-21,10400,"[You know, you guys have a lot of amazing band...","[Charlie Colin, Jimmy Stafford, Pat Monahan, R...",[Taylor Swift],"[Rock, Pop, In English, USA, Ballad, Adult Con..."
346,Speak Now: World Tour Live (Brazilian Edition),https://genius.com/albums/Taylor-swift/Speak-n...,Non-Album Songs,11,Bette Davis Eyes (Live/2011),https://genius.com/Taylor-swift-bette-davis-ey...,[Taylor Swift],2011-11-21,0,[There is some unbelievable music that has com...,"[Donna Weiss, Jackie DeShannon]",[Taylor Swift],"[Country, Pop, USA, In English, Live, Concert,..."
347,Anti-Hero (Remixes),https://genius.com/albums/Taylor-swift/Anti-he...,Midnights,1,Anti-Hero (Remix),https://genius.com/Taylor-swift-anti-hero-remi...,"[Taylor Swift, Bleachers]",2022-11-07,69200,"[(I'm the problem, it's me), I have this thing...","[Taylor Swift, Jack Antonoff]","[Taylor Swift, Jack Antonoff, Mikey Freedom Hart]","[Pop, Duet, Singer-Songwriter, Synth-Pop, Remix]"
348,Speak Now: World Tour Live (Brazilian Edition),https://genius.com/albums/Taylor-swift/Speak-n...,Non-Album Songs,12,I Want You Back (Live/2011),https://genius.com/Taylor-swift-i-want-you-bac...,[Taylor Swift],2011-11-21,7200,"[Ooh baby give me one more chance, (I'll show ...",[],[Taylor Swift],"[Country, Pop, Singer-Songwriter, Concert, Liv..."
349,From the Vault: Live 2007-2009,https://genius.com/albums/Jack-ingram/From-the...,Other Artist Songs,5,Hold On (Live),https://genius.com/Jack-ingram-hold-on-live-ly...,"[Jack Ingram, Taylor Swift]",2018-08-31,0,"[I've been following your ghost, Running circl...",[Blu Sanders],[Jack Ingram],"[Country, Live]"
350,Live In No Shoes Nation,https://genius.com/albums/Kenny-chesney/Live-i...,Other Artist Songs,3,Big Star (Live),https://genius.com/Kenny-chesney-big-star-with...,"[Kenny Chesney, Taylor Swift]",2017-10-27,0,[This song is about a girl who had a dream and...,[Stephony Smith],[],"[Country, Duet, Live]"
351,Hope for Haiti Now,https://genius.com/albums/Various-artists/Hope...,Non-Album Songs,8,Breathless,https://genius.com/Taylor-swift-breathless-lyrics,[Taylor Swift],2010-01-23,6000,"[Here you are now, Fresh from your war, Back f...",[Kevin Griffin],[],"[Country, Pop, Cover]"
352,Cats: Highlights From the Motion Picture Sound...,https://genius.com/albums/Andrew-lloyd-webber/...,Non-Album Songs,13,Macavity,https://genius.com/Taylor-swift-macavity-lyrics,"[Taylor Swift, Idris Elba]",2019-12-20,33800,"[Macavity's a mystery cat, he's called the Hid...","[T.S. Eliot, Andrew Lloyd Webber]","[Greg Wells, Andrew Lloyd Webber]","[Pop, Soundtrack, Musicals, Screen, Adult Cont..."


In [6]:
# Changing instances of 'Joe Alwyn' to 'William Bowery'
tswift['song_writers'] = discog_mods.change_credit_name(tswift['song_writers'], 'Joe Alwyn', 'William Bowery')
tswift['song_producers'] = discog_mods.change_credit_name(tswift['song_producers'], 'Joe Alwyn', 'William Bowery')

# Export dataframe to pkl
tswift.to_pickle('data/taylor_swift_clean.pkl')

## Database Creation and Exporting

At this point, the dataframe is complete and ready to be used in later notebooks. However, I also want a database version of the dataframe so I can query via SQL. Using SQLite3, I convert the dataframe into a seven-table database, the schema of which can be seen here (courtesy of [dbdiagram.io](https://dbdiagram.io/)):

![db_schema.png](../figures/db_schema.png)

In [7]:
# Creates CSV for Kaggle distribution
tswift.to_csv('data/kaggle/ts_discography_released.csv', index=False)

# Creates database
discog_mods.convert_to_db(tswift, 'data/taylor_swift.db')