# Taylor Swift Discography - Data Collection

## Introduction

In this notebook, I collect and compile data using [parsel](https://parsel.readthedocs.io/en/latest/) to webscrape from [Genius](https://genius.com/), an online music encyclopedia. From there, I use pandas for dataframe creation and SQLite3 for database creation. 

It should be noted that Genius does have an [API](https://docs.genius.com/) for application development. However, song lyrics are not made available via API, which is a crucial component of this project. A third-party Python package, [lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/), is available to work in tandem with the API, webscraping on the user's behalf. I tested both the API and lyricsgenius as resources before ultimately concluding to collect the data myself.

The resulting dataframes and database from this notebook are available in the [data folder](../data). CSV files used in this notebook are also included in the [data folder](../data/csv).

This notebook uses the modules [`genius_scrape.py`](../src/genius_scrape.py) for webscraping and discography dataframe creation and [`discog_mods.py`](../src/discog_mods.py) for additional modifications and transformations.

In [1]:
%cd -q ..

import pandas as pd

from src import discog_mods
from src import genius_scrape

## Webscraping and Initial Dataframe

Genius is very thorough in cataloguing an artist's discogrpahy, tracking all physical and digital variants of a release with different tracklists as well as singles, EPs, demo CDs, official playlists, and special releases. Therefore, Genius catalogues Taylor Swift as having over [100 different albums](https://genius.com/artists/Taylor-swift/albums), most of which contain repeated songs. For sake of processing times, I manually select the releases used to create the initial dataframe: this includes deluxe versions of original albums and "Taylor's Version" rerecordings of her studio albums are preferred, and non-album singles and streaming-specific EPs that contain previously unreleased tracks are included.

The dataframe is structured as follows:

* **album_title**: the title of the album (or other release) containing each song (string)
* **album_url**: the URL to the Genius page for the album (string)
* **category**: the categorical classification of an album/song (string)
  * Note: In this project, the categories are studio album names, rerecording names, "Non-Album Songs" for songs made for soundtracks or Taylor Swift songs not associated with albums, and "Other Artist Songs" for collaborations ans remixes on other artists' albums. I'll often refer to these categories as "eras," a terminology brought about via [The Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour).
* **album_track_number**: the track number for the song on the given album (integer)
* **song_title**: the title of the song (string)
* **song_url**: the URL to the Genius page for the song (string)
* **song_release_date**: the original release date for the song (date)
* **song_page_views**: the amount of views the Genius song page has (integer)
* **song_artists**: the performing artist(s) for the song (list)
* **song_lyrics**: the lyrics to the song (list)
* **song_writers**: the writer(s) for the song (list)
* **song_producers**: the producers(s) for the song (list)
* **song_tags**: the genre tag(s) for the song (list)

In [2]:
# Manually selected albums/EPs from Genius' list
# Maximizes the number of unique songs while reducing cleanup
albums = genius_scrape.create_dict_from_file('data/csv/album_list.csv')

In [3]:
#Initial dataframe creation
raw_tswift = genius_scrape.create_discography('Taylor Swift', albums)
raw_tswift.to_pickle('data/taylor_swift_raw.pkl')
raw_tswift.head()

ConnectionError: HTTPSConnectionPool(host='genius.com', port=443): Max retries exceeded with url: /Taylor-swift-the-moment-i-knew-taylors-version-lyrics (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f7137e8df70>: Failed to resolve 'genius.com' ([Errno -3] Temporary failure in name resolution)"))

## Data Cleaning and Additional Entries

Like I said previously, many of Taylor Swift's releases contain repeating songs. For example, the streaming-specific EP, *The More Lover Chapter*, contains five songs: four songs are also on the standard release of the parent album, *Lover*, while only one song is a unique release. There are also numerous remixes and alternative versions of the same songs. Both these remixes and repeated songs need to be removed from the dataframe; the latter can be easily filtered out while the former requires some manual work. Remixes or versions of songs that include new features, such "Karma (Remix) (Ft. Ice Spice)," are included for the sake of tracking collaborations.

Taylor also has several songs that she wrote for movies or cowrote/performed with other artists that ultimately don't end up one of her releases. Those albums weren't included in the initial data pull as to not require more cleanup after the fact. Those songs get added into the dataframe individually.  

See the [README](../README.md#Constraints-and-Limitations-of-Discography) for more information on what is and isn't included in the discography.

Another small change I make is reverting instances of "Joe Alwyn" in both `song_writers` and `song_producers` to "William Bowery," the original pseudonym used for these writing/producing credits. This is a personal preference, as I'd like the William Bowery song credits to be consistent across albums/eras.

In [None]:
# Drops specific songs (alternative productions/remixes of existing songs)
# and duplicate songs
tswift = discog_mods.drop_songs_from_file(raw_tswift, 'data/csv/songs_to_drop_part1.csv', drop_duplicates=True)
tswift['category'].value_counts()

In [None]:
# Adds numerous songs to the dataframe at once
tswift = discog_mods.add_songs_from_file(tswift, 'data/csv/songs_to_add.csv')
tswift.tail(15)

In [None]:
# Changing instances of 'Joe Alwyn' to 'William Bowery'
tswift['song_writers'] = discog_mods.change_credit_name(tswift['song_writers'], 'Joe Alwyn', 'William Bowery')
tswift['song_producers'] = discog_mods.change_credit_name(tswift['song_producers'], 'Joe Alwyn', 'William Bowery')

# Export dataframe to pkl
tswift.to_pickle('data/taylor_swift_clean.pkl')

## Database Creation and Exporting

At this point, the dataframe is complete and ready to be used in later notebooks. However, I also want a database version of the dataframe so I can query via SQL. Using SQLite3, I convert the dataframe into a seven-table database, the schema of which can be seen here (courtesy of [dbdiagram.io](https://dbdiagram.io/)):

![db_schema.png](../figures/db_schema.png)

In [None]:
# Creates CSV for Kaggle distribution
tswift.to_csv('data/kaggle/ts_discography_released.csv', index=False)

# Creates database
discog_mods.convert_to_db(tswift, 'data/taylor_swift.db')