After the presale ticket sales opened for Taylor Swift’s “Eras” tour — her first tour since 2018, a mad rush ensued, leading to “historically unprecedented demand” that snapped up 2 million tickets, the most tickets ever sold for one artist in a day. The aftermath "Taylor Swift’s Ticketmaster meltdown" has become a prevalent topic, even taking political dimensions and stimulating bipartisan outrage from some Democrats and Republicans who have questioned whether Ticketmaster handled the Swift ticket rollout appropriately. This whole fiasko was a great opportunity for my beloved gf to introduce me to Taylor's mighty deep and meaningful music and sparked our curiosity to dive a bit deeper into the lyrics. In this Jupyter Notebook, we'll analyze the most commonly used words in all major Taylor's Swift songs released so far. We will then extend our analysis to a few albums which we (basically my gf) deem more important. 

First, let’s explore Taylor’s discography with the Spotify API. To do that, we'll connect with spotipy library. 

In [1]:
# pip install spotipy

In [2]:
import spotipy

In [3]:
from spotipy.oauth2 import SpotifyOAuth

In [4]:
#client_id = your_client_id
#client_secret = your client_secret
#redirect_uri = "http://localhost:9000"

In [5]:
client_id = "6d1d0017a077411dbfbadd6da0a80475"
client_secret = "383c621b3784424b9dd0745b26fbae65"
redirect_uri = "http://localhost:9000"

In [6]:
# Connect with API Keys created earlier
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=client_id,
                                              client_secret=client_secret,
                                              redirect_uri=redirect_uri))

Let's search for Taylor Swift in Spotify: Go to spotify.com and then look up the artist, you should see the ID in the URL.

In [7]:
taylor_swift = sp.artist("06HL4z0CvFAxyc27GXpf02")

In [8]:
taylor_albums = sp.artist_albums(taylor_swift['id'],album_type='album',limit=50)

In [9]:
for album in taylor_albums['items']:
    print(f"Album: {album['name']}")

Album: Midnights (3am Edition)
Album: Midnights (3am Edition)
Album: Midnights
Album: Midnights
Album: Red (Taylor's Version)
Album: Red (Taylor's Version)
Album: Fearless (Taylor's Version)
Album: evermore (deluxe version)
Album: evermore (deluxe version)
Album: evermore
Album: evermore
Album: folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]
Album: folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]
Album: folklore (deluxe version)
Album: folklore (deluxe version)
Album: folklore
Album: folklore
Album: Lover
Album: Taylor Swift Karaoke: reputation
Album: reputation
Album: reputation (Big Machine Radio Release Special)
Album: reputation Stadium Tour Surprise Song Playlist
Album: Taylor Swift Karaoke: 1989 (Deluxe)
Album: 1989 (Big Machine Radio Release Special)
Album: 1989
Album: 1989 (Deluxe)
Album: Red (Deluxe Edition)
Album: Red (Big Machine Radio Release Special)
Album: Red
Album: Speak Now World Tour Live
Album: Sp

Now we have all the albums from Spotify. Let's remove the duplicates and other categories. 

In [10]:
album_names = []
albums = []
for album in taylor_albums['items']: 
    album_name = album['name']
    if album_name[:3] not in album_names and 'remix' not in album_name and 'Karaoke' not in album_name and 'Live' not in album_name: 
        album_names.append(album_name[:3])
        albums.append(album_name)

In [11]:
albums

['Midnights (3am Edition)',
 "Red (Taylor's Version)",
 "Fearless (Taylor's Version)",
 'evermore (deluxe version)',
 'folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]',
 'Lover',
 'reputation',
 '1989 (Big Machine Radio Release Special)',
 'Speak Now',
 'Taylor Swift']

OK, now we only have kept the albums we are interested in. 

Next step, we'll have to fetch the lyrics of all Taylor's songs. For that purpose, we are going to use LyricsGenius, a Python client for the Genius.com API that provides a simple interface to the song, artist, and lyrics data stored on Genius.com. 
Genius.com is a cool website. If you aren’t familiar with it, Genius hosts a bunch of song lyrics and lets users highlight and annotate passages with interpretations, explanations, and references.
For more details, follow the link on its website: https://lyricsgenius.readthedocs.io/en/master/

In [12]:
# We can use pip to install lyricsgenius:
# pip install lyricsgenius

But before using the library we need to get an access token. The setup is super easy and quick. All the instructions can be found here: https://lyricsgenius.readthedocs.io/en/master/setup.html#setup

In [13]:
# token = 'my_access_token_here'
token = 'Uy-kcsPnpAinOhLmVTSVkJVRthCMrQa1u1IwX6GRPDe2z9BQEo1peaIReDilYEsI'

In [14]:
# Now let's import our library
from lyricsgenius import Genius

Let's fetch the lyrics of all the albums and save them in JSON format. This might take awhile...

In [15]:
genius = Genius(access_token=token,timeout=13, sleep_time=0.6)
non_saved_albums = []
for album in albums: 
    album_genius = genius.search_album(album, 'Taylor Swift')
    try: 
        album_genius.save_lyrics()
    except AttributeError:
        non_saved_albums.append(album)

Searching for "Midnights (3am Edition)" by Taylor Swift...
Wrote Lyrics_Midnights3amEdition.json.
Searching for "Red (Taylor's Version)" by Taylor Swift...
Wrote Lyrics_RedTaylorsVersion.json.
Searching for "Fearless (Taylor's Version)" by Taylor Swift...
Wrote Lyrics_FearlessTaylorsVersion.json.
Searching for "evermore (deluxe version)" by Taylor Swift...
Wrote Lyrics_evermoredeluxeversion.json.
Searching for "folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]" by Taylor Swift...
Wrote Lyrics_folklorethelongpondstudiosessionsfromtheDisneyspecialdeluxeedition.json.
Searching for "Lover" by Taylor Swift...
Wrote Lyrics_Lover.json.
Searching for "reputation" by Taylor Swift...
Wrote Lyrics_reputation.json.
Searching for "1989 (Big Machine Radio Release Special)" by Taylor Swift...
No results found for: '1989 (Big Machine Radio Release Special) Taylor Swift'
Searching for "Speak Now" by Taylor Swift...
Wrote Lyrics_SpeakNow.json.
Searching for "Taylor Swif

In [16]:
non_saved_albums

['1989 (Big Machine Radio Release Special)']

It seems that the album 1989 is not properly downloaded. It's a rather important album though, so let's try manually removing the text in the parenthesis. 

In [17]:
album_genius = genius.search_album('1989', 'Taylor Swift')
album_genius.save_lyrics()

Searching for "1989" by Taylor Swift...
Wrote Lyrics_1989.json.


Nice, it worked! 

Time to import other python libraries that we will need.

In [18]:
import json
import glob
import re
import string
from nltk.corpus import stopwords

Create a .txt file with all the lyrics

In [19]:
files = glob.glob("*.json")

In [20]:
files

['Lyrics_1989.json',
 'Lyrics_evermoredeluxeversion.json',
 'Lyrics_FearlessTaylorsVersion.json',
 'Lyrics_folklorethelongpondstudiosessionsfromtheDisneyspecialdeluxeedition.json',
 'Lyrics_Lover.json',
 'Lyrics_Midnights3amEdition.json',
 'Lyrics_RedTaylorsVersion.json',
 'Lyrics_reputation.json',
 'Lyrics_SpeakNow.json',
 'Lyrics_TaylorSwift.json']

In [21]:
def get_file_names(file_name): 
    return file_name[7:][:-5]

In [22]:
files_name_only = list(map(get_file_names, files))

In [23]:
files_name_only

['1989',
 'evermoredeluxeversion',
 'FearlessTaylorsVersion',
 'folklorethelongpondstudiosessionsfromtheDisneyspecialdeluxeedition',
 'Lover',
 'Midnights3amEdition',
 'RedTaylorsVersion',
 'reputation',
 'SpeakNow',
 'TaylorSwift']

In [24]:
for album in files_name_only: 
    with open('Lyrics_'+album+'.json') as json_file:
        data = json.load(json_file)
        lyrics = []
        for item in data['tracks']:
            lyrics.append(item['song']['lyrics'])
        with open('taylor_all_lyrics.txt', 'a') as f:
            f.write('\n'.join(lyrics))

Cool, Now we have one txt file with all the lyrics!

Let's clean our data a little bit to prepare for the final and most interesting part of finding out the most commonly used words in all of Taylor's songs. 

In [25]:
with open('taylor_all_lyrics.txt') as f: 
    lines = f.read().splitlines()

In [26]:
lines[:10]

['TranslationsEspañolPortuguêsPolskiWelcome to New York Lyrics[Verse 1]',
 'Walking through a crowd, the village is aglow',
 'Kaleidoscope of loud heartbeats under coats',
 'Everybody here wanted something more',
 "Searching for a sound we hadn't heard before",
 'And it said',
 '',
 '[Chorus]',
 "Welcome to New York, it's been waiting for you",
 'Welcome to New York, welcome to New York']

We notice that there are words like Verse or Chorus that we don't actually need for our analysis, so let's remove them. 

In [27]:
lyrics_str = ' '.join(map(str, lines))

In [28]:
words_to_remove = re.findall(r'\[([^]]*)\]', lyrics_str) # Finds the words in []
custom_stopwords = set(words_to_remove)

In [29]:
#custom_stopwords

Some more data cleaning. 

In [30]:
def text_process(mess,custom_stopwords):
    """
    Takes in a string of text, then performs the following:
    1. Remove English stopwords
    2. Remove all punctuation
    3. Remove all custom stopwords
    4. Return a list of the words we want to keep
    """
    # Remove all English stopwords
    no_stopwords = [word for word in lyrics_str.split() if word.lower() not in stopwords.words('english')]
    no_stopwords_join = ' '.join(no_stopwords)
    
    # Remove all punctuation
    nopunc = [char for char in no_stopwords_join if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    # Remove custom stopwords
    words = [word for word in nopunc.split() if word not in custom_stopwords]
    
    # Return the final list of words
    return [word for word in words if word.lower() not in stopwords.words('english')] # Removing any English stopwords again after the punctuation is gone

In [31]:
words = text_process(lyrics_str,custom_stopwords)

Let's quickly test the result

In [32]:
'Chorus' in words

False

In [33]:
words[:30]

['TranslationsEspañolPortuguêsPolskiWelcome',
 'New',
 'York',
 'LyricsVerse',
 '1',
 'Walking',
 'crowd',
 'village',
 'aglow',
 'Kaleidoscope',
 'loud',
 'heartbeats',
 'coats',
 'Everybody',
 'wanted',
 'something',
 'Searching',
 'sound',
 'heard',
 'said',
 'Welcome',
 'New',
 'York',
 'waiting',
 'Welcome',
 'New',
 'York',
 'welcome',
 'New',
 'York']

OK, seems that the numbers of the Verses have remained, along with the translation languages. 

In [34]:
# Final cleaning
words = [word for word in words if word not in list(map(str,list(range(10)))) and len(word)<35 and word!='lyricsverse'] 

Sweet! Now we can finally search for the most commonly used words. Let's see whether there are any surprises...

In [35]:
from collections import Counter
Counter = Counter(list(map(str.lower, words)))
most_occur = Counter.most_common(5)
print('Top-5:')
for word,i in enumerate(range(len(most_occur))): 
    print(f"{most_occur[i][0]} -- with {most_occur[i][1]} occurencies")

Top-5:
im -- with 631 occurencies
like -- with 608 occurencies
know -- with 561 occurencies
oh -- with 547 occurencies
never -- with 434 occurencies
