# Analyze Twenty One Pilots Lyrics with Python
### Inspired by CodeCademy's "Analyze Taylor Swift Lyrics with Python".

Double check that you have all required packages/libraries installed and accessible before running the import cell. \
Install Seaborn: https://seaborn.pydata.org/installing.html \
Install NLTK (Natural Language ToolKit): https://www.nltk.org/install.html

In [1]:
%matplotlib inline

import pandas as pd
import string
import seaborn as sns
import matplotlib.pyplot as plt
import collections
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer

### 1. Build and load the dataset
After running the first cell to load necessary libraries, next we'll build and load the dataset ("dataframe") to be analyzed.

Our handiest functions to get insight into our data: \
(See more info on https://pandas.pydata.org/docs/reference/index.html)
- Use pd.read_csv to load the dataframe from a file and save it as a variable dataframe in our jupyter notebook. Use pd.to_csv to write to a file.
>
>Keep in mind as you code that whenever you assign a dataframe to a variable defined in jupyter, any modifications you make to that dataframe stay local to your current jupyter session and do not directly edit the csv file. If you want your edits to the dataframe to be permanent within a file, you have to use pd.to_csv to write to a selected file.
>
>**Be careful when overwriting content!** Check what is in the file before writing to it. You don't want to lose content!
- Use .head() to inspect the first few rows of a dataframe.
- Use .info() to determine how many rows there are, check for missing values, and check the variable types.

How do we get from copy and pasted lyrics only, to a broad dataset with custom information for each line of song lyric?

Below is the process:
1. Copy and paste the lyrics from your chosen song into a new csv file. I've called this "one_song_lyrics_only.csv".
2. Skim the lyrics in the csv file and look for any lines with commas. Commas are the default delimiter, so they will be interpreted not as part of the lyric but as an indicator of a new cell of content. If you use an IDE like Visual Studio Code, extensions like Rainbow CSV can help these jump out immediately. It's your choice how to manage these lines so the dataframe can be properly read and compiled. Here are some options:
   > - My usual preference: simply count every comma in a song's lyrics as a new lyric line, and manually delete it and add a new line. (Just backspace and return after each comma when you see one.) It takes a bit extra time and effort, but I prefer to interpret many song lyrics like this manually.
   > - Wrap the lines with commas in double-quotes. Note that this will force your whole line as a string object, but this should be the default for the rest of your lines anyway. I use this as well. The decision between this option and the last is an artistic one, regarding what you qualify as a singular lyric. Consider rhyming scheme when trying to decide what should remain one line or can be separated.
   > - If you want to avoid the issue altogether upfront and leave commas inside the lyric as part of a single lyric line without bothering with double-quotes, you can adjust the dataframe parameter for field delimiters and use something else like a semicolon. E.g., pandas.dataframe_name.read_csv(sep=';'). I haven't implemented this personally so the approach may need some adjusting. Just be aware it can get dicey if you're trying to combine dataframes that are using different delimiters, and the default is a comma.
3. Starting with a file with just one column containing a song's lyrics, build a dataframe with more columns that include details about the song, to match our comprehensive lyrics dataframe. Step through the cells below for the functions to do this.
4. Concatenate/append this one song's dataframe to our comprehensive lyrics dataframe.
5. Check our files and variables and then write the new and improved, extended comprehensive dataframe to our main lyrics data csv file. (Again, be careful when writing over big files. You can write to a temporary file to check accuracy beforehand; I used one called "testingoutput.csv".)

*If you want to inspect the lyrics I have built with this method, you can find them in the file tree. To find these through Jupyter, click on the Jupyter logo in the upper left corner. You can, of course, also view these in the file directory in the github repository.*

In [2]:
# Load datasets

# Comprehensive, main lyrics list to build upon
lyrics = pd.read_csv("twenty_one_pilots_lyrics_2009-2024.csv")

# Song to add to the comprehensive lyrics list
one_song = pd.read_csv("one_song_lyrics_only.csv")

# Don't forget to include a "lyric" at the top (first line) of your "one_song_lyrics_only.csv"!
# CSV files are read with the first line as the column name/s.

In [3]:
# Inspect the first few rows of our target comprehensive dataframe
lyrics.head()

Unnamed: 0,album_name,track_title,track_n,lyric,line
0,Johnny Boy - EP,Johnny Boy,1,He stays home from work this time,1
1,Johnny Boy - EP,Johnny Boy,1,He never really told his wife,2
2,Johnny Boy - EP,Johnny Boy,1,He never really told a lie but this time he de...,3
3,Johnny Boy - EP,Johnny Boy,1,It's alright,4
4,Johnny Boy - EP,Johnny Boy,1,No one really knows his mind and no one knows ...,5


In [4]:
# A sample of what we can do with our data
# Look at the names of all songs in this dataframe, without repeats
print(lyrics.track_title.unique())

['Johnny Boy' 'Air Catcher' 'Time to Say Goodbye' 'Addict with a Pen'
 'Friend, Please' 'Taxi Cab' 'Implicit Demand for Proof' 'Fall Away'
 'The Pantaloon' 'March to the Sea' 'Oh, Ms. Believer' 'Trapdoor'
 'A Car, a Torch, a Death' 'Before You Start Your Day'
 'Isle of Flightless Birds' 'Guns for Hands' 'Holding on to You'
 'Ode to Sleep' 'Slowtown' 'Car Radio' 'Forest' 'Glowing Eyes'
 'Kitchen Sink' 'Anathema' 'Lovely' 'Ruby' 'Trees' 'Be Concerned' 'Clear'
 'House of Gold' 'Two' 'Migraine' 'Semi-Automatic' 'Screen'
 'The Run and Go' 'Fake You Out' 'Truce' 'Heavydirtysoul' 'Stressed Out'
 'Ride' 'Fairly Local' 'Tear in My Heart' 'Lane Boy' 'The Judge' 'Doubt'
 'Polarize' "We Don't Believe What's on TV" 'Message Man' 'Hometown'
 'Not Today' 'Goner' 'Jumpsuit' 'Levitate' 'Morph' 'My Blood' 'Chlorine'
 'Smithereens' 'Neon Gravestones' 'The Hype' 'Nico and the Niners'
 'Cut My Lip' 'Bandito' 'Pet Cheetah' 'Legend' 'Leave the City'
 'Christmas Saves the Year' 'Level of Concern' 'Good Day' '

In [5]:
# Inspect the first few rows of our lyrics to build upon
one_song.head()

Unnamed: 0,lyric
0,I ponder of something great
1,My lungs will fill and then deflate
2,"They fill with fire, exhale desire"
3,"I know it's dire, my time today"
4,"I have these thoughts so often, I ought"


In [6]:
# Get info about the song's dataframe, double check that there are the same # of entries as there are objects, and no null objects
one_song.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   lyric   68 non-null     object
dtypes: object(1)
memory usage: 672.0+ bytes


In [7]:
# Get info about our target dataframe, noticing again the columns we want to add to our one song.
lyrics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5405 entries, 0 to 5404
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   album_name   5405 non-null   object
 1   track_title  5405 non-null   object
 2   track_n      5405 non-null   int64 
 3   lyric        5405 non-null   object
 4   line         5405 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 211.3+ KB


## STOP Here!
Update the variables in the next cell to match information to the current song lyrics before running.

In [8]:
# Build new information around our song lyrics to match the target dataframe.

# First, we can use variables to make it even easier and more obvious where to modify content per song:
one_song_album_name = "Sample Album Name - Vessel"
one_song_track_title = "Sample Song Name - Car Radio"
one_song_track_n = 5

# Then, we'll assign these values to cells in our dataframe.

# If you don't care about matching columns to the same indices, you can simply assign, e.g.:
# one_song['album_name'] = one_song_album
# lyrics['track_n'] = one_song_track_num
# If you do care about that, you can use .insert(), like this:
one_song.insert(0,'album_name', one_song_album_name)
one_song.insert(1,'track_title', one_song_track_title)
one_song.insert(2,'track_n', one_song_track_n)
# You can add what is basically an incrementing index per lyric line with a range func, like this:
one_song.insert(4, 'line', range(1, 1+len(one_song)))

# NOTE that you can run a simple assignment many times without any consequence, 
# because it will just re-assign the same values,
# but .insert() creates a new column with the content and can only be run for a unique identifier once per session.
# So don't run a cell with .insert() more than once.

In [9]:
# Now take another look at our updated one song's dataframe
one_song.head()

Unnamed: 0,album_name,track_title,track_n,lyric,line
0,Sample Album Name - Vessel,Sample Song Name - Car Radio,5,I ponder of something great,1
1,Sample Album Name - Vessel,Sample Song Name - Car Radio,5,My lungs will fill and then deflate,2
2,Sample Album Name - Vessel,Sample Song Name - Car Radio,5,"They fill with fire, exhale desire",3
3,Sample Album Name - Vessel,Sample Song Name - Car Radio,5,"I know it's dire, my time today",4
4,Sample Album Name - Vessel,Sample Song Name - Car Radio,5,"I have these thoughts so often, I ought",5


In [10]:
# Write the new song's dataframe to its own csv, if wanted
one_song.to_csv('one_song_lyrics_finished.csv', index=False)

In [11]:
# Concatenate new song's dataframe with the comprehensive list
combined_lyrics = pd.concat([lyrics,one_song])
combined_lyrics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 67
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   album_name   5473 non-null   object
 1   track_title  5473 non-null   object
 2   track_n      5473 non-null   int64 
 3   lyric        5473 non-null   object
 4   line         5473 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 256.5+ KB


You can also concatenate it to a csv for just that album, if you'd like to have that on hand too.
Use a similar process to add these kinds of extra pieces to your dataset.

In [12]:
#combined_lyrics.to_csv('testingoutput.csv', index=False)

## STOP here!
Check your one_song_lyrics_finished.csv or testingoutput.csv file before making your csv edits permanent! If there is an error, you will need to fix and then read the file and manipulate the data again.

In [13]:
# If this test output looks good, make the change permanent by overwriting your main file.
combined_lyrics.to_csv('twenty_one_pilots_lyrics_2009-2024.csv', index=False)

### Now you can check your csv files to see the new data! 
If it all looks good, proceed to clean up your tracks and prepare to repeat the process for your next song. (Technically you don't have to do this since you'll be overwriting files when you start over anyway; it's more good practice and saves on space if you're not coming back to this for a while.) You can either do this manually by just deleting the file or deleting the contents of the file (which can be easy and fast enough to do by hand, since you're going to be copying and pasting into these files again anyway), OR you can wipe files through jupyter, by truncating with something like this:

In [14]:
# Delete the triple quotes to automate this feature.
'''
f = open("testingoutput.csv", "w+")
f.close()
g = open("one_song_lyrics_only.csv", "w+")
g.close()
h = open("one_song_lyrics_finished.csv", "w+")
h.close()
'''

'\nf = open("testingoutput.csv", "w+")\nf.close()\ng = open("one_song_lyrics_only.csv", "w+")\ng.close()\nh = open("one_song_lyrics_finished.csv", "w+")\nh.close()\n'

### 2. Add Useful Data

In [8]:
print(lyrics.album_name.unique())

['Johnny Boy - EP' 'Twenty One Pilots' 'Regional at Best' 'Vessel'
 'Blurryface' 'Trench' 'Single: Christmas Saves the Year'
 'Single: Level of Concern' 'Scaled and Icy' 'Clancy' 'Single: Heathens']


In [9]:
# this is a function to map the name of the album/single to the year it was released
def release_date(row):  
    if row['album_name'] == 'Johnny Boy - EP':
        return '2009-05-04'
    elif row['album_name'] == 'Twenty One Pilots':
        return '2009-12-29'
    elif row['album_name'] == 'Regional at Best':
        return '2011-07-08'
    elif row['album_name'] == 'Vessel':
        return '2013-01-08'
    elif row['album_name'] == 'Blurryface':
        return '2015-05-17'
    elif row['album_name'] == 'Trench':
        return '2018-10-05'
    elif row['album_name'] == 'Scaled and Icy':
        return '2021-05-21'
    elif row['album_name'] == 'Clancy':
        return '2024-05-24'
    # Now for singles
    elif row['album_name'] == 'Single: Heathens':
        return '2016-06-16'
    elif row['album_name'] == 'Single: Level of Concern':
        return '2020-04-09'
    elif row['album_name'] == 'Single: Christmas Saves the Year':
        return '2020-12-08'    
    return 'No Date'

In [10]:
lyrics['album_release_date'] = lyrics.apply(lambda row: release_date(row), axis=1)
# Inspect the first few rows of the DataFrame
lyrics.head()

Unnamed: 0,album_name,track_title,track_n,lyric,line,album_release_date
0,Johnny Boy - EP,Johnny Boy,1,He stays home from work this time,1,2009-05-04
1,Johnny Boy - EP,Johnny Boy,1,He never really told his wife,2,2009-05-04
2,Johnny Boy - EP,Johnny Boy,1,He never really told a lie but this time he de...,3,2009-05-04
3,Johnny Boy - EP,Johnny Boy,1,It's alright,4,2009-05-04
4,Johnny Boy - EP,Johnny Boy,1,No one really knows his mind and no one knows ...,5,2009-05-04


### 3. Clean the lyric text

To accurately count keyword mentions, we need to:

- lowercase everything,
- remove punctuation,
- and exclude stop words.
  
Save this in a new column called clean_lyric and check to be sure you have what you expect by viewing the first few rows.

In [11]:
# Make lowercase
lyrics['clean_lyric'] = lyrics['lyric'].str.lower()
# Remove punctuation
lyrics['clean_lyric'] = lyrics['clean_lyric'].str.replace('[^\w\s]','')

# Remove stopwords (see the next cell for illustration)
# Create a small list of English stop words, feel free to edit this list
stop = ['the', 'a', 'this', 'that', 'is', 'am', 'was', 'were', 'be', 'being', 'been']

# There are three steps in one here - explained below
# We make a list of words with `.split()`
# Then we remove all the words in our list
# Then we join the words back together into a string
lyrics['clean_lyric'] = lyrics['clean_lyric'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

  lyrics['clean_lyric'] = lyrics['clean_lyric'].str.replace('[^\w\s]','')


In [12]:
lyrics.head()

Unnamed: 0,album_name,track_title,track_n,lyric,line,album_release_date,clean_lyric
0,Johnny Boy - EP,Johnny Boy,1,He stays home from work this time,1,2009-05-04,he stays home from work time
1,Johnny Boy - EP,Johnny Boy,1,He never really told his wife,2,2009-05-04,he never really told his wife
2,Johnny Boy - EP,Johnny Boy,1,He never really told a lie but this time he de...,3,2009-05-04,he never really told lie but time he decides i...
3,Johnny Boy - EP,Johnny Boy,1,It's alright,4,2009-05-04,its alright
4,Johnny Boy - EP,Johnny Boy,1,No one really knows his mind and no one knows ...,5,2009-05-04,no one really knows his mind and no one knows ...
