# Creating the OHCO for our Corpus of Song Lyrics
### By: Nick Bruno and Karan Kant

#### Our goal of this project is to analyze a corpus of 57,650 songs from 643 artists found from Kaggle. This corpus includes recent songs from modern artists as well as classic songs. Our main purpose was to compare the similarity of artists concerning their song lyrics. We also compared the similarity of songs. Along with similarity analysis, we ran a general MALLET analysis on the entire corpus to get a general sense of the main topics contained in the song lyrics. We also ran sentiment analysis to compare which artists create the most uplifting music compared to artists who produce songs with generally negative lyrical content.

In [1]:
# Import libraries
import pandas as pd
import os
os.chdir('/Users/nickbruno/Documents/spring_2019/DS5559/project/code')

In [2]:
# Upload raw corpus
data = pd.read_csv('songdata.csv')

In [3]:
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


The data includes artist, song, text (lyrics), and link. The 'link' column is not useful and will be deleted later.

In [4]:
artists = data.artist.unique().tolist() # creates a list of unique artists

In [5]:
len(artists) # 643 total artists

643

In [6]:
data.shape[0]

57650

This corpus includes 57,650 songs from 643 different artists.

In [7]:
artists_df = pd.DataFrame(artists)

In [8]:
artists_df.head() # can be useful in creating a database later

Unnamed: 0,0
0,ABBA
1,Ace Of Base
2,Adam Sandler
3,Adele
4,Aerosmith


In [9]:
artists_df.insert(0, 'artist_id', range(len(artists_df)))
    # creates a unique id for each artist

In [10]:
artists_df.head()

Unnamed: 0,artist_id,0
0,0,ABBA
1,1,Ace Of Base
2,2,Adam Sandler
3,3,Adele
4,4,Aerosmith


In [11]:
artists_df = artists_df.rename(columns={0: 'artist'})

In [12]:
new_df = pd.merge(data, artists_df)

In [13]:
new_df = new_df.drop('link', axis=1)

In [15]:
new_df.head() # assigns the 'artist_id' to each song

Unnamed: 0,artist,song,text,artist_id
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd...",0
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0
3,ABBA,Bang,Making somebody happy is a question of give an...,0
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0


### Get rid of duplicate song titles
#### This allows each song to have a unique song_id

In [16]:
# Find duplicate songs #
all_dup = new_df[new_df['song'].duplicated() == True]

In [17]:
all_dup_list = all_dup.song.unique().tolist() # creates a song list of duplicates

In [18]:
# Remove duplicate songs #
songs = new_df.song

In [19]:
# Create list of songs that are not duplicated in the corpus #
no_dup_song_list = [x for x in songs if x not in all_dup_list]

In [20]:
# Creates a corpus of songs that are not duplicated in the corpus #
newer_df = new_df[new_df['song'].isin(no_dup_song_list)]

In [21]:
newer_df.head()
    # The song titled 'Bang' by ABBA is no longer in our corpus because it is a duplicate song title

Unnamed: 0,artist,song,text,artist_id
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd...",0
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0
5,ABBA,Burning My Bridges,"Well, you hoot and you holler and you make me ...",0


In [22]:
len(newer_df.song.unique()) # removed a lot of songs
    # should make it easier to make an OHCO included every word

38690

In [23]:
# Create clearer dataframe #
songs = newer_df.song

In [24]:
songs_df = pd.DataFrame(songs)

In [25]:
# Assign a unique song id to each song #
songs_df.insert(0, 'song_id', range(len(songs_df)))

In [26]:
songs_df = songs_df.rename(columns={0: 'song'})

In [27]:
final_df = pd.merge(newer_df, songs_df)

In [28]:
# Cleaner dataframe #
final_df.head()

Unnamed: 0,artist,song,text,artist_id,song_id
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd...",0,0
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0,1
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0,2
3,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0,3
4,ABBA,Burning My Bridges,"Well, you hoot and you holler and you make me ...",0,4


In [28]:
# Write csv out for mallet analysis #
final_df.to_csv('mallet_start_df.csv')

### Creating the OHCO and splitting each song by verse and line

In [29]:
# Set index to artist_id and song_id to create an OHCO
final_df = final_df.set_index(['artist_id', 'song_id'])

In [30]:
final_df = final_df.drop(['artist','song'], axis=1)

In [31]:
final_df = final_df.rename(columns={'text': 'lyrics'})

In [32]:
# New OHCO
final_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lyrics
artist_id,song_id,Unnamed: 2_level_1
0,0,"Look at her face, it's a wonderful face \nAnd..."
0,1,"Take it easy with me, please \nTouch me gentl..."
0,2,I'll never know why I had to go \nWhy I had t...
0,3,Making somebody happy is a question of give an...
0,4,"Well, you hoot and you holler and you make me ..."


In [33]:
# write out to a .csv (this is a good starting point)
final_df.to_csv('artist_song_OHCO_df.csv', index=True)

#### Split song lyrics by verse

In [34]:
# Split song lyrics by verse #
verses = final_df.lyrics.str.split('  \n  \n', expand=True)\
            .stack()\
            .to_frame()\
            .rename(columns={0: 'Verse'})

In [35]:
verses.index.names = ['artist_id','song_id','verse_num']

In [36]:
verses.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Verse
artist_id,song_id,verse_num,Unnamed: 3_level_1
0,0,0,"Look at her face, it's a wonderful face \nAnd..."
0,0,1,"She's just my kind of girl, she makes me feel ..."
0,0,2,And when we go for a walk in the park \nAnd s...
0,0,3,"She's just my kind of girl, she makes me feel ..."
0,1,0,"Take it easy with me, please \nTouch me gentl..."


In [42]:
verses.shape[0]
    # There are 237,737 total verses from the 57,650 songs in the corpus

237737

In [60]:
# Write out to .csv #
verses.to_csv('artist_song_verse_OHCO_df.csv') # write out to a .csv

#### Split verses by line

In [38]:
# Split by line
lines = verses.Verse.str.split('  \n', expand=True)\
    .stack()\
    .to_frame()\
    .rename(columns={0:'Line'})

In [39]:
# Recreate the index to match the new OHCO
lines.index.names = ['artist_id','song_id','verse_num','line_num']

In [40]:
lines.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Line
artist_id,song_id,verse_num,line_num,Unnamed: 4_level_1
0,0,0,0,"Look at her face, it's a wonderful face"
0,0,0,1,And it means something special to me
0,0,0,2,Look at the way that she smiles when she sees me
0,0,0,3,How lucky can one fellow be?
0,0,1,0,"She's just my kind of girl, she makes me feel ..."


In [43]:
lines.shape[0]
    # There are 1,366,608 lines in the 57,650 songs contained in this corpus

1366608

In [63]:
# Write to csv
lines.to_csv('artist_song_verse_line_OHCO_df.csv') # write out to a .csv

Creating the final OHCO with a BOW is problematic because our corpus is so large. Throughout our analysis we subset from this large dataset to do our analysis. We tried to keep as much of the data as possible, but since our corpus was so large it was difficult to create a Token table or a bag of words. During our analysis we will note how large of a subset we are working with. We often utilized Rivanna when running memory-heavy commands. Below is the code we tried to run on the entire corpus, but were unable due to how large our corpus is.

In [None]:
# Apply it to the whole dataframe #
TOKEN_PAT = r'(\W+)'
tokens = lines.Line.str.split(TOKEN_PAT, expand=True)\
    .stack()\
    .to_frame()\
    .rename(columns={0:'token_str'})
    # DOES NOT RUN BECAUSE IT IS SO LARGE

In [None]:
words.index.names = ['artists_id', 'song_id', 'Song','Verse','Line']