The first step for feature based models is to EXTRACT FEATURES, so let's start there!

- We already have the HuggingFace features (using siebert/sentiment-roberta-large-english on the lyrics)
    - we can explore other implementations of this as well
- But let's also extract other things:
    - number of words
    - number of unique words,
    - N-grams: bigrams, trigrams, etc. (2 words, 3 words, etc)
        - "hi my name is chelsea"
        - bigrams = ["hi my", "my name", "name is", "is chelsea"]
    - type token ratio (number of unique words / number of words) -> *lexical richness*
        - "the dog chased the other dog"
        - 6 tokens (the dog chased the other dog)
        - 4 types (the dog chased other)
    - counts and frequencies of various part of speech tags
    - named entities

- Also just predict on raw embeddings with traditional models
    - coverts "hi my name is chelsea" -> [<768 numbers>] for instance, [0.87 0.91 1.2 ...]

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr

In [None]:
train = pd.read_csv('/Users/chelseachandler/Desktop/Lara/Data/HuggingFace Data/TRAIN lyrics data.csv')
val = pd.read_csv('/Users/chelseachandler/Desktop/Lara/Data/HuggingFace Data/VALIDATE lyrics data.csv')
test = pd.read_csv('/Users/chelseachandler/Desktop/Lara/Data/HuggingFace Data/TEST lyrics data.csv')

test['SPLIT'] = 'test'
train['SPLIT'] = 'train'
val['SPLIT'] = 'val'

df = pd.concat([train, test, val]).drop(columns=['Unnamed: 0'])

In [None]:
df

Unnamed: 0,lastfm_url,track,artist,seeds,number_of_emotion_tags,valence_tags,arousal_tags,dominance_tags,mbid,spotify_id,...,HF_ROBERTA_byline_maximum_sentiment,HF_ROBERTA_byline_median_sentiment,HF_ROBERTA_byline_stdv_sentiment,HF_ROBERTA_byline_firstquartile_sentiment,HF_ROBERTA_byline_thirdquartile_sentiment,HF_ROBERTA_byline_ratio_negative,HF_ROBERTA_byline_ratio_positive,HF_ROBERTA_fullinput512_one_sentiment_number,HF_ROBERTA_fullinput512_one_sentiment_label,SPLIT
0,https://www.last.fm/music/metallica/_/st.%2banger,St. Anger,Metallica,['aggressive'],8,3.710000,5.833000,5.427250,727a2529-7ee8-4860-aef6-7959884895cb,3fOc9x06lKJBhz435mInlH,...,0.998394,-0.964930,0.987632,-0.998400,0.994197,0.541667,0.458333,-0.991625,NEGATIVE,train
1,https://www.last.fm/music/m.i.a./_/bamboo%2bbanga,Bamboo Banga,M.I.A.,"['aggressive', 'fun', 'sexy', 'energetic']",13,6.555071,5.537214,5.691357,99dd2c8c-e7c1-413e-8ea4-4497a00ffa18,6tqFC1DIOphJkCwrjVzPmg,...,0.998659,0.997418,0.822465,0.985194,0.998252,0.222222,0.777778,0.991217,POSITIVE,train
2,https://www.last.fm/music/drowning%2bpool/_/st...,Step Up,Drowning Pool,['aggressive'],9,2.971389,5.537500,4.726389,49e7b4d2-3772-4301-ba25-3cc46ceb342e,4Q1w4Ryyi8KNxxaFlOQClK,...,0.998025,-0.994096,0.991149,-0.999036,0.982711,0.536585,0.463415,-0.994474,NEGATIVE,train
3,https://www.last.fm/music/deftones/_/7%2bwords,7 Words,Deftones,"['aggressive', 'angry']",10,3.807121,5.473939,4.729091,1a826083-5585-445f-a708-415dc90aa050,6DoXuH326aAYEN8CnlLmhP,...,0.998425,-0.995764,0.845690,-0.997119,-0.983623,0.758065,0.241935,-0.995440,NEGATIVE,train
4,https://www.last.fm/music/deftones/_/when%2bgi...,When Girls Telephone Boys,Deftones,"['aggressive', 'angry', 'driving', 'energetic']",8,3.910741,4.915556,4.631852,3bc2c1a9-43bc-45b2-87fc-4313eb2534fe,6xK3sBdRm99g9T8Ov0gjdF,...,0.998660,-0.995459,0.849322,-0.999005,-0.537081,0.750000,0.250000,-0.998056,NEGATIVE,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1533,https://www.last.fm/music/billy%2bjoel/_/to%2b...,To Make You Feel My Love,Billy Joel,['quiet'],6,4.047212,2.953077,4.020577,1bf51e3f-dd54-43d0-9ec0-1f6a57a86d6b,6KaNr5y4GAlHGp2dQiZNPs,...,0.998726,0.995103,0.925544,-0.957914,0.998471,0.333333,0.666667,0.998069,POSITIVE,val
1534,https://www.last.fm/music/jon%2bforeman/_/equa...,Equally Skilled,Jon Foreman,['quiet'],5,6.491667,3.460000,5.688333,f2b00b37-ab87-480c-bef3-238f5c336193,3QvVWld6eVVn2osdNwML93,...,0.998476,-0.991718,0.983697,-0.997286,0.995252,0.547945,0.452055,-0.990866,NEGATIVE,val
1535,https://www.last.fm/music/warren%2bzevon/_/my%...,My Shit's Fucked Up,Warren Zevon,['cynical'],6,2.758333,3.813333,3.856667,8a6c2225-480a-4d5a-b6f1-c8b892fbaf97,26douMAqNELour6sKd2oR7,...,0.998692,-0.996378,0.862175,-0.998759,-0.496763,0.750000,0.250000,-0.997924,NEGATIVE,val
1536,https://www.last.fm/music/prince/_/i%2bhate%2bu,I Hate U,Prince,['cynical'],5,7.003258,5.863864,5.831894,4d3c271c-08a5-472d-8e5a-7b70604ec288,1hJc23JlQlCAs2SUVDAVWL,...,0.998871,-0.815917,0.984479,-0.998030,0.994888,0.525424,0.474576,-0.996826,NEGATIVE,val


In [None]:
df.columns

Index(['lastfm_url', 'track', 'artist', 'seeds', 'number_of_emotion_tags',
       'valence_tags', 'arousal_tags', 'dominance_tags', 'mbid', 'spotify_id',
       'genre', 'track edited', 'artist edited', 'Lyrics',
       'HF_byline_average_sentiment', 'HF_byline_minimum_sentiment',
       'HF_byline_maximum_sentiment', 'HF_byline_median_sentiment',
       'HF_byline_stdv_sentiment', 'HF_byline_firstquartile_sentiment',
       'HF_byline_thirdquartile_sentiment', 'HF_byline_ratio_negative',
       'HF_byline_ratio_positive', 'HF_fullinput512_one_sentiment_number',
       'HF_fullinput512_one_sentiment_label', 'number_lines',
       'HF_ROBERTA_byline_average_sentiment',
       'HF_ROBERTA_byline_minimum_sentiment',
       'HF_ROBERTA_byline_maximum_sentiment',
       'HF_ROBERTA_byline_median_sentiment',
       'HF_ROBERTA_byline_stdv_sentiment',
       'HF_ROBERTA_byline_firstquartile_sentiment',
       'HF_ROBERTA_byline_thirdquartile_sentiment',
       'HF_ROBERTA_byline_ratio_negativ

In [None]:
# might want to do preprocessing so lets get an idea of what the lyrics look like now
df['Lyrics']

0       Saint Anger 'round my neck\nSaint Anger 'round...
1       Road runner, road runner\nGoing hundred mile p...
2       Broken - You been livin' on the edge of a brok...
3       I'll never be the same, breaking decency\nDon'...
4       "...it's hella sensitive..."\n\nAlways the sam...
                              ...                        
1533    When the rain is blowing in your face\nAnd the...
1534    How miserable I am\nI feel like a fruit-picker...
1535    Well, I went to the doctor\nI said, "I'm feeli...
1536    U have just accessed the Hate Experience\nDo U...
1537    We're only making plans for Nigel\nWe only wan...
Name: Lyrics, Length: 10254, dtype: object

In [None]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/chelseachandler/nltk_data...


True

In [None]:
# sentiments by line

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

VADER_byline_average_sentiment = []
VADER_byline_minimum_sentiment = []
VADER_byline_maximum_sentiment = []
VADER_byline_median_sentiment = []
VADER_byline_stdv_sentiment = []
VADER_byline_firstquartile_sentiment = []
VADER_byline_thirdquartile_sentiment = []
VADER_byline_ratio_negative = []
VADER_byline_ratio_positive = []
VADER_byline_ratio_neutral = []

i=0

for index, row in df.iterrows():
  print('song', i)

  current_sentiments = []
  VADER_byline_count_negative = 0
  VADER_byline_count_positive = 0
  VADER_byline_count_neutral = 0

  current_song_lyrics = [x for x in row['Lyrics'].split('\n') if x != '' and x[0] != '[']

  for song_line in current_song_lyrics:

    comp = sid.polarity_scores(song_line)

    comp = comp['compound']

    if i == 0:
       print(song_line, comp)

    if comp >= 0.5:
        VADER_byline_count_positive += 1
    elif comp > -0.5 and comp < 0.5:
        VADER_byline_count_neutral += 1
    else:
        VADER_byline_count_negative += 1

    current_sentiments.append(comp)

    num_total = VADER_byline_count_negative + VADER_byline_count_neutral + VADER_byline_count_positive

    percent_negative = (VADER_byline_count_negative/float(num_total))*100
    percent_neutral = (VADER_byline_count_neutral/float(num_total))*100
    percent_positive = (VADER_byline_count_positive/float(num_total))*100

  if len(current_sentiments) > 0:
    VADER_byline_average_sentiment.append(np.average(current_sentiments))
    VADER_byline_minimum_sentiment.append(np.amin(current_sentiments))
    VADER_byline_maximum_sentiment.append(np.amax(current_sentiments))
    VADER_byline_stdv_sentiment.append(np.std(current_sentiments))

    VADER_byline_firstquartile_sentiment.append(np.percentile(current_sentiments, 25))
    VADER_byline_median_sentiment.append(np.percentile(current_sentiments, 50))
    VADER_byline_thirdquartile_sentiment.append(np.percentile(current_sentiments, 75))

    VADER_byline_ratio_negative.append(VADER_byline_count_negative/num_total)
    VADER_byline_ratio_positive.append(VADER_byline_count_positive/num_total)
    VADER_byline_ratio_neutral.append(VADER_byline_count_neutral/num_total)

  else:
    VADER_byline_average_sentiment.append(np.nan)
    VADER_byline_minimum_sentiment.append(np.nan)
    VADER_byline_maximum_sentiment.append(np.nan)
    VADER_byline_stdv_sentiment.append(np.nan)

    VADER_byline_firstquartile_sentiment.append(np.nan)
    VADER_byline_median_sentiment.append(np.nan)
    VADER_byline_thirdquartile_sentiment.append(np.nan)

    VADER_byline_ratio_negative.append(np.nan)
    VADER_byline_ratio_positive.append(np.nan)
    VADER_byline_ratio_neutral.append(np.nan)

  i+=1


song 0
Saint Anger 'round my neck -0.5719
Saint Anger 'round my neck -0.5719
She never gets respect -0.3724
Saint Anger 'round my neck -0.5719
(You flush it out, you flush it out) 0.0
Saint Anger 'round my neck -0.5719
(You flush it out, you flush it out) 0.0
He never gets respect -0.3724
(You flush it out, you flush it out) 0.0
Saint Anger 'round my neck -0.5719
(You flush it out, you flush it out) 0.0
She never gets respect -0.3724
Fuck it all and no regrets -0.802
I hit the lights on these dark sets 0.0
I need a voice to let myself 0.0
To let myself go free 0.5106
Fuck it all and fuckin' no regrets -0.802
I hit the lights on these dark sets 0.0
Medallion noose, I hang myself 0.0
Saint Anger 'round my neck -0.5719
I feel my world shake -0.1779
Like an earthquake 0.3612
Hard to see clear 0.296
Is it me? Is it fear? -0.5514
I'm madly in anger with you (x4) -0.7506
Saint Anger 'round my neck -0.5719
Saint Anger 'round my neck -0.5719
She never gets respect -0.3724
Saint Anger 'round my 

In [None]:
df['VADER_byline_average_sentiment'] = VADER_byline_average_sentiment
df['VADER_byline_minimum_sentiment'] = VADER_byline_minimum_sentiment
df['VADER_byline_maximum_sentiment'] = VADER_byline_maximum_sentiment
df['VADER_byline_median_sentiment'] = VADER_byline_median_sentiment
df['VADER_byline_stdv_sentiment'] = VADER_byline_stdv_sentiment
df['VADER_byline_firstquartile_sentiment'] = VADER_byline_firstquartile_sentiment
df['VADER_byline_thirdquartile_sentiment'] = VADER_byline_thirdquartile_sentiment
df['VADER_byline_ratio_negative'] = VADER_byline_ratio_negative
df['VADER_byline_ratio_positive'] = VADER_byline_ratio_positive
df['VADER_byline_ratio_neutral'] = VADER_byline_ratio_neutral

In [None]:
df

Unnamed: 0,lastfm_url,track,artist,seeds,number_of_emotion_tags,valence_tags,arousal_tags,dominance_tags,mbid,spotify_id,...,content_density,word_length_average,word_length_minimum,word_length_maximum,word_length_median,word_length_stdv,word_length_firstquartile,word_length_thirdquartile,BERT embedding,VADER_byline_ratio_neutral
0,https://www.last.fm/music/metallica/_/st.%2banger,St. Anger,Metallica,['aggressive'],8,3.710000,5.833000,5.427250,727a2529-7ee8-4860-aef6-7959884895cb,3fOc9x06lKJBhz435mInlH,...,0.575058,3.586605,1,10,3.0,1.640285,2.0,5.00,"[-0.55305624, 0.29175866, -0.14178172, 0.03181...",0.555556
1,https://www.last.fm/music/m.i.a./_/bamboo%2bbanga,Bamboo Banga,M.I.A.,"['aggressive', 'fun', 'sexy', 'energetic']",13,6.555071,5.537214,5.691357,99dd2c8c-e7c1-413e-8ea4-4497a00ffa18,6tqFC1DIOphJkCwrjVzPmg,...,0.723176,4.128755,1,10,4.0,1.540773,3.0,5.00,"[-0.7495488, 0.29000255, -0.018937092, 0.09212...",0.986111
2,https://www.last.fm/music/drowning%2bpool/_/st...,Step Up,Drowning Pool,['aggressive'],9,2.971389,5.537500,4.726389,49e7b4d2-3772-4301-ba25-3cc46ceb342e,4Q1w4Ryyi8KNxxaFlOQClK,...,0.683398,4.034749,1,9,4.0,1.460797,3.0,5.00,"[-0.43036273, 0.39085698, -0.18843493, 0.23764...",0.975610
3,https://www.last.fm/music/deftones/_/7%2bwords,7 Words,Deftones,"['aggressive', 'angry']",10,3.807121,5.473939,4.729091,1a826083-5585-445f-a708-415dc90aa050,6DoXuH326aAYEN8CnlLmhP,...,0.680101,3.937028,1,10,4.0,1.586646,3.0,5.00,"[-0.19958176, 0.3396095, 0.18542421, -0.180070...",0.629032
4,https://www.last.fm/music/deftones/_/when%2bgi...,When Girls Telephone Boys,Deftones,"['aggressive', 'angry', 'driving', 'energetic']",8,3.910741,4.915556,4.631852,3bc2c1a9-43bc-45b2-87fc-4313eb2534fe,6xK3sBdRm99g9T8Ov0gjdF,...,0.691244,3.857143,1,10,4.0,1.807016,3.0,5.00,"[0.0062948195, 0.38152394, 0.11528617, 0.09792...",0.857143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1533,https://www.last.fm/music/billy%2bjoel/_/to%2b...,To Make You Feel My Love,Billy Joel,['quiet'],6,4.047212,2.953077,4.020577,1bf51e3f-dd54-43d0-9ec0-1f6a57a86d6b,6KaNr5y4GAlHGp2dQiZNPs,...,0.568182,3.710227,1,8,4.0,1.595843,2.0,4.25,"[-0.4978788, 0.2379213, -0.021126416, 0.124058...",0.625000
1534,https://www.last.fm/music/jon%2bforeman/_/equa...,Equally Skilled,Jon Foreman,['quiet'],5,6.491667,3.460000,5.688333,f2b00b37-ab87-480c-bef3-238f5c336193,3QvVWld6eVVn2osdNwML93,...,0.575439,4.421053,1,13,4.0,2.130377,3.0,6.00,"[-0.290306, 0.37494284, 0.1537129, 0.02078041,...",0.863014
1535,https://www.last.fm/music/warren%2bzevon/_/my%...,My Shit's Fucked Up,Warren Zevon,['cynical'],6,2.758333,3.813333,3.856667,8a6c2225-480a-4d5a-b6f1-c8b892fbaf97,26douMAqNELour6sKd2oR7,...,0.541985,3.419847,1,7,3.0,1.392204,2.0,4.00,"[-0.3560257, 0.43088594, 0.46364734, 0.0182349...",0.666667
1536,https://www.last.fm/music/prince/_/i%2bhate%2bu,I Hate U,Prince,['cynical'],5,7.003258,5.863864,5.831894,4d3c271c-08a5-472d-8e5a-7b70604ec288,1hJc23JlQlCAs2SUVDAVWL,...,0.653543,3.311024,1,13,3.0,1.785141,2.0,4.00,"[-0.43760136, 0.07545318, 0.5136234, 0.2418227...",0.661017


In [None]:
! pip install textblob

Collecting textblob
  Using cached textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.17.1


In [None]:
# sentiments by line

# rule based
# https://github.com/aesuli/sentiwordnet

from textblob import TextBlob

TEXTBLOB_byline_average_sentiment = []
TEXTBLOB_byline_minimum_sentiment = []
TEXTBLOB_byline_maximum_sentiment = []
TEXTBLOB_byline_median_sentiment = []
TEXTBLOB_byline_stdv_sentiment = []
TEXTBLOB_byline_firstquartile_sentiment = []
TEXTBLOB_byline_thirdquartile_sentiment = []


i=0

for index, row in df.iterrows():
    print('song', i)

    current_song_lyrics = '. '.join([x for x in row['Lyrics'].split('\n') if x != '' and x[0] != '['])

    blob = TextBlob(current_song_lyrics)
    current_sentiments = [sentence.sentiment.polarity for sentence in blob.sentences]
    if i == 0:
        print(current_song_lyrics)
        print(blob)
        print(current_sentiments)

    num_total = len(current_sentiments)


    if len(current_sentiments) > 0:
        TEXTBLOB_byline_average_sentiment.append(np.average(current_sentiments))
        TEXTBLOB_byline_minimum_sentiment.append(np.amin(current_sentiments))
        TEXTBLOB_byline_maximum_sentiment.append(np.amax(current_sentiments))
        TEXTBLOB_byline_stdv_sentiment.append(np.std(current_sentiments))

        TEXTBLOB_byline_firstquartile_sentiment.append(np.percentile(current_sentiments, 25))
        TEXTBLOB_byline_median_sentiment.append(np.percentile(current_sentiments, 50))
        TEXTBLOB_byline_thirdquartile_sentiment.append(np.percentile(current_sentiments, 75))


    else:
        TEXTBLOB_byline_average_sentiment.append(np.nan)
        TEXTBLOB_byline_minimum_sentiment.append(np.nan)
        TEXTBLOB_byline_maximum_sentiment.append(np.nan)
        TEXTBLOB_byline_stdv_sentiment.append(np.nan)

        TEXTBLOB_byline_firstquartile_sentiment.append(np.nan)
        TEXTBLOB_byline_median_sentiment.append(np.nan)
        TEXTBLOB_byline_thirdquartile_sentiment.append(np.nan)

    i+=1


song 0
Saint Anger 'round my neck. Saint Anger 'round my neck. She never gets respect. Saint Anger 'round my neck. (You flush it out, you flush it out). Saint Anger 'round my neck. (You flush it out, you flush it out). He never gets respect. (You flush it out, you flush it out). Saint Anger 'round my neck. (You flush it out, you flush it out). She never gets respect. Fuck it all and no regrets. I hit the lights on these dark sets. I need a voice to let myself. To let myself go free. Fuck it all and fuckin' no regrets. I hit the lights on these dark sets. Medallion noose, I hang myself. Saint Anger 'round my neck. I feel my world shake. Like an earthquake. Hard to see clear. Is it me? Is it fear?. I'm madly in anger with you (x4). Saint Anger 'round my neck. Saint Anger 'round my neck. She never gets respect. Saint Anger 'round my neck. (You flush it out, you flush it out). Saint Anger 'round my neck. (You flush it out, you flush it out). She never gets respect. (You flush it out, you f

In [None]:
df['Lyrics without newline or punct lower'] = df['Lyrics'].str.replace('\n', ' ', regex=False).str.replace('[^\w\s]','', regex=True).str.lower()

In [None]:
df['Lyrics without newline or punct lower']

0       saint anger round my neck saint anger round my...
1       road runner road runner going hundred mile per...
2       broken  you been livin on the edge of a broken...
3       ill never be the same breaking decency dont be...
4       its hella sensitive  always the same old taste...
                              ...                        
1533    when the rain is blowing in your face and the ...
1534    how miserable i am i feel like a fruitpicker w...
1535    well i went to the doctor i said im feeling ki...
1536    u have just accessed the hate experience do u ...
1537    were only making plans for nigel we only want ...
Name: Lyrics without newline or punct lower, Length: 10254, dtype: object

In [None]:
df['Lyrics without newline']

0       Saint Anger 'round my neck Saint Anger 'round ...
1       Road runner, road runner Going hundred mile pe...
2       Broken - You been livin' on the edge of a brok...
3       I'll never be the same, breaking decency Don't...
4       "...it's hella sensitive..."  Always the same ...
                              ...                        
1533    When the rain is blowing in your face And the ...
1534    How miserable I am I feel like a fruit-picker ...
1535    Well, I went to the doctor I said, "I'm feelin...
1536    U have just accessed the Hate Experience Do U ...
1537    We're only making plans for Nigel We only want...
Name: Lyrics without newline, Length: 10254, dtype: object

In [None]:
df['Lyrics without newline or punct']

0       Saint Anger round my neck Saint Anger round my...
1       Road runner road runner Going hundred mile per...
2       Broken  You been livin on the edge of a broken...
3       Ill never be the same breaking decency Dont be...
4       its hella sensitive  Always the same old taste...
                              ...                        
1533    When the rain is blowing in your face And the ...
1534    How miserable I am I feel like a fruitpicker W...
1535    Well I went to the doctor I said Im feeling ki...
1536    U have just accessed the Hate Experience Do U ...
1537    Were only making plans for Nigel We only want ...
Name: Lyrics without newline or punct, Length: 10254, dtype: object

In [None]:
set([1, 1, 2, 2, 3])

{1, 2, 3}

In [None]:
df['num_words'] = df.apply(lambda row: len([x for x in row['Lyrics without newline or punct lower'].split() if x != '']), axis=1)
df['num_types'] = df.apply(lambda row: len(set([x for x in row['Lyrics without newline or punct lower'].split() if x != ''])), axis=1)


In [None]:
type_token_ratio = []

for index, row in df.iterrows():
    type_token_ratio.append(row['num_types']/row['num_words'])

df['type_token_ratio'] = type_token_ratio

In [None]:
# average word length
word_length_average = []
word_length_minimum = []
word_length_maximum = []
word_length_median = []
word_length_stdv = []
word_length_firstquartile = []
word_length_thirdquartile = []

for index, row in df.iterrows():
    this_lyrics_lengths = [len(x) for x in row['Lyrics without newline or punct lower'].split()]
    word_length_average.append(np.average(this_lyrics_lengths))
    word_length_minimum.append(np.amin(this_lyrics_lengths))
    word_length_maximum.append(np.amax(this_lyrics_lengths))
    word_length_stdv.append(np.std(this_lyrics_lengths))

    word_length_firstquartile.append(np.percentile(this_lyrics_lengths, 25))
    word_length_median.append(np.percentile(this_lyrics_lengths, 50))
    word_length_thirdquartile.append(np.percentile(this_lyrics_lengths, 75))

df['word_length_average'] =word_length_average
df['word_length_minimum'] =word_length_minimum
df['word_length_maximum'] =word_length_maximum
df['word_length_median'] =word_length_median
df['word_length_stdv'] =word_length_stdv
df['word_length_firstquartile'] =word_length_firstquartile
df['word_length_thirdquartile'] =word_length_thirdquartile


In [None]:
from nltk.util import ngrams

df['num_bigrams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 2) if x != '']), axis=1)
df['num_unique_bigrams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 2) if x != '']), axis=1)

df['num_trigrams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 3) if x != '']), axis=1)
df['num_unique_trigrams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 3) if x != '']), axis=1)

df['num_4grams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 4) if x != '']), axis=1)
df['num_unique_4grams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 4) if x != '']), axis=1)

df['num_5grams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 5) if x != '']), axis=1)
df['num_unique_5grams'] = df.apply(lambda row: len([x for x in ngrams(row['Lyrics without newline or punct lower'].split(), 5) if x != '']), axis=1)



In [None]:
[("Chelsea", "NN"), ...]

In [None]:
# PARTS OF SPEECH

noun_freq = []
determiner_freq = []
preposition_freq = []
base_verb_freq = []
pasttense_verb_freq = []
gerund_presentparticiple_verb_freq = []
pastparticiple_verb_freq = []
non3rdpersonsingularpresent_verb_freq = []
thirdpersonsingularpresent_verb_freq = []
TOTAL_verb_freq = []
to_freq = []
adverb_freq = []
adjective_freq = []
modal_freq = []
coordinating_conjunctions_freq = []
cardinals_freq = []
particle_freq = []
personal_pronoun_freq = []
wh_adverbs_freq = []
possessive_pronoun_freq = []
wh_determiner_freq = []
predeterminer_freq = []
interjection_freq = []
existential_there_freq = []
wh_pronoun_freq = []
content_density = []

for lyrics, wc in zip(df['Lyrics without newline or punct lower'], df.num_words):

    blob = TextBlob(lyrics)

    nouns = 0
    determiners = 0
    prepositions = 0
    base_verbs = 0
    pasttense_verbs = 0
    verb_gerund_presentparticiple = 0
    verb_pastparticiple = 0
    verb_non3rdpersonsingularpresent = 0
    verb_3rdpersonsingularpresent = 0
    tos = 0
    adverbs = 0
    adjectives = 0
    modals = 0
    coordinating_conjunctions = 0
    cardinals = 0
    particles = 0
    personal_pronouns = 0
    wh_adverbs = 0
    possessive_pronouns = 0
    wh_determiners = 0
    predeterminers = 0
    interjections = 0
    existential_theres = 0
    wh_pronouns = 0

    for word, tag in blob.tags:
        #all nouns grouped together: singular, plural, proper singular, proper plural
        if tag == 'NN' or tag == 'NNS' or tag == 'NNP' or tag == 'NNPS':
            nouns += 1
        elif tag == 'DT':
            determiners += 1
        elif tag == 'IN':
            prepositions += 1
        elif tag == 'VB':
            base_verbs +=1
        elif tag == 'VBD':
            pasttense_verbs += 1
        elif tag == 'VBG':
            verb_gerund_presentparticiple += 1
        elif tag == 'VBN':
            verb_pastparticiple += 1
        elif tag == 'VBP':
            verb_non3rdpersonsingularpresent += 1
        elif tag == 'VBZ':
            verb_3rdpersonsingularpresent += 1
        elif tag == 'TO':
            tos += 1
        #all adverbs grouped together: normal, comparative, superlative
        elif tag == 'RB' or tag == 'RBR' or tag == 'RBS':
            adverbs += 1
        #all adjectives grouped together: normal, comparative, superlative
        elif tag == 'JJ' or tag == 'JJR' or tag == 'JJS':
            adjectives += 1
        elif tag == 'MD':
            modals += 1
        elif tag == 'CC':
            coordinating_conjunctions += 1
        elif tag == 'RP':
            particles += 1
        elif tag == 'CD':
            cardinals += 1
        elif tag == 'PRP':
            personal_pronouns += 1
        #when
        elif tag == 'WRB':
            wh_adverbs += 1
        elif tag == 'PRP$':
            possessive_pronouns += 1
        #that
        elif tag == 'WDT':
            wh_determiners += 1
        elif tag == 'PDT':
            predeterminers += 1
        elif tag == 'UH':
            interjections += 1
        elif tag == 'EX':
            existential_theres += 1
        #who, what, whose
        elif tag == 'WP' or tag == 'WP$':
            wh_pronouns += 1

    total_verbs = base_verbs+pasttense_verbs+verb_gerund_presentparticiple+verb_pastparticiple+verb_non3rdpersonsingularpresent+verb_3rdpersonsingularpresent
    noun_freq.append(nouns/wc)
    determiner_freq.append(determiners/wc)
    preposition_freq.append(prepositions/wc)
    base_verb_freq.append(base_verbs/wc)
    pasttense_verb_freq.append(pasttense_verbs/wc)
    gerund_presentparticiple_verb_freq.append(verb_gerund_presentparticiple/wc)
    pastparticiple_verb_freq.append(verb_pastparticiple/wc)
    non3rdpersonsingularpresent_verb_freq.append(verb_non3rdpersonsingularpresent/wc)
    thirdpersonsingularpresent_verb_freq.append(verb_3rdpersonsingularpresent/wc)
    TOTAL_verb_freq.append(total_verbs/wc)
    to_freq.append(tos/wc)
    adverb_freq.append(adverbs/wc)
    adjective_freq.append(adjectives/wc)
    modal_freq.append(modals/wc)
    coordinating_conjunctions_freq.append(coordinating_conjunctions/wc)
    cardinals_freq.append(cardinals/wc)
    particle_freq.append(particles/wc)
    personal_pronoun_freq.append(personal_pronouns/wc)
    wh_adverbs_freq.append(wh_adverbs/wc)
    possessive_pronoun_freq.append(possessive_pronouns/wc)
    wh_determiner_freq.append(wh_determiners/wc)
    predeterminer_freq.append(predeterminers/wc)
    interjection_freq.append(interjections/wc)
    existential_there_freq.append(existential_theres/wc)
    wh_pronoun_freq.append(wh_pronouns/wc)
    content_density.append((total_verbs+nouns+adjectives+adverbs)/wc)


df['noun_freq'] = noun_freq
df['determiner_freq'] = determiner_freq
df['preposition_freq'] = preposition_freq
df['base_verb_freq'] = base_verb_freq
df['pasttense_verb_freq'] = pasttense_verb_freq
df['gerund_presentparticiple_verb_freq'] = gerund_presentparticiple_verb_freq
df['pastparticiple_verb_freq'] = pastparticiple_verb_freq
df['non3rdpersonsingularpresent_verb_freq'] = non3rdpersonsingularpresent_verb_freq
df['3rdpersonsingularpresent_verb_freq'] = thirdpersonsingularpresent_verb_freq
df['TOTAL_verb_freq'] = TOTAL_verb_freq
df['to_freq'] = to_freq
df['adverb_freq'] = adverb_freq
df['adjective_freq'] = adjective_freq
df['modal_freq'] = modal_freq
df['coordinating_conjunctions_freq'] = coordinating_conjunctions_freq
df['cardinals_freq'] = cardinals_freq
df['particle_freq'] = particle_freq
df['personal_pronoun_freq'] = personal_pronoun_freq
df['wh_adverbs_freq'] = wh_adverbs_freq
df['possessive_pronoun_freq'] = possessive_pronoun_freq
df['wh_determiner_freq'] = wh_determiner_freq
df['predeterminer_freq'] = predeterminer_freq
df['interjection_freq'] = interjection_freq
df['existential_there_freq'] = existential_there_freq
df['wh_pronoun_freq'] = wh_pronoun_freq
df['content_density'] = content_density


In [None]:
# PARTS OF SPEECH

noun_count = []
determiner_count = []
preposition_count = []
base_verb_count = []
pasttense_verb_count = []
gerund_presentparticiple_verb_count = []
pastparticiple_verb_count = []
non3rdpersonsingularpresent_verb_count = []
thirdpersonsingularpresent_verb_count = []
TOTAL_verb_count = []
to_count = []
adverb_count = []
adjective_count = []
modal_count = []
coordinating_conjunctions_count = []
cardinals_count = []
particle_count = []
personal_pronoun_count = []
wh_adverbs_count = []
possessive_pronoun_count = []
wh_determiner_count = []
predeterminer_count = []
interjection_count = []
existential_there_count = []
wh_pronoun_count = []
content_density = []

for lyrics, wc in zip(df['Lyrics without newline or punct lower'], df.num_words):

    blob = TextBlob(lyrics)

    nouns = 0
    determiners = 0
    prepositions = 0
    base_verbs = 0
    pasttense_verbs = 0
    verb_gerund_presentparticiple = 0
    verb_pastparticiple = 0
    verb_non3rdpersonsingularpresent = 0
    verb_3rdpersonsingularpresent = 0
    tos = 0
    adverbs = 0
    adjectives = 0
    modals = 0
    coordinating_conjunctions = 0
    cardinals = 0
    particles = 0
    personal_pronouns = 0
    wh_adverbs = 0
    possessive_pronouns = 0
    wh_determiners = 0
    predeterminers = 0
    interjections = 0
    existential_theres = 0
    wh_pronouns = 0

    for word, tag in blob.tags:
        #all nouns grouped together: singular, plural, proper singular, proper plural
        if tag == 'NN' or tag == 'NNS' or tag == 'NNP' or tag == 'NNPS':
            nouns += 1
        elif tag == 'DT':
            determiners += 1
        elif tag == 'IN':
            prepositions += 1
        elif tag == 'VB':
            base_verbs +=1
        elif tag == 'VBD':
            pasttense_verbs += 1
        elif tag == 'VBG':
            verb_gerund_presentparticiple += 1
        elif tag == 'VBN':
            verb_pastparticiple += 1
        elif tag == 'VBP':
            verb_non3rdpersonsingularpresent += 1
        elif tag == 'VBZ':
            verb_3rdpersonsingularpresent += 1
        elif tag == 'TO':
            tos += 1
        #all adverbs grouped together: normal, comparative, superlative
        elif tag == 'RB' or tag == 'RBR' or tag == 'RBS':
            adverbs += 1
        #all adjectives grouped together: normal, comparative, superlative
        elif tag == 'JJ' or tag == 'JJR' or tag == 'JJS':
            adjectives += 1
        elif tag == 'MD':
            modals += 1
        elif tag == 'CC':
            coordinating_conjunctions += 1
        elif tag == 'RP':
            particles += 1
        elif tag == 'CD':
            cardinals += 1
        elif tag == 'PRP':
            personal_pronouns += 1
        #when
        elif tag == 'WRB':
            wh_adverbs += 1
        elif tag == 'PRP$':
            possessive_pronouns += 1
        #that
        elif tag == 'WDT':
            wh_determiners += 1
        elif tag == 'PDT':
            predeterminers += 1
        elif tag == 'UH':
            interjections += 1
        elif tag == 'EX':
            existential_theres += 1
        #who, what, whose
        elif tag == 'WP' or tag == 'WP$':
            wh_pronouns += 1

    total_verbs = base_verbs+pasttense_verbs+verb_gerund_presentparticiple+verb_pastparticiple+verb_non3rdpersonsingularpresent+verb_3rdpersonsingularpresent
    noun_count.append(nouns)
    determiner_count.append(determiners)
    preposition_count.append(prepositions)
    base_verb_count.append(base_verbs)
    pasttense_verb_count.append(pasttense_verbs)
    gerund_presentparticiple_verb_count.append(verb_gerund_presentparticiple)
    pastparticiple_verb_count.append(verb_pastparticiple)
    non3rdpersonsingularpresent_verb_count.append(verb_non3rdpersonsingularpresent)
    thirdpersonsingularpresent_verb_count.append(verb_3rdpersonsingularpresent)
    TOTAL_verb_count.append(total_verbs)
    to_count.append(tos)
    adverb_count.append(adverbs)
    adjective_count.append(adjectives)
    modal_count.append(modals)
    coordinating_conjunctions_count.append(coordinating_conjunctions)
    cardinals_count.append(cardinals)
    particle_count.append(particles)
    personal_pronoun_count.append(personal_pronouns)
    wh_adverbs_count.append(wh_adverbs)
    possessive_pronoun_count.append(possessive_pronouns)
    wh_determiner_count.append(wh_determiners)
    predeterminer_count.append(predeterminers)
    interjection_count.append(interjections)
    existential_there_count.append(existential_theres)
    wh_pronoun_count.append(wh_pronouns)
    content_density.append((total_verbs+nouns+adjectives+adverbs))


df['noun_count'] = noun_count
df['determiner_count'] = determiner_count
df['preposition_count'] = preposition_count
df['base_verb_count'] = base_verb_count
df['pasttense_verb_count'] = pasttense_verb_count
df['gerund_presentparticiple_verb_count'] = gerund_presentparticiple_verb_count
df['pastparticiple_verb_count'] = pastparticiple_verb_count
df['non3rdpersonsingularpresent_verb_count'] = non3rdpersonsingularpresent_verb_count
df['3rdpersonsingularpresent_verb_count'] = thirdpersonsingularpresent_verb_count
df['TOTAL_verb_count'] = TOTAL_verb_count
df['to_count'] = to_count
df['adverb_count'] = adverb_count
df['adjective_count'] = adjective_count
df['modal_count'] = modal_count
df['coordinating_conjunctions_count'] = coordinating_conjunctions_count
df['cardinals_count'] = cardinals_count
df['particle_count'] = particle_count
df['personal_pronoun_count'] = personal_pronoun_count
df['wh_adverbs_count'] = wh_adverbs_count
df['possessive_pronoun_count'] = possessive_pronoun_count
df['wh_determiner_count'] = wh_determiner_count
df['predeterminer_count'] = predeterminer_count
df['interjection_count'] = interjection_count
df['existential_there_count'] = existential_there_count
df['wh_pronoun_count'] = wh_pronoun_count
df['content_density'] = content_density


In [None]:
# Named entities!!!!!

import spacy


In [None]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
NER = spacy.load("en_core_web_sm")

In [None]:
# sentiments by line

ORG_count = []
PERSON_count = []
NORP_count = [] # (nationalities, religious and political groups)
FAC_count = [] # (buildings, airports etc.), ORG (organizations),
GPE_count = [] #(countries, cities etc.),
LOC_count = [] #(mountain ranges, water bodies etc.),
PRODUCT_count = [] #(products),
EVENT_count = [] #(event names),
WORK_OF_ART_count = [] #(books, song titles),
LAW_count = [] #(legal document titles),
LANGUAGE_count = [] #(named languages),
DATE_count = []
TIME_count = []
PERCENT_count = []
MONEY_count = []
QUANTITY_count = []
ORDINAL_count = []
CARDINAL_count = []
TOTAL_count = []

ORG_freq = []
PERSON_freq = []
NORP_freq = []
FAC_freq = []
GPE_freq = []
LOC_freq = []
PRODUCT_freq = []
EVENT_freq = []
WORK_OF_ART_freq = []
LAW_freq = []
LANGUAGE_freq = []
DATE_freq = []
TIME_freq = []
PERCENT_freq = []
MONEY_freq = []
QUANTITY_freq = []
ORDINAL_freq = []
CARDINAL_freq = []
TOTAL_freq = []

i=0
labels = []
for index, row in df.iterrows():

    print('song', i)

    current_song_lyrics = [x for x in row['Lyrics'].split('\n') if x != '' and x[0] != '[']

    CARDINAL = 0
    DATE = 0
    EVENT = 0
    FAC = 0
    GPE = 0
    LAW = 0
    LOC = 0
    MONEY = 0
    NORP = 0
    ORDINAL = 0
    ORG = 0
    PERSON = 0
    PRODUCT = 0
    QUANTITY = 0
    TIME = 0
    WORK_OF_ART = 0
    LANGUAGE = 0
    PERCENT = 0

    for line in current_song_lyrics:
        w_e_pairs = NER(line)
        for word in w_e_pairs.ents:

            if word.label_ == 'ORG':
                ORG += 1
            elif word.label_ == 'PERSON':
                PERSON += 1
            elif word.label_ == 'CARDINAL':
                CARDINAL += 1
            elif word.label_ == 'DATE':
                DATE += 1
            elif word.label_ == 'EVENT':
                EVENT += 1
            elif word.label_ == 'FAC':
                FAC += 1
            elif word.label_ == 'GPE':
                GPE += 1
            elif word.label_ == 'LAW':
                LAW += 1
            elif word.label_ == 'LOC':
                LOC += 1
            elif word.label_ == 'MONEY':
                MONEY += 1
            elif word.label_ == 'NORP':
                NORP += 1
            elif word.label_ == 'ORDINAL':
                ORDINAL += 1
            elif word.label_ == 'PRODUCT':
                PRODUCT += 1
            elif word.label_ == 'QUANTITY':
                QUANTITY += 1
            elif word.label_ == 'TIME':
                TIME += 1
            elif word.label_ == 'WORK_OF_ART':
                WORK_OF_ART += 1
            elif word.label_ == 'LANGUAGE':
                LANGUAGE += 1
            elif word.label_ == 'PERCENT':
                PERCENT += 1
            else:
                print(word.label_)

        total = ORG + PERSON + NORP + FAC + GPE + LOC + PRODUCT + EVENT + WORK_OF_ART + LAW + LANGUAGE + DATE + TIME + PERCENT + MONEY + QUANTITY + ORDINAL + CARDINAL

    ORG_count.append(ORG)
    PERSON_count .append(PERSON)
    NORP_count.append(NORP) # (nationalities, religious and political groups)
    FAC_count.append(FAC) # (buildings, airports etc.), ORG (organizations),
    GPE_count.append(GPE) #(countries, cities etc.),
    LOC_count.append(LOC) #(mountain ranges, water bodies etc.),
    PRODUCT_count.append(PRODUCT) #(products),
    EVENT_count.append(EVENT) #(event names),
    WORK_OF_ART_count.append(WORK_OF_ART) #(books, song titles),
    LAW_count.append(LAW) #(legal document titles),
    LANGUAGE_count.append(LANGUAGE) #(named languages),
    DATE_count.append(DATE)
    TIME_count.append(TIME)
    PERCENT_count.append(PERCENT)
    MONEY_count.append(MONEY)
    QUANTITY_count.append(QUANTITY)
    ORDINAL_count.append(ORDINAL)
    CARDINAL_count.append(CARDINAL)
    TOTAL_count.append(total)

    ORG_freq.append(ORG/row['num_words'])
    PERSON_freq.append(PERSON/row['num_words'])
    NORP_freq.append(NORP/row['num_words'])
    FAC_freq.append(FAC/row['num_words'])
    GPE_freq.append(GPE/row['num_words'])
    LOC_freq.append(LOC/row['num_words'])
    PRODUCT_freq.append(PRODUCT/row['num_words'])
    EVENT_freq.append(EVENT/row['num_words'])
    WORK_OF_ART_freq.append(WORK_OF_ART/row['num_words'])
    LAW_freq.append(LAW/row['num_words'])
    LANGUAGE_freq.append(LANGUAGE/row['num_words'])
    DATE_freq.append(DATE/row['num_words'])
    TIME_freq.append(TIME/row['num_words'])
    PERCENT_freq.append(PERCENT/row['num_words'])
    MONEY_freq.append(MONEY/row['num_words'])
    QUANTITY_freq.append(QUANTITY/row['num_words'])
    ORDINAL_freq.append(ORDINAL/row['num_words'])
    CARDINAL_freq.append(CARDINAL/row['num_words'])
    TOTAL_freq.append(total/row['num_words'])

    i += 1


song 0
song 1
song 2
song 3
song 4
song 5
song 6
song 7
song 8
song 9
song 10
song 11
song 12
song 13
song 14
song 15
song 16
song 17
song 18
song 19
song 20
song 21
song 22
song 23
song 24
song 25
song 26
song 27
song 28
song 29
song 30
song 31
song 32
song 33
song 34
song 35
song 36
song 37
song 38
song 39
song 40
song 41
song 42
song 43
song 44
song 45
song 46
song 47
song 48
song 49
song 50
song 51
song 52
song 53
song 54
song 55
song 56
song 57
song 58
song 59
song 60
song 61
song 62
song 63
song 64
song 65
song 66
song 67
song 68
song 69
song 70
song 71
song 72
song 73
song 74
song 75
song 76
song 77
song 78
song 79
song 80
song 81
song 82
song 83
song 84
song 85
song 86
song 87
song 88
song 89
song 90
song 91
song 92
song 93
song 94
song 95
song 96
song 97
song 98
song 99
song 100
song 101
song 102
song 103
song 104
song 105
song 106
song 107
song 108
song 109
song 110
song 111
song 112
song 113
song 114
song 115
song 116
song 117
song 118
song 119
song 120
song 121
song 122
son

In [None]:
df['NER_ORG_count'] = ORG_count
df['NER_PERSON_count'] = PERSON_count
df['NER_NORP_count'] = NORP_count
df['NER_FAC_count'] = FAC_count
df['NER_GPE_count'] = GPE_count
df['NER_LOC_count'] = LOC_count
df['NER_PRODUCT_count'] = PRODUCT_count
df['NER_EVENT_count'] = EVENT_count
df['NER_WORK_OF_ART_count'] = WORK_OF_ART_count
df['NER_LAW_count'] = LAW_count
df['NER_LANGUAGE_count'] = LANGUAGE_count
df['NER_DATE_count'] = DATE_count
df['NER_TIME_count'] = TIME_count
df['NER_PERCENT_count'] = PERCENT_count
df['NER_MONEY_count'] = MONEY_count
df['NER_QUANTITY_count'] = QUANTITY_count
df['NER_ORDINAL_count'] = ORDINAL_count
df['NER_CARDINAL_count'] = CARDINAL_count
df['NER_TOTAL_count'] = TOTAL_count
df['NER_ORG_freq'] = ORG_freq
df['NER_PERSON_freq'] = PERSON_freq
df['NER_NORP_freq'] = NORP_freq
df['NER_FAC_freq'] = FAC_freq
df['NER_GPE_freq'] = GPE_freq
df['NER_LOC_freq'] = LOC_freq
df['NER_PRODUCT_freq'] = PRODUCT_freq
df['NER_EVENT_freq'] = EVENT_freq
df['NER_WORK_OF_ART_freq'] = WORK_OF_ART_freq
df['NER_LAW_freq'] = LAW_freq
df['NER_LANGUAGE_freq'] = LANGUAGE_freq
df['NER_DATE_freq'] = DATE_freq
df['NER_TIME_freq'] = TIME_freq
df['NER_PERCENT_freq'] = PERCENT_freq
df['NER_MONEY_freq'] = MONEY_freq
df['NER_QUANTITY_freq'] = QUANTITY_freq
df['NER_ORDINAL_freq'] = ORDINAL_freq
df['NER_CARDINAL_freq'] = CARDINAL_freq
df['NER_TOTAL_freq'] = TOTAL_freq

In [None]:
set(labels)

{'CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

In [None]:
df

Unnamed: 0,lastfm_url,track,artist,seeds,number_of_emotion_tags,valence_tags,arousal_tags,dominance_tags,mbid,spotify_id,...,NER_LAW_freq,NER_LANGUAGE_freq,NER_DATE_freq,NER_TIME_freq,NER_PERCENT_freq,NER_MONEY_freq,NER_QUANTITY_freq,NER_ORDINAL_freq,NER_CARDINAL_freq,NER_TOTAL_freq
0,https://www.last.fm/music/metallica/_/st.%2banger,St. Anger,Metallica,['aggressive'],8,3.710000,5.833000,5.427250,727a2529-7ee8-4860-aef6-7959884895cb,3fOc9x06lKJBhz435mInlH,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.032333
1,https://www.last.fm/music/m.i.a./_/bamboo%2bbanga,Bamboo Banga,M.I.A.,"['aggressive', 'fun', 'sexy', 'energetic']",13,6.555071,5.537214,5.691357,99dd2c8c-e7c1-413e-8ea4-4497a00ffa18,6tqFC1DIOphJkCwrjVzPmg,...,0.0,0.0,0.002146,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.023605
2,https://www.last.fm/music/drowning%2bpool/_/st...,Step Up,Drowning Pool,['aggressive'],9,2.971389,5.537500,4.726389,49e7b4d2-3772-4301-ba25-3cc46ceb342e,4Q1w4Ryyi8KNxxaFlOQClK,...,0.0,0.0,0.011583,0.000000,0.0,0.0,0.0,0.007722,0.000000,0.027027
3,https://www.last.fm/music/deftones/_/7%2bwords,7 Words,Deftones,"['aggressive', 'angry']",10,3.807121,5.473939,4.729091,1a826083-5585-445f-a708-415dc90aa050,6DoXuH326aAYEN8CnlLmhP,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.005038
4,https://www.last.fm/music/deftones/_/when%2bgi...,When Girls Telephone Boys,Deftones,"['aggressive', 'angry', 'driving', 'energetic']",8,3.910741,4.915556,4.631852,3bc2c1a9-43bc-45b2-87fc-4313eb2534fe,6xK3sBdRm99g9T8Ov0gjdF,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.009217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1533,https://www.last.fm/music/billy%2bjoel/_/to%2b...,To Make You Feel My Love,Billy Joel,['quiet'],6,4.047212,2.953077,4.020577,1bf51e3f-dd54-43d0-9ec0-1f6a57a86d6b,6KaNr5y4GAlHGp2dQiZNPs,...,0.0,0.0,0.005682,0.005682,0.0,0.0,0.0,0.000000,0.000000,0.017045
1534,https://www.last.fm/music/jon%2bforeman/_/equa...,Equally Skilled,Jon Foreman,['quiet'],5,6.491667,3.460000,5.688333,f2b00b37-ab87-480c-bef3-238f5c336193,3QvVWld6eVVn2osdNwML93,...,0.0,0.0,0.003509,0.000000,0.0,0.0,0.0,0.000000,0.003509,0.007018
1535,https://www.last.fm/music/warren%2bzevon/_/my%...,My Shit's Fucked Up,Warren Zevon,['cynical'],6,2.758333,3.813333,3.856667,8a6c2225-480a-4d5a-b6f1-c8b892fbaf97,26douMAqNELour6sKd2oR7,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.007634
1536,https://www.last.fm/music/prince/_/i%2bhate%2bu,I Hate U,Prince,['cynical'],5,7.003258,5.863864,5.831894,4d3c271c-08a5-472d-8e5a-7b70604ec288,1hJc23JlQlCAs2SUVDAVWL,...,0.0,0.0,0.001969,0.000000,0.0,0.0,0.0,0.003937,0.045276,0.068898


In [None]:
for x in df.columns:
    print(x)

lastfm_url
track
artist
seeds
number_of_emotion_tags
valence_tags
arousal_tags
dominance_tags
mbid
spotify_id
genre
track edited
artist edited
Lyrics
HF_byline_average_sentiment
HF_byline_minimum_sentiment
HF_byline_maximum_sentiment
HF_byline_median_sentiment
HF_byline_stdv_sentiment
HF_byline_firstquartile_sentiment
HF_byline_thirdquartile_sentiment
HF_byline_ratio_negative
HF_byline_ratio_positive
HF_fullinput512_one_sentiment_number
HF_fullinput512_one_sentiment_label
number_lines
HF_ROBERTA_byline_average_sentiment
HF_ROBERTA_byline_minimum_sentiment
HF_ROBERTA_byline_maximum_sentiment
HF_ROBERTA_byline_median_sentiment
HF_ROBERTA_byline_stdv_sentiment
HF_ROBERTA_byline_firstquartile_sentiment
HF_ROBERTA_byline_thirdquartile_sentiment
HF_ROBERTA_byline_ratio_negative
HF_ROBERTA_byline_ratio_positive
HF_ROBERTA_fullinput512_one_sentiment_number
HF_ROBERTA_fullinput512_one_sentiment_label
SPLIT
VADER_byline_average_sentiment
VADER_byline_minimum_sentiment
VADER_byline_maximum_sentim

In [None]:
# save our new features

df[df['SPLIT'] == 'train'].to_excel('TRAIN language data.xlsx')
df[df['SPLIT'] == 'val'].to_excel('VAL language data.xlsx')
df[df['SPLIT'] == 'test'].to_excel('TEST language data.xlsx')

Then let's see if any of them correlate with our variables of interest - ONLY ON THE TRAIN + DEV SETS

## Dominance
our largest correlation for dominance - acoustics was 0.18

In [None]:
# for feature selection, we can combine the train and validation sets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, r_regression

feature_selection_train = df[(df['SPLIT'] == 'train') | (df['SPLIT'] == 'val')]
feature_selection_X = feature_selection_train.drop(columns=['lastfm_url','track','artist','seeds','number_of_emotion_tags','valence_tags','arousal_tags','dominance_tags','mbid','spotify_id','genre','track edited','artist edited','Lyrics'])

# let's just do the single values for now, we will have to
# do something special for lists or strings
ignore_me_for_now = []
for column in feature_selection_X.columns:
  if feature_selection_X[column].dtype != 'float64' and feature_selection_X[column].dtype != 'int64':
    #print(feature_selection_X[column].dtype)
    ignore_me_for_now.append(column)

feature_selection_X = feature_selection_X.drop(columns=ignore_me_for_now)

In [None]:
# see if there are any NaN or infinite values in the df

for index, row in feature_selection_X.iterrows():
  for c in feature_selection_X.columns:
    if np.isnan(row[c]) or not np.isfinite(row[c]):
        print(c, row[c])

# looks like there is just one - acousticbrainz_beats_loudness_median is NaN on row 3038
# that is fine - we will handle it below!

HF_byline_average_sentiment nan
HF_byline_minimum_sentiment nan
HF_byline_maximum_sentiment nan
HF_byline_median_sentiment nan
HF_byline_stdv_sentiment nan
HF_byline_firstquartile_sentiment nan
HF_byline_thirdquartile_sentiment nan
HF_byline_ratio_negative nan
HF_byline_ratio_positive nan
HF_ROBERTA_byline_average_sentiment nan
HF_ROBERTA_byline_minimum_sentiment nan
HF_ROBERTA_byline_maximum_sentiment nan
HF_ROBERTA_byline_median_sentiment nan
HF_ROBERTA_byline_stdv_sentiment nan
HF_ROBERTA_byline_firstquartile_sentiment nan
HF_ROBERTA_byline_thirdquartile_sentiment nan
HF_ROBERTA_byline_ratio_negative nan
HF_ROBERTA_byline_ratio_positive nan
VADER_byline_average_sentiment nan
VADER_byline_minimum_sentiment nan
VADER_byline_maximum_sentiment nan
VADER_byline_median_sentiment nan
VADER_byline_stdv_sentiment nan
VADER_byline_firstquartile_sentiment nan
VADER_byline_thirdquartile_sentiment nan
VADER_byline_ratio_negative nan
VADER_byline_ratio_positive nan
VADER_byline_ratio_neutral nan


In [None]:
# here is where we deal with the rows that have a NaN or inf value
feature_selection_X = feature_selection_X.replace((np.inf, -np.inf, np.nan), 0)

In [None]:
feature_selection_y = feature_selection_train['dominance_tags']

# configure to select all features
skb = SelectKBest(score_func = r_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# INDICES of largest elements in list
# using sorted() + lambda -> creates a list of indices
res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)

#printing top features by correlation
for i, ind, s in zip(range(len(res)), res, sorted(scores)):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind], 'CORRELATION:', pearsonr(feature_selection_X[feature_selection_X.columns[:][ind]],feature_selection_y))

INDEX 0 : VADER_byline_average_sentiment SCORE: 0.17764430770990897 CORRELATION: PearsonRResult(statistic=0.17764430770990816, pvalue=1.026741028872314e-62)
INDEX 1 : HF_byline_average_sentiment SCORE: 0.15431569547190846 CORRELATION: PearsonRResult(statistic=0.15431569547190818, pvalue=1.3847392443946945e-47)
INDEX 2 : HF_byline_ratio_positive SCORE: 0.15380022237167704 CORRELATION: PearsonRResult(statistic=0.15380022237167756, pvalue=2.822352174668705e-47)
INDEX 3 : HF_ROBERTA_byline_ratio_positive SCORE: 0.14927708388578098 CORRELATION: PearsonRResult(statistic=0.14927708388578179, pvalue=1.3124942587223025e-44)
INDEX 4 : HF_ROBERTA_byline_average_sentiment SCORE: 0.1489732521004843 CORRELATION: PearsonRResult(statistic=0.14897325210048445, pvalue=1.9693472390701076e-44)
INDEX 5 : base_verb_count SCORE: 0.14669282017518484 CORRELATION: PearsonRResult(statistic=0.14669282017518379, pvalue=4.0284127407439574e-43)
INDEX 6 : VADER_byline_maximum_sentiment SCORE: 0.1368421697257087 CORRE

# Valence
Our largest valence from acoustics was 0.21

In [None]:
feature_selection_y = feature_selection_train['valence_tags']

# configure to select all features
skb = SelectKBest(score_func = r_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# INDICES of largest elements in list
# using sorted() + lambda -> creates a list of indices
res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)

#printing top features by correlation
for i, ind, s in zip(range(len(res)), res, sorted(scores)):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind], 'CORRELATION:', pearsonr(feature_selection_X[feature_selection_X.columns[:][ind]],feature_selection_y))

INDEX 0 : HF_byline_average_sentiment SCORE: 0.21168941938500377 CORRELATION: PearsonRResult(statistic=0.21168941938500394, pvalue=7.267501001788164e-89)
INDEX 1 : HF_byline_ratio_positive SCORE: 0.2104400899192048 CORRELATION: PearsonRResult(statistic=0.21044008991920463, pvalue=8.0959707571434e-88)
INDEX 2 : VADER_byline_average_sentiment SCORE: 0.2036338039344327 CORRELATION: PearsonRResult(statistic=0.2036338039344321, pvalue=3.115925373229476e-82)
INDEX 3 : HF_ROBERTA_byline_ratio_positive SCORE: 0.20227564684053706 CORRELATION: PearsonRResult(statistic=0.2022756468405371, pvalue=3.839968561263931e-81)
INDEX 4 : HF_ROBERTA_byline_average_sentiment SCORE: 0.2022683379703372 CORRELATION: PearsonRResult(statistic=0.20226833797033753, pvalue=3.892029468515418e-81)
INDEX 5 : HF_byline_median_sentiment SCORE: 0.18384236993326963 CORRELATION: PearsonRResult(statistic=0.18384236993327052, pvalue=4.115267481146993e-67)
INDEX 6 : HF_ROBERTA_byline_median_sentiment SCORE: 0.17563942800162155

# Arousal
Our largest correlation from acoustics was 0.42

In [None]:
feature_selection_y = feature_selection_train['arousal_tags']

# configure to select all features
skb = SelectKBest(score_func = r_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# INDICES of largest elements in list
# using sorted() + lambda -> creates a list of indices
res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)

#printing top features by correlation
for i, ind, s in zip(range(len(res)), res, sorted(scores)):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind], 'CORRELATION:', pearsonr(feature_selection_X[feature_selection_X.columns[:][ind]],feature_selection_y))

INDEX 0 : content_density SCORE: 0.23467502049351263 CORRELATION: PearsonRResult(statistic=0.23467502049351238, pvalue=2.4227594565299844e-109)
INDEX 1 : TOTAL_verb_count SCORE: 0.23064980061795257 CORRELATION: PearsonRResult(statistic=0.23064980061795026, pvalue=1.37533647671277e-105)
INDEX 2 : number_lines SCORE: 0.2304298987812174 CORRELATION: PearsonRResult(statistic=0.2304298987812175, pvalue=2.194947187385357e-105)
INDEX 3 : num_5grams SCORE: 0.2262872255583596 CORRELATION: PearsonRResult(statistic=0.22628722555835964, pvalue=1.336469949774808e-101)
INDEX 4 : num_unique_5grams SCORE: 0.2262872255583596 CORRELATION: PearsonRResult(statistic=0.22628722555835964, pvalue=1.336469949774808e-101)
INDEX 5 : num_4grams SCORE: 0.22628677300010927 CORRELATION: PearsonRResult(statistic=0.2262867730001112, pvalue=1.3377300704732601e-101)
INDEX 6 : num_unique_4grams SCORE: 0.22628677300010927 CORRELATION: PearsonRResult(statistic=0.2262867730001112, pvalue=1.3377300704732601e-101)
INDEX 7 : n

Now let's look at mutual information

In [None]:
from sklearn.feature_selection import mutual_info_regression

feature_selection_y = feature_selection_train['dominance_tags']

# configure to select all features
skb = SelectKBest(score_func=mutual_info_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# Indices of largest elements in list
# using sorted() + lambda + list slicing

res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)[:]

#printing top features by correlation

for i, ind in zip(range(len(res)), res):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind])

INDEX 0 : TOTAL_verb_count SCORE: 0.0369766012647883
INDEX 1 : HF_ROBERTA_byline_median_sentiment SCORE: 0.0323697487240695
INDEX 2 : HF_ROBERTA_byline_average_sentiment SCORE: 0.03089705594588743
INDEX 3 : number_lines SCORE: 0.02738068900040247
INDEX 4 : base_verb_freq SCORE: 0.025424330579574494
INDEX 5 : HF_ROBERTA_fullinput512_one_sentiment_number SCORE: 0.024593101616992463
INDEX 6 : VADER_byline_average_sentiment SCORE: 0.022853155889398735
INDEX 7 : base_verb_count SCORE: 0.022583554198184252
INDEX 8 : num_unique_trigrams SCORE: 0.022411148049984675
INDEX 9 : num_5grams SCORE: 0.022365244066357093
INDEX 10 : VADER_byline_maximum_sentiment SCORE: 0.02213890123798823
INDEX 11 : HF_ROBERTA_byline_ratio_negative SCORE: 0.022050179566609707
INDEX 12 : HF_ROBERTA_byline_ratio_positive SCORE: 0.021724403677830573
INDEX 13 : num_unique_5grams SCORE: 0.021689708002269548
INDEX 14 : num_bigrams SCORE: 0.021607253693519013
INDEX 15 : num_unique_bigrams SCORE: 0.02117635196410994
INDEX 16 

In [None]:
feature_selection_y = feature_selection_train['valence_tags']

# configure to select all features
skb = SelectKBest(score_func=mutual_info_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# Indices of largest elements in list
# using sorted() + lambda + list slicing

res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)[:]

#printing top features by correlation

for i, ind in zip(range(len(res)), res):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind])

INDEX 0 : num_5grams SCORE: 0.039105293725746115
INDEX 1 : num_unique_4grams SCORE: 0.03848529633635689
INDEX 2 : num_trigrams SCORE: 0.038449910607366355
INDEX 3 : num_words SCORE: 0.03780691760139643
INDEX 4 : num_4grams SCORE: 0.037759290078604124
INDEX 5 : num_unique_5grams SCORE: 0.03767899983412004
INDEX 6 : num_unique_bigrams SCORE: 0.03739506364452527
INDEX 7 : num_bigrams SCORE: 0.03712649806488155
INDEX 8 : HF_ROBERTA_fullinput512_one_sentiment_number SCORE: 0.03634794246266049
INDEX 9 : num_unique_trigrams SCORE: 0.03599740275192875
INDEX 10 : type_token_ratio SCORE: 0.03399851734510673
INDEX 11 : HF_byline_firstquartile_sentiment SCORE: 0.03163255225832229
INDEX 12 : HF_ROBERTA_byline_thirdquartile_sentiment SCORE: 0.030106526323122385
INDEX 13 : HF_ROBERTA_byline_average_sentiment SCORE: 0.02941034830271505
INDEX 14 : HF_byline_average_sentiment SCORE: 0.0292043042757113
INDEX 15 : number_lines SCORE: 0.029023310607135144
INDEX 16 : base_verb_count SCORE: 0.028992544310930

In [None]:
feature_selection_y = feature_selection_train['arousal_tags']

# configure to select all features
skb = SelectKBest(score_func=mutual_info_regression, k='all')

# learn relationship from training data
fit = skb.fit(feature_selection_X, feature_selection_y)

# summarize scores
scores = fit.scores_

# Indices of largest elements in list
# using sorted() + lambda + list slicing

res = sorted(range(len(scores)), key = lambda sub: scores[sub], reverse=True)[:]

#printing top features by correlation

for i, ind in zip(range(len(res)), res):
    print('INDEX', str(i), ':', feature_selection_X.columns[:][ind], 'SCORE:', scores[ind])

INDEX 0 : adjective_count SCORE: 0.05082632896920991
INDEX 1 : num_words SCORE: 0.04876484449326757
INDEX 2 : num_5grams SCORE: 0.04837045280215513
INDEX 3 : num_4grams SCORE: 0.04821393048227751
INDEX 4 : num_unique_bigrams SCORE: 0.047744072700954376
INDEX 5 : num_unique_trigrams SCORE: 0.04765633572761274
INDEX 6 : num_unique_5grams SCORE: 0.04755501286334418
INDEX 7 : num_trigrams SCORE: 0.047505371216749026
INDEX 8 : number_lines SCORE: 0.04738084578128987
INDEX 9 : num_bigrams SCORE: 0.046992205977806734
INDEX 10 : non3rdpersonsingularpresent_verb_count SCORE: 0.04570499626396973
INDEX 11 : num_unique_4grams SCORE: 0.04524936674246405
INDEX 12 : noun_count SCORE: 0.042097992805620876
INDEX 13 : base_verb_count SCORE: 0.03753682045568851
INDEX 14 : TOTAL_verb_count SCORE: 0.03736745136731923
INDEX 15 : content_density SCORE: 0.03458158557082136
INDEX 16 : pastparticiple_verb_freq SCORE: 0.032048842533438204
INDEX 17 : personal_pronoun_count SCORE: 0.03104155923545271
INDEX 18 : 3r

So for machine learning, we want to drop the features that have very low mutual information or small correlations. These features will just add noise to a model that is undesireable.

- we can just take the top 20-50 (play around with this number) correlation values and use that as our feature set
- we can just take the top 20-50 (play around with this number) mutual information values and use that as our feature set
- we can just take the top 20-50 (play around with this number) correlation values AND the top 20-50 mutual information use that as our feature set
- we can do a percentile - 20th percentile of the results (in the same way as the 3 points above)
- we can do a model that doesn't do any feature selection at all.

Let's also just get thte embedding of each song and see if that predicts sentiment

In [None]:
from transformers import BertTokenizer, BertModel
import torch

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

def get_BERT_embedding(text):
    tokenized_text = bert_tokenizer.encode(text, truncation=True)
    input_ids = torch.tensor(tokenized_text).unsqueeze(0)
    outputs = bert_model(input_ids)
    last_hidden_state = outputs[0]
    this_batch = last_hidden_state[0]
    cls_vector = this_batch[0]
    return cls_vector.detach().numpy()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
df['BERT embedding'] = df['Lyrics without newline'].apply(get_BERT_embedding)